data-formats
This Claude Code skill provides practical guidance for inspecting and parsing unknown or diverse data formats including binary files, structured text formats, machine learning checkpoints, and databases. Use it when encountering unfamiliar file types, needing to detect file formats programmatically, or requiring strategies for safely parsing large or complex datasets without making incorrect assumptions about encoding or structure.
git clone --depth 1 https://github.com/vstorm-co/pydantic-deepagents /tmp/data-formats && cp -r /tmp/data-formats/pydantic_deep/bundled_skills/data-formats ~/.claude/skills/data-formatsSKILL.md
# Data Formats
How to work with diverse and unknown data formats.
## Format Detection
Always inspect before parsing:
```bash
file <filename> # MIME type detection
xxd <filename> | head -5 # hex dump (first bytes)
head -3 <filename> # text preview
python3 -c "
with open('<filename>', 'rb') as f:
h = f.read(16)
print(h, h.hex())
"
```
## Common Formats
### Binary
- **Magic bytes**: Most binary formats start with a signature (ELF: `\x7fELF`, PNG: `\x89PNG`)
- **Endianness**: Check if little-endian or big-endian (`struct.unpack('<I', ...)` vs `'>I'`)
- **Alignment**: Fields are often aligned to 4 or 8 bytes
- **Offsets**: Binary headers often contain offsets to other sections
### Structured text
- **CSV/TSV**: Check delimiter (comma, tab, pipe), quoting, header row
- **JSON**: `python3 -c "import json; json.load(open('f'))"`
- **YAML**: Check indentation, anchors/aliases
- **TOML**: `python3 -c "import tomllib; ..."`
- **XML**: Check encoding declaration, namespaces
### Checkpoints / Model files
- **PyTorch**: `.pt`, `.pth` → `torch.load(f, map_location='cpu')`
- **TensorFlow**: `.ckpt` → index + data files, use `tf.train.load_checkpoint()`
- **NumPy**: `.npy`, `.npz` → `numpy.load()`
- **HuggingFace**: `config.json` + `model.safetensors`
- **ONNX**: `onnx.load()`
### Database files
- **SQLite**: `file` says "SQLite 3.x database" → `sqlite3 <file> ".tables"`
- **WAL files**: SQLite write-ahead log — recover with `sqlite3` PRAGMA
- **CSV dumps**: Often need schema inference
## Parsing Strategies
### Unknown binary format
1. Hex dump first 256 bytes: `xxd file | head -16`
2. Look for magic bytes, version numbers, string tables
3. Check file size — does it suggest a pattern? (e.g., N * record_size)
4. Look for documentation of the format online
5. Write a minimal parser, test on known values
### Large structured files
1. Never load entirely — sample first: `head`, `tail`, `shuf -n 10`
2. Check consistency: are all lines the same format?
3. Count fields: `head -1 file | awk -F',' '{print NF}'`
4. Watch for: mixed types, missing values, encoding issues
### Multi-file datasets
1. List all files and sizes
2. Look for manifest/index files (often JSON or CSV)
3. Check naming patterns — timestamps, sequence numbers, shards
4. Process one file first, then generalize
## Common Pitfalls
- Assuming UTF-8 when the file is Latin-1 or binary
- Assuming CSV when it's TSV (or vice versa)
- Ignoring the header row
- Not handling quoted fields with embedded delimiters
- Reading binary files as text (corrupts data)
- Endianness mismatch (x86 is little-endian, network byte order is big-endian)Building, compiling, and resolving dependency issues across languages
Systematic code review for bugs, security, style, and performance
Systematic exploration of unknown environments before starting work
Git operations: commits, branches, PRs, and conflict resolution
Writing efficient code that handles large data and tight constraints
Refactor code to improve structure and maintainability
Create new reusable skills from conversation context
Systematic approach to diagnosing and fixing errors