Skill1k estrellas del repoactualizado today

data-formats

This Claude Code skill provides practical guidance for inspecting and parsing unknown or diverse data formats including binary files, structured text formats, machine learning checkpoints, and databases. Use it when encountering unfamiliar file types, needing to detect file formats programmatically, or requiring strategies for safely parsing large or complex datasets without making incorrect assumptions about encoding or structure.

Ver fuente Repositorio: pydantic-deepagents

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/vstorm-co/pydantic-deepagents /tmp/data-formats && cp -r /tmp/data-formats/pydantic_deep/bundled_skills/data-formats ~/.claude/skills/data-formats

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Data Formats

How to work with diverse and unknown data formats.

## Format Detection

Always inspect before parsing:

```bash
file <filename>                    # MIME type detection
xxd <filename> | head -5           # hex dump (first bytes)
head -3 <filename>                 # text preview
python3 -c "
with open('<filename>', 'rb') as f:
    h = f.read(16)
    print(h, h.hex())
"
```

## Common Formats

### Binary
- **Magic bytes**: Most binary formats start with a signature (ELF: `\x7fELF`, PNG: `\x89PNG`)
- **Endianness**: Check if little-endian or big-endian (`struct.unpack('<I', ...)` vs `'>I'`)
- **Alignment**: Fields are often aligned to 4 or 8 bytes
- **Offsets**: Binary headers often contain offsets to other sections

### Structured text
- **CSV/TSV**: Check delimiter (comma, tab, pipe), quoting, header row
- **JSON**: `python3 -c "import json; json.load(open('f'))"`
- **YAML**: Check indentation, anchors/aliases
- **TOML**: `python3 -c "import tomllib; ..."`
- **XML**: Check encoding declaration, namespaces

### Checkpoints / Model files
- **PyTorch**: `.pt`, `.pth` → `torch.load(f, map_location='cpu')`
- **TensorFlow**: `.ckpt` → index + data files, use `tf.train.load_checkpoint()`
- **NumPy**: `.npy`, `.npz` → `numpy.load()`
- **HuggingFace**: `config.json` + `model.safetensors`
- **ONNX**: `onnx.load()`

### Database files
- **SQLite**: `file` says "SQLite 3.x database" → `sqlite3 <file> ".tables"`
- **WAL files**: SQLite write-ahead log — recover with `sqlite3` PRAGMA
- **CSV dumps**: Often need schema inference

## Parsing Strategies

### Unknown binary format
1. Hex dump first 256 bytes: `xxd file | head -16`
2. Look for magic bytes, version numbers, string tables
3. Check file size — does it suggest a pattern? (e.g., N * record_size)
4. Look for documentation of the format online
5. Write a minimal parser, test on known values

### Large structured files
1. Never load entirely — sample first: `head`, `tail`, `shuf -n 10`
2. Check consistency: are all lines the same format?
3. Count fields: `head -1 file | awk -F',' '{print NF}'`
4. Watch for: mixed types, missing values, encoding issues

### Multi-file datasets
1. List all files and sizes
2. Look for manifest/index files (often JSON or CSV)
3. Check naming patterns — timestamps, sequence numbers, shards
4. Process one file first, then generalize

## Common Pitfalls

- Assuming UTF-8 when the file is Latin-1 or binary
- Assuming CSV when it's TSV (or vice versa)
- Ignoring the header row
- Not handling quoted fields with embedded delimiters
- Reading binary files as text (corrupts data)
- Endianness mismatch (x86 is little-endian, network byte order is big-endian)

Del mismo repositorio

build-and-compileSkill

Building, compiling, and resolving dependency issues across languages

code-reviewSkill

Systematic code review for bugs, security, style, and performance

environment-discoverySkill

Systematic exploration of unknown environments before starting work

git-workflowSkill

Git operations: commits, branches, PRs, and conflict resolution

performant-codeSkill

Writing efficient code that handles large data and tight constraints

refactorSkill