Skip to main content
ClaudeWave
Skill1.5k estrellas del repoactualizado yesterday

PDF Processing Pro

PDF Processing Pro is a production-ready toolkit for handling complex PDF workflows, including form field extraction and filling, table extraction, OCR processing, and batch operations. Use it when working with large volumes of PDFs in production environments, requiring robust error handling with validation, logging, and CLI interfaces for automation.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/anbeime/skill /tmp/pdf-processing-pro && cp -r /tmp/pdf-processing-pro/skills/pdf-processing-pro/pdf-processing-pro ~/.claude/skills/pdf-processing-pro
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# PDF Processing Pro

Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.

## Quick start

### Extract text from PDF

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)
```

### Analyze PDF form (using included script)

```bash
python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions
```

### Fill PDF form with validation

```bash
python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting
```

### Extract tables from PDF

```bash
python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection
```

## Features

### ✅ Production-ready scripts

All scripts include:
- **Error handling**: Graceful failures with detailed error messages
- **Validation**: Input validation and type checking
- **Logging**: Configurable logging with timestamps
- **Type hints**: Full type annotations for IDE support
- **CLI interface**: `--help` flag for all scripts
- **Exit codes**: Proper exit codes for automation

### ✅ Comprehensive workflows

- **PDF Forms**: Complete form processing pipeline
- **Table Extraction**: Advanced table detection and extraction
- **OCR Processing**: Scanned PDF text extraction
- **Batch Operations**: Process multiple PDFs efficiently
- **Validation**: Pre and post-processing validation

## Advanced topics

### PDF Form Processing

For complete form workflows including:
- Field analysis and detection
- Dynamic form filling
- Validation rules
- Multi-page forms
- Checkbox and radio button handling

See [FORMS.md](FORMS.md)

### Table Extraction

For complex table extraction:
- Multi-page tables
- Merged cells
- Nested tables
- Custom table detection
- Export to CSV/Excel

See [TABLES.md](TABLES.md)

### OCR Processing

For scanned PDFs and image-based documents:
- Tesseract integration
- Language support
- Image preprocessing
- Confidence scoring
- Batch OCR

See [OCR.md](OCR.md)

## Included scripts

### Form processing

**analyze_form.py** - Extract form field information
```bash
python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
```

**fill_form.py** - Fill PDF forms with data
```bash
python scripts/fill_form.py input.pdf data.json output.pdf [--validate]
```

**validate_form.py** - Validate form data before filling
```bash
python scripts/validate_form.py data.json schema.json
```

### Table extraction

**extract_tables.py** - Extract tables to CSV/Excel
```bash
python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
```

### Text extraction

**extract_text.py** - Extract text with formatting preservation
```bash
python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
```

### Utilities

**merge_pdfs.py** - Merge multiple PDFs
```bash
python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf
```

**split_pdf.py** - Split PDF into individual pages
```bash
python scripts/split_pdf.py input.pdf --output-dir pages/
```

**validate_pdf.py** - Validate PDF integrity
```bash
python scripts/validate_pdf.py input.pdf
```

## Common workflows

### Workflow 1: Process form submissions

```bash
# 1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json

# 2. Validate submission data
python scripts/validate_form.py submission.json schema.json

# 3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf

# 4. Validate output
python scripts/validate_pdf.py completed.pdf
```

### Workflow 2: Extract data from reports

```bash
# 1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv

# 2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt
```

### Workflow 3: Batch processing

```python
import glob
from pathlib import Path
import subprocess

# Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"):
    output_file = Path("processed") / Path(pdf_file).name

    result = subprocess.run([
        "python", "scripts/extract_text.py",
        pdf_file,
        "--output", str(output_file)
    ], capture_output=True)

    if result.returncode == 0:
        print(f"✓ Processed: {pdf_file}")
    else:
        print(f"✗ Failed: {pdf_file} - {result.stderr}")
```

## Error handling

All scripts follow consistent error patterns:

```python
# Exit codes
# 0 - Success
# 1 - File not found
# 2 - Invalid input
# 3 - Processing error
# 4 - Validation error

# Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])

if result.returncode == 0:
    print("Success")
elif result.returncode == 4:
    print("Validation failed - check input data")
else:
    print(f"Error occurred: {result.returncode}")
```

## Dependencies

All scripts require:

```bash
pip install pdfplumber pypdf pillow pytesseract pandas
```

Optional for OCR:
```bash
# Install tesseract-ocr system package
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases
```

## Performance tips

- **Use batch processing** for multiple PDFs
- **Enable multiprocessing** with `--parallel` flag (where supported)
- **Cache extracted data** to avoid re-processing
- **Validate inputs early** to fail fast
- **Use streaming** for large PDFs (>50MB)

## Best practices

1. **Always validate inputs** before processing
2. **Use try-except** in custom scripts
3. **Log all operations** for debugging
4. **Test with sample PDFs** before production
5. **Set timeouts** for long-running operations
6. **Check exit codes** in automation
7. **Backup originals** before modification

## Troubleshooting

### Common issues

**"Module not found" errors**:
```bash
pip install -r requirements.txt
```

**Tessera
xiaoyue-companionSkill

小跃虚拟伴侣 - 使用智谱 AI 提供温暖的对话陪伴和静态图片分享

companion-skillSkill
agent-teamSkill

统一管理多智能体角色的团队协作框架,支持智能体动态组合、灵活协作和扩展新角色。智能体本质上是"角色定义",可以根据任务需求灵活组建团队,实现从会议决策到系统构建的完整能力。智能体角色明确分工:有干活的、有指挥的、有挑毛病的,能实时看到沟通过程,共享数据库记忆,确保上下文一致。

agentkit-multimedia-shoppingSkill

基于ByteDance agentkit-samples多媒体用例的小省导购员数字人带货视频生成技能,整合多模态内容生成能力(图像、视频、音频),支持AI绘画、语音合成、视频生成,与小省导购员人设融合,9:16竖屏适配,直接对接带货视频生成流程

article-illustratorSkill

分析文章内容,在需要视觉辅助理解的位置生成插画。配图可以是信息补充、概念具象化,或引导读者想象。当用户要求"给文章配图"、"为文章生成插图"、"添加配图"时使用此技能。

bedtime-storySkill

为3-12岁儿童提供温馨亲切的睡前寓言故事和成语典故讲解。支持用户唤醒后提供故事列表选择,或直接讲解指定故事/成语。讲解时保持亲切温馨的语气、0.6倍正常语速、通俗易懂的表达,为小朋友营造舒适的睡前氛围。

chrome-automationSkill

Connect to and control Google Chrome browser using agent-browser with CDP (Chrome DevTools Protocol). Use when the user wants to automate their existing Chrome browser, see browser actions in real-time, or needs to control the Chrome instance they're already using. Handles installation, setup, connecting via remote debugging, and all browser automation tasks with live visual feedback.

content-creation-publisherSkill

内容创作与发布全流程技能,整合网页采集、Markdown格式化、智能配图、多平台发布(微信公众号、X/Twitter)功能,实现从内容获取到发布的一站式解决方案