Skip to main content
ClaudeWave

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

MCP ServersRegistry oficial69 estrellas10 forksPythonMITActualizado 7d ago
ClaudeWave Trust Score
87/100
Trusted
Passed
  • Open-source license (MIT)
  • Actively maintained (<30d)
  • Clear description
  • Topics declared
Last scanned: 6/11/2026
Install in Claude Code / Claude Desktop
Method: pip / Python · pdfmux
Claude Code CLI
claude mcp add pdfmux -- python -m pdfmux
claude_desktop_config.json (Claude Desktop)
{
  "mcpServers": {
    "pdfmux": {
      "command": "python",
      "args": ["-m", "pdfmux"]
    }
  }
}
1. Run the command above in your terminal (Claude Code), or paste the JSON config into claude_desktop_config.json (Claude Desktop).
2. Replace any <placeholder> values with your API keys or paths.
3. Restart Claude. The MCP server and its tools appear automatically.
💡 Install first: pip install pdfmux
Casos de uso

Resumen de MCP Servers

# pdfmux

[![CI](https://github.com/NameetP/pdfmux/actions/workflows/ci.yml/badge.svg)](https://github.com/NameetP/pdfmux/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/pdfmux)](https://pypi.org/project/pdfmux/)
[![Python 3.11+](https://img.shields.io/pypi/pyversions/pdfmux)](https://pypi.org/project/pdfmux/)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Downloads](https://img.shields.io/pypi/dm/pdfmux)](https://pypi.org/project/pdfmux/)

**Self-healing PDF extraction with per-page confidence scoring.** Open-source LlamaParse alternative for RAG pipelines, MCP server for Claude Desktop, LangChain + LlamaIndex loaders. Ranked #2 on opendataloader-bench (0.900).

The only PDF extractor that audits its own output. Catches blank pages, scrambled columns, broken tables — re-extracts them with a stronger backend. So your LLM gets clean data, not silent garbage. Routes each page to the best of 5 rule-based backends + BYOK LLM fallback (Gemini / Claude / GPT-4o / Ollama). One CLI. One API. Zero config.

<p align="center">
  <img src="demo.svg" alt="pdfmux terminal demo" width="700" />
</p>

```
PDF ──> pdfmux router ──> best extractor per page ──> audit ──> re-extract failures ──> Markdown / JSON / chunks
            |
            ├─ PyMuPDF         (digital text, 0.01s/page)
            ├─ OpenDataLoader  (complex layouts, 0.05s/page)
            ├─ RapidOCR        (scanned pages, CPU-only)
            ├─ Docling         (tables, 97.9% TEDS)
            ├─ Surya           (heavy OCR fallback)
            ├─ Marker          (academic papers, neural)
            ├─ Mistral OCR     ($0.002/page, 96.6% tables)
            └─ YOUR LLM        (Gemini / Gemma 4 / Claude / GPT-4o / Ollama / Mistral — BYOK via YAML)
```

## Install

```bash
pip install pdfmux
```

That handles digital PDFs. **For any real-world batch, install `pdfmux[ocr]` too** — almost every directory of PDFs has at least one scan, and without OCR those pages return empty text:

```bash
pip install "pdfmux[ocr]"             # ⭐ recommended — RapidOCR for scanned pages (~200MB, CPU)
```

Other backends, by document type:

```bash
pip install "pdfmux[tables]"          # Docling — table-heavy docs (~500MB)
pip install "pdfmux[opendataloader]"  # OpenDataLoader — complex layouts (Java 11+)
pip install "pdfmux[marker]"          # Marker — neural extraction for academic papers
pip install "pdfmux[llm]"             # Gemini fallback (default LLM)
pip install "pdfmux[llm-claude]"      # Claude (Sonnet / Opus)
pip install "pdfmux[llm-openai]"      # GPT-4o family
pip install "pdfmux[llm-ollama]"      # Ollama (any local model)
pip install "pdfmux[llm-mistral]"     # Mistral OCR API ($0.002/page)
pip install "pdfmux[llm-all]"         # all LLM providers (incl. Gemma 4 via Gemini key)
pip install "pdfmux[watch]"           # `pdfmux watch <dir>` auto-convert on change
pip install "pdfmux[all]"             # everything
```

Requires Python 3.11+.

## Quick Start

### CLI

```bash
# zero config — just works
pdfmux convert invoice.pdf
# invoice.pdf -> invoice.md (2 pages, 95% confidence, via pymupdf4llm)

# RAG-ready chunks with token limits
pdfmux convert report.pdf --chunk --max-tokens 500

# cost-aware extraction with budget cap
pdfmux convert report.pdf --mode economy --budget 0.50

# schema-guided structured extraction (5 built-in presets)
pdfmux convert invoice.pdf --schema invoice

# BYOK any LLM for hardest pages
pdfmux convert scan.pdf --llm-provider claude

# use a built-in or saved profile (invoices, receipts, papers, contracts, bulk-rag)
pdfmux convert invoice.pdf --profile invoices

# predict cost before running anything
pdfmux estimate big-report.pdf --llm-provider gemini

# stream pages as NDJSON as they finish (great for long documents)
pdfmux stream report.pdf --quality high

# auto-convert any new PDFs that land in a folder
pdfmux watch ./inbox/ -o ./output/

# diff two extractions side-by-side
pdfmux diff old.pdf new.pdf

# batch a directory — writes manifest.json with per-doc confidence
pdfmux convert ./docs/ -o ./output/

# CI mode: fail the run if any document is below 0.20 confidence
pdfmux convert ./docs/ -o ./output/ --strict --min-confidence 0.20

# pre-flight a directory: which extras do you actually need for THIS batch?
pdfmux doctor --check ./docs/

# results are cached by file hash — re-runs are instant; bypass with --no-cache
pdfmux convert report.pdf --no-cache
pdfmux convert report.pdf --clear-cache
```

### Python

For batch processing, use `batch_extract()` — not a `subprocess.run(['pdfmux', ...])` loop. Same pipeline, no per-file process spawn, handles non-ASCII filenames:

```python
import pdfmux
from pathlib import Path

# Batch extract — yields (path, result) tuples as each PDF completes.
pdfs = list(Path("./inbox").glob("*.pdf"))
for path, result in pdfmux.batch_extract(pdfs, quality="standard"):
    if isinstance(result, Exception):
        print(f"FAILED {path.name}: {result}")
        continue
    if result.confidence < 0.50:
        print(f"REVIEW {path.name} ({result.confidence:.2f})")
    else:
        print(f"OK     {path.name} ({result.confidence:.2f})")

# Single-file helpers.
text   = pdfmux.extract_text("report.pdf")             # markdown string
data   = pdfmux.extract_json("report.pdf")             # locked schema dict
chunks = pdfmux.chunk("report.pdf", max_tokens=500)    # RAG-ready chunks
```

> **Don't wrap pdfmux with your own pypdf/pdfplumber fallback.** pdfmux already routes per page through PyMuPDF → RapidOCR → vision LLM. PyMuPDF tolerates malformed PDFs that pypdf rejects ("Stream has ended unexpectedly"), so a downstream pypdf fallback turns recoverable PDFs into failures. Trust the router; check the confidence score on the result.

## Architecture

```
                           ┌─────────────────────────────┐
                           │     Segment Detector         │
                           │  text / tables / images /    │
                           │  formulas / headers per page │
                           └─────────────┬───────────────┘
                                         │
                    ┌────────────────────────────────────────┐
                    │            Router Engine                │
                    │                                        │
                    │   economy ── balanced ── premium        │
                    │   (minimize $)  (default)  (max quality)│
                    │   budget caps: --budget 0.50            │
                    └────────────────────┬───────────────────┘
                                         │
          ┌──────────┬──────────┬────────┴────────┬──────────┐
          │          │          │                  │          │
     PyMuPDF   OpenData    RapidOCR           Docling     LLM
     digital   Loader      scanned            tables    (BYOK)
     0.01s/pg  complex     CPU-only           97.9%    any provider
               layouts                        TEDS
          │          │          │                  │          │
          └──────────┴──────────┴────────┬────────┴──────────┘
                                         │
                    ┌────────────────────────────────────────┐
                    │           Quality Auditor               │
                    │                                        │
                    │   4-signal dynamic confidence scoring   │
                    │   per-page: good / bad / empty          │
                    │   if bad -> re-extract with next backend│
                    └────────────────────┬───────────────────┘
                                         │
                    ┌────────────────────────────────────────┐
                    │           Output Pipeline               │
                    │                                        │
                    │   heading injection (font-size analysis)│
                    │   table extraction + normalization      │
                    │   text cleanup + merge                  │
                    │   confidence score (honest, not inflated)│
                    └────────────────────────────────────────┘
```

### Key design decisions

- **Router, not extractor.** pdfmux does not compete with PyMuPDF or Docling. It picks the best one per page.
- **Agentic multi-pass.** Extract, audit confidence, re-extract failures with a stronger backend. Bad pages get retried automatically.
- **Segment-level detection.** Each page is classified by content type (text, tables, images, formulas, headers) before routing.
- **4-signal confidence.** Dynamic quality scoring from character density, OCR noise ratio, table integrity, and heading structure. Not hardcoded thresholds.
- **Document cache.** Each PDF is opened once, not once per extractor. Shared across the full pipeline.
- **Data flywheel.** Local telemetry tracks which extractors win per document type. Routing improves with usage.

## Features

| Feature | What it does | Command |
|---------|-------------|---------|
| Zero-config extraction | Routes to best backend automatically | `pdfmux convert file.pdf` |
| RAG chunking | Section-aware chunks with token estimates | `pdfmux convert file.pdf --chunk --max-tokens 500` |
| Cost modes | economy / balanced / premium with budget caps | `pdfmux convert file.pdf --mode economy --budget 0.50` |
| Schema extraction | 5 built-in presets (invoice, receipt, contract, resume, paper) | `pdfmux convert file.pdf --schema invoice` |
| Profiles | Save and re-use config; built-ins for invoices/receipts/papers/contracts/bulk-rag | `pdfmux convert file.pdf --profile invoices` |
| BYOK LLM | Gemini, Gemma 4, Claude, GPT-4o, Ollama, Mistral, any OpenAI-compatible API | `pdfmux convert file.pdf --llm-provider claude` |
| Cost estimate | Predict spend before running | `pdfmux estimate file.pdf --llm-provider gemini` |
| Streaming output | NDJSON events page-by-page for long docs | `pdfmux stre
ai-agentdoclingdocument-parsingllmmcpocropendataloaderpdfpdf-extractionpdf-to-jsonpdf-to-markdownpythonragself-healingstructured-extraction

Lo que la gente pregunta sobre pdfmux

¿Qué es NameetP/pdfmux?

+

NameetP/pdfmux es mcp servers para el ecosistema de Claude AI. PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost. Tiene 69 estrellas en GitHub y se actualizó por última vez 7d ago.

¿Cómo se instala pdfmux?

+

Puedes instalar pdfmux clonando el repositorio (https://github.com/NameetP/pdfmux) o siguiendo las instrucciones del README en GitHub. ClaudeWave también te ofrece bloques de instalación rápida en esta misma página.

¿Es seguro usar NameetP/pdfmux?

+

Nuestro agente de seguridad ha analizado NameetP/pdfmux y le ha asignado un Trust Score de 87/100 (tier: Trusted). Revisa el desglose completo de comprobaciones superadas y flags en esta página.

¿Quién mantiene NameetP/pdfmux?

+

NameetP/pdfmux es mantenido por NameetP. La última actividad registrada en GitHub es de 7d ago, con 4 issues abiertos.

¿Hay alternativas a pdfmux?

+

Sí. En ClaudeWave puedes explorar mcp servers similares en /categories/mcp, ordenados por popularidad o actividad reciente.

Despliega pdfmux en tu cloud

Lleva este repo a producción en minutos. Cada plataforma genera su propio entorno con variables de entorno editables.

¿Mantienes este repo? Añade un badge a tu README

Pega el badge en tu README de GitHub para mostrar que está auditado por ClaudeWave. Cada badge enlaza de vuelta a esta página y muestra el Trust Score actual.

Featured on ClaudeWave: NameetP/pdfmux
[![Featured on ClaudeWave](https://claudewave.com/api/badge/nameetp-pdfmux)](https://claudewave.com/repo/nameetp-pdfmux)
<a href="https://claudewave.com/repo/nameetp-pdfmux"><img src="https://claudewave.com/api/badge/nameetp-pdfmux" alt="Featured on ClaudeWave: NameetP/pdfmux" width="320" height="64" /></a>

Más MCP Servers

Alternativas a pdfmux