Skill564 repo starsupdated 2mo ago

pdf

This Claude Code skill performs comprehensive PDF operations including text and table extraction, merging, splitting, page rotation, watermarking, form filling, encryption, image extraction, and OCR processing. Use it whenever users need to manipulate PDF files in any way, whether reading content, combining documents, or creating new PDFs from scratch.

View source Repository: ScienceClaw

Install in Claude Code

Copy

git clone --depth 1 https://github.com/AgentTeam-TaichuAI/ScienceClaw /tmp/pdf && cp -r /tmp/pdf/ScienceClaw/backend/builtin_skills/pdf ~/.claude/skills/pdf

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# PDF Processing Guide

## Overview

This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.

## Default Output Format for Complex Tasks

When a user's task involves multiple tool calls (web search, data analysis, tool invocations, etc.) and produces substantial research results, but the user has **NOT specified an output format**, you should **default to generating a PDF report** using the template below. This applies to:

- Research tasks that gather information from multiple sources
- Analysis tasks that produce structured findings
- Any multi-step task where the final deliverable is a comprehensive answer

**Do NOT** default to PDF for simple Q&A, quick lookups, or tasks where the user clearly expects a chat response. Use your judgment: if the task took 5+ tool calls and produced rich, structured content, a PDF report is the appropriate default.

When defaulting to PDF output, follow the "Generate PDF Reports" workflow below — use `generate_report.py` with `report_data.json`, NOT the markdown-to-PDF approach.

---

## Quick Start — Choosing the Right Tool for Text Extraction

**Not all extractors are equal.** Pick the right one based on your PDF type:

| PDF Type | Best Tool | Why |
|---|---|---|
| Academic papers (two-column, conference/journal) | `pdftotext -layout` (poppler) | Handles column detection and character spacing reconstruction |
| Simple single-column documents | pdfplumber or pypdf | Good enough, easier to script |
| Scanned PDFs (image-based) | pytesseract + pdf2image | Needs OCR |
| Tables / structured data | pdfplumber | Best table extraction |

### Academic Papers — Use pdftotext First

Most academic PDFs (arXiv, IEEE, ACM, etc.) use two-column layouts and custom font encodings where spaces are implicit (encoded as character spacing, not space characters). Python libraries like `pypdf` and basic `pdfplumber` often produce **merged words** (e.g. `"TheConferenceonAI"` instead of `"The Conference on AI"`).

**Always prefer `pdftotext` from poppler-utils for academic papers:**

```bash
# Best option — preserves layout and column structure
pdftotext -layout input.pdf output.txt

# Alternative — raw text without layout (still handles spacing correctly)
pdftotext input.pdf output.txt
```

If `pdftotext` is not available, use PyMuPDF with sort mode:

```python
import fitz  # PyMuPDF

doc = fitz.open("document.pdf")
for page in doc:
    # sort=True reorders text by reading position (top-to-bottom, left-to-right)
    text = page.get_text("text", sort=True)
    print(text)
```

### Simple Documents — pypdf Quick Start

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

text = ""
for page in reader.pages:
    text += page.extract_text()
```

> **Warning**: pypdf's text extraction is basic. If the output has merged words or garbled column text, switch to `pdftotext -layout`.

## Generate PDF Reports (Data-Driven Template)

**MANDATORY: When creating ANY structured report (research, analysis, summary, etc.), you MUST use the pre-built professional template.** Do NOT write your own PDF generation code using reportlab, fpdf2, or any other library from scratch. The template already handles all styling, CJK fonts, layout, and pagination correctly.

This template produces business-professional quality PDFs with:
- Cover page with metadata table and disclaimer
- **Auto-generated Table of Contents with page numbers**
- Numbered section headings, dense detailed content
- Tables with auto column widths and smart alignment
- References section with numbered citations

### Step 1: Copy the generator to your workspace

**CRITICAL**: You MUST use shell `cp` to copy the script exactly as shown below. Do NOT:
- Write your own PDF generator from scratch
- Use `read_file` + `write_file` (risks stale cached version)
- Modify the generator script in any way

```bash
cp /builtin-skills/pdf/scripts/generate_report.py ./generate_report.py
```

### Step 2: Build `report_data.json`

**Two phases: write section text files, then assemble into JSON.**

**Phase 1 — Write each section as a plain text file** using `write_file`:

For each major section, `read_file` the relevant research_data, then `write_file` the section content directly:
```
read_file("research_data/literature.md")           # refresh data in context
write_file("sections/sec_01_intro.txt", "...")      # write section content
write_file("sections/sec_02_mutations.txt", "...")   # next section
...
```
Each section file should be 1,000-2,000+ words with specific data, citations, and analysis.

**NEVER write a Python script that contains section text as string literals.** The section content goes directly into .txt files via `write_file`, not into Python code. Do NOT write scripts named "generate_sections", "create_content", "build_report" etc. that embed text in Python strings. If a sandbox script fails twice, switch to direct `write_file` calls.

**Writing style — academic research report (CRITICAL):**
- Write continuous flowing prose. Each paragraph: 8--10 sentences following the pattern: topic sentence → supporting evidence with specific data → analysis/comparison → transition to next point.
- Use in-text citations [1], [2] when referencing data. These render as superscript links in the PDF. Do NOT add a "References" list at the end of each chapter — all references go in ONE final `references` section.
- Synthesize across sources: "Study A [1] reported X, while Study B [2] found Y, suggesting that Z."
- Use academic connectives: "Furthermore", "In contrast", "These findings indicate", "Notably", "Taken together".
- NEVER use numbered-point structure (e.g. "1. Topic Title\n\nParagraph. 2. Topic Title\n\nParagraph."). Instead, use `##` subheadings for structure and

More from this repository

docxSkill

Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.

feishu-setupSkill

自动配置飞书机器人应用。当用户要求配置飞书、创建飞书机器人、接入 Lark/飞书、设置飞书 app_id/app_secret、或询问如何配置飞书 IM 时触发此 skill。该 skill 通过 sandbox 内置浏览器自动完成飞书开放平台上的应用创建、权限配置、事件订阅和发布，用户仅需扫码登录。

find-skillsSkill

MANDATORY: When a user asks to install, find, search, or add ANY skill (e.g. 'install hello-world skill', 'find a skill for X', 'add a skill'), you MUST first run `skills find <query>` to search the skills ecosystem. NEVER create a skill from scratch without searching first. Even if the name sounds simple, always search — it may already exist as a published skill.

pptxSkill

Use this skill any time a .pptx file is involved — as input, output, or both. This includes: creating slide decks, pitch decks, or presentations; reading or extracting text from .pptx files; editing or updating existing presentations; combining or splitting slide files; working with templates, layouts, speaker notes, or comments. Trigger whenever the user mentions 'deck', 'slides', 'presentation', or references a .pptx filename. If a .pptx file needs to be opened, created, or touched, use this skill.

skill-creatorSkill

Create new skills, modify and improve existing skills, and measure skill performance. MANDATORY: Use this skill whenever the user wants to create a custom skill from scratch, design a workflow as a skill, write their own SKILL.md, update or optimize an existing skill, run evals to test a skill, benchmark skill performance, or asks questions like 'how do I make a skill', 'create a skill for X', 'turn this into a skill', 'I want to build a skill'. Even if the user doesn't use the word 'skill' explicitly, trigger this if they want to capture a reusable workflow or set of instructions for the agent.

tool-creatorSkill

Create new tools or upgrade existing tools for the agent. MANDATORY: Use this skill whenever the user wants to create a custom tool, convert a script into a reusable tool, write a new tool function, upgrade or modify an existing tool, test and improve a tool in the sandbox, or asks things like 'make a tool for X', 'create a tool that does Y', 'improve the X tool', 'upgrade my tool', 'turn this script into a tool'. Even if the user doesn't use the word 'tool' explicitly, trigger this if they want to add a new callable capability to the agent or modify an existing one.

tooluniverseSkill

Access 1000+ scientific tools through ToolUniverse for drug discovery, protein analysis, genomics, literature search, clinical data, ADMET prediction, molecular docking, and more. Use when the user needs biomedical or scientific research capabilities.

xlsxSkill

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like \"the xlsx in my downloads\") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.