Skip to main content
ClaudeWave
Skill128 repo starsupdated 17d ago

paper-fetch

Use whenever the user wants to obtain, download, or fetch a paper's PDF — given a DOI, an arXiv id, a paper title, a citation, or a list of DOIs. Trigger on phrases like "download this paper", "find the PDF for [DOI]", "grab me the [Nature/bioRxiv/arXiv] paper on X", "get the open-access version", "I need this article", or any bulk/batch paper download request, even when the user doesn't explicitly say "PDF" or "DOI". Resolves via Unpaywall → Semantic Scholar → arXiv → PubMed Central → bioRxiv/medRxiv → publisher direct (institutional opt-in) → Sci-Hub mirrors as last-resort fallback.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/Agents365-ai/paper-fetch /tmp/paper-fetch && cp -r /tmp/paper-fetch/skills/paper-fetch ~/.claude/skills/paper-fetch
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# paper-fetch

Fetch the PDF for a paper given a DOI (or title). Tries multiple sources in priority order and stops at the first hit.

## Resolution order

1. **Unpaywall** — `https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL`, read `best_oa_location.url_for_pdf` (skipped if `UNPAYWALL_EMAIL` not set)
2. **Semantic Scholar** — `https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds`
3. **arXiv** — if `externalIds.ArXiv` present, `https://arxiv.org/pdf/{arxiv_id}.pdf`
4. **PubMed Central OA** — if PMCID present, `https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/`
5. **bioRxiv / medRxiv** — if DOI prefix is `10.1101`, query `https://api.biorxiv.org/details/{server}/{doi}` for the latest version PDF URL
6. **Publisher direct** *(institutional mode only — `PAPER_FETCH_INSTITUTIONAL=1`)* — DOI-prefix → publisher PDF template (Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier). The caller's own subscription IP / cookies / EZproxy are what authorize the fetch; unauthorized responses fail the `%PDF` check and fall through to step 7.
7. **Sci-Hub mirrors** *(on by default; disable with `PAPER_FETCH_NO_SCIHUB=1`)* — last-resort fallback. Tries the mirror list in `PAPER_FETCH_SCIHUB_MIRRORS` (or built-in defaults `sci-hub.ru`, `sci-hub.st`, `sci-hub.su`, `sci-hub.box`, `sci-hub.red`, `sci-hub.al`, `sci-hub.mk`, `sci-hub.ee`) in order; on full miss, scrapes `https://www.sci-hub.pub/` once per process for fresh mirrors. CAPTCHA / missing-paper pages have no PDF iframe and fall through silently.
8. Otherwise → report failure with title/authors so the user can request via ILL

**CloakBrowser fallback (download layer, opt-in — `PAPER_FETCH_CLOAK=1`).** This is not a separate source: it sits at the download chokepoint, so it applies to *any* of the sources above. When a resolved PDF URL is blocked by Cloudflare — HTTP 403/429, or a "Just a moment…" HTML interstitial served in place of the file — and the operator opted in, the URL is retried through [CloakBrowser](https://github.com/CloakHQ/CloakBrowser) (a stealth Chromium that passes the JS challenge) via the `cloak_pdf.py` companion. Bytes it returns are re-validated through the same `%PDF` magic-byte + 50 MB checks; on success the result carries `via: "cloak"`. Off by default, fails closed (missing CloakBrowser → silent fall-through), and the agent cannot opt in — see *CloakBrowser access* below.

If only a title is given, pass it directly via `--title "<title>"`. Resolution chain:

1. **Crossref** `query.title` — primary; covers all major journal/conference DOIs
2. **Semantic Scholar `/paper/search/match`** — fallback when Crossref's top match is low-confidence (`match_score < 40`) or the gap to the runner-up is `< 3`. Critically, S2 covers arXiv-only preprints (no Crossref DOI). When S2 surfaces a paper that has only an arXiv id, the canonical `10.48550/arXiv.<id>` is synthesized so the download chain stays uniform.
3. **Crossref's best guess (low-confidence)** — used only when both resolvers struggled. The result envelope sets `meta.title_resolution.low_confidence: true` plus a `low_confidence_reason` (`score_below_threshold` / `ambiguous_runner_up`) so an agent can either bail or confirm via `--dry-run`.

Either way the resolved DOI, the winning resolver, the full `resolvers_tried` list, and the top candidate matches are all surfaced under `meta.title_resolution`.

**If `semanticscholar-skill` is registered**, it can serve as a richer pre-step for title → DOI resolution — useful when you also need relevance ranking, snippet search, or citation context, not just a DOI. The agent writes a Python script using the skill's `match_title()` to read `externalIds.DOI`, then runs `paper-fetch <doi>`. When the result has only an `ArXiv` id (no DOI), synthesize `10.48550/arXiv.<ArXiv>` and pass that to paper-fetch.

When only the DOI is needed, `--title` is the single-command path — paper-fetch's built-in Crossref → S2 chain handles most cases.

## Usage

```bash
python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # machine-readable self-description
```

### Flags

The flags below are the ones an agent composes in normal use. For the complete contract — including `--dry-run`, `--pretty`, `--stream`, `--overwrite`, `--timeout`, `--version`, plus parameter types and exit-code mappings — run `python scripts/fetch.py schema` (machine-readable, drift-checked via `schema_version`).

| Flag | Default | Description |
|------|---------|-------------|
| `doi` | — | DOI to fetch (positional). Use `-` to read a single DOI from stdin |
| `--title TITLE` | — | Paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI / `--batch` |
| `--batch FILE` | — | File with one DOI per line for bulk download. Use `-` to read from stdin |
| `--out DIR` | `pdfs` | Output directory |
| `--format` | auto | `json` for agents, `text` for humans. Auto-detects: `json` when stdout is not a TTY, `text` when it is |
| `--idempotency-key KEY` | — | Safe-retry key. Re-running with the same key replays the original envelope from `<out>/.paper-fetch-idem/` without network I/O |

### Agent discovery: `schema` subcommand

```bash
python scripts/fetch.py schema
```

Emits a complete machine-readable description of the CLI on stdout (no network). Includes `cli_version`, `schema_version`, parameter types, exit codes, error codes, envelope shapes, and environment variables. Agents should read this once, cache it against `schema_version`, and re-read when the cached version drifts.

### Output contract

**stdout** emits a single JSON envelope. Every envelope carries a `meta` slot.

**Success** (all DOIs resolved):

```json
{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,