Skill163 repo starsupdated 12d ago

paper-fetch

Paper-fetch retrieves PDF copies of academic papers by DOI, arXiv ID, title, or citation through a cascading search across open-access repositories and publishers. Use this when users request paper downloads, PDFs, or bulk retrieval across any academic discipline, checking Unpaywall, Semantic Scholar, arXiv, PubMed Central, bioRxiv, and publisher endpoints before attempting Sci-Hub as a fallback option.

View source Repository: paper-fetch

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Agents365-ai/paper-fetch /tmp/paper-fetch && cp -r /tmp/paper-fetch/skills/paper-fetch ~/.claude/skills/paper-fetch

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# paper-fetch

Fetch the PDF for a paper given a DOI (or title). Tries multiple sources in priority order and stops at the first hit.

## Resolution order

1. **Unpaywall** — `https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL`, read `best_oa_location.url_for_pdf` (skipped if `UNPAYWALL_EMAIL` not set)
2. **Semantic Scholar** — `https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds`
3. **arXiv** — if `externalIds.ArXiv` present, `https://arxiv.org/pdf/{arxiv_id}.pdf`
4. **PubMed Central OA** — if PMCID present, `https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/`
5. **bioRxiv / medRxiv** — if DOI prefix is `10.1101`, query `https://api.biorxiv.org/details/{server}/{doi}` for the latest version PDF URL
6. **Publisher direct** *(institutional mode only — `PAPER_FETCH_INSTITUTIONAL=1`)* — DOI-prefix → publisher PDF template (Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier). The caller's own subscription IP / cookies / EZproxy are what authorize the fetch; unauthorized responses fail the `%PDF` check and fall through to step 7.
7. **Sci-Hub mirrors** *(on by default; disable with `PAPER_FETCH_NO_SCIHUB=1`)* — last-resort fallback. Tries the mirror list in `PAPER_FETCH_SCIHUB_MIRRORS` (or built-in defaults `sci-hub.ru`, `sci-hub.st`, `sci-hub.su`, `sci-hub.box`, `sci-hub.red`, `sci-hub.al`, `sci-hub.mk`, `sci-hub.ee`) in order; on full miss, scrapes `https://www.sci-hub.pub/` once per process for fresh mirrors. CAPTCHA / missing-paper pages have no PDF iframe and fall through silently.
8. Otherwise → report failure with title/authors so the user can request via ILL

**CloakBrowser fallback (download layer, opt-in — `PAPER_FETCH_CLOAK=1`).** This is not a separate source: it sits at the download chokepoint, so it applies to *any* of the sources above. When a resolved PDF URL is blocked by Cloudflare — HTTP 403/429, or a "Just a moment…" HTML interstitial served in place of the file — and the operator opted in, the URL is retried through [CloakBrowser](https://github.com/CloakHQ/CloakBrowser) (a stealth Chromium that passes the JS challenge) via the `cloak_pdf.py` companion. Bytes it returns are re-validated through the same `%PDF` magic-byte + 50 MB checks; on success the result carries `via: "cloak"`. Off by default, fails closed (missing CloakBrowser → silent fall-through), and the agent cannot opt in — see *CloakBrowser access* below.

If only a title is given, pass it directly via `--title "<title>"`. Resolution chain:

1. **Crossref** `query.title` — primary; covers all major journal/conference DOIs
2. **Semantic Scholar `/paper/search/match`** — fallback when Crossref's top match is low-confidence (`match_score < 40`) or the gap to the runner-up is `< 3`. Critically, S2 covers arXiv-only preprints (no Crossref DOI). When S2 surfaces a paper that has only an arXiv id, the canonical `10.48550/arXiv.<id>` is synthesized so the download chain stays uniform.
3. **Crossref's best guess (low-confidence)** — used only when both resolvers struggled. The result envelope sets `meta.title_resolution.low_confidence: true` plus a `low_confidence_reason` (`score_below_threshold` / `ambiguous_runner_up`) so an agent can either bail or confirm via `--dry-run`.

Either way the resolved DOI, the winning resolver, the full `resolvers_tried` list, and the top candidate matches are all surfaced under `meta.title_resolution`.

**If `semanticscholar-skill` is registered**, it can serve as a richer pre-step for title → DOI resolution — useful when you also need relevance ranking, snippet search, or citation context, not just a DOI. The agent writes a Python script using the skill's `match_title()` to read `externalIds.DOI`, then runs `paper-fetch <doi>`. When the result has only an `ArXiv` id (no DOI), synthesize `10.48550/arXiv.<ArXiv>` and pass that to paper-fetch.

When only the DOI is needed, `--title` is the single-command path — paper-fetch's built-in Crossref → S2 chain handles most cases.

## Usage

```bash
python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # machine-readable self-description
```

### Flags

The flags below are the ones an agent composes in normal use. For the complete contract — including `--dry-run`, `--pretty`, `--stream`, `--overwrite`, `--timeout`, `--version`, plus parameter types and exit-code mappings — run `python scripts/fetch.py schema` (machine-readable, drift-checked via `schema_version`).

| Flag | Default | Description |
|------|---------|-------------|
| `doi` | — | DOI to fetch (positional). Use `-` to read a single DOI from stdin |
| `--title TITLE` | — | Paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI / `--batch` |
| `--batch FILE` | — | File with one DOI per line for bulk download. Use `-` to read from stdin |
| `--out DIR` | `pdfs` | Output directory |
| `--format` | auto | `json` for agents, `text` for humans. Auto-detects: `json` when stdout is not a TTY, `text` when it is |
| `--idempotency-key KEY` | — | Safe-retry key. Re-running with the same key replays the original envelope from `<out>/.paper-fetch-idem/` without network I/O |

### Agent discovery: `schema` subcommand

```bash
python scripts/fetch.py schema
```

Emits a complete machine-readable description of the CLI on stdout (no network). Includes `cli_version`, `schema_version`, parameter types, exit codes, error codes, envelope shapes, and environment variables. Agents should read this once, cache it against `schema_version`, and re-read when the cached version drifts.

### Output contract

**stdout** emits a single JSON envelope. Every envelope carries a `meta` slot.

**Success** (all DOIs resolved):

```json
{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,