paper-writing-bench
This Claude Code skill reverse-engineers the three core components of an AI research paper (sparse idea, dense idea, experimental log) to construct benchmark cases for evaluating paper-writing pipelines. Use it when you need to extract raw materials from an existing paper PDF or markdown text to replicate the PaperWritingBench dataset construction procedure, enabling systematic comparison of generated papers against originals through automated evaluation.
git clone --depth 1 https://github.com/Ar9av/PaperOrchestra /tmp/paper-writing-bench && cp -r /tmp/paper-writing-bench/skills/paper-writing-bench ~/.claude/skills/paper-writing-benchSKILL.md
# PaperWritingBench (§3)
Faithful implementation of the PaperWritingBench dataset construction
procedure from PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §3 and
App. C, F.2).
The original benchmark contains 200 papers (100 CVPR 2025 + 100 ICLR 2025).
For each paper, the authors reverse-engineer the (I, E) tuple by stripping
narrative flow from the original PDF using the three prompts in App. F.2.
You can use this skill to reverse-engineer your own benchmark cases from
any paper PDF.
## What this skill does
Given an existing AI research paper (PDF or markdown extract), produce:
- `idea.md` (Sparse variant) — high-level concept note, no math, no
experimental results
- `idea.md` (Dense variant) — detailed technical proposal with LaTeX
equations and variable definitions, but still no experimental results
- `experimental_log.md` — exhaustive raw experimental setup, numeric data,
and qualitative observations, with all narrative references stripped
These three files form a complete (I, E) input pair for the
paper-orchestra pipeline. You can then run the pipeline and compare its
output to the original paper using `paper-autoraters`.
## Inputs
- A paper PDF or extracted markdown text. The paper uses MinerU
(Wang et al., 2024) for PDF→markdown extraction; you (the host agent)
should use whatever PDF extractor your environment provides.
- For controlled experiments, you may also extract figures separately
(PDFFigures 2.0 in the paper).
## Outputs
- `bench/<paper_id>/idea_sparse.md` — Sparse variant
- `bench/<paper_id>/idea_dense.md` — Dense variant
- `bench/<paper_id>/experimental_log.md` — Experimental log
## Workflow
For each paper, run three independent LLM calls using the verbatim prompts
below:
### 1. Sparse idea generation
Load `references/sparse-idea-prompt.md`. Pass the paper text (or
markdown extract) as `{paper_content}`. The prompt instructs the model to:
- Stop extracting at empirical verification (no Experiments / Results / Comparisons)
- Use first-person future tense ("We propose to explore...")
- Avoid LaTeX math; describe components by function
- Anonymize authors and titles
Output: `idea_sparse.md` with the four sections (Problem Statement, Core
Hypothesis, Proposed Methodology high-level, Expected Contribution).
### 2. Dense idea generation
Load `references/dense-idea-prompt.md`. Same input. The prompt instructs
the model to:
- Preserve mathematical formulations using LaTeX
- Define every variable used in equations
- Include specific architectural choices and dimensions
- Same exclusion zone (no experiments)
Output: `idea_dense.md` with the four sections (Problem Statement, Core
Hypothesis, Proposed Methodology detailed, Expected Contribution).
### 3. Experimental log generation
Load `references/experimental-log-prompt.md`. Same input. The prompt
instructs the model to:
- Use past-tense persona ("We ran...", "The results were...")
- Strip all references to figure/table numbers
- Deconstruct tables into raw numeric data
- Log figure findings as factual observations
- Anonymize authors
Output: `experimental_log.md` with sections for Setup, Raw Numeric Data,
and Qualitative Observations.
## Critical rules from the prompts
These are excerpted from App. F.2. The host agent MUST honor them:
- **No citations.** None of the three outputs may contain `\cite`,
reference numbers, or author names from the source paper.
- **No URLs.** Strip all hyperlinks.
- **Anonymize.** Author identities, affiliations, acknowledgements all
removed.
- **Self-contained.** Each file must make sense without the original paper.
- **No experimental leakage in idea files.** The Sparse and Dense ideas
must stop where empirical verification begins. They describe what will
be done, not what was done.
- **No table/figure references in experimental log.** No "as shown in
Table 1", "see Fig. 5". The downstream paper-orchestra pipeline will
generate its own figures and tables — the log must not assume any
particular ones exist.
- **100% numeric accuracy in experimental log.** This becomes the ground
truth for the section-writing-agent and content-refinement-agent's
hallucination check.
## How the bench is used
After producing `(idea_sparse.md, idea_dense.md, experimental_log.md)` for
a paper:
1. Pick a variant (Sparse or Dense) — the paper ablates both, with Dense
producing more rigorous methodology and Sparse exercising the system's
robustness on under-specified inputs.
2. Drop the chosen `idea.md`, plus `experimental_log.md`, plus a
`template.tex` for the target conference, plus a
`conference_guidelines.md`, into a paper-orchestra workspace.
3. Run the pipeline.
4. Compare the generated paper against the original using
`paper-autoraters` (citation F1, lit review quality, SxS paper quality).
## Resources
- `references/bench-overview.md` — the 200-paper bench, venue cutoffs, sizes
- `references/sparse-idea-prompt.md` — verbatim from App. F.2
- `references/dense-idea-prompt.md` — verbatim from App. F.2
- `references/experimental-log-prompt.md` — verbatim from App. F.2Pre-pipeline aggregator that scans AI agent cache directories (.claude, .cursor, .antigravity, .openclaw) or any user-specified directory for experimentation logs, extracts insights and numeric results, and formats them as PaperOrchestra-ready inputs (idea.md + experimental_log.md). TRIGGER when the user says "aggregate my agent logs for paper writing", "extract experiments from my coding agent history", "prepare PaperOrchestra inputs from my cache", "turn my agent logs into a paper", mentions a folder or directory they want to use as the basis for a paper, or wants to run PaperOrchestra but only has scattered agent experiment histories rather than structured inputs. Run this BEFORE paper-orchestra. Also called automatically by paper-orchestra when workspace/inputs/idea.md or workspace/inputs/experimental_log.md are missing.
Step 5 of the PaperOrchestra pipeline (arXiv:2604.05018). Iteratively refine drafts/paper.tex by simulating peer review and applying targeted revisions, with strict accept/revert halt rules. Maintains a worklog and snapshots each iteration so revert is real, not symbolic. TRIGGER when the orchestrator delegates Step 5 or when the user asks to "refine the draft", "iterate on the paper", or "run peer review on this paper".
Step 3 of the PaperOrchestra pipeline (arXiv:2604.05018). Execute the literature search strategy from outline.json — discover candidate papers via web search, verify them through Semantic Scholar (Levenshtein > 70 fuzzy title match, temporal cutoff, dedup by paperId), cross-corroborate against Crossref + OpenAlex to flag hallucinated citations, build a BibTeX file, and draft Introduction + Related Work using ≥90% of the verified pool. Runs in parallel with the plotting-agent. TRIGGER when the orchestrator delegates Step 3 or when the user asks to "find citations for my paper", "draft the related work", or "build the bibliography".
Step 1 of the PaperOrchestra pipeline (arXiv:2604.05018). Convert (idea.md, experimental_log.md, template.tex, conference_guidelines.md) into a strict JSON outline containing a plotting plan, literature search plan (Intro + Related Work), and section-level writing plan with citation hints. TRIGGER when the orchestrator delegates Step 1 or when the user asks to "outline a paper from raw materials" or "generate the paper structure".
Run the four paper-quality autoraters from PaperOrchestra (arXiv:2604.05018, App. F.3) — Citation F1 (P0/P1 partition + Precision/Recall/F1), Literature Review Quality (6-axis 0-100 with anti-inflation rules), SxS Overall Paper Quality (side-by-side), and SxS Literature Review Quality (side-by-side). TRIGGER when the user asks to "score this paper draft", "evaluate against the benchmark", "compare two papers", or "run the autoraters".
Orchestrate the full PaperOrchestra (Song et al., 2026, arXiv:2604.05018) five-agent pipeline to turn unstructured research materials (idea, experimental log, LaTeX template, conference guidelines, optional figures) into a submission-ready LaTeX manuscript and compiled PDF. TRIGGER when the user asks to "write a paper from my experiments", "turn this idea and these results into a paper", "generate a conference submission", "run paper-orchestra on X", or otherwise wants the end-to-end paper-writing pipeline. Coordinates the outline-agent, plotting-agent, literature-review-agent, section-writing-agent, and content-refinement-agent skills.
Step 2 of the PaperOrchestra pipeline (arXiv:2604.05018). Execute the visualization plan from outline.json — render plots and conceptual diagrams from experimental_log.md and idea.md, optionally refine via VLM critique loop, and produce context-aware captions. Runs in parallel with the literature-review-agent. TRIGGER when the orchestrator delegates Step 2 or when the user asks to "generate the figures for my paper" or "render the plots from this experiment log".
Step 4 of the PaperOrchestra pipeline (arXiv:2604.05018). ONE single multimodal LLM call that drafts the remaining paper sections (Abstract, Methodology, Experiments, Conclusion), extracts numeric values from experimental_log.md into LaTeX booktabs tables, splices the generated figures from Step 2, and merges everything into the template that already contains Intro + Related Work from Step 3. TRIGGER when the orchestrator delegates Step 4 or when the user asks to "write the methodology and experiments sections" or "fill in the rest of the paper".