Skill4.9k repo starsupdated 11d ago

results-analysis

The results-analysis skill performs rigorous statistical validation, descriptive and inferential analysis, and scientific figure generation for experimental data in ML/AI research. Use this skill when asked to analyze experimental results, run statistical tests, compare model performance, generate figures, check significance, or perform ablation analysis; it produces analysis bundles including reports, statistics appendices, and figure catalogs rather than manuscript prose.

View source Repository: claude-scholar

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Galaxy-Dawn/claude-scholar /tmp/results-analysis && cp -r /tmp/results-analysis/skills/results-analysis ~/.claude/skills/results-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Results Analysis

Run **strict, evidence-first experimental analysis** for ML/AI research.

Use this skill to produce a **strict analysis bundle**:
- `analysis-report.md`
- `stats-appendix.md`
- `figure-catalog.md`
- `figures/`

When the user asks for review, audit, no-write, dry-run, or when inputs are incomplete, use **read-only audit mode** instead of producing files or figures. In that mode, output only valid/invalid statistics, blockers, claim candidates, and what evidence is missing. If invoked by `/analyze-results`, the command layer may write a blocker summary, but this skill should not create figures, reports, or polished conclusions from incomplete evidence.

Do **not** use this skill to draft a paper `Results` section or a full experiment wrap-up report. Those belong to `ml-paper-writing` or `results-report`.

## Core contract

### This skill is responsible for
- validating experiment artifacts and comparison units,
- running rigorous descriptive and inferential statistics,
- generating **real scientific figures** when data/logs are available,
- writing figure purposes, caption requirements, and interpretation checklists,
- surfacing limits, blockers, and missing evidence explicitly.

### This skill is not responsible for
- paper-ready `Results` prose,
- manuscript narrative polishing,
- paper-ready figure/table packaging with `pubfig` / `pubtab`,
- project-level experiment retrospectives.

If the user wants the complete post-experiment summary report, hand off to `results-report` after this bundle is ready. If the user wants publication-grade figures/tables, export parameters, publication QA, or figure/table redesign, hand off to `publication-chart-skill`.

## Non-negotiable quality bar

1. **Prefer real figures over figure specs.**
   If the data can be read, generate real figures. Do not stop at “recommended visualization”.
   Exception: in read-only audit mode, do not generate figures; describe what figure would be valid after evidence is complete.
2. **Never fabricate statistics.**
   If sample size, seeds, or raw metrics are missing, state the blocker clearly.
3. **Report complete statistics.**
   Do not report only best scores or only p-values.
4. **Interpret every main figure.**
   Every major figure must have purpose, caption requirements, and post-figure interpretation notes.
5. **Separate evidence from prose.**
   This skill produces analysis artifacts; it does not write manuscript sections.

## Standard workflow

### 1. Inventory and validate artifacts

Start by identifying:
- metric tables (`csv`, `json`, `tsv`, logs),
- training curves and checkpoints,
- seeds / repeated runs,
- baselines, ablations, and comparison families,
- evaluation protocol metadata.

Validate:
- metric direction (higher/lower is better),
- unit of analysis (run, subject, fold, dataset, seed),
- number of runs / seeds,
- missing values or silent failures,
- comparability across methods.

If the comparison is not statistically valid, say so before continuing. Do not treat repeated `subject × task` rows, folds, windows, trials, or seeds as independent units unless the design justifies it.
Common blocker: a `subject × task` summary table is usually a repeated-measure summary, not an independent subject-level sample. If subjects have multiple task rows or missing task cells, state that before any significance or winner claim.

### 2. Lock the comparison questions

Before running statistics, define the exact comparison questions:
- Which method is compared to which baseline?
- What is the primary metric?
- What is the repeated-measure unit?
- Which ablation or robustness questions matter?
- Which findings are decision-changing?

Do not mix unrelated comparisons into one undifferentiated table.

### 3. Run strict statistics

Always produce:
- descriptive statistics: `mean ± std` when appropriate,
- `95% CI` or another clearly justified interval,
- run/seed counts,
- significance tests with assumptions stated,
- effect sizes,
- multiple-comparison handling when several contrasts are reported.

Default expectation:
- check parametric assumptions first,
- use non-parametric fallback when assumptions fail,
- state exactly what was tested and on what samples.

See:
- `references/statistical-methods.md`
- `references/statistical-reporting.md`

### 4. Generate real scientific figures

Produce actual figures whenever artifacts are available.

Minimum expectation for a non-trivial analysis bundle:
- **one main comparison figure**,
- **one supporting figure** (training dynamics / ablation / breakdown / error analysis),
- **one exact numeric summary table** in markdown.

Every main figure must define:
- figure purpose,
- plotted variables,
- error bar meaning,
- caption requirements,
- interpretation checklist.

See:
- `references/visualization-best-practices.md`
- `references/figure-interpretation.md`

### 5. Write analysis artifacts

#### `analysis-report.md`
Summarize:
- the analysis question,
- key findings,
- strongest supported comparisons,
- main caveats,
- what changed in the experimental understanding,
- claim candidates that may later be used in reports or manuscript writing.

Each claim candidate should use this shape:

```md
## Claim Candidates

- Claim:
  - Source evidence:
  - Allowed wording:
  - Forbidden stronger wording:
  - Uncertainty:
  - Next check:
  - Decision: keep | weaken | revise | discard
```

#### `stats-appendix.md`
Record:
- descriptive statistics,
- test choices,
- assumptions checked,
- effect sizes,
- confidence intervals,
- multiple comparison corrections,
- explicit blockers and limitations.

#### `figure-catalog.md`
For each figure, record:
- filename,
- purpose,
- data source,
- caption draft requirements,
- key observation,
- interpretation checklist,
- known caveats.

### 6. Final QA gate

Do not finish until all are true:
- [ ] the primary comparison question is explicit,
- [ ] sample size / seed count is stated,
- [ ] inferential tests are justified,
- [ ] effec