results-analysis
The results-analysis skill performs rigorous statistical validation, descriptive and inferential analysis, and scientific figure generation for experimental data in ML/AI research. Use this skill when asked to analyze experimental results, run statistical tests, compare model performance, generate figures, check significance, or perform ablation analysis; it produces analysis bundles including reports, statistics appendices, and figure catalogs rather than manuscript prose.
git clone --depth 1 https://github.com/Galaxy-Dawn/claude-scholar /tmp/results-analysis && cp -r /tmp/results-analysis/skills/results-analysis ~/.claude/skills/results-analysisSKILL.md
# Results Analysis Run **strict, evidence-first experimental analysis** for ML/AI research. Use this skill to produce a **strict analysis bundle**: - `analysis-report.md` - `stats-appendix.md` - `figure-catalog.md` - `figures/` When the user asks for review, audit, no-write, dry-run, or when inputs are incomplete, use **read-only audit mode** instead of producing files or figures. In that mode, output only valid/invalid statistics, blockers, claim candidates, and what evidence is missing. If invoked by `/analyze-results`, the command layer may write a blocker summary, but this skill should not create figures, reports, or polished conclusions from incomplete evidence. Do **not** use this skill to draft a paper `Results` section or a full experiment wrap-up report. Those belong to `ml-paper-writing` or `results-report`. ## Core contract ### This skill is responsible for - validating experiment artifacts and comparison units, - running rigorous descriptive and inferential statistics, - generating **real scientific figures** when data/logs are available, - writing figure purposes, caption requirements, and interpretation checklists, - surfacing limits, blockers, and missing evidence explicitly. ### This skill is not responsible for - paper-ready `Results` prose, - manuscript narrative polishing, - paper-ready figure/table packaging with `pubfig` / `pubtab`, - project-level experiment retrospectives. If the user wants the complete post-experiment summary report, hand off to `results-report` after this bundle is ready. If the user wants publication-grade figures/tables, export parameters, publication QA, or figure/table redesign, hand off to `publication-chart-skill`. ## Non-negotiable quality bar 1. **Prefer real figures over figure specs.** If the data can be read, generate real figures. Do not stop at “recommended visualization”. Exception: in read-only audit mode, do not generate figures; describe what figure would be valid after evidence is complete. 2. **Never fabricate statistics.** If sample size, seeds, or raw metrics are missing, state the blocker clearly. 3. **Report complete statistics.** Do not report only best scores or only p-values. 4. **Interpret every main figure.** Every major figure must have purpose, caption requirements, and post-figure interpretation notes. 5. **Separate evidence from prose.** This skill produces analysis artifacts; it does not write manuscript sections. ## Standard workflow ### 1. Inventory and validate artifacts Start by identifying: - metric tables (`csv`, `json`, `tsv`, logs), - training curves and checkpoints, - seeds / repeated runs, - baselines, ablations, and comparison families, - evaluation protocol metadata. Validate: - metric direction (higher/lower is better), - unit of analysis (run, subject, fold, dataset, seed), - number of runs / seeds, - missing values or silent failures, - comparability across methods. If the comparison is not statistically valid, say so before continuing. Do not treat repeated `subject × task` rows, folds, windows, trials, or seeds as independent units unless the design justifies it. Common blocker: a `subject × task` summary table is usually a repeated-measure summary, not an independent subject-level sample. If subjects have multiple task rows or missing task cells, state that before any significance or winner claim. ### 2. Lock the comparison questions Before running statistics, define the exact comparison questions: - Which method is compared to which baseline? - What is the primary metric? - What is the repeated-measure unit? - Which ablation or robustness questions matter? - Which findings are decision-changing? Do not mix unrelated comparisons into one undifferentiated table. ### 3. Run strict statistics Always produce: - descriptive statistics: `mean ± std` when appropriate, - `95% CI` or another clearly justified interval, - run/seed counts, - significance tests with assumptions stated, - effect sizes, - multiple-comparison handling when several contrasts are reported. Default expectation: - check parametric assumptions first, - use non-parametric fallback when assumptions fail, - state exactly what was tested and on what samples. See: - `references/statistical-methods.md` - `references/statistical-reporting.md` ### 4. Generate real scientific figures Produce actual figures whenever artifacts are available. Minimum expectation for a non-trivial analysis bundle: - **one main comparison figure**, - **one supporting figure** (training dynamics / ablation / breakdown / error analysis), - **one exact numeric summary table** in markdown. Every main figure must define: - figure purpose, - plotted variables, - error bar meaning, - caption requirements, - interpretation checklist. See: - `references/visualization-best-practices.md` - `references/figure-interpretation.md` ### 5. Write analysis artifacts #### `analysis-report.md` Summarize: - the analysis question, - key findings, - strongest supported comparisons, - main caveats, - what changed in the experimental understanding, - claim candidates that may later be used in reports or manuscript writing. Each claim candidate should use this shape: ```md ## Claim Candidates - Claim: - Source evidence: - Allowed wording: - Forbidden stronger wording: - Uncertainty: - Next check: - Decision: keep | weaken | revise | discard ``` #### `stats-appendix.md` Record: - descriptive statistics, - test choices, - assumptions checked, - effect sizes, - confidence intervals, - multiple comparison corrections, - explicit blockers and limitations. #### `figure-catalog.md` For each figure, record: - filename, - purpose, - data source, - caption draft requirements, - key observation, - interpretation checklist, - known caveats. ### 6. Final QA gate Do not finish until all are true: - [ ] the primary comparison question is explicit, - [ ] sample size / seed count is stated, - [ ] inferential tests are justified, - [ ] effec
Expert code review specialist. Proactively reviews code for quality, security, and maintainability. Use immediately after writing or modifying code. MUST BE USED for all code changes.
Use this agent when the user provides a Kaggle competition URL or asks to learn from Kaggle winning solutions. Examples:
Use this agent when the user asks to "conduct literature review", "search for papers", "analyze research papers", "identify research gaps", "review related work", or mentions starting a research project. This agent integrates with Zotero for automated paper collection, organization, and full-text analysis. Examples:
Use this agent when the user provides a research paper (PDF/DOCX/arXiv link) or asks to learn writing patterns from papers, extract venue-specific writing signals, study paper structure, or mine rebuttal strategies. The agent writes extracted knowledge into the active installed paper-miner writing memory for ml-paper-writing. It does not maintain project-specific writing memory.
Use this agent when the user asks to "write rebuttal", "respond to reviewers", "analyze review comments", or needs help with academic paper review response. This agent specializes in systematic rebuttal writing with professional tone and structured responses.
Test-driven development guide for writing tests first, implementing the smallest passing change, and keeping verification tight. Use when the user explicitly wants TDD or when a task should be driven by failing tests before code.
Run a blocker-first post-experiment workflow: validate evidence, produce strict statistical analysis when possible, and generate a decision-oriented results report only when the analysis bundle is sufficient. Uses results-analysis + results-report as a gated two-stage workflow.
Commit changes following Conventional Commits format (local only, no push).