benchmark-synthesis
Benchmark Synthesis aggregates findings from all prior benchmark archaeology analyses into a structured audit report containing cross-cutting themes, prioritized recommendations, and actionable conclusions. Use this skill after completing comprehensive analysis across audit scores, saturation data, validity assessments, coverage maps, and protocol forensics to produce a final synthesis report suitable for publication or stakeholder decision-making.
git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/benchmark-synthesis && cp -r /tmp/benchmark-synthesis/skills/benchmark-synthesis ~/.claude/skills/benchmark-synthesisSKILL.md
# Benchmark Synthesis SOP Synthesize all analysis results from a benchmark archaeology campaign into a final structured report with cross-cutting findings, prioritized recommendations, and actionable conclusions. ## Input - **all_analysis_results**: Combined outputs from all prior analysis SOPs and tactics (audit scores, saturation data, validity assessments, coverage maps, protocol forensics) ## Procedure 1. Aggregate findings across all analysis dimensions 2. Identify cross-cutting themes and systemic issues 3. Prioritize findings by impact and actionability 4. Produce executive summary and detailed report 5. Generate specific recommendations for benchmark users and creators ## Output Final synthesis report suitable for publication or decision-making. <!-- BEGIN available-tables (generated) --> ## Available SOPs Optional, no fixed order; the final leaf is always a sop. | SOP | When to use | | --- | --- | | spawn-agent | Spawn a customized CC subagent with full MCP tool access. Used by SOPs that declare execution: subagent. | <!-- END available-tables (generated) -->
Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.
Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.
loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.
loss-2 judge - pairwise quality comparison across the n rungs within one topic; decide monotonicity and endpoint separation. check-blind, D1-D5 only.
Strategy: Inference to the best explanation in the face of anomalies
Remove components one by one, observe system changes to reveal hidden
Map system architecture to ablatable units for ablation studies
Design ablation studies to isolate component contributions in ML systems