ladder-quality-order
The ladder-quality-order skill performs pairwise quality comparisons of research designs within a topic using five substantive dimensions (D1–D5: meaningfulness, skill-research value, DARE usability, layer respect, and prerequisite firmness). It ranks n samples against an intended quality order, outputs a Kendall tau correlation score and endpoint stability metrics, and flags when top and bottom designs become indistinguishable. Use this to validate whether interpolated research quality gradients maintain monotonic separation and resist framing-based confounds.
git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/ladder-quality-order && cp -r /tmp/ladder-quality-order/self-iteration/2026-06-06-probe-pretrain/skills/ladder-quality-order ~/.claude/skills/ladder-quality-orderSKILL.md
# ladder-quality-order (loss-2)
You receive: the n samples under one topic, each with (research_graph, research_result),
plus their intended_order (the id order from the interpolator: id0 should be best ->
idN-1 should be worst).
## Task (pairwise ranking, no absolute scores)
1. Enumerate all i<j pairs; for each pair ask: **in the D1-D5 sense, which research design
is more substantive?** (D1 more meaningful / D2 more skill-research value / D3 more
usable to DARE / D4 better respects the 4 layers / D5 firmer prerequisites). Output
winner + a one-line reason.
2. Aggregate into an induced order; compute Kendall tau against intended_order.
3. Endpoints: directly compare id0 vs idN-1; across K repeats, check whether id0 wins stably.
## Output (JSON)
{"tau": float, "monotonicity_pass": bool, // tau>=0.7 and no endpoint inversion
"endpoint_separation_pass": bool, // id0 wins >= K-allowance of K repeats
"rigor_floor_flag": bool, // if id0 ~ idN-1 endpoints collapse (feed risk register)
"pairwise_log": [{i,j,winner,reason}]}
## check-blind contract (hard constraint)
- The judge prompt may use **only** D1-D5 wording.
- **Forbidden**: 32-check vocabulary, 6-primitive, "pseudo-good/novel-good" categories,
any detection signature.
- z-perp-C: on the B1 confound triplet (same substance, different framing) your order
**must stay invariant**; if it varies with framing -> you were dragged by the confound,
tighten back to D1-D5 substance.Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.
Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.
loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.
Strategy: 面对异常的最佳解释推理
Remove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
Map system architecture to ablatable units for ablation studies
Design ablation studies to isolate component contributions in ML systems
Remove components one by one from a system, record the response/impact of each removal.