artifact-detection
The artifact-detection skill systematically identifies annotation shortcuts and dataset biases in machine learning benchmarks by searching for evidence of partial-input baselines, contrast set performance drops, and format sensitivity. Use this when evaluating whether benchmark scores reflect genuine model capabilities or exploitation of spurious correlations and labeling artifacts that inflate performance metrics.
git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/artifact-detection && cp -r /tmp/artifact-detection/skills/artifact-detection ~/.claude/skills/artifact-detectionSKILL.md
# Artifact Detection Tactic
Systematically probe benchmarks for annotation artifacts, dataset shortcuts, and spurious correlations that allow models to achieve high scores without the intended capability.
## Stages
### Stage 1: Hypothesis-Only Baseline Test
Search literature for evidence that partial-input baselines achieve unexpectedly high performance:
- Hypothesis-only baselines (NLI without premise)
- Question-only baselines (QA without context)
- Label-word frequency baselines
- Majority-class and surface-pattern baselines
**Search queries**: "[benchmark] annotation artifacts", "[benchmark] hypothesis only", "[benchmark] spurious correlations", "[benchmark] dataset bias"
If published partial-input results exist, record performance gap between partial and full input. Gap < 10 points above random indicates severe artifacts.
### Stage 2: Contrast Set Construction
Identify whether contrast sets or adversarial evaluations exist:
- Search for "[benchmark] contrast sets", "[benchmark] adversarial examples"
- Check if CheckList-style behavioral tests have been applied
- Look for counterfactual data augmentation studies
Record performance drops on contrast sets. Drops > 20 points indicate reliance on surface patterns.
### Stage 3: Format Manipulation Probes
Search for evidence of format sensitivity:
- Prompt template sensitivity studies
- Label name/ordering effects
- Verbalization effects in classification
- Input length correlations with labels
Record whether minor format changes cause disproportionate score changes.
### Stage 4: Conclusion Synthesis
Aggregate evidence into artifact severity assessment:
| Severity | Criteria |
|----------|----------|
| Critical | Partial-input baseline within 5 points of full model |
| High | Contrast set drop >20 points OR format sensitivity >10 points |
| Medium | Known artifacts documented but partial mitigations exist |
| Low | Minor artifacts, full-input still required for high performance |
| None | No evidence of artifacts (may indicate insufficient probing) |
## Output
```yaml
artifact_report:
benchmark: string
overall_severity: critical|high|medium|low|none
partial_input_baselines:
- input_type: string # e.g., "hypothesis only"
performance: float
full_model_performance: float
gap: float
source: string
contrast_set_results:
- contrast_set: string
original_performance: float
contrast_performance: float
drop: float
source: string
format_sensitivity:
- manipulation: string
score_range: string
source: string
shortcuts_identified:
- shortcut: string
mechanism: string
exploitability: high|medium|low
evidence_completeness: thorough|partial|minimal
```
## Yield Report
| Metric | Minimum |
|--------|---------|
| Literature sources checked | 5 |
| Artifact categories probed | 3 |
| Evidence items collected | 4 |
| Severity classification produced | 1 |Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.
Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.
loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.
loss-2 judge - pairwise quality comparison across the n rungs within one topic; decide monotonicity and endpoint separation. check-blind, D1-D5 only.
Strategy: 面对异常的最佳解释推理
Remove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
Map system architecture to ablatable units for ablation studies
Design ablation studies to isolate component contributions in ML systems