Skill389 estrellas del repoactualizado 20d ago

artifact-detection

The artifact-detection skill systematically identifies annotation shortcuts and dataset biases in machine learning benchmarks by searching for evidence of partial-input baselines, contrast set performance drops, and format sensitivity. Use this when evaluating whether benchmark scores reflect genuine model capabilities or exploitation of spurious correlations and labeling artifacts that inflate performance metrics.

Ver fuente Repositorio: de-anthropocentric-research-engine

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/yogsoth-ai/de-anthropocentric-research-engine /tmp/artifact-detection && cp -r /tmp/artifact-detection/skills/artifact-detection ~/.claude/skills/artifact-detection

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Artifact Detection Tactic

Systematically probe benchmarks for annotation artifacts, dataset shortcuts, and spurious correlations that allow models to achieve high scores without the intended capability.

## Stages

### Stage 1: Hypothesis-Only Baseline Test

Search literature for evidence that partial-input baselines achieve unexpectedly high performance:
- Hypothesis-only baselines (NLI without premise)
- Question-only baselines (QA without context)
- Label-word frequency baselines
- Majority-class and surface-pattern baselines

**Search queries**: "[benchmark] annotation artifacts", "[benchmark] hypothesis only", "[benchmark] spurious correlations", "[benchmark] dataset bias"

If published partial-input results exist, record performance gap between partial and full input. Gap < 10 points above random indicates severe artifacts.

### Stage 2: Contrast Set Construction

Identify whether contrast sets or adversarial evaluations exist:
- Search for "[benchmark] contrast sets", "[benchmark] adversarial examples"
- Check if CheckList-style behavioral tests have been applied
- Look for counterfactual data augmentation studies

Record performance drops on contrast sets. Drops > 20 points indicate reliance on surface patterns.

### Stage 3: Format Manipulation Probes

Search for evidence of format sensitivity:
- Prompt template sensitivity studies
- Label name/ordering effects
- Verbalization effects in classification
- Input length correlations with labels

Record whether minor format changes cause disproportionate score changes.

### Stage 4: Conclusion Synthesis

Aggregate evidence into artifact severity assessment:

| Severity | Criteria |
|----------|----------|
| Critical | Partial-input baseline within 5 points of full model |
| High | Contrast set drop >20 points OR format sensitivity >10 points |
| Medium | Known artifacts documented but partial mitigations exist |
| Low | Minor artifacts, full-input still required for high performance |
| None | No evidence of artifacts (may indicate insufficient probing) |

## Output

```yaml
artifact_report:
  benchmark: string
  overall_severity: critical|high|medium|low|none
  partial_input_baselines:
    - input_type: string  # e.g., "hypothesis only"
      performance: float
      full_model_performance: float
      gap: float
      source: string
  contrast_set_results:
    - contrast_set: string
      original_performance: float
      contrast_performance: float
      drop: float
      source: string
  format_sensitivity:
    - manipulation: string
      score_range: string
      source: string
  shortcuts_identified:
    - shortcut: string
      mechanism: string
      exploitability: high|medium|low
  evidence_completeness: thorough|partial|minimal
```

## Yield Report

| Metric | Minimum |
|--------|---------|
| Literature sources checked | 5 |
| Artifact categories probed | 3 |
| Evidence items collected | 4 |
| Severity classification produced | 1 |

Del mismo repositorio

formated-resultSkill

Experiment-specific - summarize the DARE executor's research design into a clean research_result report, forced to write back into the spec file produced by formated-specs.

formated-specsSkill

Experiment-specific - replaces writing-specs, emits DARE's 4-layer call plan as a clean research_graph schema. Last step forces load formated-result.

injection-fidelitySkill

loss-1 judge - read a sample's full dialogue and decide whether the user simulator semantically enacted its Policy Card. check-blind.

ladder-quality-orderSkill

loss-2 judge - pairwise quality comparison across the n rungs within one topic; decide monotonicity and endpoint separation. check-blind, D1-D5 only.

abductive-hypothesis-generationSkill

Strategy: Inference to the best explanation in the face of anomalies

ablation-brainstormSkill

Remove components one by one, observe system changes to reveal hidden

ablation-component-mappingSkill

Map system architecture to ablatable units for ablation studies

ablation-designSkill

Design ablation studies to isolate component contributions in ML systems