Skip to main content
ClaudeWave
Skill570 repo starsupdated today

bernstein-quality

Bernstein Quality Metrics analyzes and compares code generation reliability across AI models by executing quality assessment scripts and displaying success rates, pass rates for linting and testing, and completion time distributions. Use this skill when users ask about agent reliability, model performance comparisons, test failure analysis, or want to see quality metrics dashboards for decision-making on model routing and optimization.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/sipyourdrink-ltd/bernstein /tmp/bernstein-quality && cp -r /tmp/bernstein-quality/packages/cursor-plugin/skills/bernstein-quality ~/.claude/skills/bernstein-quality
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Bernstein Quality Metrics

Analyze quality and reliability of agent-generated code.

## When to Use

- User asks "how reliable are the agents?" or "which model is best?"
- User wants success rates, pass rates, or completion time stats
- User asks about test failures or lint issues across models
- User says "show me quality metrics"

## Instructions

1. Run `scripts/quality.sh metrics` for overall quality metrics.
2. Run `scripts/quality.sh pass-rates` for lint/typecheck/test pass rates by model.
3. Run `scripts/quality.sh times` for completion time distributions.

4. Present a quality dashboard:

```
## Quality Dashboard

### Success Rate by Model
| Model | Tasks | Success | Fail | Rate |
|-------|-------|---------|------|------|
| claude-sonnet-4 | 24 | 22 | 2 | 91.7% |
| gpt-4.1 | 12 | 10 | 2 | 83.3% |

### Pass Rates
| Check | Overall | claude-sonnet-4 | gpt-4.1 |
|-------|---------|-----------------|---------|
| Lint | 96% | 98% | 92% |
| Type-check | 88% | 91% | 83% |
| Tests | 85% | 89% | 75% |

### Completion Times
| Percentile | Time |
|------------|------|
| p50 | 3m 20s |
| p90 | 8m 45s |
| p99 | 15m 12s |
```

5. Highlight any models with significantly lower pass rates.
6. Recommend model routing adjustments if one model consistently underperforms.