mantis-scorecard
The mantis-scorecard slash command provides research and operations teams with visibility into per-model reliability across decision classes by tracking how often each LLM model has been overruled by authoritative signals like full analysis comparisons, judge reviews, and multi-model consensus. Use it to inspect miss-rates, compare model performance side-by-side, examine disagreement reasoning samples, and manage trust overrides for specific model-decision-class combinations through filtering, sorting, and reset operations.
mkdir -p ~/.claude/commands && curl -fsSL https://raw.githubusercontent.com/deonmenezes/mantishack/HEAD/.claude/commands/mantis-scorecard.md -o ~/.claude/commands/mantis-scorecard.mdmantis-scorecard.md
# /scorecard
Read and maintain the **model scorecard** — a per-model track-record of how often each LLM model has been overruled by an authoritative signal (full ANALYSE comparison, judge review, multi-model consensus, tool evidence, operator feedback). The scorecard is what powers fast-tier short-circuit decisions: cells with a Wilson 95% upper-bound miss-rate at or below 5% are trusted; everything else falls through to full analysis.
The slash command is for **research and ops**, not a routing API. The actual routing happens automatically inside `LLMClient.generate_structured` once the codeql consumer (and future consumers) are wired.
## Usage
```
/scorecard # default: list all cells with derived columns
/scorecard list [flags] # filtered / sorted views
/scorecard compare <model-a> <model-b> # side-by-side on shared decision_classes
/scorecard samples <decision_class> # disagreement-reasoning samples (the "why was it wrong?" view)
/scorecard pin <decision_class> --model <m> --as <override>
/scorecard unpin <decision_class> --model <m>
/scorecard reset [<decision_class>] [--model <m>] [--older-than-days <n>] [--all]
```
`list` flags: `--by-savings` `--by-miss-rate` `--untrusted` `--learning` `--consumer <prefix>` `--since <Nd|Nh>`.
The CLI lives at `libexec/mantishack-llm-scorecard`. Output is markdown so it pastes cleanly into notebooks / issues / chat.
### Friendly model aliases (when handling user input)
The CLI takes canonical model names. When the user types something shorter, resolve to canonical before invoking:
| user types | canonical |
|---|---|
| `haiku` | `claude-haiku-4-5` |
| `sonnet` | `claude-sonnet-4-6` |
| `opus` | `claude-opus-4-7` (or whatever's in `LLMConfig.primary_model`) |
| `flash` / `flash-lite` | `gemini-2.5-flash-lite` |
| `4o-mini` | `gpt-4o-mini` |
| `mistral-small` | `mistral-small-latest` |
If unsure, ask the user which canonical name they meant. Don't guess silently.
## What the scorecard tracks
Per `(model, decision_class)` cell:
- `events.cheap_short_circuit` — `{correct, incorrect}`. Recorded when a cheap-tier verdict ("clear FP") is later compared against full ANALYSE. Producer wired in `core/llm/client.py` via `LLMClient.generate_structured`; consumers include `packages/codeql/dataflow_validator` and `packages/codeql/autonomous_analyzer`.
- `events.multi_model_consensus` — agreed-with-majority vs dissented. Producer wired at `packages/llm_analysis/orchestrator.py` (`record_consensus_outcomes`); fires on multi-model agentic runs for disputed findings only (100%-agreement runs correctly produce no events).
- `events.judge_review` — judge upheld vs overruled this model. Producer wired at `packages/llm_analysis/orchestrator.py` (`record_judge_outcomes`); fires when a judge model is configured via `--judge`.
- `events.tool_evidence` — tool agreed with vs contradicted this model's claim. Producer at `core/llm/scorecard/tool_evidence.py`; recorded via `mantishack-llm-scorecard tool-evidence` CLI subcommand (operator-driven; automated wiring from validators is per-consumer).
- `events.operator_feedback` — operator's marking matched vs contradicted this model's verdict. Recorded via `mantishack-llm-scorecard mark` CLI subcommand. No automated producer — explicitly operator-driven by design (the loop-closing ground-truth signal).
- `events.reasoning_divergence` — sister event for agreed-verdict findings whose reasoning text diverged. Producer at `packages/llm_analysis/orchestrator.py` (`record_reasoning_divergence`), companion to consensus.
- `disagreement_samples` — bounded log (max 5) of reasoning text from incorrect outcomes; truncated at 500 chars per side; reasoning only, never the prompt.
- `policy_override` — `auto` (data-driven) | `force_short_circuit` | `force_fall_through`.
**Re-shadowing.** Even when a cell is in `short-circuit` policy, `LLMConfig.scorecard_shadow_rate` (default 5%) of trusted calls still run full ANALYSE. This keeps fresh ground-truth comparison data flowing and detects drift if cheap-model behaviour changes (model upgrade, prompt refinement, etc.). Operator pins (`force_short_circuit`) bypass re-shadowing — explicit intent is never sampled away. Set `scorecard_shadow_rate=0.0` to disable.
## decision_class anatomy
Format: `<consumer>:<rule_or_subject>`. Examples:
| consumer | example |
|---|---|
| codeql | `codeql:py/sql-injection`, `codeql:cpp/uncontrolled-format-string` |
| sca | `sca:major_bump:PyPI`, `sca:hygiene:gha_action_ref_drift` *(future)* |
| hypothesis | `hypothesis:taint_flow` *(future)* |
| crash | `crash:control_flow_hijack:x86_64` *(future)* |
Prefix-filter on `--consumer <prefix>` to scope a query to one consumer's data.
For codeql, the rule_id already encodes the language (`py/...`, `cpp/...`, `js/...`) so there's no separate language axis on the cell — `codeql:py/sql-injection` IS the per-language bucket.
## Interpretation rules — apply when answering questions
When the user asks anything about a cell or model, follow these rules. Don't draw conclusions outside what the rules permit; say so explicitly when data is thin.
- **n < 10 → learning mode.** Don't claim the model is "good" or "bad" at this decision_class. The Wilson upper bound is too wide. Tell the user: "still in learning mode (n=X<10) — no reliable verdict yet."
- **Wilson 95% upper-bound on miss-rate is the trust metric, not the point estimate.** A cell with 0/10 wrong has a Wilson UB of ~26%, not 0%. Always report Wilson UB when comparing or claiming reliability.
- **Policy = derived, not stored:** `auto` cells get policy from Wilson + n; `force_*` cells override. Always show the policy, not just the raw counts.
- **`calls_saved` = `cheap_short_circuit.correct` count.** Each is a full-tier call avoided. Multiply by the operator's per-call cost delta to estimate $. Do not invent a $ number unless the user gave one.
- **"Trust" measures cheap-vs-full agreement, not correctness.*Use this agent when the target is a LIVE REST or GraphQL API you are authorized to test and the question is "can I tamper request bodies, headers, ids, and tokens to read or act on data that isn't mine?" — active, request-driven abuse of the API contract, not static code review. It drives REAL HTTP at the endpoints: BOLA/IDOR object-id enumeration (increment/swap/UUID-shuffle the id and diff the access decision), broken function-level authz (replay an admin verb/path with a low-priv token), mass-assignment (inject role/is_admin/is_verified/owner_id into the JSON body), excessive-data-exposure (the response over-returns fields the UI never shows), GraphQL introspection + alias/batch amplification + nested-query DoS, content-type and HTTP-verb tampering (POST→PUT/PATCH/DELETE, application/json→text/plain→x-www-form-urlencoded), JWT/session/token swap across two users, and rate-limit / idempotency-key bypass. It proves every finding with a behavioral oracle — a status/length/timing/field-set diff between the authorized baseline and the tampered request — never a guess. Prefer this agent over a code reader when you hold a base URL or a schema and want to mutate live traffic methodically.\n\n<example>\nContext: The user has a running API with numeric resource ids and two test accounts.\nuser: "Here's our staging API at https://api.staging.acme.test and tokens for user A and user B — can user A read user B's orders?"\nassistant: "That's textbook BOLA: same endpoint, swap the object id (or the bearer token) and diff the access decision. I'll use the Task tool to launch the api-abuse-fuzzer agent to enumerate /orders/{id} with A's token against B's ids and prove the cross-tenant read with a status + ownership-field oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer: a live authorized API + two tokens + object-id enumeration is its core BOLA/IDOR mission.\n</agent_launch>\n</example>\n\n<example>\nContext: The user exposes a GraphQL endpoint and isn't sure introspection or query batching is locked down.\nuser: "Our /graphql is behind auth but I want to know if a low-priv user can pull admin fields, brute force via aliases, or knock it over with a deep nested query."\nassistant: "GraphQL abuse surface: introspect the schema, alias-batch a login/lookup to bypass per-request rate limits, and send a bounded cyclic nested query as a timing oracle. I'll launch the api-abuse-fuzzer agent to tamper the operation and measure the depth/timing oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer for GraphQL introspection, alias/batch amplification, and nested-query DoS against the live endpoint.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when: a live base URL + an OpenAPI/Swagger/GraphQL schema (or a captured request) is in hand and the target is authorized in-scope; endpoints take a resource identifier in the path/query/body (/users/{id}, ?account=, {"order_id": ...}) — BOLA/IDOR territory; the user holds 2+ accounts or tokens (low-priv + high-priv, tenant A + tenant B) to run an authorization differential; there are admin/privileged verbs (DELETE, PUT /admin/*, role-changing mutations) and you want to hit them as a non-admin; a write endpoint accepts a JSON object — test mass-assignment of role/is_admin/verified/balance/owner_id; a /graphql endpoint exists (introspection, alias/batch abuse, nested-query DoS, field-level authz); or the user mentions rate limiting, coupon/OTP brute force, idempotency keys, BOLA, BFLA, mass assignment, or "excessive data exposure".
Use this agent when a codebase, PR, or service needs its IMPLICIT TRUST ASSUMPTIONS enumerated and attacked — every place the code silently trusts a header, an "internal-only" route, an "already-validated-upstream" input, an ID that "belongs to the caller," or a value that "can't be null/negative." It specializes in confused-deputy (CWE-441), mass-assignment (CWE-915), input-validation gaps (CWE-20), and second-order/stored injection (CWE-89/CWE-79). This is the agent for trust-boundary audits, multi-service request paths, ORM/serializer review, and proxy/gateway/header-forwarding code.\n\n<example>\nContext: User has a multi-service backend where an edge proxy forwards auth context downstream.\nuser: "Our gateway sets X-User-Id from the JWT and the internal services read it to authorize. Can you check this?"\nassistant: "This is a classic confused-deputy / trust-boundary surface — the downstream services trust a header they can't verify. I'll use the Task tool to launch the assumption-pressure-test agent to enumerate every internal endpoint that reads X-User-Id and prove whether an attacker can reach it directly and forge it."\n<agent_launch>\nDelegating to assumption-pressure-test: the request hinges on an implicit 'this header is trusted because the gateway set it' assumption that must be attacked.\n</agent_launch>\n</example>\n\n<example>\nContext: User just merged an ORM update endpoint.\nuser: "Here's the new profile update handler: User.objects.filter(id=request.user.id).update(**request.data)"\nassistant: "That `**request.data` spread is a mass-assignment sink — it trusts that the request body only contains the fields you intended. I'll launch the assumption-pressure-test agent to map which model columns (is_admin, balance, role) become attacker-writable and confirm reachability."\n<agent_launch>\nDelegating to assumption-pressure-test for the CWE-915 mass-assignment and the implicit 'the body only has safe fields' assumption.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when:\n- Code reads request headers (X-Forwarded-For, X-User-Id, X-Real-IP, X-Internal-*, Host) for trust or authorization decisions\n- A serializer/ORM uses bulk binding: `**req.body`, `Object.assign`, `ModelMapper`, `BeanUtils.copyProperties`, `update_attributes`, `params.permit!`\n- Comments or names assert trust: "internal only", "already validated", "trusted", "comes from gateway", "sanitized upstream"\n- Data is stored then later concatenated into SQL/HTML/shell (second-order injection)\n- An endpoint takes an `id`/`uuid`/`account`/`order` param that maps to a resource (IDOR / object ownership)
Generate gcov coverage data for a code repository.
Analyze security bugs from any C/C++ project with full root-cause tracing
Analyze crashes using rr recordings, function traces, and coverage data to produce root-cause analyses.
Carefully analyze root cause analysis reports for crashes to make sure they are correct
Multi-stage pipeline to validate vulnerability findings are real, reachable, and exploitable
|