Skill1.3k repo starsupdated today
h-verify
h-verify enforces the FPF verification loop (baseline, measure, evidence, record) by identifying a decision, reading its predictions, optionally baselining affected files for drift detection, gathering evidence for each prediction through tests or metrics, and attaching verdict-tagged evidence artifacts. Use it when you need to validate whether previously declared predictions still hold, detect file drift, or surface stale decisions past their valid_until date.
Install in Claude Code
Copygit clone --depth 1 https://github.com/m0n0x41d/haft /tmp/h-verify && cp -r /tmp/h-verify/internal/cli/skill/h-verify ~/.claude/skills/h-verifyThen start a new Claude Code session; the skill loads automatically.
Definition
SKILL.md
# h-verify — Verify a decision still holds You are running the FPF verification loop: baseline → measure → evidence → record. Drift detection compares current state against baselined affected_files; evidence decay reports surface when valid_until passes; measure verdict is recorded for the predictions the decide step declared. ## Step 1 — Identify the decision If `decision_ref` is given, use it. Otherwise: - `mcp__haft__haft_query(action="status")` — surfaces stale/refresh-due decisions - `mcp__haft__haft_query(action="list", kind="DecisionRecord")` — full list - Ask the operator which decision to verify ## Step 2 — Read the decision's predictions `mcp__haft__haft_query(action="search", query="<decision_ref>")` returns the DecisionRecord including its `predictions` field. Each prediction has: - `claim` — the falsifiable statement - `observable` — what to measure - `threshold` — pass/fail boundary - `verify_after` — when async evidence should be available (if any) If predictions are empty (the decision was recorded tactical with `_skips: ["predictions"]`), there's nothing to measure — report that to operator and recommend either: - `/h-refresh` action=reopen to add predictions and re-decide properly - Just attach evidence directly via `haft_decision(action="evidence", ...)` ## Step 3 — Baseline (if drift detection wanted) If the decision has `affected_files` and you want drift comparison: ``` mcp__haft__haft_decision( action="baseline", decision_ref="<dec-...>" // affected_files optional — kernel uses the decision's list ) ``` The kernel snapshots file content hashes. Subsequent comparisons detect drift. Call once after each commit cycle if you want continuous drift signal. ## Step 4 — Gather evidence per prediction For each prediction: - Run the observable (test, metric query, log scan, code grep) - Compare to threshold - Capture the actual measurement value Tools available depending on the observable: - `Bash` for test runners, metric queries, log scans - `Read` / `Grep` / `Glob` for code-level invariant checks - For external metrics: kernel has no special integration; agent describes the source ## Step 5 — Attach evidence to the artifact For each material evidence item: ``` mcp__haft__haft_decision( action="evidence", artifact_ref="<dec-...>", evidence_type="measurement | test | research | benchmark | audit", evidence_content="<what you observed, with concrete numbers>", evidence_verdict="supports | weakens | refutes", carrier_ref="<file path or URL where the evidence lives>", claim_refs=["<prediction id or scope label>"], congruence_level=3, // 3=same context, 2=similar, 1=different, 0=opposed valid_until="<RFC3339 or YYYY-MM-DD — when this evidence expires>", causal_support_basis="observational | interventional | realized_counterfactual | identified_estimate | simulation_only" ) ``` **congruence_level** (CL) defaults per FPF B.3.5: - 3: same-context evidence (own production system, own tests) - 2: similar-context (related project, similar load) - 1: different-context (external docs, vendor benchmarks) - 0: opposed-context (rare; conflicting framework) CL impacts R_eff per FPF B.3:3 — never average across CL. ## Step 6 — Record the measurement verdict After all evidence is attached, record the overall verdict: ``` mcp__haft__haft_decision( action="measure", decision_ref="<dec-...>", verdict="accepted | partial | failed", findings="<what actually happened compared to predictions>", measurements=["p99 latency: 42ms (predicted <50ms — accepted)", "..."], criteria_met=["<criterion that was met>"], criteria_not_met=["<criterion that was NOT met>"] ) ``` Kernel ties the verdict back to the predictions and surfaces: - Accepted → decision health remains good - Partial → some predictions held, some didn't → consider reopen or supersede - Failed → decision invalidated → consider supersede or rollback per the decision's rollback spec If any verified prediction carried a `probability` forecast (set at `/h-decide`), the measure response also appends a **Calibration** read: the decomposed-Brier profile (Brier = reliability − resolution + uncertainty) over all verified forecasts, plus a directional over/under-confidence bias. Below ~15 accumulated forecasts it reports cold-start and is not yet actionable — surface it to the operator but do not over-read a sparse profile. ## Step 7 — Handle stale or drifted decisions If verification reveals: - **Evidence decayed** (valid_until passed): `mcp__haft__haft_refresh(action="waive", artifact_ref=..., evidence="<new evidence>", new_valid_until="...")` to extend validity, OR `action=reopen` to start a new problem cycle - **Drift detected** (affected_files changed since baseline): classify drift as cosmetic / incidental / material via `haft_query(action="status")` and decide whether to re-baseline or reopen - **Verdict failed**: `mcp__haft__haft_refresh(action="supersede", artifact_ref=<old>, new_artifact_ref=<replacement>)` after recording the replacement decision via `/h-decide` ## Step 8 — Present to operator Surface: - Predictions vs actual measurements - Verdict (accepted / partial / failed) - Evidence attached with CL - Drift status if baseline existed - Recommended next action (waive / reopen / supersede / nothing — decision still good) **Re-grounding discipline (FPF A.7).** When you reference decision IDs (`dec-20260525-...`), prediction labels, or evidence refs in the verdict summary and recommendation paragraphs, pair each with its human-readable title or claim — `dec-20260525-abc (NATS over Kafka for ops simplicity) — verdict accepted` not bare `dec-20260525-abc verdict accepted`. Bare IDs accumulate cognitive debt across long sessions. Keep IDs for traceability but never let them stand alone in summaries. See CLAUDE.md Critical Reminders for the project-wide rule. ## What NOT to do - Do NOT call `action="measure"` without first gathering evidence — kernel rejects measure-from-m