Skill1.4k repo starsupdated today

h-verify

h-verify enforces the FPF verification loop (baseline, measure, evidence, record) by identifying a decision, reading its predictions, optionally baselining affected files for drift detection, gathering evidence for each prediction through tests or metrics, and attaching verdict-tagged evidence artifacts. Use it when you need to validate whether previously declared predictions still hold, detect file drift, or surface stale decisions past their valid_until date.

View source Repository: haft

Install in Claude Code

Copy

git clone --depth 1 https://github.com/m0n0x41d/haft /tmp/h-verify && cp -r /tmp/h-verify/internal/cli/skill/h-verify ~/.claude/skills/h-verify

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# h-verify — Verify a decision still holds

You are running the FPF verification loop: baseline → measure → evidence → record. Drift detection compares current state against baselined affected_files; evidence decay reports surface when valid_until passes; measure verdict is recorded for the predictions the decide step declared.

## Step 1 — Identify the decision

If `decision_ref` is given, use it. Otherwise:
- `mcp__haft__haft_query(action="status")` — surfaces stale/refresh-due decisions
- `mcp__haft__haft_query(action="list", kind="DecisionRecord")` — full list
- Ask the operator which decision to verify

## Step 2 — Read the decision's predictions

`mcp__haft__haft_query(action="search", query="<decision_ref>")` returns the DecisionRecord including its `predictions` field. Each prediction has:
- `claim` — the falsifiable statement
- `observable` — what to measure
- `threshold` — pass/fail boundary
- `verify_after` — when async evidence should be available (if any)

If predictions are empty (the decision was recorded tactical with `_skips: ["predictions"]`), there's nothing to measure — report that to operator and recommend either:
- `/h-refresh` action=reopen to add predictions and re-decide properly
- Just attach evidence directly via `haft_decision(action="evidence", ...)`

## Step 3 — Baseline (if drift detection wanted)

If the decision has `affected_files` and you want drift comparison:

```
mcp__haft__haft_decision(
  action="baseline",
  decision_ref="<dec-...>"
  // affected_files optional — kernel uses the decision's list
)
```

The kernel snapshots file content hashes. Subsequent comparisons detect drift. Call once after each commit cycle if you want continuous drift signal.

## Step 4 — Gather evidence per prediction

For each prediction:
- Run the observable (test, metric query, log scan, code grep)
- Compare to threshold
- Capture the actual measurement value

Tools available depending on the observable:
- `Bash` for test runners, metric queries, log scans
- `Read` / `Grep` / `Glob` for code-level invariant checks
- For external metrics: kernel has no special integration; agent describes the source

## Step 5 — Attach evidence to the artifact

For each material evidence item:

```
mcp__haft__haft_decision(
  action="evidence",
  artifact_ref="<dec-...>",
  evidence_type="measurement | test | research | benchmark | audit",
  evidence_content="<what you observed, with concrete numbers>",
  evidence_verdict="supports | weakens | refutes",
  carrier_ref="<file path or URL where the evidence lives>",
  claim_refs=["<prediction id or scope label>"],
  congruence_level=3,  // 3=same context, 2=similar, 1=different, 0=opposed
  valid_until="<RFC3339 or YYYY-MM-DD — when this evidence expires>",
  causal_support_basis="observational | interventional | realized_counterfactual | identified_estimate | simulation_only"
)
```

**congruence_level** (CL) defaults per FPF B.3.5:
- 3: same-context evidence (own production system, own tests)
- 2: similar-context (related project, similar load)
- 1: different-context (external docs, vendor benchmarks)
- 0: opposed-context (rare; conflicting framework)

CL impacts R_eff per FPF B.3:3 — never average across CL.

## Step 6 — Record the measurement verdict

After all evidence is attached, record the overall verdict:

```
mcp__haft__haft_decision(
  action="measure",
  decision_ref="<dec-...>",
  verdict="accepted | partial | failed",
  findings="<what actually happened compared to predictions>",
  measurements=["p99 latency: 42ms (predicted <50ms — accepted)", "..."],
  criteria_met=["<criterion that was met>"],
  criteria_not_met=["<criterion that was NOT met>"]
)
```

Kernel ties the verdict back to the predictions and surfaces:
- Accepted → decision health remains good
- Partial → some predictions held, some didn't → consider reopen or supersede
- Failed → decision invalidated → consider supersede or rollback per the decision's rollback spec

If any verified prediction carried a `probability` forecast (set at `/h-decide`),
the measure response also appends a **Calibration** read: the decomposed-Brier
profile (Brier = reliability − resolution + uncertainty) over all verified
forecasts, plus a directional over/under-confidence bias. Below ~15 accumulated
forecasts it reports cold-start and is not yet actionable — surface it to the
operator but do not over-read a sparse profile.

## Step 7 — Handle stale or drifted decisions

If verification reveals:
- **Evidence decayed** (valid_until passed): `mcp__haft__haft_refresh(action="waive", artifact_ref=..., evidence="<new evidence>", new_valid_until="...")` to extend validity, OR `action=reopen` to start a new problem cycle
- **Drift detected** (affected_files changed since baseline): classify drift as cosmetic / incidental / material via `haft_query(action="status")` and decide whether to re-baseline or reopen
- **Verdict failed**: `mcp__haft__haft_refresh(action="supersede", artifact_ref=<old>, new_artifact_ref=<replacement>)` after recording the replacement decision via `/h-decide`

## Step 8 — Present to operator

Surface:
- Predictions vs actual measurements
- Verdict (accepted / partial / failed)
- Evidence attached with CL
- Drift status if baseline existed
- Recommended next action (waive / reopen / supersede / nothing — decision still good)

**Re-grounding discipline (FPF A.7).** When you reference decision IDs
(`dec-20260525-...`), prediction labels, or evidence refs in the verdict
summary and recommendation paragraphs, pair each with its
human-readable title or claim — `dec-20260525-abc (NATS over Kafka for
ops simplicity) — verdict accepted` not bare `dec-20260525-abc verdict
accepted`. Bare IDs accumulate cognitive debt across long sessions. Keep
IDs for traceability but never let them stand alone in summaries. See
CLAUDE.md Critical Reminders for the project-wide rule.

## What NOT to do

- Do NOT call `action="measure"` without first gathering evidence — kernel rejects measure-from-m