Skill2.9k estrellas del repoactualizado yesterday
docs-grounding-verifier
docs-grounding-verifier audits specific documentation pages for factual accuracy by evaluating individual claims against source code at the claim level, returning verdicts (GROUNDED, PARTIAL, CONTRADICTED, UNSUPPORTED) with evidence citations. Use this when verifying particular pages for drift from implementation or confirming prose changes match actual code, not for sweeping entire documentation corpora or triaging PR diffs.
Instalar en Claude Code
Copiargit clone --depth 1 https://github.com/microsoft/apm /tmp/docs-grounding-verifier && cp -r /tmp/docs-grounding-verifier/.apm/skills/docs-grounding-verifier ~/.claude/skills/docs-grounding-verifierDespués abre una sesión nueva de Claude Code; el skill carga automáticamente.
Definición
SKILL.md
# docs-grounding-verifier
CLAIM-LEVEL grounding verification. Adapts the RAGAS faithfulness-eval
pattern (proven in RAG literature) to docs/code instead of generated-
answers/retrieved-context. Source code is the ground truth; docs
paragraphs are the candidate text under audit.
[python-architect persona](../../personas/python-architect.persona.md)
[doc-writer persona](../../personas/doc-writer.persona.md)
## Sibling contract
This skill is a SIBLING of `docs-corpus-audit` and `docs-sync`. The
boundary is load-bearing:
| Skill | Trigger | Scope | Granularity |
| ---------------------- | -------------------------------------- | ---------------- | --------------- |
| docs-sync | PR opened/synchronized | PR diff only | Page-level |
| docs-corpus-audit | Maintainer asks for whole-corpus pass | Entire corpus | Page-level |
| **docs-grounding-verifier** | **Verify specific pages factually** | **1..N pages** | **CLAIM-level** |
`docs-corpus-audit` invokes this skill in its VERIFY phase on the
highest-risk pages of each wave. `docs-sync` can invoke it on the
specific pages in a PR diff. The skill is also runnable standalone.
## When to activate
- Maintainer says "verify <page> against the code".
- An audit wave wants per-claim grounding scores for its highest-risk pages.
- A PR review wants to confirm that prose changes are not just plausible
but actually consistent with the implementation.
- A "fact-check" or "grounding" or "drift hunt" request.
## When NOT to activate
- Whole-corpus sweep with no specific page list -> use `docs-corpus-audit`.
- PR review with mixed code+docs diff -> use `docs-sync`.
- Editorial / tone review -> use `editorial-owner` persona directly.
## Architecture (PIPELINE-of-PANELS)
```
PARENT
-> [Stage 1: EXTRACT claims, fan-out PANEL]
per page -> LLM extracts atomic factual claims as JSON
script: scripts/extract-claims.py
-> [Stage 2: RETRIEVE evidence, deterministic S7]
per claim -> grep over src/ via keywords + hints
script: scripts/retrieve-evidence.sh (NO LLM)
-> [Stage 3: JUDGE grounding, adversarial A7]
per (claim, evidence) -> LLM rules GROUNDED|PARTIAL|CONTRADICTED|UNSUPPORTED
asset: assets/judge-prompt.md
-> [Stage 4: SYNTHESIZE]
aggregate ungrounded -> doc-writer for fix
re-verify after fix (A8 ALIGNMENT LOOP)
```
Stage 2 is the load-bearing design choice: evidence retrieval is
DETERMINISTIC (grep + AST hints), not LLM. The judge in Stage 3 can
only rule on evidence it actually receives -- it cannot hallucinate
support that the retriever did not find. This is the structural
guard against the failure mode "the LLM convinces itself the docs
match the code."
## Phase 1: SCOPE
Input: list of page paths to verify (1..N). If a `risk_class` is
attached (e.g. "high-stakes"), prefer it; otherwise treat all as equal.
Out-of-scope:
- Pages outside `docs/src/content/docs/` or
`packages/apm-guide/.apm/skills/apm-usage/`.
- Pages with no factual claims (pure editorial / landing). Skip
rather than force-extract.
## Phase 2: EXTRACT (parallel)
For each page, dispatch ONE claim-extractor agent:
- Prompt template: `scripts/extract-claims.py <page>` produces the
prompt and embeds the page content.
- Returns: JSON `{"page", "claims":[{"id","text","section","keywords",
"expected_source_areas"}]}` capped at 15 claims per page.
Parallel safe; no shared state between extractors.
## Phase 3: RETRIEVE (deterministic, batched)
For each claim, pipe to `scripts/retrieve-evidence.sh`:
- Uses keywords + expected_source_areas to grep src/.
- Returns one-line JSON: `{"claim_id","claim_text","evidence":[...],
"evidence_count"}`.
Sequential is fine (grep is fast). No LLM. Diagnostics on stderr,
data on stdout.
## Phase 4: JUDGE (parallel)
For each (claim, evidence) tuple, dispatch ONE grounding-judge agent:
- Load `assets/judge-prompt.md`.
- Send the prompt + the tuple.
- Returns: JSON verdict per the schema in `judge-prompt.md`.
Batching across claims-of-one-page into a single judge call is fine
(prompt with all tuples at once). Across pages, fan out.
## Phase 5: SYNTHESIZE
Aggregate verdicts. Materialize the report:
```
{
"summary": {
"pages_verified": N,
"claims_total": N,
"grounded": N, "partial": N, "contradicted": N, "unsupported": N,
"grounding_rate": N/total
},
"actionable": [
{"page", "claim", "verdict", "evidence_cited", "fix_suggestion"}
]
}
```
CONTRADICTED and PARTIAL are doc-writer work items. UNSUPPORTED is
split: if `retrieval_fix_suggestion` is plausible, retry retrieval
with the suggested keywords; if still empty, treat as CONTRADICTED.
## Phase 6: ALIGNMENT LOOP (A8)
Hand actionable items to doc-writer (one subagent per page). After
edits, RE-RUN the pipeline on the same pages. The grounding_rate
must MONOTONICALLY INCREASE between iterations or the loop has
diverged -- stop and escalate to the operator.
## Ship gate
- grounding_rate >= 0.9 on each verified page after the alignment loop.
- Every CONTRADICTED claim cited a specific code file:line that
disproves it -- not vague "the code doesn't say that".
- The eval-runner (see `evals/`) passes on the trigger evals and
the content evals before the skill is treated as production-ready.
## Bundled assets
- `scripts/extract-claims.py` -- Stage 1 prompt builder. `--help`, `--schema`.
- `scripts/retrieve-evidence.sh` -- Stage 2 retriever. Deterministic. `--help`.
- `scripts/verify-page.sh` -- end-to-end orchestrator. `--help`.
- `assets/judge-prompt.md` -- Stage 3 adversarial judge prompt.
- `evals/trigger-evals.json` -- 20 dispatch queries (10 should, 10 shouldn't).
- `evals/content-evals.json` -- seeded-drift recall scenarios.
- `evals/run-evals.sh` -- the eval-runner that turns JSON into metrics.
## Failure modes guarded against
- **Hallucinated grounding**: Stage 2 i