mantis-agentic
The `/agentic` command executes a fully autonomous security analysis workflow that scans code with static analysis tools, deduplicates findings, validates each discovery through multi-stage exploitation analysis, optionally gathers consensus from multiple models, generates exploit proofs-of-concept and secure patches, and groups related vulnerabilities by structural patterns. Use it for comprehensive security reviews where findings are analyzed in depth and documented with actionable remediation guidance, optionally enriched with architectural mapping and post-analysis validation.
mkdir -p ~/.claude/commands && curl -fsSL https://raw.githubusercontent.com/deonmenezes/mantishack/HEAD/.claude/commands/mantis-agentic.md -o ~/.claude/commands/mantis-agentic.mdmantis-agentic.md
# /agentic - MANTISHACK Full Autonomous Workflow 🤖 **AGENTIC MODE** - This will autonomously: 1. Scan code with Semgrep/CodeQL (parallel) 2. Deduplicate findings 2.5. **Run auth + logging audit** automatically (JWT misuse, cookie security, audit-log coverage) — rules tagged `mantis_capability: auth-audit`; see `/mantis-auth-audit` for the standalone command 3. Prep findings (read code, extract dataflow) 4. **Validate + analyse** each finding (exploitation-validator methodology, Stages A-D) 5. **Self-review**: catch contradictions, retry low confidence (Stage F) 6. **Consensus**: multi-model second opinion (if `--consensus`) 7. **Judge**: non-blind review of primary reasoning (if `--judge`) 8. **Aggregate**: synthesize multi-model results for downstream use (if `--aggregate`) 9. **Generate exploit PoCs** for exploitable findings 10. **Generate secure patches** for confirmed vulnerabilities 11. **Cross-finding analysis** (structural grouping, shared root causes) Nothing will be applied to your code - only generated in the out/ directory. Execute: `libexec/mantishack-agentic --repo <path>` ## Optional enrichment flags By default, `/agentic` scans and analyses findings in isolation. Two optional flags add richer context for more thorough results. They are opt-in because they add time and cost, but if you are doing a proper security review rather than a quick scan, they are well worth it. | Flag | What it does | |------|-------------| | `--understand` | Runs `/understand --map` as a proper sibling run, producing `context-map.json` (entry points, trust boundaries, sinks). Two consumers: (a) the agentic checklist gets priority markers, so per-finding analysis prompts say things like *"Architectural role: entry_point"* — improving in-run analysis; (b) any `/validate` against the same target — including this run's `--validate` post-pass — picks the map up via the bridge. | | `--validate` | After the agentic pipeline completes, runs `/validate` on findings flagged `is_exploitable: true` or `confidence: "high"`. Creates a sibling validate run; the bridge auto-discovers any `/understand` sibling produced by `--understand`. | You can use either flag on its own or combine them: ``` # Recommended for thorough reviews — pair both flags /agentic --understand --validate # Just enrich this run's analysis with architectural priority markers /agentic --understand # Just validate the findings that look exploitable (no pre-mapping) /agentic --validate ``` Pass both flags straight through to `libexec/mantishack-agentic`. The Python layer owns all orchestration and selection logic; you don't need to filter findings or invoke other skills yourself. ## How analysis works Findings are dispatched for parallel analysis via one of two paths: - **Claude Code on PATH**: dispatches `claude -p` sub-agents (separate processes) - **External LLM configured**: dispatches via `generate_structured()` API calls - **Both available**: uses external LLM, falls back to Claude Code if it fails Model roles determine which model analyses (analysis), writes code (code), provides second opinions (consensus), reviews reasoning (judge), and synthesizes multi-model output for downstream use (aggregate). See the "Multi-model analysis" section below. If **neither** is available, the pipeline produces prep-only output. In that case, **YOU (Claude Code) are the LLM** — the user may ask you to analyse the findings directly in conversation. See the prep_only report mode below for instructions. Analysis follows the exploitation-validator methodology (Stages A-D): - **Stage A**: One-shot verification — is the vulnerability pattern real? - **Stage B**: Attack path analysis — what are the preconditions and blockers? - **Stage C**: Sanity check — does the code match? is the flow real? is it reachable? - **Stage D**: Ruling — test code? unrealistic preconditions? hedging? If `--binary` is provided, Stage E (binary feasibility analysis) runs before scanning and its results (chain_breaks, mitigations) are included in each finding's analysis prompt. The dispatch pipeline runs these tasks in sequence: 1. **AnalysisTask** — Stages A-D per finding (validation + analysis in one call) 2. **CrossFamilyCheckTask** — re-check suspicious responses via a different model family 3. **RetryTask** — Stage F: self-consistency check, retry contradictions + low confidence 4. **ConsensusTask** — blind second model votes on true positives (if `--consensus`) 5. **JudgeTask** — non-blind review of primary reasoning (if `--judge`) 6. **Correlation** — multi-model agreement matrix + confidence signals (if 2+ `--model`) 7. **AggregationTask** — final synthesis into `aggregation.json`, consumed by `agentic-report.md` (if `--aggregate`) 8. **ExploitTask** — PoCs for final-verdict exploitable findings 9. **PatchTask** — secure fixes for exploitable findings 10. **GroupAnalysisTask** — cross-finding patterns (shared root cause, attack chaining) Cost tracking is real-time with adaptive budget cutoff. ## Multi-model analysis By default, the primary model is auto-detected from `~/.config/mantishack/models.json` or API key env vars (GEMINI_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY). Use `--model` to override. `--model` is repeatable. Multiple models each independently analyse every finding (Stages A-D), then results are correlated — agreement matrix, confidence signals, clusters, unique insights. With 3+ analysis models, `--consensus` is auto-skipped (redundant). | Flag | Role | What it does | |------|------|-------------| | `--model MODEL` (repeatable) | Analysis | Each model independently analyses every finding. Multiple = multi-model correlation. | | `--consensus MODEL` | Blind second opinion | Re-analyses each finding independently (doesn't see the primary verdict). Majority vote decides the final ruling. Auto-skipped with 3+ `--model`. | | `--judge MODEL` | Non-blind review | Sees the primary analysis reasoning and critiques it. Flags missed attack paths
Use this agent when the target is a LIVE REST or GraphQL API you are authorized to test and the question is "can I tamper request bodies, headers, ids, and tokens to read or act on data that isn't mine?" — active, request-driven abuse of the API contract, not static code review. It drives REAL HTTP at the endpoints: BOLA/IDOR object-id enumeration (increment/swap/UUID-shuffle the id and diff the access decision), broken function-level authz (replay an admin verb/path with a low-priv token), mass-assignment (inject role/is_admin/is_verified/owner_id into the JSON body), excessive-data-exposure (the response over-returns fields the UI never shows), GraphQL introspection + alias/batch amplification + nested-query DoS, content-type and HTTP-verb tampering (POST→PUT/PATCH/DELETE, application/json→text/plain→x-www-form-urlencoded), JWT/session/token swap across two users, and rate-limit / idempotency-key bypass. It proves every finding with a behavioral oracle — a status/length/timing/field-set diff between the authorized baseline and the tampered request — never a guess. Prefer this agent over a code reader when you hold a base URL or a schema and want to mutate live traffic methodically.\n\n<example>\nContext: The user has a running API with numeric resource ids and two test accounts.\nuser: "Here's our staging API at https://api.staging.acme.test and tokens for user A and user B — can user A read user B's orders?"\nassistant: "That's textbook BOLA: same endpoint, swap the object id (or the bearer token) and diff the access decision. I'll use the Task tool to launch the api-abuse-fuzzer agent to enumerate /orders/{id} with A's token against B's ids and prove the cross-tenant read with a status + ownership-field oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer: a live authorized API + two tokens + object-id enumeration is its core BOLA/IDOR mission.\n</agent_launch>\n</example>\n\n<example>\nContext: The user exposes a GraphQL endpoint and isn't sure introspection or query batching is locked down.\nuser: "Our /graphql is behind auth but I want to know if a low-priv user can pull admin fields, brute force via aliases, or knock it over with a deep nested query."\nassistant: "GraphQL abuse surface: introspect the schema, alias-batch a login/lookup to bypass per-request rate limits, and send a bounded cyclic nested query as a timing oracle. I'll launch the api-abuse-fuzzer agent to tamper the operation and measure the depth/timing oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer for GraphQL introspection, alias/batch amplification, and nested-query DoS against the live endpoint.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when: a live base URL + an OpenAPI/Swagger/GraphQL schema (or a captured request) is in hand and the target is authorized in-scope; endpoints take a resource identifier in the path/query/body (/users/{id}, ?account=, {"order_id": ...}) — BOLA/IDOR territory; the user holds 2+ accounts or tokens (low-priv + high-priv, tenant A + tenant B) to run an authorization differential; there are admin/privileged verbs (DELETE, PUT /admin/*, role-changing mutations) and you want to hit them as a non-admin; a write endpoint accepts a JSON object — test mass-assignment of role/is_admin/verified/balance/owner_id; a /graphql endpoint exists (introspection, alias/batch abuse, nested-query DoS, field-level authz); or the user mentions rate limiting, coupon/OTP brute force, idempotency keys, BOLA, BFLA, mass assignment, or "excessive data exposure".
Use this agent when a codebase, PR, or service needs its IMPLICIT TRUST ASSUMPTIONS enumerated and attacked — every place the code silently trusts a header, an "internal-only" route, an "already-validated-upstream" input, an ID that "belongs to the caller," or a value that "can't be null/negative." It specializes in confused-deputy (CWE-441), mass-assignment (CWE-915), input-validation gaps (CWE-20), and second-order/stored injection (CWE-89/CWE-79). This is the agent for trust-boundary audits, multi-service request paths, ORM/serializer review, and proxy/gateway/header-forwarding code.\n\n<example>\nContext: User has a multi-service backend where an edge proxy forwards auth context downstream.\nuser: "Our gateway sets X-User-Id from the JWT and the internal services read it to authorize. Can you check this?"\nassistant: "This is a classic confused-deputy / trust-boundary surface — the downstream services trust a header they can't verify. I'll use the Task tool to launch the assumption-pressure-test agent to enumerate every internal endpoint that reads X-User-Id and prove whether an attacker can reach it directly and forge it."\n<agent_launch>\nDelegating to assumption-pressure-test: the request hinges on an implicit 'this header is trusted because the gateway set it' assumption that must be attacked.\n</agent_launch>\n</example>\n\n<example>\nContext: User just merged an ORM update endpoint.\nuser: "Here's the new profile update handler: User.objects.filter(id=request.user.id).update(**request.data)"\nassistant: "That `**request.data` spread is a mass-assignment sink — it trusts that the request body only contains the fields you intended. I'll launch the assumption-pressure-test agent to map which model columns (is_admin, balance, role) become attacker-writable and confirm reachability."\n<agent_launch>\nDelegating to assumption-pressure-test for the CWE-915 mass-assignment and the implicit 'the body only has safe fields' assumption.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when:\n- Code reads request headers (X-Forwarded-For, X-User-Id, X-Real-IP, X-Internal-*, Host) for trust or authorization decisions\n- A serializer/ORM uses bulk binding: `**req.body`, `Object.assign`, `ModelMapper`, `BeanUtils.copyProperties`, `update_attributes`, `params.permit!`\n- Comments or names assert trust: "internal only", "already validated", "trusted", "comes from gateway", "sanitized upstream"\n- Data is stored then later concatenated into SQL/HTML/shell (second-order injection)\n- An endpoint takes an `id`/`uuid`/`account`/`order` param that maps to a resource (IDOR / object ownership)
Generate gcov coverage data for a code repository.
Analyze security bugs from any C/C++ project with full root-cause tracing
Analyze crashes using rr recordings, function traces, and coverage data to produce root-cause analyses.
Carefully analyze root cause analysis reports for crashes to make sure they are correct
Multi-stage pipeline to validate vulnerability findings are real, reachable, and exploitable
|