Skill469 estrellas del repoactualizado 10d ago

github-archive

The GitHub Archive skill queries BigQuery's immutable record of all public GitHub events since 2011 to provide forensic evidence for security investigations. Use it to verify repository activity claims, recover deleted pull requests, branches, tags, and repositories, attribute actions to specific actors, reconstruct attack timelines with definitive timestamps, and analyze patterns across repositories for supply chain reconnaissance or incident response.

Ver fuente Repositorio: mantishack

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/deonmenezes/mantishack /tmp/github-archive && cp -r /tmp/github-archive/.claude/skills/oss-forensics/github-archive ~/.claude/skills/github-archive

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# GitHub Archive

**Purpose**: Query immutable GitHub event history via BigQuery to obtain tamper-proof forensic evidence for security investigations.

## When to Use This Skill

- Investigating security incidents involving GitHub repositories
- Building threat actor attribution profiles
- Verifying claims about repository activity (media reports, incident reports)
- Reconstructing attack timelines with definitive timestamps
- Analyzing automation system compromises
- Detecting supply chain reconnaissance
- Cross-repository behavioral analysis
- Workflow execution verification (legitimate vs API abuse)
- Pattern-based anomaly detection
- **Recovering deleted content**: PRs, issues, branches, tags, entire repositories

GitHub Archive analysis should be your **FIRST step** in any GitHub-related security investigation. Start with the immutable record, then enrich with additional sources.

## Core Principles

**ALWAYS PREFER GitHub Archive as forensic evidence over**:
- Local git command outputs (`git log`, `git show`) - commits can be backdated/forged
- Unverified claims from articles or reports - require independent confirmation
- GitHub web interface screenshots - can be manipulated
- Single-source evidence - always cross-verify

**GitHub Archive IS your ground truth for**:
- Actor attribution (who performed actions)
- Timeline reconstruction (when events occurred)
- Event verification (what actually happened)
- Pattern analysis (behavioral fingerprinting)
- Cross-repository activity tracking
- **Deleted content recovery** (issues, PRs, tags, commit references remain in archive)
- **Repository deletion forensics** (commit SHAs persist even after repo deletion and history rewrites)

### What Persists After Deletion

**Deleted Issues & PRs**:
- Issue creation events (`IssuesEvent`) remain in archive
- Issue comments (`IssueCommentEvent`) remain accessible
- PR open/close/merge events (`PullRequestEvent`) persist
- **Forensic Value**: Recover deleted evidence of social engineering, reconnaissance, or coordination

**Deleted Tags & Branches**:
- `CreateEvent` records for tag/branch creation persist
- `DeleteEvent` records document when deletion occurred
- **Forensic Value**: Reconstruct attack staging infrastructure (e.g., malicious payload delivery tags)

**Deleted Repositories**:
- All `PushEvent` records to the repository remain queryable
- Commit SHAs are permanently recorded in archive
- Fork relationships (`ForkEvent`) survive deletion
- **Forensic Value**: Access commit metadata even after threat actor deletes evidence

**Deleted User Accounts**:
- All activity events remain attributed to deleted username
- Timeline reconstruction remains possible
- **Limitation**: Direct code access lost, but commit SHAs can be searched elsewhere

## Quick Start

**Investigate if user opened PRs in June 2025:**

```python
from google.cloud import bigquery
from google.oauth2 import service_account

# Initialize client (see Setup section for credentials)
credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

# Query for PR events
query = """
SELECT
    created_at,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.202506*`
WHERE
    actor.login = 'suspected-actor'
    AND repo.name = 'target/repository'
    AND type = 'PullRequestEvent'
ORDER BY created_at
"""

results = client.query(query)
for row in results:
    print(f"{row.created_at}: PR #{row.pr_number} - {row.action}")
    print(f"  Title: {row.pr_title}")
```

**Expected Output (if PR exists)**:
```
2025-06-15 14:23:11 UTC: PR #123 - opened
  Title: Add new feature
2025-06-20 09:45:22 UTC: PR #123 - closed
  Title: Add new feature
```

**Interpretation**:
- **No results** → Claim disproven (no PR activity found)
- **Results found** → Claim verified, proceed with detailed analysis

## Setup

### Prerequisites

1. **Google Cloud Project**:
   - Login to [Google Developer Console](https://console.cloud.google.com/)
   - Create a project and activate BigQuery API
   - Create a service account with `BigQuery User` role
   - Download JSON credentials file

2. **Install BigQuery Client**:
```bash
pip install google-cloud-bigquery google-auth
```

### Initialize Client

```python
from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)

client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id
)
```

**Free Tier**: Google provides 1 TB of data processed per month free.

## Cost Management & Query Optimization

### Understanding GitHub Archive Costs

BigQuery charges **$6.25 per TiB** of data scanned (after the 1 TiB free tier). GitHub Archive tables are **large** - a single month table can be 50-100 GB, and yearly wildcards can scan multiple TiBs. **Unoptimized queries can cost $10-100+**, while optimized versions of the same query cost $0.10-1.00.

**Key Cost Principle**: BigQuery uses columnar storage - you pay for ALL data in the columns you SELECT, not just matching rows. A query with `SELECT *` on one day of data scans ~3 GB even with LIMIT 10.

### ALWAYS Estimate Costs Before Querying

**CRITICAL RULE**: Run a dry run to estimate costs before executing any query against GitHub Archive production tables.

```python
from google.cloud import bigquery

def estimate_gharchive_cost(query: str) -> dict:
    """Estimate cost before running GitHub Archive query."""
    client = bigquery.Client()

    # Dry run - validates query and returns bytes to scan
    dry_run_config = bigquery

Del mismo repositorio

api-abuse-fuzzerSubagent

Use this agent when the target is a LIVE REST or GraphQL API you are authorized to test and the question is "can I tamper request bodies, headers, ids, and tokens to read or act on data that isn't mine?" — active, request-driven abuse of the API contract, not static code review. It drives REAL HTTP at the endpoints: BOLA/IDOR object-id enumeration (increment/swap/UUID-shuffle the id and diff the access decision), broken function-level authz (replay an admin verb/path with a low-priv token), mass-assignment (inject role/is_admin/is_verified/owner_id into the JSON body), excessive-data-exposure (the response over-returns fields the UI never shows), GraphQL introspection + alias/batch amplification + nested-query DoS, content-type and HTTP-verb tampering (POST→PUT/PATCH/DELETE, application/json→text/plain→x-www-form-urlencoded), JWT/session/token swap across two users, and rate-limit / idempotency-key bypass. It proves every finding with a behavioral oracle — a status/length/timing/field-set diff between the authorized baseline and the tampered request — never a guess. Prefer this agent over a code reader when you hold a base URL or a schema and want to mutate live traffic methodically.\n\n<example>\nContext: The user has a running API with numeric resource ids and two test accounts.\nuser: "Here's our staging API at https://api.staging.acme.test and tokens for user A and user B — can user A read user B's orders?"\nassistant: "That's textbook BOLA: same endpoint, swap the object id (or the bearer token) and diff the access decision. I'll use the Task tool to launch the api-abuse-fuzzer agent to enumerate /orders/{id} with A's token against B's ids and prove the cross-tenant read with a status + ownership-field oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer: a live authorized API + two tokens + object-id enumeration is its core BOLA/IDOR mission.\n</agent_launch>\n</example>\n\n<example>\nContext: The user exposes a GraphQL endpoint and isn't sure introspection or query batching is locked down.\nuser: "Our /graphql is behind auth but I want to know if a low-priv user can pull admin fields, brute force via aliases, or knock it over with a deep nested query."\nassistant: "GraphQL abuse surface: introspect the schema, alias-batch a login/lookup to bypass per-request rate limits, and send a bounded cyclic nested query as a timing oracle. I'll launch the api-abuse-fuzzer agent to tamper the operation and measure the depth/timing oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer for GraphQL introspection, alias/batch amplification, and nested-query DoS against the live endpoint.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when: a live base URL + an OpenAPI/Swagger/GraphQL schema (or a captured request) is in hand and the target is authorized in-scope; endpoints take a resource identifier in the path/query/body (/users/{id}, ?account=, {"order_id": ...}) — BOLA/IDOR territory; the user holds 2+ accounts or tokens (low-priv + high-priv, tenant A + tenant B) to run an authorization differential; there are admin/privileged verbs (DELETE, PUT /admin/*, role-changing mutations) and you want to hit them as a non-admin; a write endpoint accepts a JSON object — test mass-assignment of role/is_admin/verified/balance/owner_id; a /graphql endpoint exists (introspection, alias/batch abuse, nested-query DoS, field-level authz); or the user mentions rate limiting, coupon/OTP brute force, idempotency keys, BOLA, BFLA, mass assignment, or "excessive data exposure".

assumption-pressure-testSubagent

Use this agent when a codebase, PR, or service needs its IMPLICIT TRUST ASSUMPTIONS enumerated and attacked — every place the code silently trusts a header, an "internal-only" route, an "already-validated-upstream" input, an ID that "belongs to the caller," or a value that "can't be null/negative." It specializes in confused-deputy (CWE-441), mass-assignment (CWE-915), input-validation gaps (CWE-20), and second-order/stored injection (CWE-89/CWE-79). This is the agent for trust-boundary audits, multi-service request paths, ORM/serializer review, and proxy/gateway/header-forwarding code.\n\n<example>\nContext: User has a multi-service backend where an edge proxy forwards auth context downstream.\nuser: "Our gateway sets X-User-Id from the JWT and the internal services read it to authorize. Can you check this?"\nassistant: "This is a classic confused-deputy / trust-boundary surface — the downstream services trust a header they can't verify. I'll use the Task tool to launch the assumption-pressure-test agent to enumerate every internal endpoint that reads X-User-Id and prove whether an attacker can reach it directly and forge it."\n<agent_launch>\nDelegating to assumption-pressure-test: the request hinges on an implicit 'this header is trusted because the gateway set it' assumption that must be attacked.\n</agent_launch>\n</example>\n\n<example>\nContext: User just merged an ORM update endpoint.\nuser: "Here's the new profile update handler: User.objects.filter(id=request.user.id).update(**request.data)"\nassistant: "That `**request.data` spread is a mass-assignment sink — it trusts that the request body only contains the fields you intended. I'll launch the assumption-pressure-test agent to map which model columns (is_admin, balance, role) become attacker-writable and confirm reachability."\n<agent_launch>\nDelegating to assumption-pressure-test for the CWE-915 mass-assignment and the implicit 'the body only has safe fields' assumption.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when:\n- Code reads request headers (X-Forwarded-For, X-User-Id, X-Real-IP, X-Internal-*, Host) for trust or authorization decisions\n- A serializer/ORM uses bulk binding: `**req.body`, `Object.assign`, `ModelMapper`, `BeanUtils.copyProperties`, `update_attributes`, `params.permit!`\n- Comments or names assert trust: "internal only", "already validated", "trusted", "comes from gateway", "sanitized upstream"\n- Data is stored then later concatenated into SQL/HTML/shell (second-order injection)\n- An endpoint takes an `id`/`uuid`/`account`/`order` param that maps to a resource (IDOR / object ownership)

coverage-analyzerSubagent

Generate gcov coverage data for a code repository.

crash-analysis-agentSubagent

Analyze security bugs from any C/C++ project with full root-cause tracing

crash-analyzerSubagent

Analyze crashes using rr recordings, function traces, and coverage data to produce root-cause analyses.

crash-analysis-checkerSubagent

Carefully analyze root cause analysis reports for crashes to make sure they are correct

exploitability-validator-agentSubagent

Multi-stage pipeline to validate vulnerability findings are real, reachable, and exploitable

federated-identity-breakerSubagent