Skill468 repo starsupdated 9d ago

github-wayback-recovery

The github-wayback-recovery skill enables recovery of deleted GitHub content such as README files, issues, pull requests, and wiki pages by querying the Internet Archive's Wayback Machine and CDX API. Use this skill when GitHub repositories or specific content have been permanently deleted but may exist in web archive snapshots, particularly for recovering documentation, issue discussions, and repository metadata that were publicly crawled before deletion.

View source Repository: mantishack

Install in Claude Code

Copy

git clone --depth 1 https://github.com/deonmenezes/mantishack /tmp/github-wayback-recovery && cp -r /tmp/github-wayback-recovery/.claude/skills/oss-forensics/github-wayback-recovery ~/.claude/skills/github-wayback-recovery

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# GitHub Wayback Recovery

**Purpose**: Recover deleted GitHub content (README files, issues, PRs, wiki pages, repository metadata) from the Internet Archive's Wayback Machine when content is no longer available on GitHub.

## When to Use This Skill

- Repository has been deleted and you need README, wiki, or metadata
- Issues or PRs were deleted by author, maintainer, or moderation
- Need to recover file contents that may have been archived
- Investigating historical state of a repository
- Finding forks of deleted repositories via archived network pages
- Recovering release notes or documentation from deleted projects

**Complementary Skills**:
- **github-archive**: For structured event data (who did what, when) - always check first
- **github-commit-recovery**: For accessing commits when you have SHAs
- **github-wayback-recovery** (this skill): For web page snapshots when content is fully deleted

## Core Principles

**Wayback Machine Archives Web Pages, Not Git Repositories**:
- Cannot `git clone` from archived content
- Cannot reconstruct full commit history
- Recovery success depends on whether specific URLs were crawled

**What CAN Be Recovered**:
- README files and repository descriptions
- Issue titles, bodies, and comments (Archive Team prioritizes these)
- PR conversations and descriptions (Files Changed tab often fails)
- Wiki pages (especially wiki home)
- Release notes and descriptions
- Repository metadata (stars, language, license visible on homepage)
- Commit SHAs from archived commit list pages (use with **github-commit-recovery** skill to access actual content)

**What CANNOT Be Recovered**:
- Private repository content (never crawled)
- Complete git history or repository clone
- Content behind authentication

## Quick Start

**Check if a repository page was archived**:
```bash
curl -s "https://archive.org/wayback/available?url=github.com/owner/repo" | jq
```

**Search for all archived URLs under a repository**:
```bash
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/*&output=json&collapse=urlkey" | head -50
```

**Access an archived snapshot**:
```
https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo
```

## GitHub URL Patterns for Archive Searches

Understanding GitHub's URL structure is essential for constructing archive queries.

### Repository-Level URLs

| Content Type | URL Pattern |
|--------------|-------------|
| Homepage | `github.com/{owner}/{repo}` |
| Commits list | `github.com/{owner}/{repo}/commits/{branch}` |
| Individual commit | `github.com/{owner}/{repo}/commit/{full-sha}` |
| Fork network | `github.com/{owner}/{repo}/network/members` |

### File and Directory URLs

| Content Type | URL Pattern |
|--------------|-------------|
| File view | `github.com/{owner}/{repo}/blob/{branch}/{path/to/file}` |
| Directory view | `github.com/{owner}/{repo}/tree/{branch}/{directory}` |
| File history | `github.com/{owner}/{repo}/commits/{branch}/{path/to/file}` |
| Raw file | `raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}` |

**Note**: `blob` = files, `tree` = directories. Raw URLs are rarely archived compared to rendered views.

### Collaboration Artifacts

| Content Type | URL Pattern |
|--------------|-------------|
| Pull request | `github.com/{owner}/{repo}/pull/{number}` |
| PR files | `github.com/{owner}/{repo}/pull/{number}/files` |
| PR commits | `github.com/{owner}/{repo}/pull/{number}/commits` |
| Issue | `github.com/{owner}/{repo}/issues/{number}` |
| Wiki page | `github.com/{owner}/{repo}/wiki/{page-name}` |
| Release | `github.com/{owner}/{repo}/releases/tag/{tag-name}` |
| All PRs | `github.com/{owner}/{repo}/pulls?state=all` |
| All issues | `github.com/{owner}/{repo}/issues?state=all` |

## CDX API Reference

The Capture Index (CDX) API provides structured search across all archived URLs.

### Basic Query Structure

```
https://web.archive.org/cdx/search/cdx?url={URL}&output=json
```

### Essential Parameters

| Parameter | Effect | Example |
|-----------|--------|---------|
| `matchType=exact` | Exact URL only (default) | Single page |
| `matchType=prefix` | All URLs starting with path | All repo content |
| `url=.../*` | Wildcard (same as prefix) | `github.com/owner/repo/*` |
| `from=YYYY` | Start date filter | `from=2023` |
| `to=YYYY` | End date filter | `to=2024` |
| `filter=statuscode:200` | Only successful captures | Skip redirects/errors |
| `collapse=timestamp:8` | One capture per day | Reduce duplicates |
| `collapse=urlkey` | Unique URLs only | List all archived pages |
| `limit=N` | Limit results | `limit=100` |
| `output=json` | JSON format | Machine-readable |

### Query Examples

**Find all archived pages under a repository**:
```bash
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/facebook/react/*&matchType=prefix&output=json&collapse=urlkey"
```

**Find archived issues for a specific repository**:
```bash
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/issues/*&output=json&collapse=urlkey&filter=statuscode:200"
```

**Find archived snapshots of a specific file**:
```bash
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/blob/*/path/to/file&output=json"
```

**Check for archived snapshots near a specific date**:
```bash
curl -s "https://archive.org/wayback/available?url=github.com/owner/repo&timestamp=20230615"
```

### CDX Response Format

```json
[
  ["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"],
  ["com,github)/owner/repo", "20230615142311", "https://github.com/owner/repo", "text/html", "200", "ABC123...", "12345"]
]
```

## Investigation Patterns

### Recovering Deleted File Contents

**Scenario**: Repository or file has been deleted, need to recover file contents.

**Step 1: Search for blob URLs**
```bash
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/blob/*/README.md&output=json"
```

**Step 2: Construct archive URL from timestamp**
```
https://

More from this repository

api-abuse-fuzzerSubagent

Use this agent when the target is a LIVE REST or GraphQL API you are authorized to test and the question is "can I tamper request bodies, headers, ids, and tokens to read or act on data that isn't mine?" — active, request-driven abuse of the API contract, not static code review. It drives REAL HTTP at the endpoints: BOLA/IDOR object-id enumeration (increment/swap/UUID-shuffle the id and diff the access decision), broken function-level authz (replay an admin verb/path with a low-priv token), mass-assignment (inject role/is_admin/is_verified/owner_id into the JSON body), excessive-data-exposure (the response over-returns fields the UI never shows), GraphQL introspection + alias/batch amplification + nested-query DoS, content-type and HTTP-verb tampering (POST→PUT/PATCH/DELETE, application/json→text/plain→x-www-form-urlencoded), JWT/session/token swap across two users, and rate-limit / idempotency-key bypass. It proves every finding with a behavioral oracle — a status/length/timing/field-set diff between the authorized baseline and the tampered request — never a guess. Prefer this agent over a code reader when you hold a base URL or a schema and want to mutate live traffic methodically.\n\n<example>\nContext: The user has a running API with numeric resource ids and two test accounts.\nuser: "Here's our staging API at https://api.staging.acme.test and tokens for user A and user B — can user A read user B's orders?"\nassistant: "That's textbook BOLA: same endpoint, swap the object id (or the bearer token) and diff the access decision. I'll use the Task tool to launch the api-abuse-fuzzer agent to enumerate /orders/{id} with A's token against B's ids and prove the cross-tenant read with a status + ownership-field oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer: a live authorized API + two tokens + object-id enumeration is its core BOLA/IDOR mission.\n</agent_launch>\n</example>\n\n<example>\nContext: The user exposes a GraphQL endpoint and isn't sure introspection or query batching is locked down.\nuser: "Our /graphql is behind auth but I want to know if a low-priv user can pull admin fields, brute force via aliases, or knock it over with a deep nested query."\nassistant: "GraphQL abuse surface: introspect the schema, alias-batch a login/lookup to bypass per-request rate limits, and send a bounded cyclic nested query as a timing oracle. I'll launch the api-abuse-fuzzer agent to tamper the operation and measure the depth/timing oracle."\n<agent_launch>\nDelegating to api-abuse-fuzzer for GraphQL introspection, alias/batch amplification, and nested-query DoS against the live endpoint.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when: a live base URL + an OpenAPI/Swagger/GraphQL schema (or a captured request) is in hand and the target is authorized in-scope; endpoints take a resource identifier in the path/query/body (/users/{id}, ?account=, {"order_id": ...}) — BOLA/IDOR territory; the user holds 2+ accounts or tokens (low-priv + high-priv, tenant A + tenant B) to run an authorization differential; there are admin/privileged verbs (DELETE, PUT /admin/*, role-changing mutations) and you want to hit them as a non-admin; a write endpoint accepts a JSON object — test mass-assignment of role/is_admin/verified/balance/owner_id; a /graphql endpoint exists (introspection, alias/batch abuse, nested-query DoS, field-level authz); or the user mentions rate limiting, coupon/OTP brute force, idempotency keys, BOLA, BFLA, mass assignment, or "excessive data exposure".

assumption-pressure-testSubagent

Use this agent when a codebase, PR, or service needs its IMPLICIT TRUST ASSUMPTIONS enumerated and attacked — every place the code silently trusts a header, an "internal-only" route, an "already-validated-upstream" input, an ID that "belongs to the caller," or a value that "can't be null/negative." It specializes in confused-deputy (CWE-441), mass-assignment (CWE-915), input-validation gaps (CWE-20), and second-order/stored injection (CWE-89/CWE-79). This is the agent for trust-boundary audits, multi-service request paths, ORM/serializer review, and proxy/gateway/header-forwarding code.\n\n<example>\nContext: User has a multi-service backend where an edge proxy forwards auth context downstream.\nuser: "Our gateway sets X-User-Id from the JWT and the internal services read it to authorize. Can you check this?"\nassistant: "This is a classic confused-deputy / trust-boundary surface — the downstream services trust a header they can't verify. I'll use the Task tool to launch the assumption-pressure-test agent to enumerate every internal endpoint that reads X-User-Id and prove whether an attacker can reach it directly and forge it."\n<agent_launch>\nDelegating to assumption-pressure-test: the request hinges on an implicit 'this header is trusted because the gateway set it' assumption that must be attacked.\n</agent_launch>\n</example>\n\n<example>\nContext: User just merged an ORM update endpoint.\nuser: "Here's the new profile update handler: User.objects.filter(id=request.user.id).update(**request.data)"\nassistant: "That `**request.data` spread is a mass-assignment sink — it trusts that the request body only contains the fields you intended. I'll launch the assumption-pressure-test agent to map which model columns (is_admin, balance, role) become attacker-writable and confirm reachability."\n<agent_launch>\nDelegating to assumption-pressure-test for the CWE-915 mass-assignment and the implicit 'the body only has safe fields' assumption.\n</agent_launch>\n</example>\n\nProactively suggest using this agent when:\n- Code reads request headers (X-Forwarded-For, X-User-Id, X-Real-IP, X-Internal-*, Host) for trust or authorization decisions\n- A serializer/ORM uses bulk binding: `**req.body`, `Object.assign`, `ModelMapper`, `BeanUtils.copyProperties`, `update_attributes`, `params.permit!`\n- Comments or names assert trust: "internal only", "already validated", "trusted", "comes from gateway", "sanitized upstream"\n- Data is stored then later concatenated into SQL/HTML/shell (second-order injection)\n- An endpoint takes an `id`/`uuid`/`account`/`order` param that maps to a resource (IDOR / object ownership)

coverage-analyzerSubagent

Generate gcov coverage data for a code repository.

crash-analysis-agentSubagent

Analyze security bugs from any C/C++ project with full root-cause tracing

crash-analyzerSubagent

Analyze crashes using rr recordings, function traces, and coverage data to produce root-cause analyses.

crash-analysis-checkerSubagent

Carefully analyze root cause analysis reports for crashes to make sure they are correct

exploitability-validator-agentSubagent

Multi-stage pipeline to validate vulnerability findings are real, reachable, and exploitable

federated-identity-breakerSubagent