web-archiving
This Claude Code skill provides methods for retrieving archived versions of inaccessible webpages using a cascading hierarchy of services, prioritizing the Wayback Machine, Archive.today, and Memento Time Travel. Use it when accessing deleted or paywalled pages, preserving web content for journalism and research, documenting evidence for legal purposes, or building redundant archival workflows that require historical snapshots of online sources.
git clone --depth 1 https://github.com/jamditis/claude-skills-journalism /tmp/web-archiving && cp -r /tmp/web-archiving/research-toolkit/skills/web-archiving ~/.claude/skills/web-archivingSKILL.md
# Web archiving methodology
Patterns for accessing inaccessible web pages and preserving web content for journalism, research, and legal purposes.
## Archive service hierarchy
Try services in this order for maximum coverage:
```
┌─────────────────────────────────────────────────────────────────┐
│ ARCHIVE RETRIEVAL CASCADE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Wayback Machine (archive.org) │
│ └─ 900B+ pages, historical depth, API access │
│ ↓ not found │
│ 2. Archive.today (archive.is/archive.ph) │
│ └─ On-demand snapshots, paywall bypass │
│ └─ Caveat (2026): FBI subpoenaed registrar in Oct 2025; │
│ Wikipedia deprecated as citation source in Feb 2026 — │
│ prefer Wayback / Perma.cc for legal or citation use │
│ ↓ not found │
│ 3. Memento Time Travel (aggregator) │
│ └─ Searches multiple archives simultaneously │
│ │
│ Retired (do not use): Google Cache (`cache:` operator) was │
│ shut down in Sept 2024; Bing Cache dropdown was removed in │
│ the same year. Both formerly fed this cascade. │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## Wayback Machine API
### Check if URL is archived
```python
import requests
from typing import Optional
from datetime import datetime
from urllib.parse import quote, unquote
def check_wayback_availability(url: str) -> Optional[dict]:
"""Check if URL exists in Wayback Machine."""
api_url = "https://archive.org/wayback/available"
try:
response = requests.get(api_url, params={'url': url}, timeout=10)
data = response.json()
if data.get('archived_snapshots', {}).get('closest'):
snapshot = data['archived_snapshots']['closest']
return {
'available': snapshot.get('available', False),
'url': snapshot.get('url'),
'timestamp': snapshot.get('timestamp'),
'status': snapshot.get('status')
}
return None
except Exception as e:
return None
def get_wayback_url(url: str, timestamp: str = None) -> str:
"""Generate Wayback Machine URL for a page.
Returns the canonical raw form (`.../web/<timestamp>/<url>`) per
Wayback's replay-URL convention. If you intend to navigate to the
returned link in a browser AND the target URL has `#` fragments,
encode at the call site with urllib.parse.quote so the browser
doesn't strip the fragment before request dispatch.
Args:
url: Original URL to retrieve
timestamp: Optional YYYYMMDDHHMMSS format, or None for latest
"""
if timestamp:
return f"https://web.archive.org/web/{timestamp}/{url}"
return f"https://web.archive.org/web/{url}"
```
### Save page to Wayback Machine
```python
def save_to_wayback(url: str, s3_keys: Optional[tuple[str, str]] = None) -> Optional[str]:
"""Request Wayback Machine to archive a URL via Save Page Now.
Returns the archived URL if successful.
Anonymous requests are rate-limited at roughly 15/minute. Pass
`s3_keys=(access_key, secret)` from an Internet Archive account
to raise the cap (anonymous → ~50/min with auth) and avoid silent
drops on paywalled / heavily JS-rendered pages.
"""
# quote(unquote(url), ...) normalizes any existing %xx escapes
# first so they don't get double-encoded into %25xx.
save_url = f"https://web.archive.org/save/{quote(unquote(url), safe='')}"
headers = {'User-Agent': 'Mozilla/5.0 (research-archiver)'}
if s3_keys:
headers['Authorization'] = f'LOW {s3_keys[0]}:{s3_keys[1]}'
try:
response = requests.get(save_url, headers=headers, timeout=60)
if response.status_code == 200:
# SPN delivers the canonical archive URL via the final URL
# after redirect-following (or the `Link` header on async
# captures). `response.url` is the reliable common case.
return response.url
return None
except Exception:
return None
```
### CDX API for historical snapshots
```python
def get_all_snapshots(url: str, limit: int = 100) -> list[dict]:
"""Get all archived snapshots of a URL using CDX API.
Returns list of snapshots with timestamps and status codes.
"""
cdx_url = "https://web.archive.org/cdx/search/cdx"
params = {
'url': url,
'output': 'json',
'limit': limit,
'fl': 'timestamp,original,statuscode,digest,length'
}
try:
response = requests.get(cdx_url, params=params, timeout=30)
data = response.json()
if len(data) < 2: # First row is headers
return []
headers = data[0]
snapshots = []
for row in data[1:]:
snapshot = dict(zip(headers, row))
snapshot['wayback_url'] = (
f"https://web.archive.org/web/{snapshot['timestamp']}/{snapshot['original']}"
)
snapshots.append(snapshot)
return snapshots
except Exception:
return []
```
## Archive.today integration
### Save to Archive.today
```python
import re
import requests
from urllib.parse import quote, unquote, urljoin
def save_to_archive_today(url: str) -> Optional[str]:
"""Submit URL to Archive.today for archiving.
Note: Archive.today has rate limiting and CAPTCHA requirements.
This function works for basic archiving but may require
manual intervention for high-volumeWeb accessibility patterns for news sites, journalism tools, and academic platforms. Use when building accessible interfaces, auditing existing sites for WCAG compliance, writing alt text for news images, creating accessible data visualizations, or ensuring content reaches all readers including those using assistive technologies. Essential for newsroom developers and anyone publishing web content.
Electron desktop application development with React, TypeScript, and Vite. Use when building desktop apps, implementing IPC communication, managing windows/tray, handling PTY terminals, integrating WebRTC/audio, or packaging with electron-builder. Covers patterns from AudioBash, Yap, and Pisscord projects.
Remote JavaScript console access and debugging on mobile devices. Use when debugging web pages on phones/tablets, accessing console errors without desktop DevTools, testing responsive designs on real devices, or diagnosing mobile-specific issues. Covers Eruda, vConsole, Chrome/Safari remote debugging, and cloud testing platforms.
Use this skill when creating new files that represent architectural decisions — data models, infrastructure configs, auth boundaries, API contracts, CI/CD pipelines, or event systems. Flags irreversible decisions and forces a discussion about trade-offs before committing.
Python data processing pipelines with modular architecture. Use when building content processing workflows, implementing dispatcher patterns, integrating Google Sheets/Drive APIs, or creating batch processing systems. Covers patterns from rosen-scraper, image-analyzer, and social-scraper projects.
This skill should be used when the user reports a bug, describes unexpected behavior, says something is "broken", "not working", "failing", mentions an "error", "issue", or "problem" in code, or asks to "fix" something. Enforces test-driven bug fixing workflow.
Methodology for effective AI-assisted software development. Use when helping users build software with AI coding assistants, debugging AI-generated code, planning features for AI implementation, managing version control in AI workflows, or when users mention "vibe coding," Claude Code, Cursor, GitHub Copilot, Aider, Continue, Cline, Codex, Windsurf, or similar AI coding tools. Provides strategies for planning, testing, debugging, and iterating on code written with LLM assistance.
Web scraping with anti-bot bypass, content extraction, undocumented APIs and poison pill detection. Use when extracting content from websites, handling paywalls, implementing scraping cascades or processing social media. Covers requests, trafilatura, Playwright with stealth mode, yt-dlp and instaloader patterns.