web-scraping
This Claude Code skill provides web scraping patterns with anti-bot evasion techniques and multiple extraction fallbacks. Use it when extracting article content from websites, bypassing paywalls, implementing scraping cascades, or processing social media data. It covers trafilatura for fast extraction, requests with rotating user agents, Playwright with stealth mode for JavaScript-heavy sites, and specialized tools like yt-dlp for video metadata and instaloader for Instagram content.
git clone --depth 1 https://github.com/jamditis/claude-skills-journalism /tmp/web-scraping && cp -r /tmp/web-scraping/dev-toolkit/skills/web-scraping ~/.claude/skills/web-scrapingSKILL.md
# Web scraping methodology
Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.
## Scraping cascade architecture
Implement multiple extraction strategies with automatic fallback:
```python
from abc import ABC, abstractmethod
from typing import Optional
import requests
from bs4 import BeautifulSoup
import trafilatura
#for .py files
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
#for .ipynb files
import asyncio
from playwright.async_api import async_playwright
class ScrapingResult:
def __init__(self, content: str, title: str, method: str):
self.content = content
self.title = title
self.method = method # Track which method succeeded
class Scraper(ABC):
@abstractmethod
def fetch(self, url: str) -> Optional[ScrapingResult]: ...
class TrafilaturaCscraper(Scraper):
"""Fast, lightweight extraction for standard articles."""
def fetch(self, url: str) -> Optional[ScrapingResult]:
try:
downloaded = trafilatura.fetch_url(url)
if not downloaded:
return None
content = trafilatura.extract(
downloaded,
include_comments=False,
include_tables=True,
favor_recall=True
)
if not content or len(content) < 100:
return None
# Extract title separately
soup = BeautifulSoup(downloaded, 'html.parser')
title = soup.find('title')
title_text = title.get_text() if title else ''
return ScrapingResult(content, title_text, 'trafilatura')
except Exception:
return None
class RequestsScraper(Scraper):
"""HTTP requests with rotating user agents."""
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]
def fetch(self, url: str) -> Optional[ScrapingResult]:
import random
headers = {
'User-Agent': random.choice(self.USER_AGENTS),
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9',
}
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Remove script/style elements
for element in soup(['script', 'style', 'nav', 'footer', 'aside']):
element.decompose()
# Find main content
main = soup.find('main') or soup.find('article') or soup.find('body')
content = main.get_text(separator='\n', strip=True) if main else ''
title = soup.find('title')
title_text = title.get_text() if title else ''
if len(content) < 100:
return None
return ScrapingResult(content, title_text, 'requests')
except Exception:
return None
class PlaywrightScraper(Scraper):
"""Heavy JavaScript rendering with stealth mode for anti-bot bypass."""
def fetch(self, url: str) -> Optional[ScrapingResult]:
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = context.new_page()
# Apply stealth to avoid detection
stealth_sync(page)
page.goto(url, wait_until='networkidle', timeout=60000)
# Wait for content to load
page.wait_for_timeout(2000)
# Extract content
content = page.evaluate('''() => {
const article = document.querySelector('article, main, .content, #content');
return article ? article.innerText : document.body.innerText;
}''')
title = page.title()
browser.close()
if len(content) < 100:
return None
return ScrapingResult(content, title, 'playwright')
except Exception:
return None
class PlaywrightScraperAsync:
"""Async Playwright scraper for Jupyter notebooks (.ipynb files).
Jupyter notebooks run their own event loop, so sync Playwright won't work.
Use this async version with `await` in notebook cells.
"""
async def fetch(self, url: str) -> Optional[ScrapingResult]:
try:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = await context.new_page()
# Note: playwright-stealth async version
# from playwright_stealth import stealth_async
# await stealth_async(page)
await page.goto(url, wait_until='networkidle', timeout=60000)
# Wait for content to load
await page.wait_for_timeout(2000)
# Extract content
content = await page.evaluate('''() => {
const article = document.querySelector('article, main, .content, #content');
return article ? article.innerText : document.body.innerText;
}''')
title = await page.title()
await browser.close()
if len(content) < 100:Web accessibility patterns for news sites, journalism tools, and academic platforms. Use when building accessible interfaces, auditing existing sites for WCAG compliance, writing alt text for news images, creating accessible data visualizations, or ensuring content reaches all readers including those using assistive technologies. Essential for newsroom developers and anyone publishing web content.
Electron desktop application development with React, TypeScript, and Vite. Use when building desktop apps, implementing IPC communication, managing windows/tray, handling PTY terminals, integrating WebRTC/audio, or packaging with electron-builder. Covers patterns from AudioBash, Yap, and Pisscord projects.
Remote JavaScript console access and debugging on mobile devices. Use when debugging web pages on phones/tablets, accessing console errors without desktop DevTools, testing responsive designs on real devices, or diagnosing mobile-specific issues. Covers Eruda, vConsole, Chrome/Safari remote debugging, and cloud testing platforms.
Use this skill when creating new files that represent architectural decisions — data models, infrastructure configs, auth boundaries, API contracts, CI/CD pipelines, or event systems. Flags irreversible decisions and forces a discussion about trade-offs before committing.
Python data processing pipelines with modular architecture. Use when building content processing workflows, implementing dispatcher patterns, integrating Google Sheets/Drive APIs, or creating batch processing systems. Covers patterns from rosen-scraper, image-analyzer, and social-scraper projects.
This skill should be used when the user reports a bug, describes unexpected behavior, says something is "broken", "not working", "failing", mentions an "error", "issue", or "problem" in code, or asks to "fix" something. Enforces test-driven bug fixing workflow.
Methodology for effective AI-assisted software development. Use when helping users build software with AI coding assistants, debugging AI-generated code, planning features for AI implementation, managing version control in AI workflows, or when users mention "vibe coding," Claude Code, Cursor, GitHub Copilot, Aider, Continue, Cline, Codex, Windsurf, or similar AI coding tools. Provides strategies for planning, testing, debugging, and iterating on code written with LLM assistance.
Signs of taste in web UI. Use when building or reviewing any user-facing web interface — dashboards, SaaS apps, marketing sites, internal tools. Covers interaction speed, navigation depth, visual restraint, copy quality, and the small details that separate polished products from rough ones.