Skill128 estrellas del repoactualizado 2mo ago

using-web-scraping

This Claude Code skill enables agents to search DuckDuckGo and extract structured content from public webpages using headless Chrome, with built-in protections for robots.txt compliance and rate-limiting. Use it when you need to collect public webpage data for summarization or metadata extraction, but avoid it for bypassing paywalls or accessing login-restricted content.

Ver fuente Repositorio: open-skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/besoeasy/open-skills /tmp/using-web-scraping && cp -r /tmp/using-web-scraping/skills/using-web-scraping ~/.claude/skills/using-web-scraping

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Web Scraping Skill — Chrome (Playwright) + DuckDuckGo

A privacy-minded, agent-facing web-scraping skill that uses headless Chrome (Playwright/Puppeteer) and DuckDuckGo for search. Focuses on: reliable navigation, extracting structured text, obeying robots.txt, and rate-limiting.

## When to use
- Collect public webpage content for summarization, metadata extraction, or link discovery.
- Use DuckDuckGo for queries when you want a privacy-respecting search source.
- NOT for bypassing paywalls, scraping private/logged-in content, or violating Terms of Service.

## Safety & etiquette
- Always check and respect `/robots.txt` before scraping a site.
- Rate-limit requests (default: 1 request/sec) and use polite `User-Agent` strings.
- Avoid executing arbitrary user-provided JavaScript on scraped pages.
- Only scrape public content; if login is required, return `login_required` instead of attempting to bypass.

## Capabilities
- Search DuckDuckGo and return top-N result links.
- Visit result pages in headless Chrome and extract `title`, `meta description`, `main` text (or best-effort article text), and `canonical` URL.
- Return results as structured JSON for downstream consumption.

## Examples
### Node.js (Playwright)
```javascript
const { chromium } = require('playwright');

async function ddgSearchAndScrape(query) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({ userAgent: 'open-skills-bot/1.0' });

  // DuckDuckGo search
  await page.goto('https://duckduckgo.com/');
  await page.fill('input[name="q"]', query);
  await page.keyboard.press('Enter');
  await page.waitForSelector('.result__title a');

  // collect top result URL
  const href = await page.getAttribute('.result__title a', 'href');
  if (!href) { await browser.close(); return []; }

  // visit result and extract
  await page.goto(href, { waitUntil: 'domcontentloaded' });
  const title = await page.title();
  const description = await page.locator('meta[name="description"]').getAttribute('content').catch(() => null);
  const article = await page.locator('article, main, #content').first().innerText().catch(() => null);

  await browser.close();
  return [{ url: href, title, description, text: article }];
}

// usage
// ddgSearchAndScrape('open-source agent runtimes').then(console.log);
```

## Agent prompt (copy/paste)
```text
You are an agent with a web-scraping skill. For any `search:` task, use DuckDuckGo to find relevant pages, then open each page in a headless Chrome instance (Playwright/Puppeteer) and extract `title`, `meta description`, `main text`, and `canonical` URL. Always:
- Check and respect robots.txt
- Rate-limit requests (<=1 req/sec)
- Use a clear `User-Agent` and do not execute arbitrary page JS
Return results as JSON: [{url,title,description,text}] or `login_required` if a page needs authentication.
```

## Quick setup
- Node: `npm i playwright` and run `npx playwright install` for browser binaries.
- Python: `pip install playwright` and `playwright install`.

## Tips
- Use `page.route` to block large assets (images, fonts) when you only need text.
- Respect site terms and introduce exponential backoff for retries.

## See also
- [using-youtube-download.md](using-youtube-download.md) — media-specific scraping and download examples.

Del mismo repositorio

age-file-encryptionSkill

Encrypt and decrypt files or streams using age — a simple, modern, and secure encryption tool with small explicit keys, passphrase support, SSH key support, post-quantum hybrid keys, and UNIX-style composability. No config options, no footguns.

anonymous-file-uploadSkill

Upload and host files anonymously using decentralized storage with Originless and IPFS.

browser-automation-agentSkill

Automate web browsers for AI agents using agent-browser CLI with deterministic element selection.

bulk-github-starSkill

Star all repositories from a GitHub user automatically. Use when: (1) Supporting open source creators, (2) Bulk discovery of useful projects, or (3) Automating GitHub engagement.

changelog-generatorSkill

Automatically creates user-facing changelogs from git commits by analyzing commit history, categorizing changes, and transforming technical commits into clear, customer-friendly release notes. Turns hours of manual changelog writing into minutes of automated generation.

chat-loggerSkill

Log all chat messages to a SQLite database for searchable history and audit. Use when: (1) Building chat history, (2) Auditing conversations, (3) Searching past messages, or (4) User asks to log chats.

check-crypto-address-balanceSkill

Check cryptocurrency wallet balances across multiple blockchains using free public APIs.

city-distanceSkill

Calculate line-of-sight and road distances between two cities using free OpenStreetMap services.