Skill1.1k estrellas del repoactualizado 5d ago

conference-speaker-scraper

Conference Speaker Scraper extracts speaker names, titles, companies, and bios from conference website speaker pages using multiple HTML extraction strategies or Apify-based scraping for JavaScript-heavy sites. Use this skill when building conference attendee databases, speaker directories, or event analytics systems that require structured speaker data from conference websites.

Ver fuente Repositorio: goose-skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/gooseworks-ai/goose-skills /tmp/conference-speaker-scraper && cp -r /tmp/conference-speaker-scraper/skills/lead-generation/capabilities/conference-speaker-scraper ~/.claude/skills/conference-speaker-scraper

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Conference Speaker Scraper

Extract speaker names, titles, companies, and bios from conference website /speakers pages. Supports direct HTML scraping with multiple extraction strategies, plus Apify fallback for JS-heavy sites.

## Quick Start

No API key needed for direct scraping mode.

```bash
# Scrape speakers from a conference page
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
  --url "https://example.com/speakers"

# Use Apify for JS-heavy sites
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
  --url "https://example.com/speakers" --mode apify

# Custom conference name (otherwise inferred from URL)
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
  --url "https://example.com/speakers" --conference "Sage Future 2026"

# Output formats
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json     # default
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary
```

## How It Works

### Direct Mode (default)

Fetches the page HTML and tries multiple extraction strategies in order, using whichever returns the most results:

1. **Strategy A -- CSS class hints:** Looks for speaker cards with class names containing "speaker", "presenter", "faculty", "panelist", "team-member"
2. **Strategy B -- Heading + paragraph patterns:** Looks for repeated `<h2>`/`<h3>` + `<p>` structures
3. **Strategy C -- JSON-LD structured data:** Checks for `<script type="application/ld+json">` with speaker data
4. **Strategy D -- Platform embeds:** Detects Sched.com/Sessionize patterns used by many conferences

### Apify Mode

Uses `apify/cheerio-scraper` actor with a custom page function that targets common speaker card selectors. Standard POST/poll/GET dataset pattern.

## CLI Reference

| Flag | Default | Description |
|------|---------|-------------|
| `--url` | *required* | Conference speakers page URL |
| `--conference` | inferred | Conference name (otherwise inferred from URL domain) |
| `--mode` | direct | `direct` (HTML scraping) or `apify` (Apify cheerio scraper) |
| `--output` | json | Output format: `json`, `csv`, or `summary` |
| `--token` | env var | Apify token (only needed for apify mode) |
| `--timeout` | 300 | Max seconds for Apify run |

## Output Schema

```json
{
  "name": "Jane Smith",
  "title": "VP of Finance",
  "company": "Acme Corp",
  "bio": "Jane leads the finance transformation at...",
  "linkedin_url": "https://linkedin.com/in/janesmith",
  "image_url": "https://...",
  "conference": "Sage Future 2026",
  "source_url": "https://sagefuture2026.com/speakers"
}
```

## Cost

- **Direct mode:** Free (no API, no tokens)
- **Apify mode:** Uses `apify/cheerio-scraper` -- minimal Apify credits

## Testing Notes

HTML scraping is inherently fragile across conference sites. The multi-strategy approach maximizes coverage, but JS-heavy sites will require Apify mode. When direct scraping returns 0 results, try `--mode apify`.

Del mismo repositorio

aeoSkill

ai-video-calls-tavusSkill

AI video conversations - create real-time video calls with AI personas

ai-web-scraping-scrapegraphSkill

AI-powered web scraping - extract data using natural language prompts

amazon-searchSkill

Search Amazon products - find items, compare prices, read reviews

api-testerSkill

Test and document API endpoints - validate responses, check status, generate examples

apollo-lead-finderSkill

blog-feed-monitorSkill

brand-intel-branddevSkill

Brand intelligence - logos, colors, fonts, styleguides, and company data from any domain