conference-speaker-scraper
Conference Speaker Scraper extracts speaker names, titles, companies, and bios from conference website speaker pages using multiple HTML extraction strategies or Apify-based scraping for JavaScript-heavy sites. Use this skill when building conference attendee databases, speaker directories, or event analytics systems that require structured speaker data from conference websites.
git clone --depth 1 https://github.com/gooseworks-ai/goose-skills /tmp/conference-speaker-scraper && cp -r /tmp/conference-speaker-scraper/skills/capabilities/conference-speaker-scraper ~/.claude/skills/conference-speaker-scraperSKILL.md
# Conference Speaker Scraper
Extract speaker names, titles, companies, and bios from conference website /speakers pages. Supports direct HTML scraping with multiple extraction strategies, plus Apify fallback for JS-heavy sites.
## Quick Start
No API key needed for direct scraping mode.
```bash
# Scrape speakers from a conference page
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
--url "https://example.com/speakers"
# Use Apify for JS-heavy sites
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
--url "https://example.com/speakers" --mode apify
# Custom conference name (otherwise inferred from URL)
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
--url "https://example.com/speakers" --conference "Sage Future 2026"
# Output formats
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json # default
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary
```
## How It Works
### Direct Mode (default)
Fetches the page HTML and tries multiple extraction strategies in order, using whichever returns the most results:
1. **Strategy A -- CSS class hints:** Looks for speaker cards with class names containing "speaker", "presenter", "faculty", "panelist", "team-member"
2. **Strategy B -- Heading + paragraph patterns:** Looks for repeated `<h2>`/`<h3>` + `<p>` structures
3. **Strategy C -- JSON-LD structured data:** Checks for `<script type="application/ld+json">` with speaker data
4. **Strategy D -- Platform embeds:** Detects Sched.com/Sessionize patterns used by many conferences
### Apify Mode
Uses `apify/cheerio-scraper` actor with a custom page function that targets common speaker card selectors. Standard POST/poll/GET dataset pattern.
## CLI Reference
| Flag | Default | Description |
|------|---------|-------------|
| `--url` | *required* | Conference speakers page URL |
| `--conference` | inferred | Conference name (otherwise inferred from URL domain) |
| `--mode` | direct | `direct` (HTML scraping) or `apify` (Apify cheerio scraper) |
| `--output` | json | Output format: `json`, `csv`, or `summary` |
| `--token` | env var | Apify token (only needed for apify mode) |
| `--timeout` | 300 | Max seconds for Apify run |
## Output Schema
```json
{
"name": "Jane Smith",
"title": "VP of Finance",
"company": "Acme Corp",
"bio": "Jane leads the finance transformation at...",
"linkedin_url": "https://linkedin.com/in/janesmith",
"image_url": "https://...",
"conference": "Sage Future 2026",
"source_url": "https://sagefuture2026.com/speakers"
}
```
## Cost
- **Direct mode:** Free (no API, no tokens)
- **Apify mode:** Uses `apify/cheerio-scraper` -- minimal Apify credits
## Testing Notes
HTML scraping is inherently fragile across conference sites. The multi-strategy approach maximizes coverage, but JS-heavy sites will require Apify mode. When direct scraping returns 0 results, try `--mode apify`.>
AI video conversations - create real-time video calls with AI personas
AI-powered web scraping - extract data using natural language prompts
Search Amazon products - find items, compare prices, read reviews
Test and document API endpoints - validate responses, check status, generate examples
>
>
Brand intelligence - logos, colors, fonts, styleguides, and company data from any domain