Skill506 repo starsupdated 8d ago

seo-firecrawl

The seo-firecrawl skill integrates Firecrawl web scraping into Codex SEO workflows, enabling full-site crawls, single-page scrapes, site mapping, and content search. Use it for comprehensive SEO audits, section-specific analysis, broken link detection, and content inventory when the Firecrawl extension is installed and its MCP server is connected.

View source Repository: codex-seo

Install in Claude Code

Copy

git clone --depth 1 https://github.com/AgriciDaniel/codex-seo /tmp/seo-firecrawl && cp -r /tmp/seo-firecrawl/skills/seo-firecrawl ~/.claude/skills/seo-firecrawl

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Firecrawl Extension for Codex SEO
## Shared Data Cache

**Step 0 -- Check shared data cache:**

Before gathering, check `.seo-cache/` for reusable context from related SEO skills.
Reference: `../seo/references/shared-data-cache.md` for schemas and dependency map.

Check these cache files when present:
- `.seo-cache/site-meta.json` for domain, business type, industry, and crawl context
- `.seo-cache/audit-scores.json` for prior full-audit priorities
- `.seo-cache/pages/{url-slug}/page-analysis.json` for page-level context when a URL is provided

- If found: parse and use clearly valid fields (note "Using cached [X] from [date]")
- If missing, corrupt, or irrelevant: continue with fresh evidence
- If the user says "refresh" or "re-run": ignore cache reads and overwrite on write

This skill requires the Firecrawl extension to be installed:
```bash
./extensions/firecrawl/install.sh
```

**Check availability:** Before using any Firecrawl tool, verify the MCP server
is connected by checking if `firecrawl_scrape` or any Firecrawl tool
is available. If tools are not available, inform the user the extension is not
installed and provide install instructions.

## Quick Reference

| Command | Purpose |
|---------|---------|
| `/seo firecrawl crawl <url>` | Full-site crawl with content extraction |
| `/seo firecrawl map <url>` | Discover site structure (URLs only, fast) |
| `/seo firecrawl scrape <url>` | Single-page scrape with JS rendering |
| `/seo firecrawl search <query> <url>` | Search within a crawled site |

## Commands

### crawl -- Full-Site Crawl

Crawl an entire website starting from the given URL. Returns page content,
metadata, and links for all discovered pages.

**MCP Tool:** `firecrawl_crawl`

**Parameters:**
- `url` (required): Starting URL to crawl
- `limit`: Max pages to crawl (default: 100, max: 500)
- `maxDepth`: Max link depth from start URL (default: 3)
- `includePaths`: Array of glob patterns to include (e.g., `["/blog/*"]`)
- `excludePaths`: Array of glob patterns to exclude (e.g., `["/admin/*", "/api/*"]`)
- `scrapeOptions.formats`: Output formats -- `["markdown", "html", "links"]`

**SEO Usage Patterns:**
1. **Comprehensive audit crawl**: Crawl full site, extract all pages for subagent analysis
2. **Section-focused crawl**: Use `includePaths` to audit only `/blog/*` or `/products/*`
3. **Broken link detection**: Crawl with `["links"]` format, check all hrefs for 404s
4. **Content inventory**: Extract all page titles, meta descriptions, H1s at scale
5. **SPA/JS-rendered sites**: Firecrawl renders JavaScript, solving the Issue #11 problem

**Example orchestration for `/seo audit`:**
```
1. firecrawl_map(url) -> get all URLs (fast, no content)
2. Filter to top 50 most important pages (homepage, key sections)
3. firecrawl_crawl(url, limit=50) -> get full content
4. Feed content to seo-technical, seo-content, seo-schema agents
```

**Cost awareness:**
- Free tier: 500 credits/month
- 1 credit = 1 page crawled or scraped
- Map operations are cheaper (0.5 credits per URL discovered)
- Always inform user of estimated credit usage before large crawls

### map -- Site Structure Discovery

Discover all URLs on a website without fetching content. Fast and credit-efficient.

**MCP Tool:** `firecrawl_map`

**Parameters:**
- `url` (required): Website URL to map
- `limit`: Max URLs to discover (default: 5000)
- `search`: Optional search term to filter URLs

**SEO Usage Patterns:**
1. **Sitemap comparison**: Map site, compare discovered URLs vs XML sitemap
2. **Orphan page detection**: URLs in sitemap but not linked from any page
3. **Crawl budget analysis**: Total indexable pages vs pages linked from homepage
4. **URL pattern analysis**: Identify URL structure patterns, duplicates, parameter bloat
5. **Pre-audit discovery**: Run map first, then targeted crawl on key sections

**Output:** Array of URLs. Present as:
```
Site: example.com
Pages discovered: 342

URL Pattern Breakdown:
  /blog/*          - 128 pages (37%)
  /products/*      - 89 pages (26%)
  /category/*      - 45 pages (13%)
  /pages/*         - 32 pages (9%)
  / (root pages)   - 48 pages (14%)
```

### scrape -- Single-Page Deep Scrape

Scrape a single page with full JavaScript rendering. More thorough than
`fetch_page.py` because it executes JS and waits for dynamic content.

**MCP Tool:** `firecrawl_scrape`

**Parameters:**
- `url` (required): Page URL to scrape
- `formats`: Output formats -- `["markdown", "html", "links", "screenshot"]`
- `onlyMainContent`: Strip nav/footer/sidebar (default: true)
- `waitFor`: CSS selector or milliseconds to wait for content
- `timeout`: Request timeout in ms (default: 30000)
- `actions`: Browser actions before scraping (click, scroll, wait)

**SEO Usage Patterns:**
1. **SPA content extraction**: Scrape JS-rendered React/Vue/Angular pages
2. **Dynamic content audit**: Pages with lazy-loaded content below the fold
3. **Paywall/login detection**: Identify content behind authentication walls
4. **Main content extraction**: Use `onlyMainContent` for clean E-E-A-T analysis
5. **Screenshot capture**: Use `screenshot` format for visual analysis

**When to use scrape vs fetch_page.py:**
| Scenario | Use |
|----------|-----|
| Static HTML page | `fetch_page.py` (no API cost) |
| JS-rendered SPA | `firecrawl_scrape` (renders JS) |
| Need response headers | `fetch_page.py` (returns headers) |
| Need clean markdown | `firecrawl_scrape` (better extraction) |
| Rate-limited/blocked | `firecrawl_scrape` (handles anti-bot) |

### search -- Site-Scoped Search

Search within a website for specific content. Useful for finding pages
related to a topic without crawling everything.

**MCP Tool:** `firecrawl_search`

**Parameters:**
- `query` (required): Search query
- `url` (required): Website to search within
- `limit`: Max results (default: 10)
- `scrapeOptions.formats`: Output format for matched pages

**SEO Usage Patterns:**
1. **Content gap validation**: Search for a keyword on the site to check if con