Skill111 repo starsupdated 2mo ago

media-processor

The media-processor skill provides specialized tools for extracting precise visual details from images, transcribing audio and video, generating images, and converting document formats to markdown. Use it when tasks require accurate analysis of UI mockups, screenshots, charts, or design files where specific values like hex colors, spacing measurements, and component hierarchies must be identified, or when processing media files and complex document layouts through Python scripts that interface with the Gemini API.

View source Repository: claude-prime

Install in Claude Code

Copy

git clone --depth 1 https://github.com/avibebuilder/claude-prime /tmp/media-processor && cp -r /tmp/media-processor/.claude/skills/media-processor ~/.claude/skills/media-processor

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Media Processor

Specialized tools for extracting precise visual details (exact colors, spacing, hierarchy), processing audio/video, and generating images.

## Tools

All scripts live in `scripts/` relative to this skill's directory. They auto-select the best model per task and handle retries, large file uploads, and error reporting.

| Script | Purpose |
|--------|---------|
| `gemini_batch_process.py` | Analyze images, transcribe audio/video, extract data from PDFs |
| `image_gen.py` | Generate and edit images (paid plan required) |
| `document_converter.py` | Convert PDF, DOCX, XLSX, PPTX to Markdown; extract page ranges and images |

Requires `GEMINI_API_KEY` in environment or `.env` in this skill's directory. Run any script with `--help` for setup details and available parameters.

**Quick start — image analysis:**
```bash
python <skill-dir>/scripts/gemini_batch_process.py \
  --files <image-path> \
  --task analyze \
  --prompt "<tailored prompt>" \
  --output <output-path>.md
```

## Prompt Quality Matters

The prompt sent to the processing model is the single biggest factor in output quality. Tailor prompts to what the task actually needs — generic prompts produce generic results.

**What makes a good analysis prompt:**
- Ask for the specific details the task requires (hex colors, spacing in px, component hierarchy) rather than "describe this image"
- Structure the ask as a numbered list — the model mirrors the structure back, making output easy to parse
- Name the desired output format ("as a markdown table", "as JSON", "as a component tree")
- Include implementation context when relevant ("for React with Tailwind") so the model emphasizes useful details

**Example prompt patterns:**

UI implementation: *"Extract component hierarchy, layout type, exact hex colors, typography (sizes/weights), spacing in px, interactive states, icons and decorative elements"*

Chart data: *"Extract chart type, axes with units, every data point with exact values, legend entries with colors. Output as a markdown table"*

Design review: *"Compare this screenshot against the design. Flag differences in spacing, colors, alignment, missing elements, and visual inconsistencies. Note exact values for each discrepancy"*

## Pasted Images

When a user pastes images in chat, they are auto-saved to:
```
$CLAUDE_DIR/image-cache/<current_session_id>/<image_number>.png
```
Use `ls "$CLAUDE_DIR/image-cache/"` to discover the session ID, then list its contents to find available images.

## Model Overrides

Scripts auto-select models per task (see [model-routing.md](./references/model-routing.md)). Override with `--model <model-id>` when the default isn't enough — for example, `--model gemini-3.1-pro-preview` for complex visual analysis where the pro model catches more detail than flash.

## References

| Reference | When to read |
|-----------|-------------|
| [api-gotchas.md](./references/api-gotchas.md) | Before using image generation, video processing, or raw API calls — prevents common failures |
| [model-routing.md](./references/model-routing.md) | When choosing or overriding the default model for a task |
| [media-optimization.md](./references/media-optimization.md) | When files are too large to upload — ffmpeg compression recipes |

## Gotchas

- **Rate limits** — scripts retry up to 3 times with backoff. If still rate-limited after retries, stop and ask the user to check their API key quota or provide a new key.
- **Model IDs change** — Google frequently rotates preview model IDs. If you get a 404, the model was likely superseded — check the [models page](https://ai.google.dev/gemini-api/docs/models) for current IDs.
- **Safety filters** — the API may refuse some content. Report clearly to the user rather than retrying.
- **Large files auto-upload** — files >20MB automatically use the File API (2GB max, 48h retention). No action needed.

More from this repository

agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.

askSkill

Answer questions about code, architecture, and technical decisions — no implementation. Trigger on questions asking 'why', 'what does this do', 'what is the purpose of', 'explain', 'what's the difference', 'compare', or 'what are the tradeoffs' — even when referencing specific files, code snippets, or inline code. The key signal is the user wants to UNDERSTAND something, not change it. Do NOT trigger for requests to build, fix, plan, review, research, or add/modify code.

cookSkill

Implement, build, create, or add any feature, endpoint, page, component, or functionality. Use this skill whenever the user asks you to write new code or make code changes — whether it's adding an API endpoint, building a UI page, creating an export feature, wiring up a webhook, implementing a search/filter, or any other hands-on coding task. This is the default skill for all 'build this', 'add this', 'create this', 'wire up', 'implement' requests. Covers the full cycle: clarify requirements, plan if needed, write code, verify, and review. Do NOT use for pure research, debugging, documentation, or explanation — only when the user wants working code delivered.

create-docSkill

Use when the user wants to save knowledge as a file so others don't have to rediscover it — \"turn this into a doc\", \"write this up\", \"document how X works\", \"we figured this out and want to capture it\", \"nobody should have to figure this out again\". Covers any request to create or update durable written artifacts: onboarding guides, runbooks, ADRs, API docs, architecture notes, postmortems, changelogs, setup guides. The trigger: user wants knowledge captured in a file for future reference, not just a conversation. Do NOT use when still making decisions (→ give-plan), just asking for explanation without a file (→ ask), or writing code (→ cook).

diagnoseSkill

Investigate unexpected behavior and mysterious bugs. Use when the cause of a problem is unknown and the user needs to understand WHY something is happening — symptoms like: sudden unexplained changes in metrics or behavior, works locally but not in staging/production, inconsistent or intermittent failures, correct code producing wrong results, operations succeeding but having no effect, environment-specific failures, duplicate executions, stale data, or any \"why did this change?\" or \"why is this happening?\" situation. Covers infrastructure anomalies (cache hit rates dropping, latency spikes, queue behavior shifts) as well as code bugs. The key signal is confusion about root cause, not a request to implement a known fix. Do NOT use for feature requests, known fixes, planning, or documentation tasks.

discussSkill

Brainstorms and debates approaches, then drives toward an actionable decision. Use whenever someone needs a thinking partner for a decision they're facing: 'discuss', 'debate', 'brainstorm', 'weigh options', 'tradeoffs', 'should I do X or Y', 'help me decide', 'I'm torn between', 'sanity check my thinking', or 'what do you think about'. The user must be asking for help reasoning through a choice — not asking to build, fix, evaluate, plan, or modify something (even if the topic involves this skill itself). Picks the right decision lens, surfaces tradeoffs and blind spots, pushes back when reasoning is genuinely weak, and never implements.

docs-seekerSkill

Fetch up-to-date documentation for any library, framework, API, or service into context. Use when the user wants to look up API references, check function signatures or required fields, find feature-specific docs, or verify how an external tool actually works. Triggers for queries about third-party libraries like Stripe, SQLAlchemy, Tailwind, FastAPI, shadcn, Drizzle, Hono, Better Auth — any time the answer lives in official docs rather than in the project codebase. Use this instead of guessing from trained knowledge, which is stale.

fixSkill

Fix bugs and broken behavior when there is enough evidence to act on a repair path. Use for errors, crashes, incorrect results, API failures (500, 404, 403), CORS problems, database exceptions, broken rendering, duplicated or wrong data, off-by-one mistakes, timezone/date bugs, broken forms, config-caused runtime failures, and regressions. Trigger when the user wants the bug repaired and the conversation already contains a clear failing area, a reproducible failing test, a concrete error path, or a prior diagnosis to implement. Do NOT use for new features, pure explanation, architecture discussion, broad research, or bug reports where the main need is figuring out why the behavior happens — use diagnose for that.