Skip to main content
ClaudeWave
Skill2.1k estrellas del repoactualizado 2mo ago

ai-multimodal

The ai-multimodal skill provides a unified interface to Google Gemini's multimodal API for processing and generating multimedia content. It handles audio transcription and analysis up to 9.5 hours, image understanding including OCR and object detection, video analysis up to 6 hours with scene detection, PDF document extraction with structured output, and text-to-image generation with editing capabilities. Use this skill when analyzing audio or video files, extracting data from images or PDFs, performing visual question-answering, or generating images from text prompts across Gemini 2.5 and 2.0 models.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mrgoonie/claudekit-skills /tmp/ai-multimodal && cp -r /tmp/ai-multimodal/.claude/skills/ai-multimodal ~/.claude/skills/ai-multimodal
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

## Core Capabilities

### Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice

### Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction

### Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis

### Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)

### Image Generation
- Text-to-image generation
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality

## Capability Matrix

| Task | Audio | Image | Video | Document | Generation |
|------|:-----:|:-----:|:-----:|:--------:|:----------:|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object Detection | - | ✓ | ✓ | - | - |
| Text Extraction | - | ✓ | - | ✓ | - |
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
| Creation | TTS | - | - | - | ✓ |
| Timestamps | ✓ | - | ✓ | - | - |
| Segmentation | - | ✓ | - | - | - |

## Model Selection Guide

### Gemini 2.5 Series (Recommended)
- **gemini-2.5-pro**: Highest quality, all features, 1M-2M context
- **gemini-2.5-flash**: Best balance, all features, 1M-2M context
- **gemini-2.5-flash-lite**: Lightweight, segmentation support
- **gemini-2.5-flash-image**: Image generation only

### Gemini 2.0 Series
- **gemini-2.0-flash**: Fast processing, object detection
- **gemini-2.0-flash-lite**: Lightweight option

### Feature Requirements
- **Segmentation**: Requires 2.5+ models
- **Object Detection**: Requires 2.0+ models
- **Multi-video**: Requires 2.5+ models
- **Image Generation**: Requires flash-image model

### Context Windows
- **2M tokens**: ~6 hours video (low-res) or ~2 hours (default)
- **1M tokens**: ~3 hours video (low-res) or ~1 hour (default)
- **Audio**: 32 tokens/second (1 min = 1,920 tokens)
- **PDF**: 258 tokens/page (fixed)
- **Image**: 258-1,548 tokens based on size

## Quick Start

### Prerequisites

**API Key Setup**: Supports both Google AI Studio and Vertex AI.

The skill checks for `GEMINI_API_KEY` in this order:
1. Process environment: `export GEMINI_API_KEY="your-key"`
2. Project root: `.env`
3. `.claude/.env`
4. `.claude/skills/.env`
5. `.claude/skills/ai-multimodal/.env`

**Get API key**: https://aistudio.google.com/apikey

**For Vertex AI**:
```bash
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional
```

**Install SDK**:
```bash
pip install google-genai python-dotenv pillow
```

### Common Patterns

**Transcribe Audio**:
```bash
python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash
```

**Analyze Image**:
```bash
python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
```

**Process Video**:
```bash
python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
```

**Extract from PDF**:
```bash
python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json
```

**Generate Image**:
```bash
python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9
```

**Optimize Media**:
```bash
# Prepare large video for processing
python scripts/media_optimizer.py \
  --input large-video.mp4 \
  --output docs/assets/<output-file-name> \
  --target-size 100MB

# Batch optimize multiple files
python scripts/media_optimizer.py \
  --input-dir ./videos \
  --output-dir docs/assets/optimized \
  --quality 85
```

**Convert Documents to Markdown**:
```bash
# Convert to PDF
python scripts/document_converter.py \
  --input document.docx \
  --output docs/assets/document.md

# Extract pages
python scripts/document_converter.py \
  --input large.pdf \
  --output docs/assets/chapter1.md \
  --pages 1-20
```

## Supported Formats

### Audio
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
- Max 9.5 hours per request
- Auto-downsampled to 16 Kbps mono

### Images
- PNG, JPEG, WEBP, HEIC, HEIF
- Max 3,600 images per request
- Resolution: ≤384px = 258 tokens, larger = tiled

### Video
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Max 6 hours (low-res) or 2 hours (default)
- YouTube URLs supported (public only)

### Documents
- PDF only for vision processing
- Max 1,000 pages
- TXT, HTML, Markdown supported (text-only)

### Size Limits
- **Inline**: <20MB total request
- **File API**: 2GB per file, 20GB project quota
- **Retention**: 48 hours auto-delete

## Reference Navigation

For detailed implementation guidance, see:

### Audio Processing
- `references/audio-processing.md` - Transcription, analysis, TTS
  - Timestamp handling and segment analysis
  -
mcp-managerSubagent

Manage MCP (Model Context Protocol) server integrations - discover tools/prompts/resources, analyze relevance for tasks, and execute MCP capabilities. Use when need to work with MCP servers, discover available MCP tools, filter MCP capabilities for specific tasks, execute MCP tools programmatically, or implement MCP client functionality. Keeps main context clean by handling MCP discovery in subagent context.

cmSlash Command

Stage all files and create a commit.

cpSlash Command

Stage, commit and push all code in the current branch

prSlash Command

Create a pull request

createSlash Command

Create a new agent skill

use-mcpSlash Command

Utilize tools of Model Context Protocol (MCP) servers

aestheticSkill

Create aesthetically beautiful interfaces following proven design principles. Use when building UI/UX, analyzing designs from inspiration sites, generating design images with ai-multimodal, implementing visual hierarchy and color theory, adding micro-interactions, or creating design documentation. Includes workflows for capturing and analyzing inspiration screenshots with chrome-devtools and ai-multimodal, iterative design image generation until aesthetic standards are met, and comprehensive design system guidance covering BEAUTIFUL (aesthetic principles), RIGHT (functionality/accessibility), SATISFYING (micro-interactions), and PEAK (storytelling) stages. Integrates with chrome-devtools, ai-multimodal, media-processing, ui-styling, and web-frameworks skills.

backend-developmentSkill

Build robust backend systems with modern technologies (Node.js, Python, Go, Rust), frameworks (NestJS, FastAPI, Django), databases (PostgreSQL, MongoDB, Redis), APIs (REST, GraphQL, gRPC), authentication (OAuth 2.1, JWT), testing strategies, security best practices (OWASP Top 10), performance optimization, scalability patterns (microservices, caching, sharding), DevOps practices (Docker, Kubernetes, CI/CD), and monitoring. Use when designing APIs, implementing authentication, optimizing database queries, setting up CI/CD pipelines, handling security vulnerabilities, building microservices, or developing production-ready backend systems.