Skip to main content
ClaudeWave
Skill1.1k repo starsupdated today

video-generation

This skill generates videos from text prompts or images using the `generate_media` function with `mode="video"`. It supports three backends (Grok, Google Veo, and OpenAI Sora) with automatic selection based on available API keys, offering customizable parameters like duration (1-15 seconds depending on backend), aspect ratio, resolution, and reference images. Use it when you need to create videos with specific visual descriptions, style guidance, or starting frames across multiple AI video generation services.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/massgen/MassGen /tmp/video-generation && cp -r /tmp/video-generation/massgen/skills/video-generation ~/.claude/skills/video-generation
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Video Generation

Generate videos using `generate_media` with `mode="video"`. The system auto-selects the best backend based on available API keys.

## Quick Start

```python
# Simple text-to-video (auto-selects backend)
generate_media(prompt="A robot walking through a city", mode="video")

# Specify backend and duration
generate_media(prompt="Ocean waves crashing on rocks", mode="video",
               backend_type="google", duration=8)

# With aspect ratio
generate_media(prompt="A timelapse of clouds", mode="video",
               backend_type="grok", aspect_ratio="16:9", duration=10)
```

## Backend Comparison

| Backend | Default Model | Duration Range | Default Duration | Resolutions | API Key |
|---------|--------------|----------------|-----------------|-------------|---------|
| **Grok** (priority 1) | `grok-imagine-video` | 1-15s | 5s | 480p, 720p | `XAI_API_KEY` |
| **Google Veo** (priority 2) | `veo-3.1-generate-preview` | 4-8s | 8s | 720p, 1080p, 4K (use `size`); default 16:9 | `GOOGLE_API_KEY` |
| **OpenAI Sora** (priority 3) | `sora-2` | 4, 8, or 12s (discrete) | 4s | Standard | `OPENAI_API_KEY` |

## Key Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `prompt` | Text description of the video | `"A drone flying over mountains"` |
| `backend_type` | Force a specific backend | `"grok"`, `"google"`, `"openai"` |
| `model` | Override default model | `"veo-3.1-generate-preview"` |
| `duration` | Video length in seconds | `8` (clamped to backend limits) |
| `aspect_ratio` | Video aspect ratio | `"16:9"`, `"9:16"`, `"1:1"` |
| `size` | Resolution (Grok: 480p/720p; Veo: 720p/1080p/4k) | `"720p"`, `"1080p"`, `"4k"` |
| `input_images` | Source image for image-to-video | `["starting_frame.jpg"]` |
| `video_reference_images` | Style/content guide images (Veo, up to 3) | `["ref1.png", "ref2.png"]` |
| `negative_prompt` | What to exclude (Veo) | `"blurry, low quality"` |

## Duration Handling

Each backend has different duration constraints. `generate_media` automatically clamps the requested duration:

- **Grok**: Continuous range 1-15s (clamped to bounds)
- **Google Veo**: Continuous range 4-8s (clamped to bounds), defaults to 16:9 aspect ratio
- **OpenAI Sora**: Discrete values only (4, 8, or 12s) - snaps to nearest valid value

A warning is logged if duration is adjusted.

## Image-to-Video

All three video backends support starting video from an existing image via `input_images`:

```python
generate_media(
    prompt="Animate this scene with gentle movement",
    mode="video",
    input_images=["scene.jpg"],
    duration=5
)
```

The first image in `input_images` is used; additional images are ignored.

## Generation Time

Video generation is significantly slower than images. All backends use polling:
- **Grok**: SDK handles polling internally (up to 10 min timeout)
- **Google Veo**: Custom polling every 20s (up to 10 min)
- **OpenAI Sora**: Custom polling every 2s

## Veo 3.1: Native Audio

Veo 3.1 generates audio (dialogue, SFX, ambient) automatically from prompt content. No extra parameter needed — just describe the sounds:

- **Dialogue**: Use quotation marks in prompt (`"Hello," she said.`)
- **Sound effects**: Describe sounds (`tires screeching, engine roaring`)
- **Ambient**: Describe atmosphere (`eerie hum resonates through the hallway`)

## Veo 3.1: Extension Constraints

When extending videos via `continue_from` with a `veo_vid_*` ID:
- Resolution is forced to **720p** (API requirement for extensions)
- Only **16:9** and **9:16** aspect ratios are supported
- Each extension adds up to 7 seconds (API limit: 20 extensions, ~141s total)
- Generated videos are retained for 2 days before expiry

## Producing Longer Videos

Current APIs cap at **15 seconds max per clip** (Grok), with most backends at 4-8s. There is no way to generate a continuous 30+ second video in one call. The proven approach:

1. **Plan a shot list** — break your video into 6-8s segments with specific camera language per shot
2. **Generate clips in parallel** — launch all segments concurrently using `background=True`
3. **Composite in Remotion** (see below) — layer programmatic animation on top of generated footage
4. **Bridge with audio** — a unified narration or music track smooths over visual cuts between clips

For visual continuity, use the same **style anchor** in every prompt (e.g., "BBC Earth documentary cinematography") and maintain consistent lighting/color descriptions.

**Full production guide with examples, transition types, and duration strategy**: See [references/production.md](references/production.md)

## Hybrid Workflow: AI Footage + Remotion Animation

**The best results come from combining AI-generated footage with Remotion's programmatic animation — not choosing one or the other.**

AI video generation produces photorealistic, cinematic footage that pure programmatic rendering cannot match. Remotion produces precise typography, motion graphics, overlays, and transitions that AI generation cannot reliably control. Use both together.

### The Rule: Generate First, Composite Second

1. **Generate AI clips** for cinematic/photorealistic shots (environments, product demos, atmospheric footage)
2. **Use those clips as visual foundations** in Remotion — import them as `<Video>` or `<OffthreadVideo>` background layers
3. **Composite programmatic elements on top** — typography, motion graphics, logos, data overlays, transitions, captions
4. **Fill gaps with pure Remotion animation** — title cards, intro sequences, motion-graphics-only segments where AI footage isn't needed

### Do NOT Discard Generated Clips

**Every AI-generated clip costs real money and time. Do not abandon generated footage and fall back to purely programmatic rendering.** This is a common failure mode — agents generate clips, notice minor artifacts (e.g., repeated patterns, slight distortion), then pivot entirely to OpenCV/PIL/moviepy rendering, wasting all the generation budget.
audio-generationSkill

Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.

backend-integratorSkill

Complete guide for integrating a new LLM backend into MassGen. Use when adding a new provider (e.g., Codex, Mistral, DeepSeek) or when auditing an existing backend for missing integration points. Covers all ~15 files that need touching.

evolving-skill-creatorSkill

Guide for creating evolving skills - detailed workflow plans that capture what you'll do, what tools you'll create, and learnings from execution. Use this when starting a new task that could benefit from a reusable workflow.

file-searchSkill

This skill should be used when agents need to search codebases for text patterns or structural code patterns. Provides fast search using ripgrep for text and ast-grep for syntax-aware code search.

image-generationSkill

Guide to image generation and editing in MassGen. Use when creating images, editing existing images, iterating on image designs, or choosing between image backends (OpenAI, Google Gemini/Imagen, Grok, OpenRouter).

massgen-config-creatorSkill

Guide for creating properly structured YAML configuration files for MassGen. This skill should be used when agents need to create new configs for examples, case studies, testing, or demonstrating features.

massgen-develops-massgenSkill

Guide for using MassGen to develop and improve itself. This skill should be used when agents need to run MassGen experiments programmatically (using automation mode) OR analyze terminal UI/UX quality (using visual evaluation tools). These are mutually exclusive workflows for different improvement goals.

massgen-log-analyzerSkill

Run MassGen experiments and analyze logs using automation mode, logfire tracing, and SQL queries. Use this skill for performance analysis, debugging agent behavior, evaluating coordination patterns, and improving the logging structure, or whenever an ANALYSIS_REPORT.md is needed in a log directory.