Skill1.1k repo starsupdated 1mo ago

video-generation

This skill generates videos from text prompts or images using the `generate_media` function with `mode="video"`. It supports three backends (Grok, Google Veo, and OpenAI Sora) with automatic selection based on available API keys, offering customizable parameters like duration (1-15 seconds depending on backend), aspect ratio, resolution, and reference images. Use it when you need to create videos with specific visual descriptions, style guidance, or starting frames across multiple AI video generation services.

View source Repository: MassGen

Install in Claude Code

Copy

git clone --depth 1 https://github.com/massgen/MassGen /tmp/video-generation && cp -r /tmp/video-generation/massgen/skills/video-generation ~/.claude/skills/video-generation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Video Generation

Generate videos using `generate_media` with `mode="video"`. The system auto-selects the best backend based on available API keys.

## Quick Start

```python
# Simple text-to-video (auto-selects backend)
generate_media(prompt="A robot walking through a city", mode="video")

# Specify backend and duration
generate_media(prompt="Ocean waves crashing on rocks", mode="video",
               backend_type="google", duration=8)

# With aspect ratio
generate_media(prompt="A timelapse of clouds", mode="video",
               backend_type="grok", aspect_ratio="16:9", duration=10)
```

## Backend Comparison

| Backend | Default Model | Duration Range | Default Duration | Resolutions | API Key |
|---------|--------------|----------------|-----------------|-------------|---------|
| **Grok** (priority 1) | `grok-imagine-video` | 1-15s | 5s | 480p, 720p | `XAI_API_KEY` |
| **Google Veo** (priority 2) | `veo-3.1-generate-preview` | 4-8s | 8s | 720p, 1080p, 4K (use `size`); default 16:9 | `GOOGLE_API_KEY` |
| **OpenAI Sora** (priority 3) | `sora-2` | 4, 8, or 12s (discrete) | 4s | Standard | `OPENAI_API_KEY` |

## Key Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `prompt` | Text description of the video | `"A drone flying over mountains"` |
| `backend_type` | Force a specific backend | `"grok"`, `"google"`, `"openai"` |
| `model` | Override default model | `"veo-3.1-generate-preview"` |
| `duration` | Video length in seconds | `8` (clamped to backend limits) |
| `aspect_ratio` | Video aspect ratio | `"16:9"`, `"9:16"`, `"1:1"` |
| `size` | Resolution (Grok: 480p/720p; Veo: 720p/1080p/4k) | `"720p"`, `"1080p"`, `"4k"` |
| `input_images` | Source image for image-to-video | `["starting_frame.jpg"]` |
| `video_reference_images` | Style/content guide images (Veo, up to 3) | `["ref1.png", "ref2.png"]` |
| `negative_prompt` | What to exclude (Veo) | `"blurry, low quality"` |

## Duration Handling

Each backend has different duration constraints. `generate_media` automatically clamps the requested duration:

- **Grok**: Continuous range 1-15s (clamped to bounds)
- **Google Veo**: Continuous range 4-8s (clamped to bounds), defaults to 16:9 aspect ratio
- **OpenAI Sora**: Discrete values only (4, 8, or 12s) - snaps to nearest valid value

A warning is logged if duration is adjusted.

## Image-to-Video

All three video backends support starting video from an existing image via `input_images`:

```python
generate_media(
    prompt="Animate this scene with gentle movement",
    mode="video",
    input_images=["scene.jpg"],
    duration=5
)
```

The first image in `input_images` is used; additional images are ignored.

## Generation Time

Video generation is significantly slower than images. All backends use polling:
- **Grok**: SDK handles polling internally (up to 10 min timeout)
- **Google Veo**: Custom polling every 20s (up to 10 min)
- **OpenAI Sora**: Custom polling every 2s

## Veo 3.1: Native Audio

Veo 3.1 generates audio (dialogue, SFX, ambient) automatically from prompt content. No extra parameter needed — just describe the sounds:

- **Dialogue**: Use quotation marks in prompt (`"Hello," she said.`)
- **Sound effects**: Describe sounds (`tires screeching, engine roaring`)
- **Ambient**: Describe atmosphere (`eerie hum resonates through the hallway`)

## Veo 3.1: Extension Constraints

When extending videos via `continue_from` with a `veo_vid_*` ID:
- Resolution is forced to **720p** (API requirement for extensions)
- Only **16:9** and **9:16** aspect ratios are supported
- Each extension adds up to 7 seconds (API limit: 20 extensions, ~141s total)
- Generated videos are retained for 2 days before expiry

## Producing Longer Videos

Current APIs cap at **15 seconds max per clip** (Grok), with most backends at 4-8s. There is no way to generate a continuous 30+ second video in one call. The proven approach:

1. **Plan a shot list** — break your video into 6-8s segments with specific camera language per shot
2. **Generate clips in parallel** — launch all segments concurrently using `background=True`
3. **Composite in Remotion** (see below) — layer programmatic animation on top of generated footage
4. **Bridge with audio** — a unified narration or music track smooths over visual cuts between clips

For visual continuity, use the same **style anchor** in every prompt (e.g., "BBC Earth documentary cinematography") and maintain consistent lighting/color descriptions.

**Full production guide with examples, transition types, and duration strategy**: See [references/production.md](references/production.md)

## Hybrid Workflow: AI Footage + Remotion Animation

**The best results come from combining AI-generated footage with Remotion's programmatic animation — not choosing one or the other.**

AI video generation produces photorealistic, cinematic footage that pure programmatic rendering cannot match. Remotion produces precise typography, motion graphics, overlays, and transitions that AI generation cannot reliably control. Use both together.

### The Rule: Generate First, Composite Second

1. **Generate AI clips** for cinematic/photorealistic shots (environments, product demos, atmospheric footage)
2. **Use those clips as visual foundations** in Remotion — import them as `<Video>` or `<OffthreadVideo>` background layers
3. **Composite programmatic elements on top** — typography, motion graphics, logos, data overlays, transitions, captions
4. **Fill gaps with pure Remotion animation** — title cards, intro sequences, motion-graphics-only segments where AI footage isn't needed

### Do NOT Discard Generated Clips

**Every AI-generated clip costs real money and time. Do not abandon generated footage and fall back to purely programmatic rendering.** This is a common failure mode — agents generate clips, notice minor artifacts (e.g., repeated patterns, slight distortion), then pivot entirely to OpenCV/PIL/moviepy rendering, wasting all the generation budget.