Skip to main content
ClaudeWave
Skill294 repo starsupdated today

video-compose

The video-compose skill dispatches video generation requests by constructing prompts, managing references, and invoking the generate_video CLI with appropriate flags. Use it when the user requests video creation, animation, editing, or composition, including tasks like rendering scripts as video, animating storyboards, applying audio or image references, or generating ads and product promos. The skill enforces production defaults like staging all outputs and enabling audio by default unless explicitly disabled.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/Utopai-Research/pai-pro /tmp/video-compose && cp -r /tmp/video-compose/skills/video-compose ~/.claude/skills/video-compose
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

This file is the intent dispatcher. Each pattern below names triggers, the CLI invocation, edge / node rules, and which reference owns the prompt construction. References live in `references/`.

## Hard defaults

Behaviors that production-judgment instinct will silently flip when they aren't enshrined here. Don't override these without the user explicitly asking.

- **STAGE BY DEFAULT** — every `generate_video.js` call goes through `--stage`; see the project `PROJECT_AGENT.md` § "Draft gate" for draft and result handling.
- **AUDIO ON BY DEFAULT** — every `generate_video.js` call generates an audio track (`generate_audio: true`). Pass `--no-audio` ONLY when the user has explicitly asked for a silent clip ("silent", "no audio", "I'll add sound in post"). Trailer / portrait / cinematic framing is NOT a trigger; audio is the baseline, not optional polish.

## First-use video mode

For the ask-once flow and per-mode prices, see the project `PROJECT_AGENT.md` § "First-use generation choices". Pass `--resolution` only for `480p Draft` or `1080p Final`.

## CLI shape

```
node "$PAI_REPO_ROOT/server/cli/generate_video.js" --prompt "..." [--duration <seconds>] [--aspect-ratio 16:9]
  [--resolution <480p|1080p>] [--no-audio]
  [--label "..."] [--ref-source-id <id> ...] [--ref-audio-source-id <audio_id> ...]
  [--source-node-id <id>] [--shot-id <N>]
```

`$PAI_REPO_ROOT` is exported by the viewer — see the project `PROJECT_AGENT.md` § "Media CLIs (server/cli/)".

Calls go via `--stage` — see the project `PROJECT_AGENT.md` § "Draft gate".

`--label` defaults to the truncated prompt (≤30 chars) if omitted. Pass
`--ref-source-id <id>` once per `image_result` / `video_result` source
node you want as a byte ref — the CLI resolves each source's
`local_path`, hands the tunnel URL to PAI's `video-generation-assets`
endpoint, and emits one `derived` edge per ref. Pass
`--ref-audio-source-id <audio_id>` once per canvas `audio_result` node
you want as an audio ref (same wiring; separate flag so the CLI can
partition by type without reading the workflow). External URLs (a
pasted CDN link, a still you want as a ref) must be mirrored onto the
canvas first via `mirror_url.js --url <URL>` — the returned
`node_id` plugs into `--ref-source-id` like any other canvas source.
When a canvas note authored the clip (most commonly a shot note being
rendered), pass `--source-node-id <note_id>` — see the project `PROJECT_AGENT.md` §
"Asset, ref, and edge rules". Don't set `--shot-id` unless the user asked for a
specific reel position; the Timeline UI owns shot_id assignment.

Match any stated single-clip duration with `--duration <seconds>`; omit only
for the 15s default. Split or chain >15s totals.

Each clip costs real money even after staging — only stage after the user has explicitly asked for a video.

## Reference caps (video-generation)

≤9 image refs, ≤3 audio refs, ≤3 video refs. Each audio / video ref must be **1.8s–15.2s per file**. **Video refs additionally cap at 15s aggregate** (sum across the ≤3 video refs); audio has no aggregate cap. Audio refs need an image or video anchor — they can't be the only reference. Don't preflight — submit and read `limits` + `sent` on failure. Audio / video durations are already on the canvas — read `audio_result.data.metadata.duration_sec` and `video_result.data.duration` from `workflow.json` instead of probing the files.

## Reference roles — vocabulary

The same CLI flag can serve different semantic roles depending on how the prompt names the ref. Choose the role first; the prompt phrasing binds it.

| Role | Flag | Wording in prompt |
|---|---|---|
| Character identity | `--ref-source-id` (image) | "the character in @Image1" |
| Location / setting | `--ref-source-id` (image) | "the location shown in @Image1" |
| Opening frame | `--ref-source-id` (image) | "opening frame @Image1, …" |
| Closing frame | `--ref-source-id` (image) | "closing on the frame from @Image1" |
| Source clip — continue | `--ref-source-id` (video) | "Continue from @Video1 — start after its final frame, no frames from @Video1 in the new clip" |
| Source clip — transform | `--ref-source-id` (video) | "Re-render @Video1 in …" |
| Camera-move source | `--ref-source-id` (video) | "camera moves match @Video1" |
| Action source | `--ref-source-id` (video) | "action choreography matches @Video1" |
| VFX template | `--ref-source-id` (video) | "use the visual-effects template from @Video1" |
| Voice / timbre anchor | `--ref-audio-source-id` | "Use @Audio1 as the voice/timbre reference. Speak the quoted line exactly once, no echo." |

## Prompt-language conventions

- Reference syntax: `@Image1` / `@Video1` / `@Audio1`, positional, in `--ref-source-id` / `--ref-audio-source-id` order (image and video refs share the `@Image…` / `@Video…` slot per their source node type).
- Spoken text rule: when a script note, shot note, or user request contains dialogue/VO, include those words in the video prompt verbatim. Do not summarize, translate, shorten, polish, or invent dialogue/VO.
- Dialogue scenes: keep the shot/script dialogue in the prompt; use one approved voice sample per speaker as a timbre anchor. Do not generate per-line audio refs unless the user explicitly wants separate final audio.
- Final audio exception: if an audio node is the approved narration/line read, use `audio_result.data.text` verbatim. If it is just a character voice sample, do not replace the shot dialogue with the sample text.
- Add dialogue guards for model-spoken lines: *"each line spoken exactly once, no echo, no repeated reads."* Add phonetic spelling for names or words likely to slur.
- Direction beats adjectives: one camera move, one action speed, concrete sound/music (`No Music` if none). Use exact terms: `locked off`, `handheld, subtle`, `slow dolly in`, `slow orbit`, `whip pan`, `speed ramp`.
- Avoid conflicts ("static camera" + "orbit shot").
- For brand / MV / ad work, end the prompt with a negative line: *"no captions, watermarks, di