ai-video-script
# ai-video-script Generates a structured short-video shooting script by converting a topic, style, and duration into a machine-parseable shot list with strict formatting. Each shot includes an image prompt, video prompt, voiceover, and on-screen text designed for downstream parsing by video generation tools. Use this skill when a user requests a video script, 分镜, short-video copy, or AI video production assets in formats ready for image and video generation workflows.
git clone --depth 1 https://github.com/opensquilla/opensquilla /tmp/ai-video-script && cp -r /tmp/ai-video-script/src/opensquilla/skills/bundled/ai-video-script ~/.claude/skills/ai-video-scriptSKILL.md
# ai-video-script — structured short-video script generator
Turns a topic/keyword + style + duration into a strict-format shooting script
the downstream `nano-banana-pro` and `seedance-2-prompt` skills can parse
without ambiguity. The default emits 3 shots; the caller may ask for 4 or 5.
## Inputs
Free-text via `with.task` / `with.request`:
- Topic / product / story
- Target audience (optional)
- Style (轻松/专业/故事/科普/带货) — narrative style, not render style
- Total duration (15s, 30s, 60s default)
- Aspect ratio (9:16 default, 16:9 optional)
- `N_SHOTS` override (5 default, **1-10 allowed**)
Caller-supplied anchors (used verbatim — this skill never invents them):
- `with.render_style` — one-line aesthetic the per-shot prompts must end
with. Examples: `2D anime illustration, flat colour, soft cel-shading`,
`watercolour storybook illustration`, `cinematic photoreal 35mm grain`.
If absent / empty, emit the literal sentinel `(render style missing)`
into the RENDER_STYLE field so downstream parsers can fail loudly.
- `with.identity_anchor` — one-line description of the main character(s)
that every shot must reproduce byte-for-byte. Example: `Lin, a
25-year-old East Asian woman with chin-length black bob, almond eyes,
wearing sage-green oversized knit sweater and gold round earrings`. If
absent / empty, emit the literal sentinel `(identity anchor missing)`
so callers can detect the gap before spending on image/video gen.
This skill does **not** choose render style or character identity; the
orchestrator (or its user_input clarify step) does. This separation lets
the same skill serve product ads (no human) and short dramas (with
locked characters) without baked-in defaults.
## Output format (STRICT — orchestrators parse this)
Always emit exactly these top-level blocks, in this order:
```
=== OVERVIEW ===
TITLE: <one line>
DURATION_S: <int>
ASPECT_RATIO: <9:16|16:9>
STYLE: <one line>
AUDIENCE: <one line>
N_SHOTS: <int 3-5>
IDENTITY_ANCHOR: <copied verbatim from with.identity_anchor, or "(identity anchor missing)">
RENDER_STYLE: <copied verbatim from with.render_style, or "(render style missing)">
=== SHOT_1 ===
DURATION_S: <int 3-6>
CAMERA: <wide|medium|close-up + push/pull/pan/tilt/static>
IMAGE_PROMPT: <IDENTITY_ANCHOR verbatim>, <scene/action>, <RENDER_STYLE verbatim>, --ar 9:16
VIDEO_PROMPT: <IDENTITY_ANCHOR verbatim>, <one major action + camera move + duration hint>, <dialogue/voiceover/sound tags derived from VOICEOVER — see rule 11>, <RENDER_STYLE verbatim>, aspect_ratio: 9:16, no watermark, no logo, no subtitles
VOICEOVER: <one line, max 20 Chinese chars or 30 English words — kept verbatim for SRT subtitle burn-in>
ON_SCREEN_TEXT: <one short line or empty>
=== SHOT_2 ===
... (same fields, IMAGE_PROMPT and VIDEO_PROMPT must begin with the
exact same IDENTITY_ANCHOR bytes as SHOT_1)
=== SHOT_3 ===
... (same fields)
```
For any `N_SHOTS` between 1 and 10, emit exactly that many
`=== SHOT_K ===` blocks numbered 1..N_SHOTS, each with the same fields.
Do not emit shot blocks beyond `N_SHOTS`. Never skip a field; use the
literal value `none` for empty `ON_SCREEN_TEXT`.
`N_SHOTS` semantics:
- 1: a single hero shot (5-10s typical) — product/landscape vignette.
- 2-3: classic short-form story arc.
- 4-6: extended narrative with multiple beats; good for 45-60s drama.
- 7-10: stretched-form drama; total duration grows linearly with cost.
## Rules
1. **Identity continuity** — `with.identity_anchor` is pasted byte-for-byte
at the start of every shot's IMAGE_PROMPT and VIDEO_PROMPT. Do not
paraphrase, summarize, or pronoun-substitute it. If shot 3's anchor
text differs by one comma from shot 1's, you wrote it wrong.
2. **Visual concreteness** — replace abstract verbs with observable action:
"a young woman in a red trench coat walks through rain-soaked neon
streets" >> "a woman walking".
3. **IP-safe** — do not use franchise names, character names, brand terms,
or "style of" references. Invent original names if needed.
4. **No multi-line values** — IMAGE_PROMPT, VIDEO_PROMPT, VOICEOVER,
ON_SCREEN_TEXT must each be a single line.
5. **Aspect ratio explicit** — every IMAGE_PROMPT ends with the literal
token `--ar 9:16` (or `--ar 16:9`); every VIDEO_PROMPT ends with the
literal token `aspect_ratio: 9:16` (or 16:9).
6. **Duration math** — `sum(SHOT_i.DURATION_S) == OVERVIEW.DURATION_S` ±2s.
7. **Voiceover length** — total voiceover should be speakable in
`DURATION_S` seconds (~3 Chinese chars/sec, ~2 English words/sec).
8. **Match the user's language** — write **all** fields (TITLE, STYLE,
AUDIENCE, IDENTITY_ANCHOR, RENDER_STYLE, IMAGE_PROMPT, VIDEO_PROMPT,
VOICEOVER, ON_SCREEN_TEXT) in the **same language the user wrote in**.
- The current downstream image/video models — `google/gemini-3.1-flash-image-preview`
and `bytedance/seedance-2.0` — both accept Chinese natively.
Seedance (ByteDance) is in fact a Chinese-first model and tends to
produce **more on-topic results** with Chinese prompts when the
story itself is Chinese (e.g. 咖啡店偶遇 / 国风武侠 / 校园回忆).
- Do **not** translate the user's Chinese topic into English just to
fill IMAGE_PROMPT — that loses cultural detail and often hallucinates
a Western-coded substitute.
- Mixed-language input (English topic + Chinese voiceover note,
vice-versa) → the *bulk* of prompts follow whichever language the
**topic/story** is in; localised fields like VOICEOVER may follow
the language explicitly named by the user.
- English remains valid: pick it when the user wrote in English, or
when the user explicitly asked for English prompts.
9. **Plain text only — no emoji, no decorative symbols.** The script
flows through Python subprocesses on Windows consoles whose default
code page (cp936/GBK) cannot encode `✅`, `❌`, `✨`, `🎬`, or any
non-BMP character. The orchestrator will crash with a
`UnicodeEncodeError` if any field contains one. Use pSubmit audio or video for multilingual dubbing, poll status, and download dubbed audio. Use when the user asks for dubbing, 多语言配音, 视频翻译配音, 译制片, or wants a source clip dubbed into another language.
Use when the user asks to schedule recurring tasks, one-off reminders, timers, or cron-style jobs through the OpenSquilla cron tool.
Multi-round research with explicit methodology, evidence tracking, and citation-tagged synthesis. Trigger on 'deep dive', 'research report', 'literature review', 'investigate X across sources', 'multi-round investigation'. Distinct from the `summarize` skill, which is a single-pass condensation; this skill maintains a state file across iterations, tracks coverage, and produces a long-form report with per-claim citations. Three execution stages: plan (scope into sub-questions), iterate (record evidence per round), compile (synthesize report). The skill itself does not fetch the web — it tells the host agent which fetches to perform via OpenSquilla's existing web tools, and records what comes back.
Read, edit, or create Microsoft Word `.docx` files. Trigger this skill whenever the user mentions a Word document, .docx file, contract, report, brief, memo, or asks to extract text, modify an existing doc, generate one from a brief, or audit tracked changes. Three execution paths: text-and-structure extraction, in-place edit-by-run (preserves styles), and create-from-scratch with python-docx. Falls back to OOXML unzip-and-patch for layout work python-docx cannot reach.
Capture the current git diff (staged, working-tree, or staged file list) as text. Direct shell call for workflows that need repository diffs without an LLM agent loop.
GitHub operations via `gh` CLI: issues, PRs, CI runs, code review, API queries. Use when: (1) checking PR status or CI, (2) creating/commenting on issues, (3) listing/filtering PRs or issues, (4) viewing run logs. NOT for: complex web UI interactions requiring manual browser flows (use browser tooling when available), bulk operations across many repos (script with gh api), or when gh auth is not configured.
Query the per-turn DecisionEntry log for skill co-occurrence patterns, meta-skill usage stats, and the router fixture corpus. Returns a JSON summary suitable for downstream LLM consumption. Used by meta-skill-creator's harvest step but also useful standalone for 'which skills did I use most this week?'
Render HTML (with CSS) to a PDF file. Trigger when the user wants to export a styled report, invoice, label, or any HTML/Jinja-rendered page to PDF. Uses WeasyPrint, which supports a meaningful subset of CSS Paged Media (page size, margins, headers/footers, page-break-before/after). Optional dependency — install via `pip install opensquilla[document-extras]` or `uv add weasyprint` because WeasyPrint pulls in native libraries (Pango, Cairo, fontconfig) that need OS-level packages.