Skill1.2k repo starsupdated today

ai-video-script-sop-remotion-diffusion

This Claude Code skill generates structured video scripts optimized for AI video production using Remotion and diffusion models. It establishes narrative principles (single hero, action-driven storytelling, three-act structure) and technical constraints (1–3 minutes, 10-second diffusion clip limits, 720p/1080p resolution), then maps shot types to appropriate tools: diffusion for photoreal scenes, code-based Remotion for charts and typography, and SVG for vector animation. Use this when planning scripts that balance compelling visual storytelling with technical feasibility across multiple AI generation platforms.

View source Repository: AWorld

Install in Claude Code

Copy

git clone --depth 1 https://github.com/inclusionAI/AWorld /tmp/ai-video-script-sop-remotion-diffusion && cp -r /tmp/ai-video-script-sop-remotion-diffusion/aworld-skills/video_script_writting ~/.claude/skills/ai-video-script-sop-remotion-diffusion

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

## 1. Core narrative rules (narrative DNA)

To keep the video engaging (“satisfaction”), the script should follow:

*   **Single hero**: One core character drives the story through **action** that solves the problem.
*   **Show, don’t tell**: No inner monologue; emphasize **what happens on screen**.
*   **Three-part arc**:
    1.  **Opening (hook)**: A clear, seemingly impossible big task.
    2.  **Middle (grind)**: Dense, fast execution (cathartic, orderly).
    3.  **Ending (payoff)**: A strong visual reward.
*   **Radical brevity**: Voice and subtitles stay **1:1**; lines only **announce** or **briefly react**—let the pictures carry meaning.

## 2. Technical specs and limits

*   **Total length**: $1\ \text{min}$–$3\ \text{min}$.
*   **Segment length**: Must be an **integer** in seconds (e.g. $4.5\text{s} \rightarrow 5\text{s}$). Diffusion clips are capped at **$10\text{s}$** per segment.
*   **Resolution**: $1080\text{p}$ or $720\text{p}$.
*   **Frame rate**: $24\text{fps}$ or $30\text{fps}$.
*   **Mandarin VO baseline**: Plan copy at about **4–5 characters per second**.

## 3. Shot tech-selection matrix

| Need | Recommended tech | Why | Avoid |
| :--- | :--- | :--- | :--- |
| **Photoreal / complex lighting** | **Diffusion (video)** | Texture, mood, physics, transitions. | On-screen **text or charts** in the same shot; don’t mix code and diffusion **in one lens**. |
| **Character close-up / background change** | **Diffusion (I2V)** | Image-to-video keeps continuity. | Control **physical camera motion** strictly. |
| **Cartoon / vector motion** | **Code (SVG/TSX)** | Clean edges, flat look, precise paths. | Hard to express rich texture. |
| **Info / formulas / charts** | **Code (HTML/Remotion)** | Exact typography, math, data. | Don’t use for photoreal landscapes. |

---

## 4. Diffusion prompt protocol

**This is what keeps visuals high quality and coherent.** Every diffusion shot description should combine **five parts**:

$$ \text{Prompt} = \text{[Style anchor]} + \text{[Micro-timeline]} + \text{[Concrete entities]} + \text{[Camera physics]} + \text{[Physical bridge]} $$

### A. Style anchors

*   **Force consistency**: Start every shot with the **same style phrase**, e.g. `【Impressionist oil painting】` or `【Cyberpunk photoreal】`.
*   **Push intensity**: Use extreme wording—reject “fine.”
    *   *Weak:* “sunflowers”
    *   *Strong:* “**Van Gogh sunflowers as extremely thick, rough impasto in blazing yellow**”

### B. Micro-timing

*   **Avoid even mush**: State what happens **each second**.
    *   *Pattern:* `【0–2s】action A, 【2–10s】action B`.

### C. Concrete entities

*   **Make everything physical**: Turn abstractions into **objects**. Models don’t understand metaphor alone.
    *   *Weak:* “falling into despair”
    *   *Strong:* “**the floor collapses underfoot into a bottomless pit of black tar**”

### D. Camera physics

*   **Lock direction**: Say push in, pull back, pan.
*   **Keep inertia**: If the last shot **pushed in**, this shot must **continue** pushing in—random moves cause visual whiplash.

### E. Physical transitions

*   **Input dependency**: Say explicitly: “this shot is generated from the **last frame of the previous shot**.”
*   **No pop in/out**: Nothing vanishes without a process.
    *   *Weak:* “the house disappears”
    *   *Strong:* “**the house crumbles from the roof into golden sand blown away by wind**”

---

## 5. Execution workflow

1.  **Storyboard**: Lock the story, split into $N$ shots.
2.  **Duration math**:
    *   Write lines $\rightarrow$ count characters $\rightarrow$ divide by speech rate ($4.5$) $\rightarrow$ **round up** to duration $T$.
    *   *Check:* $T \le 10\text{s}$ for diffusion segments.
3.  **Continuity**:
    *   For each shot, define **start frame** and **end frame** sources.
    *   *Strategy A (Diff $\rightarrow$ Diff)*: previous **end frame** = next **start frame** (I2V).
    *   *Strategy B (Code $\rightarrow$ Diff)*: last **code frame export** = first **diffusion** frame.
4.  **Asset build**:
    *   Render all **silent** video segments.
    *   Generate matching **TTS** and **SRT**.
    *   **Verify:** $\sum(\text{segment durations}) = \text{total audio duration}$.
5.  **Final mux**: Remotion combines video, audio, and subtitle layers into MP4.

---

## 6. Standard script output template

When writing a script, use this structure.

### Video basics

*   **Theme**: [e.g. a developer sorting a mountain of messy code]
*   **Estimated total length**: $[xx]\ \text{s}$
*   **Resolution**: $1920 \times 1080$ ($1080\text{p}$)
*   **Style keywords**: [e.g. minimal, low-poly, cool palette]

### Shot execution table

| Shot ID | Duration (s) | Technique | Visual & diffusion prompt / code logic | Audio (VO + subtitles) | Transition strategy |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **01** | 5 | **Diffusion** (T2V) | **[Style]** … <br> **[Time]** 【0–2s】… <br> **[Entity]** … <br> **[Camera]** … | “This is everything that piled up this week.” | **Cold open**: text-only generation; no prior frame. |
| **02** | 8 | **Code** (React/SVG) | **UI**: giant red progress bar SVG.<br> **Motion**: numbers jump 0%→99%; warning icon blinks. | “The system is on the edge.” | **Hard cut**: clean code look vs previous chaos. |
| **03** | 6 | **Diffusion** (I2V) | **[Style]** …<br> **[Bridge]** Start from the red warning; red **liquifies** into flowing lava… | “We must cool it down now.” | **I2V**: **Shot 02 last frame** → **Shot 03 first frame**. |
| … | … | … | … | … | … |

---

## Document metadata

| Field | Value |
|-------|-------|
| Source | `script_skill.md` (Chinese) |
| Last updated | 2026-03-30 |

More from this repository

ad_image_createSkill

Create ad-ready product images (single or collage) by back-solving sub-image sizes from target output ratio, grounding scene design with media_comprehension, generating images via image_generator with strict request params and actor-count control, and pairing each deliverable with a short social tagline for 小红书/抖音.

ad_video_createSkill

Create ad-ready product video from product images, with or without character/subject images. The workflow leverages AI-powered image composition, scene understanding, and video generation. Video prompts should follow commercial shot language—visual hooks, product presence, hero shots, detail showcase, function expression, and dynamic visuals.

agent-browserSkill

Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.

app_evaluatorSkill

A professional skill for App Evaluation (evaluating app's performance with score) and App Improvement (giving professional suggestions for improving the app's performance).

embedded-video-pip-smooth-playbackSkill

last_7_days_newsSkill

Search and summarize the latest 7 days of AI news and X discussions using public sources plus browser-based X collection. Use for recent AI news, trends, X discussions, industry briefs, and summaries organized into hot topics, viewpoints, and opportunity areas.

media_comprehensionSkill

An intelligent assistant specialized in handling media files (images/audio/video). **Only for media file analysis**, does not handle document types.\n\n✅ Media files that can be processed:\n- Images: .jpg, .jpeg, .png, .gif, .bmp, .webp, .svg\n- Audio: .mp3, .wav, .m4a, .flac, .aac, .ogg\n- Video: .mp4, .avi, .mov, .mkv, .webm, .flv\n\n❌ Files that cannot be processed (please do not trigger this skill):\n- Documents: .pdf, .doc, .docx, .txt, .md, .rtf\n- Spreadsheets: .xlsx, .xls, .csv, .tsv\n- Presentations: .pptx, .ppt, .key\n- Code: .py, .js, .ts, .java, .cpp, .go, .rs\n- Archives: .zip, .tar, .gz, .rar, .7z\n- Executables: .exe, .bin, .app, .dmg\n- Databases: .db, .sqlite, .sql\n- Configuration files: .json, .xml, .yaml, .yml, .toml, .ini\n- Web pages: .html, .htm, .css\n\n**Trigger conditions**: When the user explicitly requests to analyze image/audio/video content, or when the file extension belongs to the aforementioned media types.".

optimizerSkill

Analyzes and automatically optimizes existing agents by improving system prompts and tool configuration.