Skill1.1k repo starsupdated 10d ago

video-polish

Video Polish takes a raw screen recording or demo video and automatically adds professional zoom and pan effects that synchronize with the narration. It analyzes the audio transcript to identify what the speaker discusses, then applies smooth camera movements that focus on relevant UI elements, text, or sections at the appropriate moments. Use this skill to transform flat screen recordings into polished, engaging videos with dynamic cinematography without manually editing each zoom or pan effect.

View source Repository: goose-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/gooseworks-ai/goose-skills /tmp/video-polish && cp -r /tmp/video-polish/skills/design/packs/video-production/video-polish ~/.claude/skills/video-polish

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Video Polish Skill

You take an existing video (screen recording, demo, walkthrough, Loom) and add professional zoom/pan effects that follow the narration. The output looks like a professionally edited video where the camera zooms into whatever the speaker is discussing.

---

## What This Skill Does

**Input:** A raw video file (screen recording, Loom, product demo) + optionally a soundtrack
**Output:** The same video with smooth zoom/pan effects synchronized to the narration

**What it adds:**
- Zoom-in effects on UI elements, metrics, text, code when the narrator mentions them
- Smooth pan/slide effects across sections (e.g., sliding across table columns)
- Transitions with ease-in-out easing (no jarring jumps)
- Optional audio replacement (background music instead of or mixed with original narration)

**What it does NOT do:**
- Generate new video content
- Add avatars or talking heads
- Edit or cut the video (no trimming, no removing sections)
- Add text overlays or annotations

---

## Prerequisites

- **Node.js** (v18+) and **npm** — required for Remotion
- **Remotion** — Video rendering framework. If not already set up, create a project:
  ```bash
  npx create-video@latest --yes --blank --no-tailwind video-polish
  cd video-polish && npm i
  ```
- **whisper-cpp** — For audio transcription. Install via `brew install whisper-cpp` on macOS
- **Whisper model** — Download the base English model:
  ```bash
  curl -L -o /tmp/ggml-base.en.bin "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin"
  ```
- **Python 3 + Pillow** — For frame extraction and coordinate grid overlays (`pip install Pillow`)

**Before starting:** Verify that Node.js, whisper-cpp, and Python 3 with Pillow are installed. If any are missing, instruct the user to install them before proceeding.

---

## How This Skill Works

### Step 1: Analyze the Source Video

Get video metadata:
```bash
npx remotion ffprobe -v quiet -print_format json -show_format -show_streams <video_path>
```

Record: duration, resolution (width x height), fps, whether audio exists.

### Step 2: Transcribe the Audio

Extract audio and transcribe with word-level timestamps.

**Extract audio:**
```bash
npx remotion ffmpeg -y -i <video_path> -vn -acodec pcm_s16le -ar 16000 -ac 1 /tmp/audio.wav
```

**Transcribe:**
```bash
whisper-cli -m /tmp/ggml-base.en.bin -f /tmp/audio.wav --output-json --output-file /tmp/transcript
```

**Read the transcript** and identify moments where the narrator emphasizes or references specific on-screen elements:
- "look at this", "you can see", "notice", "here", "this is where"
- Naming specific UI elements: "the latency column", "pass rate", "this button"
- Describing what's on screen: "each row represents...", "on the top you can see..."

For each emphasis moment, record:
- Timestamp (when they start talking about it)
- What they're referring to (which UI element, metric, section)

### Step 3: Extract Source Frames at Key Timestamps

For each emphasis moment from the transcript, extract the source frame at that timestamp:

```bash
npx remotion ffmpeg -y -i <video_path> -vf "select=eq(n\,<frame_number>)" -vframes 1 -update 1 /tmp/frame-<timestamp>.png
```

Where `frame_number = timestamp_seconds * fps`.

**Important:** The video content may scroll or change over time. Always extract frames at the ACTUAL timestamp, not from a single reference frame. UI element positions change as the user scrolls.

### Step 4: Measure Zoom Target Coordinates

For each extracted frame, draw a coordinate grid overlay to precisely identify element positions:

```python
from PIL import Image, ImageDraw
img = Image.open('/tmp/frame-<timestamp>.png')
w, h = img.size
draw = ImageDraw.Draw(img)
for pct in range(0, 100, 5):
    y = int(h * pct / 100)
    x = int(w * pct / 100)
    color = 'red' if pct % 10 == 0 else 'yellow'
    draw.line([(0, y), (w, y)], fill=color, width=1)
    draw.text((5, y+2), f'y{pct}', fill=color)
    draw.line([(x, 0), (x, h)], fill=color, width=1)
    draw.text((x+2, 12), f'x{pct}', fill=color)
img.save('/tmp/frame-<timestamp>-grid.png')
```

Look at the grid overlay and measure the **center point** (focusX, focusY) of the element the narrator is referring to. Express as normalized 0-1 values (e.g., x=0.47 means 47% from the left edge).

### Step 5: Build the Keyframe Timeline

Create a list of keyframes. Each keyframe defines a target zoom state at a specific time. The system smoothly interpolates between consecutive keyframes using ease-in-out.

**Keyframe design rules:**

1. **Anticipate the narration by 0.5 seconds.** If the narrator says "pass rate" at 0:37, start zooming at 0:36.5 so the zoom arrives just as they say it.

2. **Between two zoom-in targets, don't zoom all the way out.** If going from metric A to metric B, reduce zoom to 1.5x briefly while shifting focus, then zoom back in. This is faster and smoother than full-out-then-full-in.

3. **For sliding across adjacent elements** (e.g., table columns), keep the same zoom level and just change focusX. This creates a smooth horizontal pan.

4. **Fast transitions between distant targets** (e.g., jumping from the query column to the latency column) should take 1-1.5 seconds max. Slow slides across distant areas feel boring.

5. **Slow slides across adjacent elements** (e.g., panning from latency → tokens → cost) should take 3-5 seconds. This lets the viewer read each element.

6. **Hold the zoom for at least 2-3 seconds** after arriving at a target. Quick zoom-in → immediate zoom-out is disorienting.

7. **Start and end the video at full view (zoom=1.0).** Don't start zoomed in — let the viewer orient first.

**Zoom level guide:**
| What you're showing | Zoom level |
|---|---|
| Full dashboard/page overview | 1.0 (no zoom) |
| A section (metrics row + charts) | 1.1-1.3 |
| A specific area (one chart, a few table columns) | 1.8-2.5 |
| A single metric, cell, or button | 2.8-3.5 |

**Example keyframe timeline:**
```typescript
const KEYFRA