Skill107 repo starsupdated 4d ago

wjs-reframing-video

This Claude Code skill converts videos between portrait and landscape orientations by intelligently cropping a narrow band from the source while tracking the active speaker's mouth movement via MediaPipe face landmarks. Use it when repurposing horizontal interviews or podcasts for vertical platforms like TikTok and YouTube Shorts, or vice versa, ensuring the talking person remains centered in frame rather than simply rotating the video.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jianshuo/claude-skills /tmp/wjs-reframing-video && cp -r /tmp/wjs-reframing-video/wjs-reframing-video ~/.claude/skills/wjs-reframing-video

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# wjs-reframing-video

Convert a video's orientation by **cropping** a narrow band from the source — not by physically rotating it. The crop window follows the **active speaker** (the face whose mouth is *moving*), not just the largest or most-confident face. A `.crop.json` sidecar records the crop plan, the per-segment speaker decisions, and the parameters used. The original input is never modified.

## When to use

- Repurposing a 16:9 podcast / interview / talk for vertical short-video platforms (WeChat Channels 视频号, Douyin 抖音, Xiaohongshu 小红书, YouTube Shorts, TikTok, Reels).
- Repurposing a 9:16 phone recording for horizontal players (YouTube long-form, blog embeds).
- Repurposing 4:3 archive footage for 3:4 mobile, or vice versa.

The output aspect is the source aspect with width and height swapped — 16:9 → 9:16, not "letterboxed 16:9 in a 9:16 frame".

## When NOT to use

- **Multi-person Q&A** where each face needs its own crop — this skill picks one crop track per video. For per-speaker split renders, use **wjs-editing-multicam** instead.
- **Animated content / B-roll with no faces** — falls back to center crop, usually wrong for the intent.
- **Heavy camera motion in the source** (handheld pan/zoom) — the face tracker amplifies camera shake. Stabilize first.
- **Source already at target aspect** — no work to do.

## What this skill IS — and IS NOT

| Is | Is not |
|---|---|
| **Visual active-speaker detection** via MAR (mouth-aspect-ratio) variance | Audio-visual fusion (audio energy + lip motion cross-correlated) |
| Stable face tracking across frames by center-distance matching | Re-identification across long gaps / occlusions |
| Speaker-aligned segments with hysteresis to prevent flicker | Frame-by-frame switching on every flicker |
| `--face-pick speaker` (default) — pick whoever's mouth is moving | `--face-pick largest` (opt-in legacy) — pick largest face |
| **Hard cuts between segments, fixed crop within each segment** (`--motion cut`, default) | Smooth panning that drifts during a speaker's turn (opt-in `--motion smooth`) |
| Audio stream-copy (bit-exact) | Audio reprocessing / re-encoding |
| MediaPipe Tasks `FaceLandmarker` (478-pt mesh) at 5 fps sampled via ffmpeg | Per-frame neural inpainting / out-painting |
| One `ffmpeg crop + scale` pass | Frame-by-frame Python compositor |

Falls back to "largest face" automatically when no one is talking (silence, music-only stretches).

## Dependencies

```bash
pip install mediapipe opencv-python numpy
```

(MediaPipe lives outside the standard Python distribution; ffmpeg and ffprobe must be on `PATH`.)

**First-run model download**: MediaPipe 0.10+ uses the Tasks API, which needs a `face_landmarker.task` model file (~4 MB). On the first call, `crop.py` downloads it to `~/.claude/skills/wjs-reframing-video/models/` and caches it for subsequent runs. The script fails offline on first run.

**Range limitation**: The bundled landmarker is tuned for faces within ~2 m of the camera (selfie / podcast / interview distance). Wide event shots with small faces may not detect — sample a frame first to confirm.

## Crop math

Source aspect = `W / H`. Target aspect = `H / W` (inverted). Compute crop window:

| Source orientation | Crop window |
|---|---|
| Horizontal (W > H) → Portrait | `W_crop = H × H / W`, `H_crop = H` (narrow vertical band) |
| Portrait (W < H) → Horizontal | `W_crop = W`, `H_crop = W × W / H` (narrow horizontal band) |

For 1920×1080 → portrait, `W_crop = 608`, `H_crop = 1080`. Final scale to 1080×1920 (upscale ~1.78×).
For 1080×1920 → landscape, `W_crop = 1080`, `H_crop = 608`. Final scale to 1920×1080.

Override the final size via `--output-size 1080x1920` if you want native crop dimensions instead of upscaling.

## Pipeline

1. **Probe** input dimensions, fps, duration via ffprobe.
2. **Decide orientation** — auto from aspect (`--target portrait|landscape` to override).
3. **Sample frames at `--sample-fps`** (default 5; high enough to catch mouth motion — Nyquist for speech is ~10 Hz, we need at least 4–5 fps).
4. **Detect face landmarks** per sampled frame with MediaPipe Tasks `FaceLandmarker` (478 landmarks). For each detected face record: center, size proxy, MAR (mouth-aspect-ratio = inner-lip vertical distance / horizontal mouth-corner distance).
5. **Track faces** across frames by center-distance matching → each face gets a stable `face_id`.
6. **Per-sample active speaker**: for each face track, variance of MAR over a sliding window (`--mar-var-window-sec`, default 1 s). The face with the highest variance is "speaking". Below `--mar-var-threshold`, no one is speaking → fall back to largest face.
7. **Hysteresis**: a candidate switch only commits if the new speaker is stable for `--min-segment-sec` (default 1.5 s). Shorter flickers are squashed — prevents the crop from ping-ponging on a one-frame mis-detection.
8. **Speaker-aligned segments** → for each segment, mean (cx, cy) of that speaker's face over the segment becomes the crop center, *fixed* for the full duration of the segment.
9. **Build a ffmpeg step-function expression** (`--motion cut`, default) that holds each segment's crop position constant and **jumps instantly at each segment boundary** — the visual feel of a real cut between camera angles. (`--motion smooth` switches to piecewise-linear pan between segment midpoints; rarely the right call for talking-head content because the camera appears to drift mid-sentence.)
10. **Render** one ffmpeg pass — `crop=W:H:x='expr':y='expr', scale=OUT_W:OUT_H`. The crop filter evaluates `x` and `y` per frame natively. Audio stream-copied.

`scripts/crop.py` is the implementation. Output side effects:
- `<input>.crop.json` — sidecar with the crop plan
- `<input>_cropped.mp4` — final cropped + scaled video

## Sidecar schema (`<input>.crop.json`)

```json
{
  "_about": "wjs-reframing-video crop plan for cam_a.MOV. Active-speaker detected via MAR variance.",
  "_help": {
    "source_size":     "[width, height] in

More from this repository

skill-quality-reviewerSubagent

Repo-wide drift detector for the wjs-* Claude Code skills in this marketplace. Sweeps every SKILL.md, scores it against the repo's own conventions (V-ing naming, trigger-phrase density, companion files, description shape), and returns a grouped punch list ordered by severity. Read-only — never edits files. Use before pushing a batch of skill changes, or whenever you wonder "are these skills still internally consistent?

wangjianshuo-perspectiveSkill

wjs-auditing-projectSkill

Use when the user asks to audit what's wrong with a project, "make it right", "看看项目出了什么问题", "为什么用户的需求还没上线", "为什么没提交App Store", "为什么没新build", or wants a holistic state-of-the-project check covering unmerged branches, stalled PRs, failed GitHub Actions, stale builds, plan drift (TODOS.md / ROADMAP), unreleased commits, and log errors. Runs read-only investigation, presents a grouped checklist, fixes only after explicit user confirmation. Aware of the Cathier iOS app workflow (Xcode + fastlane + auto-merge @claude PRs from in-app feedback).

wjs-burning-subtitlesSkill

Use when the user has a video + an SRT and wants the subtitles either burned into the pixels (libass, always-visible) or soft-muxed as a togglable track. Also handles the final composite step for the localization pipeline — burn subs, mix a dub track, and keep the original audio as a low-volume bed, all in ONE ffmpeg encode (no cascade). Verifies libass availability and auto-downloads a static evermeet ffmpeg build when Homebrew's stripped binary lacks it. Triggers — "烧字幕", "硬字幕", "burn subtitles", "burn-in subs", "embed subtitle", "soft mux SRT", "把字幕烧进视频", "做最终合成".

wjs-cleaning-spamSkill

Use when the user complains about spam on his X/Twitter posts — 同城面付 / 寻固炮 / 线下上门 / 免费破处这类引流号在他推文下刷的 emoji 垃圾回复 — and wants them removed. Covers the last 7 days (X recent-search window). Triggers — "把这些spam删掉", "清理X垃圾回复", "推文下面好多引流号", "clean spam replies", "/wjs-cleaning-spam".

wjs-converting-text-to-videoSkill

Use when the user wants a 王建硕-style WeChat article (article.md) turned into a narrated short MP4 video — TTS voiceover via 火山引擎 Volcano TTS, HyperFrames CSS/GSAP animation per scene, subtle SFX, abstract watercolor background, full pipeline rendering to 1080×1920 portrait MP4 (30-90s). Triggers — "把这篇文章做成视频", "做一个解说视频", "讲解视频", "/wjs-converting-text-to-video".

wjs-converting-wp-to-hugoSkill

Use when migrating a WordPress site to a Hugo static site on GitHub Pages from a WXR export (.xml) plus the wp-content/uploads folder — preserving /archives/<id>/ URLs, localizing images, and deploying via GitHub Actions. Triggers — "把 WordPress 迁成 Hugo", "wordpress 转静态站", "migrate WordPress to Hugo", "WXR to Hugo", "publish WordPress to GitHub Pages", "/wjs-converting-wp-to-hugo".

wjs-dubbing-videoSkill

Use when the user has a video + a target-language SRT and wants the video to actually speak that language — generates a time-aligned TTS voice dub. Routes by voice ID — Volcano (豆包) TTS for Chinese, edge-tts neural for any language. Defaults to one voice (single-speaker); opt-in multi-speaker via visual diarization. Outputs `*_<lang>_dub.mp4` with the dub audio in place of the original. Final mixing (audio bed + burn-in) is handed off to `/wjs-burning-subtitles`. Triggers — "配音", "中文配音", "Chinese dub", "voice over this", "dub the video", "TTS this SRT", "different voice for each speaker".