Skill107 repo starsupdated 4d ago

wjs-editing-multicam

wjs-editing-multicam combines multiple synced camera recordings into a single MP4 by automatically selecting which angle to display each second based on audio energy levels, using the polysync Python package. Use this when you have two or more video files of the same event (each with a .sync.json sidecar from wjs-syncing-multicam) and want them merged with hard cuts between cameras or with a picture-in-picture inset of an alternate angle.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jianshuo/claude-skills /tmp/wjs-editing-multicam && cp -r /tmp/wjs-editing-multicam/wjs-editing-multicam ~/.claude/skills/wjs-editing-multicam

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# wjs-editing-multicam

Combine N synced camera angles into a single rendered MP4. Decisions are audio-energy-driven only — the cam with the loudest mic each second wins. Output is hard cuts (or hard cuts plus a corner PiP).

## Setup & commands

The implementation lives in the open-source **`polysync`** pip package (<https://pypi.org/project/polysync/> · <https://github.com/jianshuo/polysync>) — this skill no longer ships its own scripts. Install it, then drive it via its CLI:

```bash
python3 -m pip install -U polysync      # needs ffmpeg/ffprobe on PATH

polysync edit        CAM_A CAM_B CAM_C --out edl.json   # build the decision list
polysync render-cuts edl.json --out out.mp4             # hard cuts
polysync render-pip  edl.json --out out.mp4 --pip bottom-right   # cuts + corner inset
```

`edit` and the renderers read each input's `.sync.json` automatically. Sync first with **wjs-syncing-multicam** (`polysync sync`) if the sidecars don't exist yet.

Render flags for raw camera footage (see **Preflight** below for when to use each):

```bash
# Sony S-Log3 footage, shot vertical, FX6 cams turned on their side, for 小红书:
polysync render-cuts edl.json --out out.mp4 \
    --log slog3 \           # S-Log3/S-Gamut3.Cine -> Rec.709 grade
    --rotate 1:90 --rotate 2:90 \   # rotate cam1,cam2 90° CW (FX6 with no flag)
    --width 1080 --height 1920 --fill \   # vertical, crop-to-fill (no black bars)
    --duck-audio --audio-cams 0,1        # clean speaker-gated audio (cams 0,1 = the two lavs)
```

`--duck-audio` replaces the single-cam soundtrack with a speaker-gated mix: each moment keeps the active speaker's close mic and ducks the rest (much cleaner than a constant 2-mic sum, which piles up bleed/room tone). `--audio-cams 0,1` restricts gating to the real speaker mics — **always exclude the wide/room cam**, whose mic sits at a similar level but is reverby and would otherwise get picked. (See the editing-quality note below for the why.)

## Preflight — ALWAYS check raw footage before rendering (hard-won)

Straight-off-the-card footage renders WRONG without these checks. Spend 2 minutes here or you'll render the whole thing broken (we did).

1. **Color profile (Log).** Sony FX3/FX6 default to **S-Log3 / S-Gamut3.Cine** — flat, grey, low-contrast. It MUST be converted to Rec.709 or it looks washed-out and broken. Check the `.XML` sidecar's `CaptureGammaEquation` (`s-log3-cine`) or `ffprobe -show_entries stream=color_transfer`. Fix: `--log slog3`. (Performance: the LUT is applied AFTER downscale automatically — a 3D LUT on 4K is ~4x slower than on 1080p for an identical result.)
2. **Orientation.** Phone / vertically-mounted shoots record rotated. Extract one frame per cam (`ffmpeg -ss 200 -i CAM -frames:v 1 f.jpg`) and LOOK. Some cameras (FX3) write a rotation flag → ffmpeg auto-rotates, fine. Others (FX6 physically turned on its side) write **no flag** → people come out lying down. Fix per-cam: `--rotate 1:90` (90 = clockwise; try 270 if upside-from-the-other-side).
3. **Delivery orientation.** 小红书 / Reels / Shorts are **vertical** — and this footage is usually shot vertical. Render `--width 1080 --height 1920 --fill`. Default (1920×1080, pad) is for landscape only; mixing a portrait source into a landscape frame pillarboxes it (ugly black side bars).
4. **Staggered camera starts.** Cameras rarely roll at the same instant; the sidecar `delta`/coverage shows it. The opening N seconds may be single-cam (only the first-rolling camera covers t=0). That's unavoidable — expect a single-cam intro of `max(delta)` seconds.

## Editing quality — when `polysync edit` isn't enough

`polysync edit` switches to the **loudest mic** per second. With **close, bleeding mics** (each cam's mic also picks up the other speaker loudly), the close/guest mic stays loudest even when the *other* person talks, so the editor over-selects that cam and the cuts don't track the real speaker. Two fixes, applied by hand on the EDL when "cut to the correct speaker" matters:

- **Per-mic baseline-normalized speaker attribution.** Per second, subtract each mic's own median energy, then pick whichever mic is *highest relative to its own baseline* → that's who's actually talking. (polysync's raw `cam[k] - mean(others)` doesn't remove the per-mic baseline offset, so it favors the loud mic.)
- **Cutaways for rhythm (剪辑感).** Audio attribution alone gives long static holds. Cap any single shot (~8–10 s) and insert ~3 s cutaways to the **listener** (reaction shot) and the **wide** cam, alternating. The wide cam's quiet mic means the editor never picks it on energy — inject it deliberately for establishing / 整体.

## What this skill IS — and IS NOT

| Is | Is not |
|---|---|
| Audio-energy-driven cam switching | Face / framing detection (no face_recognition, no MediaPipe) |
| Single-source audio (one cam's mic) | Multi-mic mix / per-speaker gating |
| Hard cuts, with optional PiP inset | Crossfades / opacity transitions / sliding animations |
| `ffmpeg` concat + `overlay` filter renders | HyperFrames composition / `<hf-clip>` |
| Coverage-aware (won't pick a cam outside its sidecar window) | Frame-accurate beat alignment / VAD-edge cuts |

If you need face tracking, fade transitions, captions, or HyperFrames composition, use the **hyperframes** skill on top of this skill's MP4 output.

## REQUIRED INPUT

**Original camera files (untouched) plus their `.sync.json` sidecars next to them.** If sources aren't synced yet, run **wjs-syncing-multicam** first to write the sidecars. Missing sidecar = cam assumed at delta=0, full coverage.

`polysync edit` reads each sidecar for `delta_seconds` + `overlap_in_reference`, lifts the cam's audio envelope into the reference timeline, and only schedules a cam during its coverage window. `polysync render-cuts` / `polysync render-pip` apply `ffmpeg -itsoffset` per input using the EDL's `deltas[]` array.

## When NOT to use

- One source — nothing to switch between; use **video-segmentation**.
- Polishe

More from this repository

skill-quality-reviewerSubagent

Repo-wide drift detector for the wjs-* Claude Code skills in this marketplace. Sweeps every SKILL.md, scores it against the repo's own conventions (V-ing naming, trigger-phrase density, companion files, description shape), and returns a grouped punch list ordered by severity. Read-only — never edits files. Use before pushing a batch of skill changes, or whenever you wonder "are these skills still internally consistent?

wangjianshuo-perspectiveSkill

wjs-auditing-projectSkill

Use when the user asks to audit what's wrong with a project, "make it right", "看看项目出了什么问题", "为什么用户的需求还没上线", "为什么没提交App Store", "为什么没新build", or wants a holistic state-of-the-project check covering unmerged branches, stalled PRs, failed GitHub Actions, stale builds, plan drift (TODOS.md / ROADMAP), unreleased commits, and log errors. Runs read-only investigation, presents a grouped checklist, fixes only after explicit user confirmation. Aware of the Cathier iOS app workflow (Xcode + fastlane + auto-merge @claude PRs from in-app feedback).

wjs-burning-subtitlesSkill

Use when the user has a video + an SRT and wants the subtitles either burned into the pixels (libass, always-visible) or soft-muxed as a togglable track. Also handles the final composite step for the localization pipeline — burn subs, mix a dub track, and keep the original audio as a low-volume bed, all in ONE ffmpeg encode (no cascade). Verifies libass availability and auto-downloads a static evermeet ffmpeg build when Homebrew's stripped binary lacks it. Triggers — "烧字幕", "硬字幕", "burn subtitles", "burn-in subs", "embed subtitle", "soft mux SRT", "把字幕烧进视频", "做最终合成".

wjs-cleaning-spamSkill

Use when the user complains about spam on his X/Twitter posts — 同城面付 / 寻固炮 / 线下上门 / 免费破处这类引流号在他推文下刷的 emoji 垃圾回复 — and wants them removed. Covers the last 7 days (X recent-search window). Triggers — "把这些spam删掉", "清理X垃圾回复", "推文下面好多引流号", "clean spam replies", "/wjs-cleaning-spam".

wjs-converting-text-to-videoSkill

Use when the user wants a 王建硕-style WeChat article (article.md) turned into a narrated short MP4 video — TTS voiceover via 火山引擎 Volcano TTS, HyperFrames CSS/GSAP animation per scene, subtle SFX, abstract watercolor background, full pipeline rendering to 1080×1920 portrait MP4 (30-90s). Triggers — "把这篇文章做成视频", "做一个解说视频", "讲解视频", "/wjs-converting-text-to-video".

wjs-converting-wp-to-hugoSkill

Use when migrating a WordPress site to a Hugo static site on GitHub Pages from a WXR export (.xml) plus the wp-content/uploads folder — preserving /archives/<id>/ URLs, localizing images, and deploying via GitHub Actions. Triggers — "把 WordPress 迁成 Hugo", "wordpress 转静态站", "migrate WordPress to Hugo", "WXR to Hugo", "publish WordPress to GitHub Pages", "/wjs-converting-wp-to-hugo".

wjs-dubbing-videoSkill

Use when the user has a video + a target-language SRT and wants the video to actually speak that language — generates a time-aligned TTS voice dub. Routes by voice ID — Volcano (豆包) TTS for Chinese, edge-tts neural for any language. Defaults to one voice (single-speaker); opt-in multi-speaker via visual diarization. Outputs `*_<lang>_dub.mp4` with the dub audio in place of the original. Final mixing (audio bed + burn-in) is handed off to `/wjs-burning-subtitles`. Triggers — "配音", "中文配音", "Chinese dub", "voice over this", "dub the video", "TTS this SRT", "different voice for each speaker".