Skill107 repo starsupdated 4d ago

wjs-segmenting-video

This Claude Code skill segments long-form video (interviews, lectures, podcasts) into multiple stand-alone short clips using an existing SRT transcript. The agent reads the transcript, identifies topic boundaries through semantic judgment, and uses `segment.py` to cut the video at precise timestamps, producing raw cropped clips and per-clip SRT files ready for downstream post-production in `/wjs-overlaying-video`. Use this when you have a video over 10 minutes with a transcript and need topical short-form clips for social platforms.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jianshuo/claude-skills /tmp/wjs-segmenting-video && cp -r /tmp/wjs-segmenting-video/wjs-segmenting-video ~/.claude/skills/wjs-segmenting-video

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# wjs-segmenting-video

Cut a long video + SRT into multiple stand-alone short clips, each
oriented for the target platform. **This skill stops after cutting +
cropping** — it hands off the raw clips to `/wjs-overlaying-video` for
covers, captions, illustrations, CTA, and final render.

## When to use

- Long-form video (≥10 min) with an existing SRT transcript.
- Goal is **stand-alone** short clips (each viewable without context).
- The user will (or you will) drive post-production separately in
  `/wjs-overlaying-video`.

## When NOT to use

- Single-topic trimming → just use `ffmpeg -ss A -to B`.
- No transcript yet → run **`/wjs-transcribing-audio`** first (then `/wjs-translating-subtitles` if the segments need a non-source language).
- Multicam editing → use **`/wjs-editing-multicam`**.
- Highlight reel with multiple cuts inside a single topic → that's
  editing, not segmentation.

## What this skill IS — and IS NOT

| Is | Is not |
|---|---|
| You (the agent) **read the full SRT and decide the topic boundaries** | A script that runs NLP topic modeling, silence detection, or "viral moment" scoring. Topic boundaries are semantic; competing tools (Descript, OpusClip, Riverside Magic Clips) all get this wrong by automating it. |
| `segment.py` cuts; `/wjs-reframing-video` reorients | An end-to-end "magic" pipeline |
| Accurate-seek cuts by default (re-encode) — clip starts EXACTLY at requested timestamp | Stream-copy cuts (those produce keyframe-snap drift up to GOP duration) |
| Hands off **raw cropped clips + per-clip SRTs** | Burned subtitles, covers, intros, CTAs (those live in `/wjs-overlaying-video`) |

## The pipeline

```
long video + SRT
   ↓     (agent reads SRT, decides topics — judgment, not parsing)
segments.json
   ↓     segment.py --reencode (accurate seek; clip starts exactly at requested t)
clip_NN.mp4 + frame_NN.jpg
   ↓     ASK: target platform orientation match source?
   ↓     /wjs-reframing-video on each clip (if 16:9 → 9:16, etc.)
   ↓     re-extract frames from cropped clips
clip_NN.mp4 (now in target orientation) + clip_NN.zh-CN.burn.srt
   ↓
HAND OFF → /wjs-overlaying-video
   (does covers + captions + illustrations + CTA + final render)
```

## Step 1 — Read SRT, write `segments.json`

**Don't outsource topic identification to a script.** For each candidate segment, judge:

- **Self-contained?** A cold viewer must understand it without prior context.
- **Single thread?** One central question / insight; if the speaker pivots mid-clip, that's two segments.
- **Length fits platform?** 60–180s for 视频号 / 30–60s for 抖音&Shorts. <30s feels truncated; >4min loses retention.
- **Hook + payoff?** Open on a claim / question / vivid image; close on a takeaway. Never end mid-sentence.
- **Snap to SRT cue boundaries** — never cut mid-word.

3–6 strong segments from a 10-minute source is normal. Drop boring middles. Quality > quantity.

Schema (full spec in `references/segments_schema.json`, example in `references/example_segments.json`):

```json
{
  "source_video": "input.mp4",
  "source_srt": "input.zh-CN.srt",
  "platform": "wechat_channels",
  "segments": [{
    "id": 1, "slug": "intent-not-code",
    "title": "AI 时代不是写代码\n而是写意图",
    "summary": "Two-sentence pitch — what's the insight, what's at stake.",
    "start": "00:00:43.460", "end": "00:02:35.220",
    "cover_prompt": "Visual concept for gpt-image-2 (style anchor, not literal scene)"
  }]
}
```

`slug` = kebab-case English (used in filenames). `title` uses `\n` for line break, 2 lines max, 8–12 Chinese chars per line. `cover_prompt` is consumed downstream by `/wjs-overlaying-video`'s cover-generation step — keep it written here so the overlay skill can pick it up without re-asking.

## Step 2 — Accurate-seek cut

```bash
python3 ~/.claude/skills/wjs-segmenting-video/scripts/segment.py \
    --segments segments.json --out output/ --reencode
```

`--reencode` is the **default recommended mode**. It cuts with
`ffmpeg -ss N -i src -c:v libx264 -c:a aac` so the output starts
EXACTLY at the requested timestamp. ~30s per clip on CPU. Also extracts
a midpoint frame per segment to `output/frame_NN_slug.jpg`.

**Why default to `--reencode` and not stream-copy:**

Stream-copy via `ffmpeg -ss N -c copy` seeks to the nearest keyframe
*before* N (it can't re-encode). The output's t=0 then maps to source
t=keyframe, so the clip plays a fraction of a second of "lead-in"
content before the requested speech. Captions sliced from the master
SRT at boundary N appear **AHEAD of the audio** by exactly that GOP
fraction — listeners feel "subtitles lead the voice."

In practice on H.264 source with GOP=2s: every clip is off by
0.6–1.5s. Looks like a synchronization bug downstream; it's actually
a cut-time bug upstream.

### Stream-copy variant (only if you control the source encode)

If the source has been re-encoded with `-force_key_frames` at every
requested cut boundary, stream-copy IS accurate. Workflow:

```bash
# Build the comma-separated keyframe list from segments.json
KF=$(python3 -c "import json; s=json.load(open('segments.json'))
ts=[]
for seg in s['segments']:
    ts += [seg['start'], seg['end']]
print(','.join(ts))")

# Re-encode master once, forcing keyframes at all segment boundaries
ffmpeg -i master.mp4 \
  -c:v libx264 -preset medium -crf 18 \
  -force_key_frames "$KF" \
  -c:a copy master_kf.mp4

# Now stream-copy cuts land exactly:
python3 segment.py --segments segments.json --source master_kf.mp4 --out output/
```

Use this only when iterating on segment boundaries (you'll re-cut the
same source many times). For one-shot work, `--reencode` is simpler
and just as correct.

### Diagnosing keyframe-snap on already-cut clips

```bash
ffprobe -v error -select_streams v:0 -read_intervals "$((N-2))%$((N+5))" \
  -show_entries packet=pts_time,flags -of csv=p=0 master.mp4 | grep "K_"
```

Output like `360.023,K__   362.023,K__` → GOP=2s. A `-c copy` cut at
361.000 actually starts at 360.023, captions are 0.977s ahead of aud

More from this repository

skill-quality-reviewerSubagent

Repo-wide drift detector for the wjs-* Claude Code skills in this marketplace. Sweeps every SKILL.md, scores it against the repo's own conventions (V-ing naming, trigger-phrase density, companion files, description shape), and returns a grouped punch list ordered by severity. Read-only — never edits files. Use before pushing a batch of skill changes, or whenever you wonder "are these skills still internally consistent?

wangjianshuo-perspectiveSkill

wjs-auditing-projectSkill

Use when the user asks to audit what's wrong with a project, "make it right", "看看项目出了什么问题", "为什么用户的需求还没上线", "为什么没提交App Store", "为什么没新build", or wants a holistic state-of-the-project check covering unmerged branches, stalled PRs, failed GitHub Actions, stale builds, plan drift (TODOS.md / ROADMAP), unreleased commits, and log errors. Runs read-only investigation, presents a grouped checklist, fixes only after explicit user confirmation. Aware of the Cathier iOS app workflow (Xcode + fastlane + auto-merge @claude PRs from in-app feedback).

wjs-burning-subtitlesSkill

Use when the user has a video + an SRT and wants the subtitles either burned into the pixels (libass, always-visible) or soft-muxed as a togglable track. Also handles the final composite step for the localization pipeline — burn subs, mix a dub track, and keep the original audio as a low-volume bed, all in ONE ffmpeg encode (no cascade). Verifies libass availability and auto-downloads a static evermeet ffmpeg build when Homebrew's stripped binary lacks it. Triggers — "烧字幕", "硬字幕", "burn subtitles", "burn-in subs", "embed subtitle", "soft mux SRT", "把字幕烧进视频", "做最终合成".

wjs-cleaning-spamSkill

Use when the user complains about spam on his X/Twitter posts — 同城面付 / 寻固炮 / 线下上门 / 免费破处这类引流号在他推文下刷的 emoji 垃圾回复 — and wants them removed. Covers the last 7 days (X recent-search window). Triggers — "把这些spam删掉", "清理X垃圾回复", "推文下面好多引流号", "clean spam replies", "/wjs-cleaning-spam".

wjs-converting-text-to-videoSkill

Use when the user wants a 王建硕-style WeChat article (article.md) turned into a narrated short MP4 video — TTS voiceover via 火山引擎 Volcano TTS, HyperFrames CSS/GSAP animation per scene, subtle SFX, abstract watercolor background, full pipeline rendering to 1080×1920 portrait MP4 (30-90s). Triggers — "把这篇文章做成视频", "做一个解说视频", "讲解视频", "/wjs-converting-text-to-video".

wjs-converting-wp-to-hugoSkill

Use when migrating a WordPress site to a Hugo static site on GitHub Pages from a WXR export (.xml) plus the wp-content/uploads folder — preserving /archives/<id>/ URLs, localizing images, and deploying via GitHub Actions. Triggers — "把 WordPress 迁成 Hugo", "wordpress 转静态站", "migrate WordPress to Hugo", "WXR to Hugo", "publish WordPress to GitHub Pages", "/wjs-converting-wp-to-hugo".

wjs-dubbing-videoSkill

Use when the user has a video + a target-language SRT and wants the video to actually speak that language — generates a time-aligned TTS voice dub. Routes by voice ID — Volcano (豆包) TTS for Chinese, edge-tts neural for any language. Defaults to one voice (single-speaker); opt-in multi-speaker via visual diarization. Outputs `*_<lang>_dub.mp4` with the dub audio in place of the original. Final mixing (audio bed + burn-in) is handed off to `/wjs-burning-subtitles`. Triggers — "配音", "中文配音", "Chinese dub", "voice over this", "dub the video", "TTS this SRT", "different voice for each speaker".