Skip to main content
ClaudeWave
Skill199 repo starsupdated today

video-understanding

This skill converts source video into a structured understanding index through six analysis stages: scene detection with junk filtering, frame extraction, automatic speech recognition via MiMo, silence period identification, visual-language model analysis per scene, and timeline fusion. Use it as a prerequisite for downstream video processing tasks like script generation, providing agents with timestamped transcripts, scene boundaries, visual descriptions, and silence windows in standardized JSON formats.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/worldwonderer/video-recap-skills /tmp/video-understanding && cp -r /tmp/video-understanding/skills/video-understanding ~/.claude/skills/video-understanding
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

## What this does

Turns a source video into an **understanding index** an agent (or a downstream stage) can read:
1. **Scene detection** — `scenes.json` (cut points, durations) + junk-scene filtering.
2. **Frame extraction** — sampled frames for the visual analysis.
3. **ASR** — `asr_result.json` (timestamped dialogue) via MiMo `mimo-v2.5-asr`.
4. **Silence detection** — `silence_periods.json` (quiet windows, `has_speech` flag).
5. **VLM analysis** — `vlm_analysis.json` (per-scene description, depth analysis, `frame_facts`).
6. **Timeline fusion + brief** — `timeline_fusion.json`, `asr_writing_chunks.json`, `agent_narration_brief.md`.

Stateless: reusable stages are skipped only when their output and provenance sidecar match
the current source video plus output-affecting settings. `--force` recomputes.

## Requirements

```bash
# ffmpeg: brew install ffmpeg | apt install ffmpeg | choco install ffmpeg
export MIMO_API_KEY=***          # one key drives ASR (mimo-v2.5-asr) + VLM (mimo-v2.5)
```

ASR uses MiMo `mimo-v2.5-asr`; without `MIMO_API_KEY` it is skipped (or pass `--skip-asr`).
Optional MiMo scene-chunk video understanding: `--mimo-video-overview`.

If `work_dir/background_research.json` exists (story research the agent did first, see
`references/research-guide.md`), its synopsis and named characters are folded into the VLM
context, so scene descriptions can name people and read scenes with plot knowledge. Combine with
`--context` for a quick inline hint.

## Run

```bash
python3 scripts/understand.py <video> --work-dir <work_dir> \
  [--context "节目名/角色名"] [--scene-threshold 0.1] [--skip-asr] [--mimo-video-overview] [--force]
```

## Output contract

| File | Content |
|------|---------|
| `scenes.json` | scene cut list (start/end/duration) |
| `asr_result.json` | `[{start, end, text}]` timestamped transcript |
| `vlm_analysis.json` | per-scene description / depth / `frame_facts` |
| `silence_periods.json` | `[{start, end, duration, has_speech}]` quiet windows |
| `timeline_fusion.json` | VLM + ASR + silence overlap, unified timeline |
| `asr_writing_chunks.json` | ASR split at sentence boundaries, scene-aligned |
| `agent_narration_brief.md` | the human/agent-facing writing brief (read this first) |

Downstream, **video-script** reads the brief + index to write `narration.json`.

## References
- Background research before writing: `references/research-guide.md` (writes `background_research.json`).
- Output JSON shapes: `references/data-schema.md`.

## What this skill does NOT do
- Does NOT write narration / 解说词 or score it — that is video-script.
- Does NOT cut, edit, voice, or render video.
- Does NOT invent plot the signal doesn't support — it emits a substrate warning when ASR/VLM are thin, rather than fabricating.
- Does NOT publish or schedule anything; it writes artifacts to work_dir and stops.