Skill425 repo starsupdated 4d ago

video-understanding

This skill converts source video into a structured understanding index through six analysis stages: scene detection with junk filtering, frame extraction, automatic speech recognition via MiMo, silence period identification, visual-language model analysis per scene, and timeline fusion. Use it as a prerequisite for downstream video processing tasks like script generation, providing agents with timestamped transcripts, scene boundaries, visual descriptions, and silence windows in standardized JSON formats.

View source Repository: video-recap-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/worldwonderer/video-recap-skills /tmp/video-understanding && cp -r /tmp/video-understanding/skills/video-understanding ~/.claude/skills/video-understanding

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

## What this does

Turns a source video into an **understanding index** an agent (or a downstream stage) can read:
1. **Scene detection** — `scenes.json` (cut points, durations) + junk-scene filtering.
2. **Frame extraction** — sampled frames for the visual analysis.
3. **ASR** — `asr_result.json` (timestamped dialogue) via MiMo `mimo-v2.5-asr`.
4. **Silence detection** — `silence_periods.json` (quiet windows, `has_speech` flag).
5. **VLM analysis** — `vlm_analysis.json` (per-scene description, depth analysis, `frame_facts`).
6. **Timeline fusion + brief** — `timeline_fusion.json`, `asr_writing_chunks.json`, `agent_narration_brief.md`.

Stateless: reusable stages are skipped only when their output and provenance sidecar match
the current source video plus output-affecting settings. `--force` recomputes.

## Requirements

```bash
# ffmpeg: brew install ffmpeg | apt install ffmpeg | choco install ffmpeg
export MIMO_API_KEY=***          # one key drives ASR (mimo-v2.5-asr) + VLM (mimo-v2.5)
```

ASR uses MiMo `mimo-v2.5-asr`; pass `--skip-asr` to skip dialogue transcription. The full understanding run still requires `MIMO_API_KEY` for VLM scene analysis.
Optional MiMo scene-chunk video understanding: `--mimo-video-overview`.

If `work_dir/background_research.json` exists (story research the agent did first, see
`references/research-guide.md`), its synopsis and named characters are folded into the VLM
context, so scene descriptions can name people and read scenes with plot knowledge. Combine with
`--context` for a quick inline hint.

## Run

```bash
python3 scripts/understand.py <video> --work-dir <work_dir> \
  [--context "节目名/角色名"] [--scene-threshold 0.1] [--skip-asr] [--mimo-video-overview] [--force]
```

## Output contract

| File | Content |
|------|---------|
| `scenes.json` | scene cut list (start/end/duration) |
| `asr_result.json` | `[{start, end, text}]` timestamped transcript |
| `vlm_analysis.json` | per-scene description / depth / `frame_facts` |
| `silence_periods.json` | `[{start, end, duration, has_speech}]` quiet windows |
| `timeline_fusion.json` | VLM + ASR + silence overlap, unified timeline |
| `asr_writing_chunks.json` | ASR split at sentence boundaries, scene-aligned |
| `agent_narration_brief.md` | the human/agent-facing writing brief (read this first) |

Downstream, **video-script** reads the brief + index to write `narration.json`.

## References
- Background research before writing: `references/research-guide.md` (writes `background_research.json`).
- Output JSON shapes: `references/data-schema.md`.

## What this skill does NOT do
- Does NOT write narration / 解说词 or score it — that is video-script.
- Does NOT cut, edit, voice, or render video.
- Does NOT invent plot the signal doesn't support — it emits a substrate warning when ASR/VLM are thin, rather than fabricating.
- Does NOT publish or schedule anything; it writes artifacts to work_dir and stops.