Skip to main content
ClaudeWave
Skill199 repo starsupdated today

video-voiceover

This skill converts timestamped narration scripts into synthesized speech audio using MiMo TTS, dynamically adjusting speech rate to fit each segment within its designated time slot. Use it as part of a video recap workflow to generate voice-over clips that align precisely with video timeline segments, producing individual WAV files and metadata for downstream video assembly.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/worldwonderer/video-recap-skills /tmp/video-voiceover && cp -r /tmp/video-voiceover/skills/video-voiceover ~/.claude/skills/video-voiceover
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

## What this does

Reads a timestamped narration script and synthesizes one audio clip per segment, fitting speech
to each segment's time slot (dynamic rate), then records placement metadata. The only engine is
MiMo TTS (`mimo-v2.5-tts`).

## Requirements

```bash
export MIMO_API_KEY=***         # MiMo TTS (or a TTS-specific MIMO_TTS_API_KEY)
```

## Input contract

`work_dir/narration.json` (or an explicit `work_dir/narration_mapped.json` in cut mode) — segments with
`start` / `end` / `narration` (+ optional `pause_after_ms`, `overlaps_speech`). Times are the
**output-timeline** seconds the audio will be placed at.

## Run

```bash
python3 scripts/voiceover.py --work-dir <work_dir> --narration <narration.json> [--mimo-voice 冰糖]
```

For direct one-off use, omitting `--narration` reads `work_dir/narration.json`.
Pass `--narration work_dir/narration_mapped.json` explicitly for cut-mode output;
the video-recap orchestrator always passes the intended file.

## Output contract

- `tts_segments/*.wav` — one synthesized clip per narration segment.
- `tts_meta.json` — `{segments: [...], engine, narration}` where each segment carries its
  `audio_path`, timing, `pause_after_ms`, and placement fields consumed by **video-assemble**.

## Notes
- Re-runs safely reuse only matching per-segment audio; edited narration or TTS settings regenerate the affected WAVs.
- `TTS_WORKERS`, `TTS_TIMEOUT`, `TTS_RETRIES`, `ALLOW_PARTIAL_TTS` tune throughput/robustness.

## What this skill does NOT do
- Does NOT write or edit narration text.
- Does NOT mux, duck, or render subtitles — that is video-assemble.
- Does NOT analyze the video or choose timestamps — it voices the segments it is given.