Skill425 repo starsupdated 4d ago

video-voiceover

This skill converts timestamped narration scripts into synthesized speech audio using MiMo TTS, dynamically adjusting speech rate to fit each segment within its designated time slot. Use it as part of a video recap workflow to generate voice-over clips that align precisely with video timeline segments, producing individual WAV files and metadata for downstream video assembly.

View source Repository: video-recap-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/worldwonderer/video-recap-skills /tmp/video-voiceover && cp -r /tmp/video-voiceover/skills/video-voiceover ~/.claude/skills/video-voiceover

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

## What this does

Reads a timestamped narration script and synthesizes one audio clip per segment, fitting speech
to each segment's time slot (dynamic rate), then records placement metadata. The only engine is
MiMo TTS (`mimo-v2.5-tts`).

## Requirements

```bash
export MIMO_API_KEY=*** # MiMo TTS (or a TTS-specific MIMO_TTS_API_KEY)
```

## Input contract

`work_dir/narration.json` — segments with `start` / `end` / `narration` (+ optional `pause_after_ms`,
`overlaps_speech`). Times are the **output-timeline** seconds the audio will be placed at.
In the orchestrated cut-mode flow, the agent writes `narration.json` directly against the output
timeline, and the orchestrator passes it here. In the legacy direct-cut path,
`narration_mapped.json` may be passed explicitly instead.

## Run

```bash
python3 scripts/voiceover.py --work-dir <work_dir> --narration <narration.json> [--mimo-voice 冰糖]
```

For direct one-off use, omitting `--narration` reads `work_dir/narration.json`.
Pass `--narration work_dir/narration_mapped.json` explicitly only for the legacy direct-cut path;
the video-recap orchestrator always passes `narration.json`.

## Output contract

- `tts_segments/*.wav` — one synthesized clip per narration segment.
- `tts_meta.json` — `{segments: [...], engine, narration}` where each segment carries its
`audio_path`, timing, `pause_after_ms`, and placement fields consumed by **video-assemble**.

## Notes
- Re-runs safely reuse only matching per-segment audio; edited narration or TTS settings regenerate the affected WAVs.
- `TTS_WORKERS`, `TTS_TIMEOUT`, `TTS_RETRIES`, `ALLOW_PARTIAL_TTS` tune throughput/robustness.

## What this skill does NOT do
- Does NOT write or edit narration text.
- Does NOT mux, duck, or render subtitles — that is video-assemble.
- Does NOT analyze the video or choose timestamps — it voices the segments it is given.