Skill6.4k repo starsupdated today

voiceover-studio

The voiceover-studio skill converts text into spoken audio using OpenSquilla's audio provider, supporting product demos, IVR prompts, podcast narration, and short-video voiceovers. Use it when the user requests text-to-speech, narration, or audio generation in any language, and preview voice quality before generating full batches for uncertain accents or pacing.

View source Repository: opensquilla

Install in Claude Code

Copy

git clone --depth 1 https://github.com/opensquilla/opensquilla /tmp/voiceover-studio && cp -r /tmp/voiceover-studio/src/opensquilla/skills/bundled/voiceover-studio ~/.claude/skills/voiceover-studio

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# voiceover-studio

Turns text into spoken audio with OpenSquilla's configured direct audio
provider. OpenRouter may be used by the orchestrator to draft or polish
copy, but the actual audio generation must go through the `tts` tool and
the active audio provider capability report.

## Use cases

- Single-line TTS, product demos, accessibility reads, IVR prompts.
- Batch narration from a script or `ai-video-script` `VOICEOVER` lines.
- Short-video voiceover where the result should be a playable audio artifact
  in the Web UI, not only a downloadable file path.

## Request triage

Before calling tools, extract these fields from the user request:

- task type: one-shot TTS, batch narration, IVR, podcast, accessibility read,
  or short-video voiceover
- source text and whether the user wants copy editing, translation, or exact
  preservation
- target language, target locale, desired accent, emotion, speaking pace, and
  output duration constraints
- voice source: configured default, searched shared voice, or user-provided
  voice ID
- output expectation: quick sample, final asset, or multiple takes

OpenRouter can refine script wording, but it is not an audio provider. Never
answer as if OpenRouter generated the voice.

## Preview-first

When voice quality, accent, or pacing is uncertain, generate a short sample
before a full batch. For Chinese, English locale changes, or mixed-language
copy, keep the preview to one or two natural sentences and pass
`language_code`, `speed`, and a searched `voice_id` to `tts`.

If the user asks for a long script and does not specify that they need the full
asset immediately, create the first paragraph as a preview and explain that the
remaining paragraphs can be generated after voice approval.

## Tool-result handling

- If `tts` returns `status=ok`, return the playable audio artifact/path first,
  then mention voice, language code, speed, and any inferred locale.
- If `tts` returns `not_available`, quote the `note` and distinguish provider
  configuration, missing voice ID, language mismatch, and provider errors.
- If `voice_search` returns weak or empty matches, do not force the default
  English voice for non-English text. Ask for a preferred voice or retry with a
  broader locale/accent.

## Required workflow

1. Call `audio_provider_capabilities` when the provider, paid features, or
   available voices are uncertain.
2. When the requested language, locale, or accent may not match the configured
   default voice, call `voice_search` first with the target language/locale/accent
   (for example `language=zh` + `accent=beijing mandarin`, or `language=en` +
   `accent=british`). Use a matching `voice_id` in `tts`.
3. Preserve the user's source text. Do not rewrite factual claims unless the
   user asked for copy editing.
4. Choose a voice only from configured, searched, or user-provided voice IDs.
   Do not imitate a public figure, celebrity, private person, or copyrighted
   character voice unless the user provides explicit authorization and the
   provider permits it.
5. For long text, split into natural paragraphs under the provider limit and
   generate stable filenames.
6. Call `tts` with `text`, optional `voice`, optional `language_code`, optional
   voice settings, optional `speed`, and optional `output_path`.
7. Return the resulting path and artifact metadata. Prefer surfaces that render
   the result as a playable audio artifact.

## Locale and accent constraints

First identify the target language, target locale, and desired accent. Optimize
for a locale-appropriate accent, not a one-size-fits-all "AI narrator" voice.

General rules:

- Keep source text in the target language unless the user asked for translation.
- Choose a voice that natively supports the target language when possible.
- If the user specifies a locale, preserve it: en-US, en-GB, en-AU, zh-CN,
  zh-TW, ja-JP, ko-KR, fr-FR, de-DE, es-ES, es-MX, etc.
- If the user does not specify locale, infer the most likely neutral standard
  for the language and mention the choice in the final notes.
- Avoid reading non-English languages with an English accent. Avoid reading
  English with a random accent when the user requested a specific locale.
- Keep punctuation and phrasing natural for the target language. Bad punctuation
  often causes bad accent and pacing.

Chinese defaults:

- Keep Chinese text in Chinese. Do not translate it to English before TTS.
- Prefer a Mandarin-capable voice and, when the provider exposes such labels,
  choose 普通话 / Mainland Mandarin / Chinese-native voice settings.
- Avoid unnecessary English punctuation, pinyin, romanized names, and mixed
  Latin filler unless the user wrote them intentionally.
- Keep sentence boundaries short and natural. Replace overly long comma chains
  with Chinese punctuation so the TTS model pauses correctly.
- For names, acronyms, product names, and numerals, add Chinese-readable
  wording in the text itself when needed, e.g. `A I` -> `人工智能` or
  `A-I` only when the brand requires it.
- For Chinese output, start with `speed` 0.9-1.0. Very fast speed often makes
  中文口音 sound odd.
- Before batch generation, create one short sample and ask for a listen/retry
  if the user is tuning voice quality.

English defaults:

- For American English, prefer en-US / neutral American delivery.
- For British English, prefer en-GB and avoid Americanized pronunciation.
- For Australian, Indian, Singaporean, or other English locales, keep the locale
  explicit instead of silently falling back to en-US.

Other languages:

- Japanese: prefer ja-JP voice and Japanese punctuation/cadence.
- Korean: prefer ko-KR voice and Korean punctuation/cadence.
- French/German/Spanish/Portuguese: keep regional variants explicit when the
  user names them, e.g. fr-FR vs fr-CA, es-ES vs es-MX, pt-PT vs pt-BR.

## Rights and copyright guard

- Copyright: generate only text/audio the user owns, licensed material, or
  clearly original content created for this task.
- 授权: do n

More from this repository

advanced-dubbing-studioSkill

Submit audio or video for multilingual dubbing, poll status, and download dubbed audio. Use when the user asks for dubbing, 多语言配音, 视频翻译配音, 译制片, or wants a source clip dubbed into another language.

ai-video-scriptSkill

Generate a structured short-video shooting script from a topic. Emits a strict, machine-parseable shot list (3 shots by default) with image prompt + video prompt + voiceover + on-screen text per shot. Trigger when the user asks for a video script, 分镜, 短视频文案, AI视频, 短剧脚本, or wants visual prompts ready for image/video generation.

cronSkill

Use when the user asks to schedule recurring tasks, one-off reminders, timers, or cron-style jobs through the OpenSquilla cron tool.

deep-researchSkill

Multi-round research with explicit methodology, evidence tracking, and citation-tagged synthesis. Trigger on 'deep dive', 'research report', 'literature review', 'investigate X across sources', 'multi-round investigation'. Distinct from the `summarize` skill, which is a single-pass condensation; this skill maintains a state file across iterations, tracks coverage, and produces a long-form report with per-claim citations. Three execution stages: plan (scope into sub-questions), iterate (record evidence per round), compile (synthesize report). The skill itself does not fetch the web — it tells the host agent which fetches to perform via OpenSquilla's existing web tools, and records what comes back.

docxSkill

Read, edit, or create Microsoft Word `.docx` files. Trigger this skill whenever the user mentions a Word document, .docx file, contract, report, brief, memo, or asks to extract text, modify an existing doc, generate one from a brief, or audit tracked changes. Three execution paths: text-and-structure extraction, in-place edit-by-run (preserves styles), and create-from-scratch with python-docx. Falls back to OOXML unzip-and-patch for layout work python-docx cannot reach.

git-diffSkill

Capture the current git diff (staged, working-tree, or staged file list) as text. Direct shell call for workflows that need repository diffs without an LLM agent loop.

githubSkill

GitHub operations via `gh` CLI: issues, PRs, CI runs, code review, API queries. Use when: (1) checking PR status or CI, (2) creating/commenting on issues, (3) listing/filtering PRs or issues, (4) viewing run logs. NOT for: complex web UI interactions requiring manual browser flows (use browser tooling when available), bulk operations across many repos (script with gh api), or when gh auth is not configured.

history-explorerSkill

Query the per-turn DecisionEntry log for skill co-occurrence patterns, meta-skill usage stats, and the router fixture corpus. Returns a JSON summary suitable for downstream LLM consumption. Used by meta-skill-creator's harvest step but also useful standalone for 'which skills did I use most this week?'