voiceover-studio
The voiceover-studio skill converts text into spoken audio using OpenSquilla's audio provider, supporting product demos, IVR prompts, podcast narration, and short-video voiceovers. Use it when the user requests text-to-speech, narration, or audio generation in any language, and preview voice quality before generating full batches for uncertain accents or pacing.
git clone --depth 1 https://github.com/opensquilla/opensquilla /tmp/voiceover-studio && cp -r /tmp/voiceover-studio/src/opensquilla/skills/bundled/voiceover-studio ~/.claude/skills/voiceover-studioSKILL.md
# voiceover-studio Turns text into spoken audio with OpenSquilla's configured direct audio provider. OpenRouter may be used by the orchestrator to draft or polish copy, but the actual audio generation must go through the `tts` tool and the active audio provider capability report. ## Use cases - Single-line TTS, product demos, accessibility reads, IVR prompts. - Batch narration from a script or `ai-video-script` `VOICEOVER` lines. - Short-video voiceover where the result should be a playable audio artifact in the Web UI, not only a downloadable file path. ## Request triage Before calling tools, extract these fields from the user request: - task type: one-shot TTS, batch narration, IVR, podcast, accessibility read, or short-video voiceover - source text and whether the user wants copy editing, translation, or exact preservation - target language, target locale, desired accent, emotion, speaking pace, and output duration constraints - voice source: configured default, searched shared voice, or user-provided voice ID - output expectation: quick sample, final asset, or multiple takes OpenRouter can refine script wording, but it is not an audio provider. Never answer as if OpenRouter generated the voice. ## Preview-first When voice quality, accent, or pacing is uncertain, generate a short sample before a full batch. For Chinese, English locale changes, or mixed-language copy, keep the preview to one or two natural sentences and pass `language_code`, `speed`, and a searched `voice_id` to `tts`. If the user asks for a long script and does not specify that they need the full asset immediately, create the first paragraph as a preview and explain that the remaining paragraphs can be generated after voice approval. ## Tool-result handling - If `tts` returns `status=ok`, return the playable audio artifact/path first, then mention voice, language code, speed, and any inferred locale. - If `tts` returns `not_available`, quote the `note` and distinguish provider configuration, missing voice ID, language mismatch, and provider errors. - If `voice_search` returns weak or empty matches, do not force the default English voice for non-English text. Ask for a preferred voice or retry with a broader locale/accent. ## Required workflow 1. Call `audio_provider_capabilities` when the provider, paid features, or available voices are uncertain. 2. When the requested language, locale, or accent may not match the configured default voice, call `voice_search` first with the target language/locale/accent (for example `language=zh` + `accent=beijing mandarin`, or `language=en` + `accent=british`). Use a matching `voice_id` in `tts`. 3. Preserve the user's source text. Do not rewrite factual claims unless the user asked for copy editing. 4. Choose a voice only from configured, searched, or user-provided voice IDs. Do not imitate a public figure, celebrity, private person, or copyrighted character voice unless the user provides explicit authorization and the provider permits it. 5. For long text, split into natural paragraphs under the provider limit and generate stable filenames. 6. Call `tts` with `text`, optional `voice`, optional `language_code`, optional voice settings, optional `speed`, and optional `output_path`. 7. Return the resulting path and artifact metadata. Prefer surfaces that render the result as a playable audio artifact. ## Locale and accent constraints First identify the target language, target locale, and desired accent. Optimize for a locale-appropriate accent, not a one-size-fits-all "AI narrator" voice. General rules: - Keep source text in the target language unless the user asked for translation. - Choose a voice that natively supports the target language when possible. - If the user specifies a locale, preserve it: en-US, en-GB, en-AU, zh-CN, zh-TW, ja-JP, ko-KR, fr-FR, de-DE, es-ES, es-MX, etc. - If the user does not specify locale, infer the most likely neutral standard for the language and mention the choice in the final notes. - Avoid reading non-English languages with an English accent. Avoid reading English with a random accent when the user requested a specific locale. - Keep punctuation and phrasing natural for the target language. Bad punctuation often causes bad accent and pacing. Chinese defaults: - Keep Chinese text in Chinese. Do not translate it to English before TTS. - Prefer a Mandarin-capable voice and, when the provider exposes such labels, choose 普通话 / Mainland Mandarin / Chinese-native voice settings. - Avoid unnecessary English punctuation, pinyin, romanized names, and mixed Latin filler unless the user wrote them intentionally. - Keep sentence boundaries short and natural. Replace overly long comma chains with Chinese punctuation so the TTS model pauses correctly. - For names, acronyms, product names, and numerals, add Chinese-readable wording in the text itself when needed, e.g. `A I` -> `人工智能` or `A-I` only when the brand requires it. - For Chinese output, start with `speed` 0.9-1.0. Very fast speed often makes 中文口音 sound odd. - Before batch generation, create one short sample and ask for a listen/retry if the user is tuning voice quality. English defaults: - For American English, prefer en-US / neutral American delivery. - For British English, prefer en-GB and avoid Americanized pronunciation. - For Australian, Indian, Singaporean, or other English locales, keep the locale explicit instead of silently falling back to en-US. Other languages: - Japanese: prefer ja-JP voice and Japanese punctuation/cadence. - Korean: prefer ko-KR voice and Korean punctuation/cadence. - French/German/Spanish/Portuguese: keep regional variants explicit when the user names them, e.g. fr-FR vs fr-CA, es-ES vs es-MX, pt-PT vs pt-BR. ## Rights and copyright guard - Copyright: generate only text/audio the user owns, licensed material, or clearly original content created for this task. - 授权: do n
Submit audio or video for multilingual dubbing, poll status, and download dubbed audio. Use when the user asks for dubbing, 多语言配音, 视频翻译配音, 译制片, or wants a source clip dubbed into another language.
Generate a structured short-video shooting script from a topic. Emits a strict, machine-parseable shot list (3 shots by default) with image prompt + video prompt + voiceover + on-screen text per shot. Trigger when the user asks for a video script, 分镜, 短视频文案, AI视频, 短剧脚本, or wants visual prompts ready for image/video generation.
Use when the user asks to schedule recurring tasks, one-off reminders, timers, or cron-style jobs through the OpenSquilla cron tool.
Multi-round research with explicit methodology, evidence tracking, and citation-tagged synthesis. Trigger on 'deep dive', 'research report', 'literature review', 'investigate X across sources', 'multi-round investigation'. Distinct from the `summarize` skill, which is a single-pass condensation; this skill maintains a state file across iterations, tracks coverage, and produces a long-form report with per-claim citations. Three execution stages: plan (scope into sub-questions), iterate (record evidence per round), compile (synthesize report). The skill itself does not fetch the web — it tells the host agent which fetches to perform via OpenSquilla's existing web tools, and records what comes back.
Read, edit, or create Microsoft Word `.docx` files. Trigger this skill whenever the user mentions a Word document, .docx file, contract, report, brief, memo, or asks to extract text, modify an existing doc, generate one from a brief, or audit tracked changes. Three execution paths: text-and-structure extraction, in-place edit-by-run (preserves styles), and create-from-scratch with python-docx. Falls back to OOXML unzip-and-patch for layout work python-docx cannot reach.
Capture the current git diff (staged, working-tree, or staged file list) as text. Direct shell call for workflows that need repository diffs without an LLM agent loop.
GitHub operations via `gh` CLI: issues, PRs, CI runs, code review, API queries. Use when: (1) checking PR status or CI, (2) creating/commenting on issues, (3) listing/filtering PRs or issues, (4) viewing run logs. NOT for: complex web UI interactions requiring manual browser flows (use browser tooling when available), bulk operations across many repos (script with gh api), or when gh auth is not configured.
Query the per-turn DecisionEntry log for skill co-occurrence patterns, meta-skill usage stats, and the router fixture corpus. Returns a JSON summary suitable for downstream LLM consumption. Used by meta-skill-creator's harvest step but also useful standalone for 'which skills did I use most this week?'