Skip to main content
ClaudeWave
Skill1.1k repo starsupdated today

audio-generation

The audio-generation skill provides comprehensive guidance for generating and understanding audio content within MassGen, supporting text-to-speech, music generation, sound effects, and advanced audio processing through ElevenLabs and OpenAI backends. Use this skill when you need to convert text prompts into audio files, create background music or sound effects, clone voices, remove background noise, or perform multilingual dubbing. ElevenLabs is the priority backend with broader capabilities, while OpenAI serves as a fallback for speech-only tasks.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/massgen/MassGen /tmp/audio-generation && cp -r /tmp/audio-generation/massgen/skills/audio-generation ~/.claude/skills/audio-generation
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Audio Generation

Generate audio using `generate_media` with `mode="audio"`. Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.

## Quick Start

```python
# Text-to-speech (auto-selects ElevenLabs if key available)
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")

# With specific voice
generate_media(prompt="Hello!", mode="audio", voice="Rachel")

# Music generation (ElevenLabs only)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio",
               audio_type="music", duration=30)

# Sound effects (ElevenLabs only)
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio",
               audio_type="sound_effect", duration=5)
```

## Audio Types

| Type | Backends | Description |
|------|----------|-------------|
| `"speech"` (default) | ElevenLabs, OpenAI | Text-to-speech with voice selection |
| `"music"` | ElevenLabs only | Music generation from text prompt |
| `"sound_effect"` | ElevenLabs only | Sound effect generation |
| `"voice_conversion"` | ElevenLabs only | Change voice of existing audio (speech-to-speech) |
| `"audio_isolation"` | ElevenLabs only | Remove background noise, isolate vocals |
| `"voice_design"` | ElevenLabs only | Create a new synthetic voice from text description |
| `"voice_clone"` | ElevenLabs only | Clone a voice from audio samples |
| `"dubbing"` | ElevenLabs only | Translate and dub audio to another language |

## Backend Comparison

| Backend | Default Model | Supports | API Key |
|---------|--------------|----------|---------|
| **ElevenLabs** (priority 1) | `eleven_multilingual_v2` | Speech, music, SFX | `ELEVENLABS_API_KEY` |
| **OpenAI** (priority 2) | `gpt-4o-mini-tts` | Speech only | `OPENAI_API_KEY` |

If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.

## Key Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `prompt` | Text to speak (speech) or description (music/SFX) | `"Hello world!"` |
| `voice` | Voice name or ID | `"Rachel"`, `"nova"`, `"alloy"` |
| `audio_type` | Type of audio | `"speech"`, `"music"`, `"sound_effect"` |
| `duration` | Length in seconds (music/SFX only) | `30` |
| `instructions` | Speaking style (OpenAI `gpt-4o-mini-tts` only) | `"warm, reflective tone"` |
| `audio_format` | Output format | `"mp3"`, `"wav"`, `"opus"` |

## Voice Quick Reference

**ElevenLabs** (top voices):
| Voice | Character |
|-------|-----------|
| Rachel | Warm, conversational female |
| Sarah | Clear, professional female |
| Josh | Friendly male |
| Adam | Deep, authoritative male |
| Emily | Bright, energetic female |

**OpenAI** voices: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`, `coral`, `sage`

## Important: prompt vs instructions

For speech, `prompt` is the **literal text to speak**. Style guidance goes in `instructions`:

```python
# CORRECT: prompt = text to speak, instructions = how to speak it
generate_media(
    prompt="Welcome to the annual report presentation.",
    mode="audio",
    voice="alloy",
    instructions="warm, reflective tone with measured pacing",
    backend_type="openai"
)

# WRONG: Don't put style instructions in prompt
generate_media(prompt="Say this warmly: Welcome...", mode="audio")  # Bad!
```

`instructions` only works with OpenAI `gpt-4o-mini-tts`. ElevenLabs uses voice selection for tone.

## Audio Understanding

Use `read_media` (not `generate_media`) to analyze existing audio:

```python
read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")
```

## Need More Control?

- **Full ElevenLabs voice catalog (28+ voices)**: See [references/voices.md](references/voices.md)
- **Music and sound effects details**: See [references/music_and_sfx.md](references/music_and_sfx.md)
- **Advanced audio capabilities (voice conversion, cloning, isolation, dubbing)**: See [references/advanced.md](references/advanced.md)
backend-integratorSkill

Complete guide for integrating a new LLM backend into MassGen. Use when adding a new provider (e.g., Codex, Mistral, DeepSeek) or when auditing an existing backend for missing integration points. Covers all ~15 files that need touching.

evolving-skill-creatorSkill

Guide for creating evolving skills - detailed workflow plans that capture what you'll do, what tools you'll create, and learnings from execution. Use this when starting a new task that could benefit from a reusable workflow.

file-searchSkill

This skill should be used when agents need to search codebases for text patterns or structural code patterns. Provides fast search using ripgrep for text and ast-grep for syntax-aware code search.

image-generationSkill

Guide to image generation and editing in MassGen. Use when creating images, editing existing images, iterating on image designs, or choosing between image backends (OpenAI, Google Gemini/Imagen, Grok, OpenRouter).

massgen-config-creatorSkill

Guide for creating properly structured YAML configuration files for MassGen. This skill should be used when agents need to create new configs for examples, case studies, testing, or demonstrating features.

massgen-develops-massgenSkill

Guide for using MassGen to develop and improve itself. This skill should be used when agents need to run MassGen experiments programmatically (using automation mode) OR analyze terminal UI/UX quality (using visual evaluation tools). These are mutually exclusive workflows for different improvement goals.

massgen-log-analyzerSkill

Run MassGen experiments and analyze logs using automation mode, logfire tracing, and SQL queries. Use this skill for performance analysis, debugging agent behavior, evaluating coordination patterns, and improving the logging structure, or whenever an ANALYSIS_REPORT.md is needed in a log directory.

massgen-release-documenterSkill

Guide for following MassGen's release documentation workflow. This skill should be used when preparing release documentation, updating changelogs, writing case studies, or maintaining project documentation across releases.