Skip to main content
ClaudeWave
Skill654 repo starsupdated today

fish-audio

The fish-audio skill generates expressive audio clips using Fish Audio's S2 Pro text-to-speech API with bracket notation for emotional tags. Use this to create narration, voice memos, announcements, or any spoken content that requires dynamic emotional expression, with support for combining multiple clips through ffmpeg.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/vellum-ai/vellum-assistant /tmp/fish-audio && cp -r /tmp/fish-audio/skills/fish-audio ~/.claude/skills/fish-audio
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Fish Audio TTS

Generate expressive audio clips using the Fish Audio S2 TTS API with `[bracket]` emotion tags.

## Overview

This skill lets you create audio clips on demand — narration, announcements, podcast intros, dramatic readings, voice memos, or any spoken content. Uses Fish Audio S2 Pro with the full bracket syntax for emotional expressiveness.

## Configuration

- **API Endpoint:** `https://api.fish.audio/v1/tts`
- **Model:** `s2-pro`
- **Voice Reference ID:** Configured via `assistant config get services.tts.providers.fish-audio.referenceId`
- **API Key:** Stored as credential `fish-audio/api_key`
- **Default Format:** `mp3` at 192kbps
- **Default Output Directory:** `scratch/`

## API Key Setup

The Fish Audio API key must be stored securely via the credential store. Get an API key from the Fish Audio dashboard at https://fish.audio.

Check if the key is already configured:

```bash
assistant credentials inspect --service fish-audio --field api_key --json
```

If not set, collect it securely (never ask the user to paste it in chat):

```
credential_store action="prompt" service="fish-audio" field="api_key" label="Fish Audio API Key" description="Enter your Fish Audio API key" placeholder="sk-..."
```

## Generating a Single Clip

Use `bash` with `curl` to call the Fish Audio API:

```bash
curl -s -X POST "https://api.fish.audio/v1/tts" \
  -H "Authorization: Bearer $(assistant credentials reveal --service fish-audio --field api_key)" \
  -H "Content-Type: application/json" \
  -H "model: s2-pro" \
  -d '{
    "text": "YOUR TEXT WITH [bracket] TAGS HERE",
    "reference_id": "'"$(assistant config get services.tts.providers.fish-audio.referenceId)"'",
    "format": "mp3",
    "mp3_bitrate": 192,
    "temperature": 0.8
  }' --output scratch/OUTPUT_FILENAME.mp3
```

**Important:** This API call requires network access. Always use `network_mode: proxied` when running this command.

## Generating Multiple Clips & Combining

For longer pieces (narrations, multi-part messages), generate each clip separately then combine with ffmpeg:

### 1. Generate silence for gaps between clips

```bash
ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 1.5 -q:a 9 -acodec libmp3lame scratch/silence.mp3 -y
```

### 2. Create a concat file

```bash
cat > scratch/concat.txt << 'EOF'
file 'clip1.mp3'
file 'silence.mp3'
file 'clip2.mp3'
file 'silence.mp3'
file 'clip3.mp3'
EOF
```

### 3. Combine

```bash
ffmpeg -f concat -safe 0 -i scratch/concat.txt -c copy scratch/final_output.mp3 -y
```

## Bracket Syntax — Complete Guide

Fish Audio S2 uses `[bracket]` syntax for inline emotion and prosody control. This is the core of what makes the voice expressive. Tags are natural-language instructions placed directly in the text that control how words are spoken — the delivery, emotion, pacing, or vocal quality at that exact point.

**Key principle:** You are not choosing from a fixed menu. You write the description, and S2 interprets it. If you can describe it to a voice actor, S2 can attempt it. Over 15,000+ unique tags are supported, and the system understands free-form descriptions.

### How Placement Works

Tags affect what comes **after** them. Place the tag at the **exact point** where the shift should happen. Placement IS meaning.

```
[whispering] I didn't want to go inside.     <- whispers the entire line
I didn't want to go [whispering] inside.     <- only whispers from "inside" onward
```

Tags can go **anywhere** — start, middle, or end of a sentence. They apply from the point they appear until the next tag or end of the sentence.

### Well-Tested Tags (Reliable Out of the Box)

These tags consistently produce strong results. Organized by category:

#### Emotions

| Tag             | Effect                  | Best For                    |
| --------------- | ----------------------- | --------------------------- |
| `[happy]`       | Cheerful, upbeat        | Good news, greetings        |
| `[sad]`         | Melancholic, downcast   | Sympathy, vulnerability     |
| `[angry]`       | Frustrated, aggressive  | Arguments, complaints       |
| `[excited]`     | Energetic, enthusiastic | Celebrations, announcements |
| `[surprised]`   | Shocked, amazed         | Reactions, discoveries      |
| `[embarrassed]` | Awkward, flustered      | Mistakes, confessions       |
| `[delight]`     | Very pleased, joyful    | Genuine happiness           |
| `[nervous]`     | Anxious, uncertain      | Vulnerability, apologies    |
| `[confident]`   | Assertive, self-assured | Bold statements             |
| `[nostalgic]`   | Longing for the past    | Memories, stories           |
| `[scared]`      | Frightened, fearful     | Warnings, tension           |
| `[jealous]`     | Envious, resentful      | Comparisons, possessiveness |
| `[shocked]`     | Sudden realization      | Dramatic reveals            |
| `[moved]`       | Emotionally touched     | Heartfelt moments           |

#### Voice Quality & Style

| Tag                    | Effect               | Best For                   |
| ---------------------- | -------------------- | -------------------------- |
| `[soft]`               | Gentle, tender       | Intimate moments, kindness |
| `[whisper]`            | Very quiet, close    | Secrets, tension, suspense |
| `[breathy]`            | Airy, expressive     | Vulnerability, emphasis    |
| `[low voice]`          | Deep, quiet register | Gravity, seriousness       |
| `[loud]`               | Raised volume        | Emphasis, excitement       |
| `[screaming]`          | Full volume yelling  | Anger, extreme excitement  |
| `[shouting]`           | Forceful projection  | Arguments, calling out     |
| `[emphasis]`           | Stressed delivery    | Key words, making a point  |
| `[singing]`            | Musical quality      | Playfulness, joy           |
| `[echo]`               | Reverberant effect   | Dramatic moments           |
| `[with strong accent]` | Pronounced accent    | Character work             |

#### Par