Skill122 repo starsupdated 8d ago

venice-audio-transcription

This Venice Transcription skill provides OpenAI-compatible speech-to-text conversion via POST to `/audio/transcriptions`, accepting audio files in wav, flac, m4a, aac, mp4, mp3, ogg, or webm format. Use it to transcribe voice notes, meetings, and podcasts with optional timestamp generation and language hints, choosing from models like Parakeet for fast English transcription or Whisper for large multilingual processing.

View source Repository: skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/veniceai/skills /tmp/venice-audio-transcription && cp -r /tmp/venice-audio-transcription/skills/venice-audio-transcription ~/.claude/skills/venice-audio-transcription

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Venice Transcription (`/audio/transcriptions`)

`POST /api/v1/audio/transcriptions` takes an audio file and returns text. It's OpenAI-compatible with `multipart/form-data` — the OpenAI SDK's `audio.transcriptions.create()` works unchanged.

## Use when

- You need STT (speech-to-text) for voice notes, meetings, podcasts, short audio.
- You need timestamps for subtitles / chapters.
- You want to pick between fast local-style models (Parakeet) and large multilingual ones (Whisper, Wizper, Scribe).

For long video / YouTube transcription, see [`venice-video`](../venice-video/SKILL.md)'s `/video/transcriptions` (takes a public video URL directly).

## Minimal request

```bash
curl https://api.venice.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -F "file=@./meeting.m4a" \
  -F "model=nvidia/parakeet-tdt-0.6b-v3" \
  -F "response_format=json" \
  -F "timestamps=false"
```

```json
{ "text": "Alright everyone, let's kick off the meeting..." }
```

With `timestamps=true`, `json` format also returns segment/word timings (schema is model-specific).

## Request (`multipart/form-data`)

| Field | Type | Default | Notes |
|---|---|---|---|
| `file` | binary | — | **Required.** Audio file. Supported: `wav`, `wave`, `flac`, `m4a`, `aac`, `mp4`, `mp3`, `ogg`, `webm`. Base64 is **not** accepted — upload as a real file. |
| `model` | enum | `nvidia/parakeet-tdt-0.6b-v3` | See models below. |
| `response_format` | `json` / `text` | `json` | `text` returns `text/plain` body. |
| `timestamps` | bool | `false` | Include segment/word timestamps (JSON only). |
| `language` | string | — | ISO 639-1 hint (e.g. `en`, `ja`). Only Whisper-family models honor it; others auto-detect. |

## Models

| Model ID | Notes |
|---|---|
| `nvidia/parakeet-tdt-0.6b-v3` | Default. Fast, English-first, great for real-time-ish flows. |
| `openai/whisper-large-v3` | Large multilingual, honors `language` hint. |
| `fal-ai/wizper` | Whisper variant, competitive on quality/latency tradeoff. |
| `elevenlabs/scribe-v2` | ElevenLabs Scribe, strong on noisy audio. |
| `stt-xai-v1` | xAI Speech-to-Text. |

`GET /models?type=asr` returns the current catalog. ASR pricing is `pricing.per_audio_second.usd` — cost scales with audio duration.

## OpenAI SDK

```ts
import OpenAI from 'openai'
import fs from 'node:fs'

const client = new OpenAI({
  apiKey: process.env.VENICE_API_KEY,
  baseURL: 'https://api.venice.ai/api/v1',
})

const out = await client.audio.transcriptions.create({
  file: fs.createReadStream('meeting.m4a'),
  model: 'openai/whisper-large-v3',
  response_format: 'json',
  language: 'en',
  // @ts-expect-error — Venice-specific extra, passes through multipart
  timestamps: true,
})

console.log(out.text)
```

## Batch / long files

Venice doesn't expose native chunking. For files > ~30 min, split client-side on silence with `ffmpeg` or `pydub`, transcribe each chunk, then concatenate with offset timestamps.

```bash
ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3
```

## Errors

| Code | Meaning |
|---|---|
| `400` | Bad params, unsupported audio format, empty file, or **file larger than 25 MB** (this endpoint returns `400` with `"Maximum size is 25MB"`, not `413`). |
| `401` | Auth / Pro-only. |
| `402` | Insufficient balance. |
| `415` | Wrong `Content-Type` — must be `multipart/form-data`. |
| `422` | Validation / upstream ASR error (e.g. zero-length audio, upstream provider 422). Not a "content policy" code on this path. |
| `429` | Rate limited. |
| `500` / `503` | Transient; retry with jitter. |

## Gotchas

- `file` must be uploaded as a real multipart file part. JSON + base64 is **not** supported here.
- Timestamps are only surfaced in the JSON response shapes (`json`, `verbose_json`, `srt`, `vtt`). With `response_format: text` the handler returns a plain `text/plain` body containing just the transcript — you'll lose any timestamp data, so pick `verbose_json` / `srt` / `vtt` when you need timings.
- `language` is Whisper-specific. Parakeet / Scribe ignore it and auto-detect.
- Peak concurrency limits apply — on `429`, back off; big batches should throttle to ~5 parallel requests.
- Content-policy rejection on the transcript is returned as `422` with an error string; it does not surface `suggested_prompt` on this path.