venice-audio-transcription
Transcribe audio files to text via POST /audio/transcriptions. Covers supported models (Parakeet, Whisper, Wizper, Scribe, xAI STT), supported formats (wav/flac/m4a/aac/mp4/mp3/ogg/webm), response formats (json/text), timestamps, and language hints. OpenAI-compatible multipart.
git clone --depth 1 https://github.com/veniceai/skills /tmp/venice-audio-transcription && cp -r /tmp/venice-audio-transcription/skills/venice-audio-transcription ~/.claude/skills/venice-audio-transcriptionSKILL.md
# Venice Transcription (`/audio/transcriptions`)
`POST /api/v1/audio/transcriptions` takes an audio file and returns text. It's OpenAI-compatible with `multipart/form-data` — the OpenAI SDK's `audio.transcriptions.create()` works unchanged.
## Use when
- You need STT (speech-to-text) for voice notes, meetings, podcasts, short audio.
- You need timestamps for subtitles / chapters.
- You want to pick between fast local-style models (Parakeet) and large multilingual ones (Whisper, Wizper, Scribe).
For long video / YouTube transcription, see [`venice-video`](../venice-video/SKILL.md)'s `/video/transcriptions` (takes a public video URL directly).
## Minimal request
```bash
curl https://api.venice.ai/api/v1/audio/transcriptions \
-H "Authorization: Bearer $VENICE_API_KEY" \
-F "file=@./meeting.m4a" \
-F "model=nvidia/parakeet-tdt-0.6b-v3" \
-F "response_format=json" \
-F "timestamps=false"
```
```json
{ "text": "Alright everyone, let's kick off the meeting..." }
```
With `timestamps=true`, `json` format also returns segment/word timings (schema is model-specific).
## Request (`multipart/form-data`)
| Field | Type | Default | Notes |
|---|---|---|---|
| `file` | binary | — | **Required.** Audio file. Supported: `wav`, `wave`, `flac`, `m4a`, `aac`, `mp4`, `mp3`, `ogg`, `webm`. Base64 is **not** accepted — upload as a real file. |
| `model` | enum | `nvidia/parakeet-tdt-0.6b-v3` | See models below. |
| `response_format` | `json` / `text` | `json` | `text` returns `text/plain` body. |
| `timestamps` | bool | `false` | Include segment/word timestamps (JSON only). |
| `language` | string | — | ISO 639-1 hint (e.g. `en`, `ja`). Only Whisper-family models honor it; others auto-detect. |
## Models
| Model ID | Notes |
|---|---|
| `nvidia/parakeet-tdt-0.6b-v3` | Default. Fast, English-first, great for real-time-ish flows. |
| `openai/whisper-large-v3` | Large multilingual, honors `language` hint. |
| `fal-ai/wizper` | Whisper variant, competitive on quality/latency tradeoff. |
| `elevenlabs/scribe-v2` | ElevenLabs Scribe, strong on noisy audio. |
| `stt-xai-v1` | xAI Speech-to-Text. |
`GET /models?type=asr` returns the current catalog. ASR pricing is `pricing.per_audio_second.usd` — cost scales with audio duration.
## OpenAI SDK
```ts
import OpenAI from 'openai'
import fs from 'node:fs'
const client = new OpenAI({
apiKey: process.env.VENICE_API_KEY,
baseURL: 'https://api.venice.ai/api/v1',
})
const out = await client.audio.transcriptions.create({
file: fs.createReadStream('meeting.m4a'),
model: 'openai/whisper-large-v3',
response_format: 'json',
language: 'en',
// @ts-expect-error — Venice-specific extra, passes through multipart
timestamps: true,
})
console.log(out.text)
```
## Batch / long files
Venice doesn't expose native chunking. For files > ~30 min, split client-side on silence with `ffmpeg` or `pydub`, transcribe each chunk, then concatenate with offset timestamps.
```bash
ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3
```
## Errors
| Code | Meaning |
|---|---|
| `400` | Bad params, unsupported audio format, empty file, or **file larger than 25 MB** (this endpoint returns `400` with `"Maximum size is 25MB"`, not `413`). |
| `401` | Auth / Pro-only. |
| `402` | Insufficient balance. |
| `415` | Wrong `Content-Type` — must be `multipart/form-data`. |
| `422` | Validation / upstream ASR error (e.g. zero-length audio, upstream provider 422). Not a "content policy" code on this path. |
| `429` | Rate limited. |
| `500` / `503` | Transient; retry with jitter. |
## Gotchas
- `file` must be uploaded as a real multipart file part. JSON + base64 is **not** supported here.
- Timestamps are only surfaced in the JSON response shapes (`json`, `verbose_json`, `srt`, `vtt`). With `response_format: text` the handler returns a plain `text/plain` body containing just the transcript — you'll lose any timestamp data, so pick `verbose_json` / `srt` / `vtt` when you need timings.
- `language` is Whisper-specific. Parakeet / Scribe ignore it and auto-detect.
- Peak concurrency limits apply — on `429`, back off; big batches should throttle to ~5 parallel requests.
- Content-policy rejection on the transcript is returned as `422` with an error string; it does not surface `suggested_prompt` on this path.Manage Venice API keys. Covers GET/POST/PATCH/DELETE /api_keys, GET /api_keys/{id}, GET /api_keys/rate_limits, GET /api_keys/rate_limits/log, the two-step /api_keys/generate_web3_key wallet flow, INFERENCE vs ADMIN key types, and per-key consumption limits (USD / DIEM).
High-level map of the Venice.ai API - base URL, authentication modes, endpoint categories, response headers, pricing model, error shape, and versioning. Load this first when starting any Venice integration.
Async music / audio-track generation via Venice. Covers the /audio/quote + /audio/queue + /audio/retrieve + /audio/complete lifecycle, lyrics vs instrumental, voice selection, duration, language, speed, model capability probing, and webhook-free polling.
Generate speech from text via POST /audio/speech. Covers TTS models (Kokoro, Qwen 3, xAI, Inworld, Chatterbox, Orpheus, ElevenLabs Turbo, MiniMax, Gemini Flash), voices per family, output formats (mp3/opus/aac/flac/wav/pcm), streaming, prompt/emotion styling, temperature/top_p, and language hints.
Venice augmentation endpoints for agent pipelines. Covers POST /augment/text-parser (extract text from PDF/DOCX/XLSX/plain text, multipart, up to 25MB, JSON or plain text response), POST /augment/scrape (fetch a URL and return markdown; blocks X/Reddit), and POST /augment/search (Brave ZDR or anonymized Google; structured title/url/content/date results, up to 20 per query). Privacy (zero data retention), rate limits, and error shapes.
Authenticate to the Venice API with a Bearer API key or with an x402 / SIWE wallet. Covers header formats, the SIWE message fields, TTL and nonce rules, the venice-x402-client SDK, and how to choose between the two modes.
Venice billing and usage analytics - GET /billing/balance, GET /billing/usage (paginated per-request ledger, JSON or CSV), and GET /billing/usage-analytics (aggregated by date/model/key). Covers the DIEM/USD/BUNDLED_CREDITS consumption priority and building dashboards. (Beta)
Discover and use Venice public characters (persona-driven system prompts with a bound model). Covers GET /characters (search/filter/sort), /characters/{slug}, /characters/{slug}/reviews, the Character schema, and how to apply a character via venice_parameters.character_slug in chat completions.