Audio

Generate speech from text

post

Generate speech from text using the Sesame (CSM 1B) TTS model via direct processing.

Available Models: sesame/csm-1b

Required Parameters

Parameter
Type
Description

text

string

Text to convert to speech (max 2000 characters)

Optional Parameters - Voice Selection

Parameter
Type
Default
Description

preset_voice

string

-

Pre-configured voice: Alice, Avery, Brock, Chloe, Ella, Emma, Grace, Karen, Kevin, Lucas, Matt, William

custom_voice

object

-

Clone a voice: {"url": "https://...", "context_text": "..."}

model

string

sesame/csm-1b

TTS model to use

Voice Selection Modes

1. Preset Voice (Recommended) - Use a pre-configured voice profile 2. Custom Voice Cloning - Clone any voice from an audio URL 3. Random Voice - Omit both for parametric synthesis with random voice

Note: Choose only ONE mode per request. Do not combine preset_voice with custom_voice.

Response

Returns binary WAV audio file (16-bit PCM, mono, 24kHz) with Content-Type: audio/wav

Async Mode

Set async: true to get a task_address immediately and poll for results. When complete, result contains media_key to fetch audio via /v2/audio/media/{media_key}.

Header parameters
x-api-keyany ofOptional
stringOptional
or
nullOptional
Body

Request schema for Text-to-Speech generation

textstring · min: 1 · max: 10000Required

Text to convert to speech

Example: The quick brown fox jumps over the lazy dog.
modelany ofOptional

TTS model. Available: 'sesame/csm-1b' (default)

Example: sesame/csm-1b
stringOptional
or
nullOptional
preset_voiceany ofOptional

Preset voice name. Use GET /v2/audio/tts/presets for available voices.

Example: Alice
stringOptional
or
nullOptional
custom_voiceany ofOptional

Custom voice cloning config (cannot use with preset_voice)

or
nullOptional
modeany ofOptional

Routing mode: 'auto' (intelligent), 'opengpu' (blockchain), 'direct' (low-latency)

Default: auto
string · enumOptionalPossible values:
or
nullOptional
asyncany ofOptional

Async mode: returns task_address immediately, poll /v2/tasks/{task_address} for result. Default: false (sync mode).

Default: false
booleanOptional
or
nullOptional
Responses
chevron-right
200

Successful Response

application/json
Responseany
post
/v2/audio/tts/sesame

No content

List available voice presets

get

List available TTS voice presets.

Returns a list of preset voice names that can be used with the preset_voice parameter. Available to both guests and authenticated users.

Header parameters
x-api-keyany ofOptional
stringOptional
or
nullOptional
Responses
chevron-right
200

Successful Response

application/json
get
/v2/audio/tts/presets

Transcribe audio to text

post

Transcribe audio to text using Whisper large-v3 model.

Available Models: openai/whisper-large-v3

Required Parameters

Parameter
Type
Description

audio_url

string

URL to audio file (WAV, MP3, M4A, WEBM, FLAC, OGG)

Optional Parameters

Parameter
Type
Default
Description

language

string

-

Language hint for better accuracy (ISO 639-1 code, e.g., "en", "es", "fr")

model

string

openai/whisper-large-v3

ASR model to use

chunk_length_s

integer

30

Chunk length in seconds for processing long audio

batch_size

integer

24

Batch size for parallel chunk processing

Response

{
  "text": "Transcribed text content",
  "language": "en",
  "duration": 5.2
}

Supported Languages: 90+ languages including English, Spanish, French, German, Chinese, Japanese, etc.

Async Mode

Set async: true to get a task_address immediately and poll for results.

Header parameters
x-api-keyany ofOptional
stringOptional
or
nullOptional
Body

Request schema for Automatic Speech Recognition (Whisper).

Supports transcription (original language) and translation (to English).

audio_urlstringRequired

URL to audio file (WAV, MP3, M4A, WEBM, FLAC supported)

Example: https://example.com/audio.wav
modelany ofOptional

ASR model. Available: 'openai/whisper-large-v3' (default)

Example: openai/whisper-large-v3
stringOptional
or
nullOptional
taskany ofOptional

Task: 'transcribe' (original language) or 'translate' (to English)

Default: transcribe
string · enumOptionalPossible values:
or
nullOptional
languageany ofOptional

Language hint (ISO 639-1 code). Auto-detected if not specified.

Example: en
stringOptional
or
nullOptional
return_timestampsany ofOptional

Return timestamps: False (none), True (segment-level), 'word' (word-level)

Default: false
booleanOptional
or
const: wordOptional
or
nullOptional
temperatureany ofOptional

Sampling temperature (0.0 = deterministic)

number · max: 1Optional
or
nullOptional
chunk_length_sany ofOptional

Process audio in chunks of this length in seconds (default: 30)

integer · min: 1 · max: 60Optional
or
nullOptional
batch_sizeany ofOptional

Batch size for processing chunks (default: 16)

integer · min: 1 · max: 32Optional
or
nullOptional
modeany ofOptional

Routing mode: 'auto' or 'direct' (ASR is direct-only)

Default: auto
string · enumOptionalPossible values:
or
nullOptional
asyncany ofOptional

Return task_address immediately, poll /v2/tasks/{task_address} for result. Default: false (synchronous).

Default: false
booleanOptional
or
nullOptional
Responses
chevron-right
200

Successful Response

application/json
post
/v2/audio/asr/whisper

Last updated