Audio

Generate speech from text

post

Generate speech from text using the Sesame (CSM 1B) TTS model via direct processing.

Available Models: sesame/csm-1b

Required Parameters

Parameter

Type

Description

text

string

Text to convert to speech (max 2000 characters)

Optional Parameters - Voice Selection

Parameter

Type

Default

Description

preset_voice

string

Pre-configured voice: Alice, Avery, Brock, Chloe, Ella, Emma, Grace, Karen, Kevin, Lucas, Matt, William

custom_voice

object

Clone a voice: {"url": "https://...", "context_text": "..."}

model

string

sesame/csm-1b

TTS model to use

Voice Selection Modes

1. Preset Voice (Recommended) - Use a pre-configured voice profile 2. Custom Voice Cloning - Clone any voice from an audio URL 3. Random Voice - Omit both for parametric synthesis with random voice

Note: Choose only ONE mode per request. Do not combine preset_voice with custom_voice.

Response

Returns binary WAV audio file (16-bit PCM, mono, 24kHz) with Content-Type: audio/wav

Async Mode

Set async: true to get a task_address immediately and poll for results. When complete, result contains media_key to fetch audio via /v2/audio/media/{media_key}.

Header parameters

x-api-keyany ofOptional

stringOptional

nullOptional

Body

Request schema for Text-to-Speech generation

textstring · min: 1 · max: 10000Required

Text to convert to speech

Example: The quick brown fox jumps over the lazy dog.

modelany ofOptional

TTS model. Available: 'sesame/csm-1b' (default)

Example: sesame/csm-1b

stringOptional

nullOptional

preset_voiceany ofOptional

Preset voice name. Use GET /v2/audio/tts/presets for available voices.

Example: Alice

stringOptional

nullOptional

custom_voiceany ofOptional

Custom voice cloning config (cannot use with preset_voice)

nullOptional

modeany ofOptional

Routing mode: 'auto' (intelligent), 'opengpu' (blockchain), 'direct' (low-latency)

Default: auto

string · enumOptionalPossible values:

nullOptional

asyncany ofOptional

Async mode: returns task_address immediately, poll /v2/tasks/{task_address} for result. Default: false (sync mode).

Default: false

booleanOptional

nullOptional

Responses

200

Successful Response

application/json

Responseany

202

Task accepted (async mode). Poll the poll_url for status.

application/json

422

Validation Error

application/json

post

/v2/audio/tts/sesame

POST /v2/audio/tts/sesame HTTP/1.1
Content-Type: application/json
Accept: */*
Content-Length: 51

{
  "text": "Hello, this is a test of text to speech."
}

No content

List available voice presets

get

List available TTS voice presets.

Returns a list of preset voice names that can be used with the preset_voice parameter. Available to both guests and authenticated users.

Header parameters

x-api-keyany ofOptional

stringOptional

nullOptional

Responses

200

Successful Response

application/json

422

Validation Error

application/json

get

/v2/audio/tts/presets

GET /v2/audio/tts/presets HTTP/1.1
Accept: */*

{
  "presets": [
    "text"
  ],
  "count": 1
}

Transcribe audio to text

post

Transcribe audio to text using Whisper large-v3 model.

Available Models: openai/whisper-large-v3

Required Parameters

Parameter

Type

Description

audio_url

string

URL to audio file (WAV, MP3, M4A, WEBM, FLAC, OGG)

Optional Parameters

Parameter

Type

Default

Description

language

string

Language hint for better accuracy (ISO 639-1 code, e.g., "en", "es", "fr")

model

string

openai/whisper-large-v3

ASR model to use

chunk_length_s

integer

30

Chunk length in seconds for processing long audio

batch_size

integer

24

Batch size for parallel chunk processing

Response

{
  "text": "Transcribed text content",
  "language": "en",
  "duration": 5.2
}

Supported Languages: 90+ languages including English, Spanish, French, German, Chinese, Japanese, etc.

Async Mode

Set async: true to get a task_address immediately and poll for results.

Header parameters

x-api-keyany ofOptional

stringOptional

nullOptional

Body

Request schema for Automatic Speech Recognition (Whisper).

Supports transcription (original language) and translation (to English).

audio_urlstringRequired

URL to audio file (WAV, MP3, M4A, WEBM, FLAC supported)

Example: https://example.com/audio.wav

modelany ofOptional

ASR model. Available: 'openai/whisper-large-v3' (default)

Example: openai/whisper-large-v3

stringOptional

nullOptional

taskany ofOptional

Task: 'transcribe' (original language) or 'translate' (to English)

Default: transcribe

string · enumOptionalPossible values:

nullOptional

languageany ofOptional

Language hint (ISO 639-1 code). Auto-detected if not specified.

Example: en

stringOptional

nullOptional

return_timestampsany ofOptional

Return timestamps: False (none), True (segment-level), 'word' (word-level)

Default: false

booleanOptional

const: wordOptional

nullOptional

temperatureany ofOptional

Sampling temperature (0.0 = deterministic)

number · max: 1Optional

nullOptional

chunk_length_sany ofOptional

Process audio in chunks of this length in seconds (default: 30)

integer · min: 1 · max: 60Optional

nullOptional

batch_sizeany ofOptional

Batch size for processing chunks (default: 16)

integer · min: 1 · max: 32Optional

nullOptional

modeany ofOptional

Routing mode: 'auto' or 'direct' (ASR is direct-only)

Default: auto

string · enumOptionalPossible values:

nullOptional

asyncany ofOptional

Return task_address immediately, poll /v2/tasks/{task_address} for result. Default: false (synchronous).

Default: false

booleanOptional

nullOptional

Responses

200

Successful Response

application/json

202

Task accepted (async mode). Poll the poll_url for status.

application/json

422

Validation Error

application/json

post

/v2/audio/asr/whisper

POST /v2/audio/asr/whisper HTTP/1.1
Content-Type: application/json
Accept: */*
Content-Length: 77

{
  "audio_url": "https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3"
}

{
  "text": "text",
  "language": "text",
  "duration": 1,
  "task_address": "text",
  "mode": "text",
  "done": true,
  "ANY_ADDITIONAL_PROPERTY": "anything"
}

PreviousOllama NextStable Diffusion

Last updated 4 days ago

hashtagGenerate speech from text

hashtagRequired Parameters

hashtagOptional Parameters - Voice Selection

hashtagVoice Selection Modes

hashtagResponse

hashtagAsync Mode

hashtagList available voice presets

hashtagTranscribe audio to text

hashtagRequired Parameters

hashtagOptional Parameters

hashtagResponse

hashtagAsync Mode

Generate speech from text

Required Parameters

Optional Parameters - Voice Selection

Voice Selection Modes

Response

Async Mode

List available voice presets

Transcribe audio to text

Required Parameters

Optional Parameters

Response

Async Mode