Text to Speech (T2A) HTTP - MiniMax API Docs

Authorizations

Authorization

string

header

required

HTTP: Bearer Auth

Security Scheme Type: http
HTTP Authorization Scheme: Bearer API_key, can be found in Account Management>API Keys.

Headers

Content-Type

enum<string>

default:application/json

required

The media type of the request body. Must be set to application/json to ensure the data is sent in JSON format.

Available options:

application/json

Body

application/json

model

enum<string>

required

The speech synthesis model version to use. Options include: speech-2.8-hd, speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo.

Available options:

speech-2.8-hd,

speech-2.8-turbo,

speech-2.6-hd,

speech-2.6-turbo,

speech-02-hd,

speech-02-turbo,

speech-01-hd,

speech-01-turbo

text

string

required

The text to be converted into speech. Must be less than 10,000 characters.

For texts over 3,000 characters, streaming output is recommended.
Paragraph breaks should be marked with newline characters.
Pause control: You can customize speech pauses by adding markers in the form <#x#>, where x is the pause duration in seconds. Valid range: [0.01, 99.99], up to two decimal places. Pause markers must be placed between speakable text segments and cannot be used consecutively.
Inline pronunciation: Wrap Mandarin Pinyin (with tone number 1–5) or IPA symbols or Cantonese Jyutping (with tone number 1–6) in half-width parentheses to override pronunciation of the target word or polyphonic character.
- "The word live is pronounced (lɪv) as a verb and (laɪv) as an adjective."
- "This is (he2)平, not (huo4)面."
- "去街市買啲(sung3)。"
Interjection tags: Only supported when using speech-2.8-hd or speech-2.8-turbo models. Supported interjections: (laughs), (chuckle), (coughs), (clear-throat), (groans), (breath), (pant), (inhale), (exhale), (gasps), (sniffs), (sighs), (snorts), (burps), (lip-smacking), (humming), (hissing), (emm), (sneezes).

stream

boolean

Whether to enable streaming output. Defaults to false.

stream_options

object

Show child attributes

voice_setting

object

Show child attributes

audio_setting

object

Show child attributes

pronunciation_dict

object

Show child attributes

timbre_weights

object[]

Timbre weights (legacy field)

Show child attributes

language_boost

enum<string>

Controls whether recognition for specific minority languages and dialects is enhanced. Default is null. If the language type is unknown, set to "auto" and the model will automatically detect it.

Note: The speech-01 and speech-02 series models do not currently support Persian, Filipino, or Tamil.

Available options:

Chinese,

Chinese,Yue,

English,

Arabic,

Russian,

Spanish,

French,

Portuguese,

German,

Turkish,

Dutch,

Ukrainian,

Vietnamese,

Indonesian,

Japanese,

Italian,

Korean,

Thai,

Polish,

Romanian,

Greek,

Czech,

Finnish,

Hindi,

Bulgarian,

Danish,

Hebrew,

Malay,

Persian,

Slovak,

Swedish,

Croatian,

Filipino,

Hungarian,

Norwegian,

Slovenian,

Catalan,

Nynorsk,

Tamil,

Afrikaans,

auto

voice_modify

object

Voice effects configuration.

Supported audio formats:

Non-streaming: mp3, wav, flac
Streaming: mp3

Show child attributes

subtitle_enable

boolean

default:false

Controls whether subtitles are enabled. Default is false. Available for models: speech-2.8-hd, speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo.

subtitle_type

enum<string>

default:sentence

Subtitle granularity. Default is sentence. Options:

sentence: sentence-level timestamps
word: word-level timestamps
word_streaming: word-level timestamps optimized for streaming, only valid when stream=true

Available options:

sentence,

word,

word_streaming

output_format

enum<string>

default:hex

Controls the output format. Options: [url, hex]. Default is hex. Only effective in non-streaming scenarios. In streaming, only hex is supported. Returned url is valid for 24 hours.

Available options:

url,

hex

Response

data

object

The synthesized audio data object. The returned data object may be null, so a null check is required.

Show child attributes

trace_id

string

The session ID, used for troubleshooting and support.

extra_info

object

Additional audio information.

Show child attributes

base_resp

object

Status code and details of this request.

Show child attributes