Create Speech Generation Task

curl --request POST \
  --url https://api.minimax.io/v1/t2a_async_v2 \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "model": "speech-2.8-hd",
  "text": "Omg(sighs), the real danger is not that computers start thinking like people, but that people start thinking like computers. Computers can only help us with simple tasks.",
  "language_boost": "auto",
  "voice_setting": {
    "voice_id": "English_expressive_narrator",
    "speed": 1,
    "vol": 1,
    "pitch": 1
  },
  "pronunciation_dict": {
    "tone": [
      "Omg/Oh my god"
    ]
  },
  "audio_setting": {
    "audio_sample_rate": 32000,
    "bitrate": 128000,
    "format": "mp3",
    "channel": 2
  },
  "voice_modify": {
    "pitch": 0,
    "intensity": 0,
    "timbre": 0,
    "sound_effects": "spacious_echo"
  },
  "continuous_sound": false
}
'

{
  "task_id": 95157322514444,
  "task_token": "eyJhbGciOiJSUz",
  "file_id": 95157322514444,
  "usage_characters": 101,
  "base_resp": {
    "status_code": 0,
    "status_msg": "success"
  }
}

POST

t2a_async_v2

curl --request POST \
  --url https://api.minimax.io/v1/t2a_async_v2 \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "model": "speech-2.8-hd",
  "text": "Omg(sighs), the real danger is not that computers start thinking like people, but that people start thinking like computers. Computers can only help us with simple tasks.",
  "language_boost": "auto",
  "voice_setting": {
    "voice_id": "English_expressive_narrator",
    "speed": 1,
    "vol": 1,
    "pitch": 1
  },
  "pronunciation_dict": {
    "tone": [
      "Omg/Oh my god"
    ]
  },
  "audio_setting": {
    "audio_sample_rate": 32000,
    "bitrate": 128000,
    "format": "mp3",
    "channel": 2
  },
  "voice_modify": {
    "pitch": 0,
    "intensity": 0,
    "timbre": 0,
    "sound_effects": "spacious_echo"
  },
  "continuous_sound": false
}
'

{
  "task_id": 95157322514444,
  "task_token": "eyJhbGciOiJSUz",
  "file_id": 95157322514444,
  "usage_characters": 101,
  "base_resp": {
    "status_code": 0,
    "status_msg": "success"
  }
}

Returned File Information

The return result for a single file input is shown below.
If the input is a compressed package containing multiple files, a corresponding folder will be generated for each file. The contents inside each folder are the same as those for a single file input.

Input File Type: txt File

Output Files:
- Audio File: Format follows the request body settings.
- Subtitle File: Sentence-level subtitle information.
- Extra JSON File: Additional information related to the audio file.

Input File Type: json File

title Field Output Files (if this field is empty, no files will be generated)
- Audio File: Format follows the request body settings
- Subtitle File: Sentence-level subtitle information
- Extra JSON File: Additional information related to the audio file
content Field Output Files (if this field is empty, no files will be generated)
- Audio File: Format follows the request body settings
- Subtitle File: Sentence-level subtitle information
- Extra JSON File: Additional information related to the audio file
extra Field Output Files (if this field is empty, no files will be generated)
- Audio File: Format follows the request body settings
- Subtitle File: Sentence-level subtitle information
- Extra JSON File: Additional information related to the audio file

Authorizations

Authorization

string

header

required

HTTP: Bearer Auth

Security Scheme Type: http
HTTP Authorization Scheme: Bearer API_key, can be found in Account Management>API Keys.

Headers

Content-Type

enum<string>

default:application/json

required

The media type of the request body. Must be set to application/json to ensure the data is sent in JSON format.

Available options:

application/json

Body

application/json

model

enum<string>

required

Model version to call. Supported

Available options:

speech-2.8-hd,

speech-2.8-turbo,

speech-2.6-hd,

speech-2.6-turbo,

speech-02-hd,

speech-02-turbo,

speech-01-hd,

speech-01-turbo

text

string

required

Text content to convert to audio, max length 50,000 characters. Mutually exclusive with text_file_id (one is required).

Interjection tags: Only supported when using speech-2.8-hd or speech-2.8-turbo models. Supported interjections: (laughs), (chuckle), (coughs), (clear-throat), (groans), (breath), (pant), (inhale), (exhale), (gasps), (sniffs), (sighs), (snorts), (burps), (lip-smacking), (humming), (hissing), (emm), (whistles), (sneezes), (crying), (applause).

text_file_id

integer<int64>

required

ID of the text file to synthesize. Max 100,000 characters. Supported formats: txt, zip. Mutually exclusive with text (one is required).

txt file: Supports customizing speech pauses by adding markers in the form <#x#>, where x is the pause duration in seconds. Valid range: [0.01, 99.99], up to two decimal places. Pause markers must be placed between speakable text segments and cannot be used consecutively.
zip file: Must contain files of the same type (txt or json).
- json format supports ["title", "content", "extra"] fields. Each non-empty field generates an audio file, subtitles, and metadata and would be stored in a folder.

voice_setting

object

required

Show child attributes

audio_setting

object

Show child attributes

pronunciation_dict

object

Show child attributes

language_boost

enum<string>

Controls whether recognition for specific minority languages and dialects is enhanced. Default is null. If the language type is unknown, set to "auto" and the model will automatically detect it.

Available options:

Chinese,

Chinese,Yue,

English,

Arabic,

Russian,

Spanish,

French,

Portuguese,

German,

Turkish,

Dutch,

Ukrainian,

Vietnamese,

Indonesian,

Japanese,

Italian,

Korean,

Thai,

Polish,

Romanian,

Greek,

Czech,

Finnish,

Hindi,

Bulgarian,

Danish,

Hebrew,

Malay,

Persian,

Slovak,

Swedish,

Croatian,

Filipino,

Hungarian,

Norwegian,

Slovenian,

Catalan,

Nynorsk,

Tamil,

Afrikaans,

auto

voice_modify

object

Voice effect settings. Supported formats: mp3, flac.

Show child attributes

continuous_sound

boolean

default:false

Enable this parameter to improve the naturalness of transitions between clauses. Only supported for speech-2.8-hd and speech-2.8-turbo models.

Response

200 - application/json

task_id

string

Task ID

file_id

integer<int64>

The corresponding audio file ID is returned once the task is successfully created.

When the task is complete, you can use the file_id to call the File (Retrieve) API to download the file.

If the request fails, this field will not be returned.

Note: The download URL is valid for 9 hours (32,400 seconds) from the time it is generated. After expiration, the file will no longer be available and the generated data will be lost, so please ensure you download it within the validity period.

task_token

string

Token for completing the task

usage_characters

integer

Number of billed characters

base_resp

object

Status code and details.

Show child attributes

T2A (WebSocket)Query Speech Generation Task Status

⌘I

Using the API

Text

Speech

Video

Image

Music

File

Create Speech Generation Task

Returned File Information

Input File Type: txt File

Input File Type: json File

Authorizations

Headers

Body

Response

Using the API

Text

Speech

Video

Image

Music

File

​Returned File Information

​Input File Type: txt File

​Input File Type: json File

Authorizations

Headers

Body

Response

Returned File Information

Input File Type: txt File

Input File Type: json File