Text-to-speech - MyTokenGate

1. Use Cases

The Text-to-Speech (TTS) model is an AI model that converts text information into speech output. This model generates natural, fluent, and expressive speech from input text, suitable for various application scenarios:

Providing audio narration for blog articles
Generating multilingual speech content
Supporting real-time streaming audio output

2. API Usage Guide

Endpoint: /audio/speech. For details, refer to the API documentation.
Key request parameters:
- model: The model used for speech synthesis. Supported model list.
- input: The text content to be converted into audio.
- voice: Reference voice, supporting system preset voices, user preset voices, and user dynamic voices.
- speed: Controls the audio speed. Type: float. Default: 1.0. Range: [0.25, 4.0].
- gain: Audio gain in dB, controlling volume. Type: float. Default: 0.0. Range: [-10, 10].
- response_format: Output format. Supported: mp3, opus, wav, pcm.
- sample_rate: Output sampling rate. Varies by output type:
  - Opus: 48000 Hz only
  - Wav, pcm: 8000, 16000, 24000, 32000, 44100 Hz. Default: 44100
  - Mp3: 32000, 44100 Hz. Default: 44100

Note: Do not add spaces to the input content, and the reference audio should be less than 30 seconds.

2.1 System Preset Voices

The system currently provides the following 8 preset voices:

Male Voices: Steady Male Voice: alex Deep Male Voice: benjamin Magnetic Male Voice: charles Cheerful Male Voice: david
Female Voices: Steady Female Voice: anna Passionate Female Voice: bella Gentle Female Voice: claire Cheerful Female Voice: diana

To use system preset voices in requests, you need to prefix the model name. For example:

FunAudioLLM/CosyVoice2-0.5B:alex indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.

fishaudio/fish-speech-1.5:anna indicates the anna voice from the fishaudio/fish-speech-1.5 model.

2.2 User-predefined voices:

To ensure the quality of the generated voice, it is recommended that users upload a voice sample that is 8 to 10 seconds long, with clear pronunciation and no background noise or interference.

2.2.1 Upload user-predefined voices using base64 encoding format


import requests
import json

url = "https://gateway.mytokengate.com/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key",  # Obtain from https://mytokengate.com/app/dashboard
    "Content-Type": "application/json"
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B",  # Model name
    "customName": "your-voice-name",  # Custom audio name
    "audio": "data:audio/mpeg;base64,...",  # Base64 encoded reference audio
    "text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins."  # Text content of reference audio
}
response = requests.post(url, headers=headers, data=json.dumps(data))
# Print response status code and content
print(response.status_code)
print(response.json())  # If the response is in JSON format

The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.


{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

To use user preset voices in requests.

2.2.2 Upload User Preset Voice via File


import requests

url = "https://gateway.mytokengate.com/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key"  # Obtain from https://mytokengate.com/app/dashboard
}
files = {
    "file": open("/path/to/audio.mp3", "rb")  # Reference audio file
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B",  # Model name
    "customName": "your-voice-name",  # Custom audio name
    "text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins."  # Text content of reference audio
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.status_code)
print(response.json())  # Print response content (if in JSON format)

The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.


{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

To use user preset voices in requests.

2.3 Retrieve User Dynamic Voice List


import requests

url = "https://gateway.mytokengate.com/v1/audio/voice/list"
headers = {
    "Authorization": "Bearer your-api-key"  # Obtain from https://mytokengate.com/app/dashboard
}
response = requests.get(url, headers=headers)
print(response.status_code)
print(response.json())

The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.


{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

To use user dynamic voices in requests.

2.4 Use User Dynamic Voices

To use user dynamic voices in requests.

2.5 Delete User Dynamic Voice


import requests

url = "https://gateway.mytokengate.com/v1/audio/voice/deletions"
headers = {
    "Authorization": "Bearer your-api-key",
    "Content-Type": "application/json"
}
payload = {
    "uri": "speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.status_code)
print(response.text)  # Print response content

The uri field in the request parameters is the ID of the custom voice.

3. Supported Model List

Models UTF-8 byte Online byte counter demo

3.1 fishaudio/fish-speech Series Models

fish-speech-1.5 Supported languages: Chinese, English, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese

3.2 FunAudioLLM/CosyVoice2-0.5B Series Models

Cross-language speech synthesis: Enables speech synthesis across different languages, including Chinese, English, Japanese, Korean, and Chinese dialects (Cantonese, Sichuanese, Shanghainese, Zhengzhou dialect, Changsha dialect, Tianjin dialect).
Emotion control: Supports generating speech with various emotional expressions, such as happiness, excitement, sadness, and anger.
Fine-grained control: Allows fine-grained control of emotions and prosody in generated speech through rich text or natural language input.

4. Best Practices for Reference Audio

Providing high-quality reference audio samples can improve voice cloning results.

4.1 Audio Quality Guidelines

Single speaker only
Clear articulation, stable volume, pitch, and emotion
Short pauses (recommended: 0.5 seconds)
Ideal conditions: No background noise, professional recording quality, no room echo
Recommended duration: 8–10 seconds

4.2 File Format

Supported formats: mp3, wav, pcm, opus
Recommended: Use mp3 with 192kbps or higher to avoid quality loss
Uncompressed formats (e.g., WAV) offer limited additional benefits

5. Examples

5.1 Using System Preset Voices


from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"
client = OpenAI(
    api_key="Your API KEY",  # Obtain from https://mytokengate.com/app/dashboard
    base_url="https://gateway.mytokengate.com/v1"
)
with client.audio.speech.with_streaming_response.create(
    model="FunAudioLLM/CosyVoice2-0.5B",  # Supported models: fishaudio / CosyVoice2-0.5B
    voice="FunAudioLLM/CosyVoice2-0.5B:alex",  # System preset voice
    input="Can you say this with happiness? <|endofprompt|>Today is wonderful, the holidays are coming! I'm so happy, Spring Festival is coming!",
    response_format="mp3"  # Supported formats: mp3, wav, pcm, opus
) as response:
    response.stream_to_file(speech_file_path)

5.2 Using User Preset Voices


from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"
client = OpenAI(
    api_key="Your API KEY",  # Obtain from https://mytokengate.com/app/dashboard
    base_url="https://gateway.mytokengate.com/v1"
)
with client.audio.speech.with_streaming_response.create(
    model="FunAudioLLM/CosyVoice2-0.5B",  # Supported models: fishaudio / CosyVoice2-0.5B
    voice="speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd",  # Uploaded custom voice name
    input="Could you mimic a Cantonese accent? <|endofprompt|>Take care and rest early.",
    response_format="mp3"
) as response:
    response.stream_to_file(speech_file_path)

5.3 Using User Dynamic Voices


from pathlib import Path
from openai import OpenAI

client = OpenAI()
speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"
client = OpenAI(
    api_key="Your API KEY",  # Obtain from https://mytokengate.com/app/dashboard
    base_url="https://gateway.mytokengate.com/v1"
)
with client.audio.speech.with_streaming_response.create(
    model="FunAudioLLM/CosyVoice2-0.5B",
    voice="",  # Leave empty to use dynamic voices
    input="[laughter] Sometimes, watching the innocent actions of children [laughter], we can't help but smile.",
    response_format="mp3",
    extra_body={"references": [
        {
            "audio": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/voice_template/fish_audio-Alex.mp3",  # Reference audio URL. Base64 format also supported
            "text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins.",  # Text content of reference audio
        }
    ]}
) as response:
    response.stream_to_file(speech_file_path)