Text-to-speech - MyTokenGate
1. Use Cases
The Text-to-Speech (TTS) model is an AI model that converts text information into speech output. This model generates natural, fluent, and expressive speech from input text, suitable for various application scenarios:
- Providing audio narration for blog articles
- Generating multilingual speech content
- Supporting real-time streaming audio output
2. API Usage Guide
- Endpoint: /audio/speech. For details, refer to the API documentation.
- Key request parameters:
model: The model used for speech synthesis. Supported model list.input: The text content to be converted into audio.voice: Reference voice, supporting system preset voices, user preset voices, and user dynamic voices.speed: Controls the audio speed. Type: float. Default: 1.0. Range: [0.25, 4.0].gain: Audio gain in dB, controlling volume. Type: float. Default: 0.0. Range: [-10, 10].response_format: Output format. Supported: mp3, opus, wav, pcm.sample_rate: Output sampling rate. Varies by output type:- Opus: 48000 Hz only
- Wav, pcm: 8000, 16000, 24000, 32000, 44100 Hz. Default: 44100
- Mp3: 32000, 44100 Hz. Default: 44100
Note: Do not add spaces to the input content, and the reference audio should be less than 30 seconds.
2.1 System Preset Voices
The system currently provides the following 8 preset voices:
- Male Voices: Steady Male Voice: alex Deep Male Voice: benjamin Magnetic Male Voice: charles Cheerful Male Voice: david
- Female Voices: Steady Female Voice: anna Passionate Female Voice: bella Gentle Female Voice: claire Cheerful Female Voice: diana
To use system preset voices in requests, you need to prefix the model name. For example:
FunAudioLLM/CosyVoice2-0.5B:alex indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.
fishaudio/fish-speech-1.5:anna indicates the anna voice from the fishaudio/fish-speech-1.5 model.
2.2 User-predefined voices:
To ensure the quality of the generated voice, it is recommended that users upload a voice sample that is 8 to 10 seconds long, with clear pronunciation and no background noise or interference.
2.2.1 Upload user-predefined voices using base64 encoding format
import requests
import json
url = "https://gateway.mytokengate.com/v1/uploads/audio/voice"
headers = {
"Authorization": "Bearer your-api-key", # Obtain from https://mytokengate.com/app/dashboard
"Content-Type": "application/json"
}
data = {
"model": "FunAudioLLM/CosyVoice2-0.5B", # Model name
"customName": "your-voice-name", # Custom audio name
"audio": "data:audio/mpeg;base64,...", # Base64 encoded reference audio
"text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins." # Text content of reference audio
}
response = requests.post(url, headers=headers, data=json.dumps(data))
# Print response status code and content
print(response.status_code)
print(response.json()) # If the response is in JSON formatThe returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.
{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}To use user preset voices in requests.
2.2.2 Upload User Preset Voice via File
import requests
url = "https://gateway.mytokengate.com/v1/uploads/audio/voice"
headers = {
"Authorization": "Bearer your-api-key" # Obtain from https://mytokengate.com/app/dashboard
}
files = {
"file": open("/path/to/audio.mp3", "rb") # Reference audio file
}
data = {
"model": "FunAudioLLM/CosyVoice2-0.5B", # Model name
"customName": "your-voice-name", # Custom audio name
"text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins." # Text content of reference audio
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.status_code)
print(response.json()) # Print response content (if in JSON format)The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.
{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}To use user preset voices in requests.
2.3 Retrieve User Dynamic Voice List
import requests
url = "https://gateway.mytokengate.com/v1/audio/voice/list"
headers = {
"Authorization": "Bearer your-api-key" # Obtain from https://mytokengate.com/app/dashboard
}
response = requests.get(url, headers=headers)
print(response.status_code)
print(response.json())The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.
{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}To use user dynamic voices in requests.
2.4 Use User Dynamic Voices
To use user dynamic voices in requests.
2.5 Delete User Dynamic Voice
import requests
url = "https://gateway.mytokengate.com/v1/audio/voice/deletions"
headers = {
"Authorization": "Bearer your-api-key",
"Content-Type": "application/json"
}
payload = {
"uri": "speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.status_code)
print(response.text) # Print response contentThe uri field in the request parameters is the ID of the custom voice.
3. Supported Model List
Models UTF-8 byte Online byte counter demo
3.1 fishaudio/fish-speech Series Models
- fish-speech-1.5 Supported languages: Chinese, English, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese
3.2 FunAudioLLM/CosyVoice2-0.5B Series Models
- Cross-language speech synthesis: Enables speech synthesis across different languages, including Chinese, English, Japanese, Korean, and Chinese dialects (Cantonese, Sichuanese, Shanghainese, Zhengzhou dialect, Changsha dialect, Tianjin dialect).
- Emotion control: Supports generating speech with various emotional expressions, such as happiness, excitement, sadness, and anger.
- Fine-grained control: Allows fine-grained control of emotions and prosody in generated speech through rich text or natural language input.
4. Best Practices for Reference Audio
Providing high-quality reference audio samples can improve voice cloning results.
4.1 Audio Quality Guidelines
- Single speaker only
- Clear articulation, stable volume, pitch, and emotion
- Short pauses (recommended: 0.5 seconds)
- Ideal conditions: No background noise, professional recording quality, no room echo
- Recommended duration: 8–10 seconds
4.2 File Format
- Supported formats: mp3, wav, pcm, opus
- Recommended: Use mp3 with 192kbps or higher to avoid quality loss
- Uncompressed formats (e.g., WAV) offer limited additional benefits
5. Examples
5.1 Using System Preset Voices
from pathlib import Path
from openai import OpenAI
speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"
client = OpenAI(
api_key="Your API KEY", # Obtain from https://mytokengate.com/app/dashboard
base_url="https://gateway.mytokengate.com/v1"
)
with client.audio.speech.with_streaming_response.create(
model="FunAudioLLM/CosyVoice2-0.5B", # Supported models: fishaudio / CosyVoice2-0.5B
voice="FunAudioLLM/CosyVoice2-0.5B:alex", # System preset voice
input="Can you say this with happiness? <|endofprompt|>Today is wonderful, the holidays are coming! I'm so happy, Spring Festival is coming!",
response_format="mp3" # Supported formats: mp3, wav, pcm, opus
) as response:
response.stream_to_file(speech_file_path)5.2 Using User Preset Voices
from pathlib import Path
from openai import OpenAI
speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"
client = OpenAI(
api_key="Your API KEY", # Obtain from https://mytokengate.com/app/dashboard
base_url="https://gateway.mytokengate.com/v1"
)
with client.audio.speech.with_streaming_response.create(
model="FunAudioLLM/CosyVoice2-0.5B", # Supported models: fishaudio / CosyVoice2-0.5B
voice="speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd", # Uploaded custom voice name
input="Could you mimic a Cantonese accent? <|endofprompt|>Take care and rest early.",
response_format="mp3"
) as response:
response.stream_to_file(speech_file_path)5.3 Using User Dynamic Voices
from pathlib import Path
from openai import OpenAI
client = OpenAI()
speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"
client = OpenAI(
api_key="Your API KEY", # Obtain from https://mytokengate.com/app/dashboard
base_url="https://gateway.mytokengate.com/v1"
)
with client.audio.speech.with_streaming_response.create(
model="FunAudioLLM/CosyVoice2-0.5B",
voice="", # Leave empty to use dynamic voices
input="[laughter] Sometimes, watching the innocent actions of children [laughter], we can't help but smile.",
response_format="mp3",
extra_body={"references": [
{
"audio": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/voice_template/fish_audio-Alex.mp3", # Reference audio URL. Base64 format also supported
"text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins.", # Text content of reference audio
}
]}
) as response:
response.stream_to_file(speech_file_path)