Speech synthesis
Speech synthesis in Yandex SpeechKit allows you to convert any text to speech in multiple languages.
SpeechKit voice models use deep neural network technology. When synthesizing speech, the model pays attention to many details in the original voice. The model evaluates the entire text, not individual sentences, before starting the synthesis. This enables the synthesized voice to sound clear and natural, without electronic distortion, and reproduce appropriate intonations of a real person's speech.
The service is available at tts.api.cloud.yandex.net:443
.
To try out the Text-to-Speech and Speech-to-Text product demos, visit the SpeechKit page on our website.
Synthesis options
To work with SpeechKit, you can access it via the API or Playground. For more information about working with the Yandex Cloud API, see API concepts.
SpeechKit synthesis has two APIs: API v1 (REST) and API v3 (gRPC). The SpeechKit Python SDK is also implemented based on the API v3.
API v1 | API v3 | |
---|---|---|
Specification | REST | gRPC, REST |
Selecting voice | voice parameter |
hints: voice parameter |
Selecting language | Depends on the voice lang parameter |
Depends on the voice, not specified explicitly in the request |
Specifying role | Depends on the voice emotion parameter |
Depends on the voice hint: role parameter |
Voice tone management | No | hints: pitchShift parameter |
Controlling pronunciation | SSML TTS | TTS |
Pronunciation speed | speed parameter |
hints: speed parameter |
Adjusting volume | No | loudness_normalization_type parameter |
Output audio format | format parameter |
output_audio_spec parameter |
Specifying LPCM parameters | sampleRateHertz parameter |
output_audio_spec: raw_audio parameter |
Audio pattern-based synthesis | No | text_template parameter |
Pricing method | Total number of characters in the requests | By request |
Automatic splitting of long phrases | Not required | unsafe_mode parameter |
Note
Multiple responses with audio fragments can be returned to a single request to the SpeechKit API v3. A complete response is a result of merging all the fragments received.
Languages and voices
You can select a voice to convert your text to speech. Each voice corresponds to a model trained on the speaker's speech pattern. Voices differ by tone, gender, and language. For a list of voices and their characteristics, see List of voices.
If no voice suits your business, SpeechKit can create a unique one specifically for you. For more information, see Yandex SpeechKit Brand Voice.
SpeechKit can synthesize speech in different languages. Each voice is designed to synthesize speech in a specific language. The voices can also read text in another
language, but the quality of the synthesized speech will be worse in this case, as the speaker
will pronounce the text with an accent, and there might be errors in word synthesis.
Role
The synthesized speech will sound differently depending on the selected role. Role is a manner of pronunciation for the same speaker. Different sets of roles are available for different voices. Attempting to use a role the selected voice does not have will cause a service error.
Voice tone
Each SpeechKit voice has a certain pitch. You can change the voice in API v3 by specifying a shift from its base tone. You can specify the shift in Hz in the hints: pitchShift
parameter, which can take the [-1000;1000]
values. By default, the parameter value is 0
. Positive values of the hints: pitchShift
parameter make the voice higher, while negative values make it lower.
Controlling pronunciation
To manage pronunciation in the synthesized speech, mark up the source text. SpeechKit can synthesize speech from text marked up using Speech Synthesis Markup Language
- SSML is only supported in API v1 requests. To transmit text in SSML format, include the
ssml
parameter in the call body and use the<speak>
tag as a wrapper for the text. For more information about SSML tags, see SSML markup. - TTS markup is supported in the API v1 and API v3. In API v1 requests, transmit the text marked up according to TTS rules in the
text
parameter in the request body. API v3 and the Python SDK require no special parameters and consider any transmitted text as marked up according to TTS rules. For more information about TTS markup, see TTS markup.
Warning
When using pattern-based synthesis, the markup outside the variable part is ignored.
Synthesis settings
You can configure both pronunciation and technical characteristics of the synthesized speech.
Synthesized speech speed
The speed of synthesized speech affects perception of information. If the speech is too fast or too slow, it sounds unnatural. However, this can be useful in commercials where every second of air time counts.
By default, the speed of generated speech corresponds to the average speed of human speech.
Volume normalization
In API v3 and Python SDK requests, you can set the type and level of volume normalization. This can be useful if you are using SpeechKit synthesis along with other sound sources. For example, so that the volume of the voice assistant does not differ from the phone notifications.
SpeechKit supports two normalization types:
- Peak normalization
MAX_PEAK
, at which the audio signal level rises to the maximum possible digital audio value without distortion. LUFS
normalization is weighted normalization based on the EBU R 128 standard according to which volume is normalized relative to the full digital scale.
You can set the normalization type in the loudness_normalization_type
parameter. By default, SpeechKit uses LUFS.
The level of normalization is set in the hints: volume
parameter. Possible values depend on the normalization type:
- For
MAX_PEAK
, the parameter can have values in the(0;1]
range, default value is0.7
. - For
LUFS
, the parameter can change in the range[-149;0)
, default value is-19
.
If the normalization level value does not fall within the range supported by the normalization level, the SpeechKit server will return an InvalidArgument
error.
Synthesized audio file format
You can select the audio file format that will be used by SpeechKit to synthesize speech.
For a full list of available formats and their characteristics, see Supported audio formats.
See also
- Try speech synthesis using our demo on the service page.
- Review the parameters of the API v1 method and API v3 requests.
- See examples of using the SpeechKit API: