Yandex SpeechKit Brand Voice
The SpeechKit Brand Voice technology allows you to create unique voices for your speech synthesis model. It can synthesize both plain text and pattern-based text. Patterns contain phrases with variables that are replaced with prepared text. To cover different practical uses of trained models, Yandex Cloud offers three types of SpeechKit Brand Voice.
Brand Voice Premium | Brand Voice Self Service | Brand Voice Call Center | |
---|---|---|---|
Voice | Voice based on artist recordings | Voice based on artist recordings | Voice copy from the pattern |
Usage | Full-text synthesis. Pattern-based synthesis. | Full-text synthesis. Pattern-based synthesis. | Pattern-based synthesis. The variable part should not exceed 25% of the pattern. The same restriction applies to the duration of the variable part relative to the duration of the final audio. |
Emotions and roles | Copying emotions in pattern-based synthesis. Developing additional roles. | Copying emotions in pattern-based synthesis. | Copying emotions in pattern-based synthesis. |
Sampling frequency in source audio recordings | 48 kHz | 48 kHz | 8 kHz or higher. |
Sampling frequency in synthesized audio recordings | 22 kHz | 22 kHz | 8 kHz |
To create a unique voice for your business, fill out the form.
SpeechKit Brand Voice Premium
SpeechKit Brand Voice Premium is suitable for any business task:
- Voice assistants.
- Call center robot operators.
- Text-to-speech conversion of any text.
Creating a full-fledged model with a unique voice requires large amounts of audio recordings. Yandex Cloud experts will help you prepare the data for SpeechKit Brand Voice Premium model training, select a studio and an artist for you, and support you at each step of voice creation.
Once created, the SpeechKit Brand Voice Premium voice can be enhanced by various roles.
SpeechKit Brand Voice Self Service
If you have pre-recorded audios for training the model, you can create a SpeechKit Brand Voice Self Service voice based on them. With a voice like this, you can easily convert texts of any length into spoken language and synthesize speech using patterns. Based on such a voice, you can create voice assistants or robots for your call center.
You can add diverse emotions to your SpeechKit Brand Voice Self Service voice using pattern-based synthesis. In pattern-based synthesis, intonations are copied from your audio recordings.
Note
When using pattern-based synthesis with Yandex SpeechKit Brand Voice voices, make sure your patterns are recorded by the same artist who made recordings for your Yandex SpeechKit Brand Voice voice.
The quality of synthesized speech depends directly on the quality of audio recordings used to train the model. When creating a SpeechKit Brand Voice Self Service voice, you are in charge of the entire process of training data preparation.
For more information on how to train your own model, see Preparing and uploading data for Brand Voice Self Service.
SpeechKit Brand Voice Call Center
SpeechKit Brand Voice Call Center is purpose-designed for call center automation and other business scenarios involving phone calls:
- Telemarketing.
- Managing calls to level 1 technical support.
- Surveys.
- Call center automation.
With Brand Voice Call Center, you do not have to train a special model based on your artist's voice, as the voice will be copied directly from the patterns you provide for phrase generation. The speech is synthesized integrally rather than glued together from a pre-recorded pattern and a generated variable part.
You can use SpeechKit Brand Voice Call Center to automate your standard dialogs.
For example, if you have an audio with the phrase
Hi Michael, I am calling from Thunderclouds. My name is Anastasia. Is it a good time to talk?
, you can transform it toHi Ann, I am calling from New Doors. My name is Matt. Is it a good time to talk?
without having to record any additional phrases.
Requirements and restrictions SpeechKit Brand Voice Call Center
For speech synthesis, you need an audio file with your phrase pattern and a text with marked up variables. To learn more about text requirements, see Requirements for synthesized texts.
The sampling frequency in the synthesized audio recording is 8 kHz. This is enough for phone calls. However, in other scenarios, you might hear some noises and flaws of synthesis.
SpeechKit Brand Voice Call Center is designed for phone calls. The texts for synthesis should be short enough. The duration of a synthesized phrase cannot not be more than 24 seconds, while its length, including the variable part, cannot exceed 250 characters. The variable part of a normalized text in a phrase cannot be longer than 25% of the phrase. The same restriction applies to the duration of the variable part relative to the duration of the final audio.
SpeechKit Brand Voice Call Center logs your transmitted patterns (both text and audio). However, the synthesized audio recordings and variable parts, including your sensitive data, are not logged. To improve the model's performance with your data, you can enable variable logging through the x-data-logging-enabled: true
header.
Note
Data logging may prove useful if synthesis errors occur. If you do not want to log all data, only include the logging header in requests with issues after clearing the variable part of personal data as much as possible.