Yandex SpeechKit Brand Voice
The SpeechKit Brand Voice technology allows you to create unique voices for your speech synthesis model. It can synthesize both plain text and pattern-based text. Patterns contain phrases with variables that are replaced with prepared text. To cover different practical uses of trained models, Yandex Cloud offers four branches of SpeechKit Brand Voice.
| Brand Voice Premium | SpeechKit Brand Voice Lite | Brand Voice Call Center | |
|---|---|---|---|
| Voice | Voice based on artist recordings | Voice based on artist recordings | Voice copy from the pattern |
| Usage | Full-text synthesis. Pattern-based synthesis. | Full-text synthesis. Pattern-based synthesis are not supported. | Pattern-based synthesis. The variable part should not exceed 25% of the pattern. The same restriction applies to the duration of the variable part relative to the duration of the final audio. |
| Emotions and roles | Copying emotions in pattern-based synthesis. Developing additional roles. | Copying emotions in pattern-based synthesis. Developing additional roles. | Copying emotions in pattern-based synthesis. |
| Sampling frequency in source audio recordings | 48 kHz | 48 kHz | 8 kHz or higher. |
| Sampling frequency in synthesized audio recordings | 22 kHz | 22 kHz | 8 kHz |
To create a unique Brand Voice Premium voice for your business, fill out this form.
SpeechKit Brand Voice Premium
SpeechKit Brand Voice Premium is suitable for any business task:
- Voice assistants.
- Call center robot operators.
- Text-to-speech conversion of any text.
Creating a full-fledged model with a unique voice requires large amounts of audio recordings. Yandex Cloud experts will help you prepare the data for SpeechKit Brand Voice Premium model training, select a studio and an artist for you, and support you at each step of voice creation.
Once created, the SpeechKit Brand Voice Premium voice can be enhanced by various roles:
SpeechKit Brand Voice Lite
Create your unique voice with SpeechKit Brand Voice Lite by uploading a minimum of marked-up audio samples (30 minutes or more). As a result, you will have a fine-tuned model URI that you can access from your applications via API.
The quality of synthesized speech depends directly on the quality of audio recordings used to train the model. When creating a SpeechKit Brand Voice Lite voice, you are in charge of the entire process of training data preparation.
For more information on how to train your own model, see Data for SpeechKit Brand Voice Lite.
SpeechKit Brand Voice Call Center
SpeechKit Brand Voice Call Center is purpose-designed for call center automation and other business scenarios involving phone calls:
- Telemarketing.
- Managing calls to level 1 technical support.
- Surveys.
- Call center automation.
With Brand Voice Call Center, you do not have to train a special model based on your artist's voice, as the voice will be copied directly from the patterns you provide for phrase generation. The speech is synthesized integrally rather than glued together from a pre-recorded pattern and a generated variable part.
You can use SpeechKit Brand Voice Call Center to automate your standard dialogs.
For example, if you have an audio with the phrase
Hi Michael, I am calling from Thunderclouds. My name is Anastasia. Is it a good time to talk?, you can transform it toHi Ann, I am calling from New Doors. My name is Matt. Is it a good time to talk?without having to record any additional phrases.
Requirements and restrictions SpeechKit Brand Voice Call Center
For speech synthesis, you need an audio file with your phrase pattern and a text with marked up variables. To learn more about text requirements, see Requirements for synthesized texts.
The sampling frequency in the synthesized audio recording is 8 kHz. This is enough for phone calls. However, in other scenarios, you might hear some noises and flaws of synthesis.
SpeechKit Brand Voice Call Center is designed for phone calls. The texts for synthesis should be short enough. The duration of a synthesized phrase cannot not be more than 24 seconds, while its length, including the variable part, cannot exceed 250 characters. The variable part of a normalized text in a phrase cannot be longer than 25% of the phrase. The same restriction applies to the duration of the variable part relative to the duration of the final audio.
SpeechKit Brand Voice Call Center logs your transmitted patterns (both text and audio). However, the synthesized audio recordings and variable parts, including your sensitive data, are not logged. To improve the model's performance with your data, you can enable variable logging through the x-data-logging-enabled: true header.
Note
Data logging may prove useful if synthesis errors occur. If you do not want to log all data, only include the logging header in requests with issues after clearing the variable part of personal data as much as possible.