Pattern-based speech synthesis
With pattern-based synthesis, you can synthesize speech based on pre-edited phrases, or patterns, with changeable key parts called variables. A new phrase is synthesized integrally instead of combining multiple pre-recorded and synthesized blocks. Speech intonations are copied from the pattern. This makes speech sound natural – just like real human speech.
Pattern-based synthesis is only available for SpeechKit Brand Voice voices.
Warning
Access to the SpeechKit Brand Voice service is available on request. To get access to the technology, fill out this form.
Currently, this feature works only for Russian.
Source data for pattern-based synthesis
Each request sent for pattern-based synthesis must include:
- An audio recording of the pattern phrase.
- Text template with markup of variable parts.
- Values of variables for synthesizing a new phrase.
- Timestamp of the beginning and end of each variable part.
Warning
The normalized text of the variable part of the phrase cannot be longer than 25% of the phrase length. The same restriction applies to the duration of the variable part relative to the duration of the final audio.
Let's assume you need to synthesize this phrase:
Let's check your booking. Your flight is at eleven twenty a.m., September eight. You depart from Heathrow and arrive at Barajas. Is everything correct?
Your pattern audio recording contains the following phrase:Let's check your booking. Your flight is at nine p.m., March twenty-three. You depart from Heathrow and arrive at Barajas. Is everything correct?
.
The text pattern should look like this:Let's check your booking. Your flight is at {time}, {date}. You depart from Heathrow and arrive at Barajas. Is everything correct?
List of variables:variable_name = '{date}', variable_value = 'september eight'
,variable_name = '{time}', variable_value = 'eleven twenty a.m.'
.
Learn more about examples of pattern-based synthesis.
Requirements for pattern recordings
Requirement | Value |
---|---|
Sampling frequency | 48 kHz for SpeechKit Brand Voice Self Service and SpeechKit Brand Voice Premium8 kHz or higher for SpeechKit Brand Voice Call Center |
Audio bit depth | 16 bit PCM |
Number of channels | 1 (mono) |
Format | WAV |
Minimum pattern audio recording duration | 1 second |
Pattern audio recordings should not include extraneous noise or echo. Minimum post-processing of audio recordings is acceptable. The silence at the beginning and end of a recording must not be longer than one second.
Recommendations for audio recording
Noise and echo during recording directly affect the quality of training and performance of your speech synthesis model. For this reason, when recording audio for training and fine-tuning your model or patterns for synthesis, try to reduce sound reflection in the room. A perfect place for recording is a room equipped with acoustic panels. If recording in a household environment, you can reduce the echo with soft surfaces and furniture, such as carpet, upholstered sofa, etc.
Use equipment that has the same settings and location for any audio recording.
Make sure the sound engineer, speaker, and voice coach are different people. A voice coach is a professional who makes sure the speaker strictly follows the text without changing the manner of narration.
When recording any audio intended for your speech synthesis model, use the one phrase per file rule.
No inaccuracies, micro repeats, reservations, and word substitutions are allowed. Any speech defects in the recordings made for synthesis models substantially degrade the quality of the synthesized speech.
Each recording of a pattern or phrase for model training must have an absolutely accurate text transcript. Patterns, training phrases, and their transcripts must not contain grammatical errors.
Requirements for synthesized texts
The pattern must match the desired script.
The pattern must contain one or more variables to replace. The length of the variable portion must match the sample in the pattern. If the audio length of variable parts may vary substantially, we recommend recording several patterns with examples of different lengths. The normalized text of the variable part of the phrase cannot be longer than 25% of the pattern length. The same restriction applies to the duration of the variable part relative to the duration of the final audio.
For example, if in the pattern
Hi, my name is Iris, I work at {company}.
, the length of thecompany
variable may vary substantially (e.g.,Thunderclouds
,Reliable Bank
,a regional office of a well-known international company
), you will need to record two patterns: for short and for long values. Use a suitable pattern depending on the placeholder value length. This way, you can keep your synthesized speech natural.
Divide long utterances into independent sentences and different patterns. The synthesized phrase cannot exceed 24 seconds and 250 characters.
-
When recording Russian speech, make sure to use the letter ё when needed in the pattern text. You cannot use е instead of ё in any synthesized text.
-
In homographs where the stress may be ambiguous, explicitly indicate the stressed vowel with the
+
sign.The +import of goods from foreign countries.: Stress falls on the first syllable.
Goods they im+port from China.: Stress falls on the second syllable. -
The lines of text for synthesis cannot include numbers or abbreviations, such as 15 Smith st., bl. 3 or USD 13.10. Make sure to spell out all numbers and abbreviations.
Fifteen Smith street, block three.
Thirteen dollars and ten cents. -
In questions, specify which word should be **logically stressed** to indicate the intonation of an interrogative sentence.
The sentence Did the cat go to the forest? can be read in three different ways:
- Did the **cat** go to the forest? Meaning Who went to the forest? Was it really the cat?
- Did the cat **go** to the forest? Meaning Did the cat walk or run? or Was the action performed? Is the cat gone or not?
- Did the cat go **to the forest**? Meaning Where did the cat go and why? To the forest, outside, to look for a sausage?
In all sentences, the logical stress emphasizes the main meaning of the sentence.
Warning
The logical stress should not fall on the variable part of the phrase.