Pattern-based speech synthesis using the API v3

Written by

Updated at November 26, 2024

Getting started
Prepare an audio template
Send data to the API v3

Pattern-based synthesis in Brand Voice Premium and Brand Voice Self Service is only available for SpeechKit Brand Voice voices.

Warning

Access to Brand Voice Premium and Brand Voice Self Service services is available on request. To access the technology, fill out this form.

To synthesize a phrase based on a template:

Getting started

Clone the Yandex Cloud API repository:

git clone https://github.com/yandex-cloud/cloudapi

Create a service account to work with the SpeechKit API.
Assign the ai.speechkit-tts.user role, or higher, to the service account, which will allow it to work with SpeechKit in the folder it was created in.
Get an IAM token for the service account.

Prepare an audio template

A template consists of a pattern phrase audio recording and its text with markup.

Template parameter restrictions:

A phrase for synthesis cannot be longer than 24 seconds (the API limit), including the variable part. A phrase ideal for perception is under 16 seconds, as in a regular conversation.
Template length: not more than 250 characters of normalized text.
The normalized text of the variable part must not exceed 25% of the template length. The same restriction applies to the duration of the variable part relative to the duration of the final audio.

Warning

Submit templates with variable parts to the API for synthesis. There is no need to submit fixed phrases.

Examples of fixed phrases:

Hello, I am calling from the MedCity clinic.

Hello, this is 'Junk It Out,' a construction waste management company.

To prepare a template:

Make an audio recording of your pattern phrase.
Mark up the text of the pattern phrase according to the requirements and enclose the variable parts in {} (curly brackets).

Marked-up text example: This is to remind you that a waste collection truck will arrive at your place on {date} at {time}.
List of variables: variable_name = '{date}', variable_value = 'tomorrow', variable_name = '{time}', variable_value = 'fourteen hours'.
For each variable part, set its start time and duration in the audio recording (in milliseconds).

Send data to the API v3

Create and run a client app to send your data to the API:

Python 3

Clone the Yandex Cloud API repository:

git clone https://github.com/yandex-cloud/cloudapi

Install dependencies using the PIP package manager:
```
pip install grpcio-tools && \
pip install pydub
```
You need the grpcio-tools package to generate the interface code for the API v3 synthesis client.

You need the pydub package to process the resulting audio files.
Download the FFmpeg utility for correct operation of the pydub package.
Add the path to the folder with the FFmpeg executable to the PATH variable:
```
export PATH=$PATH:<path_to_folder_with_FFmpeg_executable>
```

Go to the folder hosting the cloned Yandex Cloud API repository, create a folder named output, and generate the client interface code in it:

cd <path_to_cloudapi_folder> && \
mkdir output && \
python3 -m grpc_tools.protoc -I . -I third_party/googleapis \
  --python_out=output \
  --grpc_python_out=output \
  google/api/http.proto \
  google/api/annotations.proto \
  yandex/cloud/api/operation.proto \
  google/rpc/status.proto \
  yandex/cloud/ai/tts/v3/tts_service.proto \
  yandex/cloud/ai/tts/v3/tts.proto

The tts_pb2.py, tts_pb2_grpc.py, tts_service_pb2.py, and tts_service_pb2_grpc.py client interface files, as well as dependency files, will be created in the output folder.

Create a file named test.py in the output folder root and add the code for API v3 template-based synthesis to it.

test.py

import io
import pydub
import argparse

import yandex.cloud.ai.tts.v3.tts_pb2 as tts_pb2
import yandex.cloud.ai.tts.v3.tts_service_pb2_grpc as tts_service_pb2_grpc

def synthesize(iam_token, bytes_array) -> pydub.AudioSegment:
    template = "<template_phrase_with_markup>"
    # Template example: 'This is to remind you that your child has an appointment for {procedure} on {date} at {time}.'
    request = tts_pb2.UtteranceSynthesisRequest(
        output_audio_spec=tts_pb2.AudioFormatOptions(
            container_audio=tts_pb2.ContainerAudio(
                container_audio_type=tts_pb2.ContainerAudio.WAV
            )
        ),
        loudness_normalization_type=tts_pb2.UtteranceSynthesisRequest.LUFS,
        text_template = tts_pb2.TextTemplate(
            text_template = template,
            variables = [
                # List template variables.
                # The number of tts_pb2.TextVariable() list items depends on the number of template variables.
                tts_pb2.TextVariable(
                    variable_name = "<variable_name_in_template>",
                    # Example: '{time}'
                    variable_value = "<text_for_synthesis>"
                    # Example: 'eight hours and thirty minutes'
                )
            ]
        ),
        hints = [
            tts_pb2.Hints(
                audio_template = tts_pb2.AudioTemplate(
                    audio = tts_pb2.AudioContent(
                        # Upload a source audio for your template.
                        content = bytes_array,
                        audio_spec = tts_pb2.AudioFormatOptions(
                            container_audio = tts_pb2.ContainerAudio(
                                container_audio_type = tts_pb2.ContainerAudio.WAV
                            )
                        )
                    ),
                    text_template = tts_pb2.TextTemplate(
                        text_template = template,
                        variables = [
                            # List template variables.
                            # The number of tts_pb2.TextVariable() list items depends on the number of template variables.
                            tts_pb2.TextVariable(
                                variable_name = "<variable_name_in_template>",
                                variable_value = "<phrase_variable_part_text_in_template_audio_file>"
                            )
                        ]
                    ),
                    variables = [
                        # List the audio parameters of template variables.
                        # The number of tts_pb2.AudioVariable() list items depends on the number of template variables.
                        tts_pb2.AudioVariable(
                            variable_name = "<variable_name_in_template>",
                            # Length of the variable part of the phrase in the template audio (ms).
                            variable_length_ms = <variable_length>,
                            # Start of the variable part of the phrase in the template audio (ms).
                            variable_start_ms = <variable_start_time>
                        )
                    ]
                )
           ),
           tts_pb2.Hints(
               voice = "<your_voice_ID>"
           )
        ],
    )

    # Establish a connection with the server.
    cred = grpc.ssl_channel_credentials()
    channel = grpc.secure_channel('tts.api.cloud.yandex.net:443', cred)
    stub = tts_service_pb2_grpc.SynthesizerStub(channel)

    # Send data for synthesis.
    it = stub.UtteranceSynthesis(request, metadata=(
        ("authorization", f"Bearer {iam_token}"),
    ))

    # Process the server responses and write the output to a file.
    try:
        audio = io.BytesIO()
        for response in it:
            audio.write(response.audio_chunk.data)
        audio.seek(0)
        return pydub.AudioSegment.from_wav(audio)
    except grpc._channel._Rendezvous as err:
        print(f'Error code {err._state.code}, message: {err._state.details}')
        raise err


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--token', required=True, help='IAM token')
    parser.add_argument('--input', required=True, help='Template WAV file')
    parser.add_argument('--output', required=True, help='Output WAV file with synthesized speech')
    args = parser.parse_args()

    with open(args.input, 'rb') as file:
        speech = file.read()

    audio = synthesize(args.token, speech)
    with open(args.output, 'wb') as fp:
        audio.export(fp, format='wav')

Execute the test.py file:
```
export IAM_TOKEN=<service_account_IAM_token> && \
python3 output/test.py \
  --token ${IAM_TOKEN} \
  --input sample.wav \
  --output speech.wav
```
Where:
- IAM_TOKEN: Service account IAM token.
- --input: Name of the audio file with the template phrase.
- --output: Name of the audio file with the synthesized phrase.

This will create a file named speech.wav with your synthesized pattern-based phrase in the cloudapi folder.

Pattern-based speech synthesis using the API v3

Getting started

Prepare an audio template

Send data to the API v3

See also

Was the article helpful?

Pattern-based speech synthesis using the API v3

Getting startedGetting started

Prepare an audio templatePrepare an audio template

Send data to the API v3Send data to the API v3

See alsoSee also

Was the article helpful?

Getting started

Prepare an audio template

Send data to the API v3

See also