Speech synthesis in the API v3
With the SpeechKit API v3, you can synthesize speech from text with TTS markup to a WAV
The example uses the following synthesis parameters:
- Synthesized audio file format: LPCM with a sample rate of 22050 Hz, WAV
container (default). - Volume normalization: LUFS (default).
Conversion and recording of a result are performed using the grpcio-tools
and pydub
packages and the FFmpeg
Authentication is performed under a service account using an API key or IAM token. For more information about authentication in the SpeechKit API, see Authentication with the SpeechKit API.
To implement an example:
-
Create a service account to work with the SpeechKit API.
-
Assign the
ai.speechkit-tts.user
role or higher to the service account, which will allow it to work with SpeechKit in the folder it was created in. -
Create a client application:
Python 3Java-
Clone the Yandex Cloud API
repository:git clone https://github.com/yandex-cloud/cloudapi
-
Install the
grpcio-tools
andpydub
packages using the pip package manager:pip install grpcio-tools && \ pip install pydub
You need the
grpcio-tools
package to generate client interface code for the API v3 synthesis. Thepydub
package is needed to process the resulting audio files. -
Download
the FFmpeg utility for correct operation of thepydub
package. Add the path to the directory with the executable file to thePATH
variable. To do this, run the following command:export PATH=$PATH:<path_to_directory_with_executable_FFmpeg_file>
-
Go to the folder hosting the cloned Yandex Cloud API repository, create a folder named
output
, and generate the client interface code there:cd <path_to_cloudapi_folder> mkdir output python3 -m grpc_tools.protoc -I . -I third_party/googleapis \ --python_out=output \ --grpc_python_out=output \ google/api/http.proto \ google/api/annotations.proto \ yandex/cloud/api/operation.proto \ google/rpc/status.proto \ yandex/cloud/operation/operation.proto \ yandex/cloud/validation.proto \ yandex/cloud/ai/tts/v3/tts_service.proto \ yandex/cloud/ai/tts/v3/tts.proto
As a result, the
tts_pb2.py
,tts_pb2_grpc.py
,tts_service_pb2.py
, andtts_service_pb2_grpc.py
client interface files as well as dependency files will be created in theoutput
directory. -
In the root of the
output
directory, create a file, e.g.,test.py
, and add to it the following code:import io import grpc import pydub import argparse import yandex.cloud.ai.tts.v3.tts_pb2 as tts_pb2 import yandex.cloud.ai.tts.v3.tts_service_pb2_grpc as tts_service_pb2_grpc # Specify the synthesis settings. # When authorizing with an API key, provide api_key instead of iam_token #def synthesize(api_key, text) -> pydub.AudioSegment: def synthesize(iam_token, text) -> pydub.AudioSegment: request = tts_pb2.UtteranceSynthesisRequest( text=text, output_audio_spec=tts_pb2.AudioFormatOptions( container_audio=tts_pb2.ContainerAudio( container_audio_type=tts_pb2.ContainerAudio.WAV ) ), # Synthesis parameters hints=[ tts_pb2.Hints(voice= 'alexander'), # (Optional) Specify the voice. The default value is marina tts_pb2.Hints(role = 'good'), # (Optional) Specify the role only if applicable for this voice tts_pb2.Hints(speed=1.1), # (Optional) Specify synthesis speed ], loudness_normalization_type=tts_pb2.UtteranceSynthesisRequest.LUFS ) # Connect to the server. cred = grpc.ssl_channel_credentials() channel = grpc.secure_channel('tts.api.cloud.yandex.net:443', cred) stub = tts_service_pb2_grpc.SynthesizerStub(channel) # Send data for synthesis. it = stub.UtteranceSynthesis(request, metadata=( # Parameters for authorizing with an IAM token ('authorization', f'Bearer {iam_token}'), # Parameters for authorizing as a service account with an API key # ('authorization', f'Api-Key {api_key}'), )) # Create an audio file from multiple chunks. try: audio = io.BytesIO() for response in it: audio.write(response.audio_chunk.data) audio.seek(0) return pydub.AudioSegment.from_wav(audio) except grpc._channel._Rendezvous as err: print(f'Error code {err._state.code}, message: {err._state.details}') raise err if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--token', required=True, help='IAM token or API key') parser.add_argument('--text', required=True, help='Text for synthesis') parser.add_argument('--output', required=True, help='Output file') args = parser.parse_args() audio = synthesize(args.token, args.text) with open(args.output, 'wb') as fp: audio.export(fp, format='wav')
-
Execute the file from the previous step:
export IAM_TOKEN=<service_account_IAM_token> export TEXT='I'm Yandex Speech+Kit. I can turn any text into speech. Now y+ou can, too!' python3 output/test.py \ --token ${IAM_TOKEN} \ --output speech.wav \ --text ${TEXT}
Where:
IAM_TOKEN
: IAM token of the service account. If you use an API key for authentication under a service account, change the Python script and the program call.TEXT
: Text in TTS markup for synthesis.--output
: Name of the file for audio recording.
As a result, the
speech.wav
file with synthesized speech will be created in thecloudapi
directory.
-
Install the dependencies:
sudo apt update && sudo apt install --yes default-jdk maven
-
Clone the repository
with a Java application configuration:git clone https://github.com/yandex-cloud-examples/yc-speechkit-tts-java
-
Go to the repository directory:
cd yc-speechkit-tts-java
-
Compile a project in this directory:
mvn clean install
-
Go to the
target
directory you created:cd target
-
Specify the service account's API key and text to synthesize:
export API_KEY=<API key> && \ export TEXT='I'm Yandex Speech+Kit. I can turn any text into speech. Now y+ou can, too!'
-
Run the Java script for speech synthesis:
java -cp speechkit_examples-1.0-SNAPSHOT.jar yandex.cloud.speechkit.examples.TtsV3Client ${TEXT}
As a result, the
result.wav
audio file should appear in thetarget
directory. It contains speech recorded from theTEXT
environment variable.
-