SpeechKit Synthesis Service API v3, gRPC: Synthesizer.UtteranceSynthesis
Synthesizing text into speech.
gRPC request
rpc UtteranceSynthesis (UtteranceSynthesisRequest) returns (stream UtteranceSynthesisResponse)
UtteranceSynthesisRequest
{
"model": "string",
// Includes only one of the fields `text`, `text_template`
"text": "string",
"text_template": {
"text_template": "string",
"variables": [
{
"variable_name": "string",
"variable_value": "string"
}
]
},
// end of the list of possible fields
"hints": [
{
// Includes only one of the fields `voice`, `audio_template`, `speed`, `volume`, `role`, `pitch_shift`, `duration`
"voice": "string",
"audio_template": {
"audio": {
// Includes only one of the fields `content`
"content": "bytes",
// end of the list of possible fields
"audio_spec": {
// Includes only one of the fields `raw_audio`, `container_audio`
"raw_audio": {
"audio_encoding": "AudioEncoding",
"sample_rate_hertz": "int64"
},
"container_audio": {
"container_audio_type": "ContainerAudioType"
}
// end of the list of possible fields
}
},
"text_template": {
"text_template": "string",
"variables": [
{
"variable_name": "string",
"variable_value": "string"
}
]
},
"variables": [
{
"variable_name": "string",
"variable_start_ms": "int64",
"variable_length_ms": "int64"
}
]
},
"speed": "double",
"volume": "double",
"role": "string",
"pitch_shift": "double",
"duration": {
"policy": "DurationHintPolicy",
"duration_ms": "int64"
}
// end of the list of possible fields
}
],
"output_audio_spec": {
// Includes only one of the fields `raw_audio`, `container_audio`
"raw_audio": {
"audio_encoding": "AudioEncoding",
"sample_rate_hertz": "int64"
},
"container_audio": {
"container_audio_type": "ContainerAudioType"
}
// end of the list of possible fields
},
"loudness_normalization_type": "LoudnessNormalizationType",
"unsafe_mode": "bool"
}
Field |
Description |
model |
string The name of the model. |
text |
string Raw text (e.g. "Hello, Alice"). Includes only one of the fields Text to synthesis, one of text synthesis markups. |
text_template |
Text template instance, e.g. Includes only one of the fields Text to synthesis, one of text synthesis markups. |
hints[] |
Optional hints for synthesis. |
output_audio_spec |
Optional. Default: 22050 Hz, linear 16-bit signed little-endian PCM, with WAV header |
loudness_normalization_type |
enum LoudnessNormalizationType Specifies type of loudness normalization.
|
unsafe_mode |
bool Optional. Automatically split long text to several utterances and bill accordingly. Some degradation in service quality is possible. |
TextTemplate
Field |
Description |
text_template |
string Template text. Sample: |
variables[] |
Defining variables in template text. Sample: |
TextVariable
Field |
Description |
variable_name |
string The name of the variable. |
variable_value |
string The text of the variable. |
Hints
Field |
Description |
voice |
string Name of speaker to use. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
audio_template |
Template for synthesizing. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
speed |
double Hint to change speed. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
volume |
double Hint to regulate normalization level.
Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
role |
string Hint to specify pronunciation character for the speaker. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
pitch_shift |
double Hint to increase (or decrease) speaker's pitch, measured in Hz. Valid values are in range [-1000;1000], default value is 0. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
duration |
Hint to limit both minimum and maximum audio duration. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
AudioTemplate
Field |
Description |
audio |
Audio file. |
text_template |
Template and description of its variables. |
variables[] |
Describing variables in audio. |
AudioContent
Field |
Description |
content |
bytes Bytes with audio data. Includes only one of the fields The audio source to read the data from. |
audio_spec |
Description of the audio format. |
AudioFormatOptions
Field |
Description |
raw_audio |
The audio format specified in request parameters. Includes only one of the fields |
container_audio |
The audio format specified inside the container metadata. Includes only one of the fields |
RawAudio
Field |
Description |
audio_encoding |
enum AudioEncoding Encoding type.
|
sample_rate_hertz |
int64 Sampling frequency of the signal. |
ContainerAudio
Field |
Description |
container_audio_type |
enum ContainerAudioType
|
AudioVariable
Field |
Description |
variable_name |
string The name of the variable. |
variable_start_ms |
int64 Start time of the variable in milliseconds. |
variable_length_ms |
int64 Length of the variable in milliseconds. |
DurationHint
Field |
Description |
policy |
enum DurationHintPolicy Type of duration constraint.
|
duration_ms |
int64 Constraint on audio duration in milliseconds. |
UtteranceSynthesisResponse
{
"audio_chunk": {
"data": "bytes"
},
"text_chunk": {
"text": "string"
},
"start_ms": "int64",
"length_ms": "int64"
}
Field |
Description |
audio_chunk |
Part of synthesized audio. |
text_chunk |
Part of synthesized text. |
start_ms |
int64 Start time of the audio chunk in milliseconds. |
length_ms |
int64 Length of the audio chunk in milliseconds. |
AudioChunk
Field |
Description |
data |
bytes Sequence of bytes of the synthesized audio in format specified in output_audio_spec. |
TextChunk
Field |
Description |
text |
string Synthesized text. |