SpeechKit Synthesis Service API v3, gRPC: Synthesizer.UtteranceSynthesis
Synthesizing text into speech.
gRPC request
rpc UtteranceSynthesis (UtteranceSynthesisRequest) returns (stream UtteranceSynthesisResponse)
UtteranceSynthesisRequest
{
"model": "string",
// Includes only one of the fields `text`, `textTemplate`
"text": "string",
"textTemplate": {
"textTemplate": "string",
"variables": [
{
"variableName": "string",
"variableValue": "string"
}
]
},
// end of the list of possible fields
"hints": [
{
// Includes only one of the fields `voice`, `audioTemplate`, `speed`, `volume`, `role`, `pitchShift`, `duration`
"voice": "string",
"audioTemplate": {
"audio": {
// Includes only one of the fields `content`
"content": "bytes",
// end of the list of possible fields
"audioSpec": {
// Includes only one of the fields `rawAudio`, `containerAudio`
"rawAudio": {
"audioEncoding": "AudioEncoding",
"sampleRateHertz": "int64"
},
"containerAudio": {
"containerAudioType": "ContainerAudioType"
}
// end of the list of possible fields
}
},
"textTemplate": {
"textTemplate": "string",
"variables": [
{
"variableName": "string",
"variableValue": "string"
}
]
},
"variables": [
{
"variableName": "string",
"variableStartMs": "int64",
"variableLengthMs": "int64"
}
]
},
"speed": "double",
"volume": "double",
"role": "string",
"pitchShift": "double",
"duration": {
"policy": "DurationHintPolicy",
"durationMs": "int64"
}
// end of the list of possible fields
}
],
"outputAudioSpec": {
// Includes only one of the fields `rawAudio`, `containerAudio`
"rawAudio": {
"audioEncoding": "AudioEncoding",
"sampleRateHertz": "int64"
},
"containerAudio": {
"containerAudioType": "ContainerAudioType"
}
// end of the list of possible fields
},
"loudnessNormalizationType": "LoudnessNormalizationType",
"unsafeMode": "bool"
}
Field |
Description |
model |
string The name of the model. |
text |
string Raw text (e.g. "Hello, Alice"). Includes only one of the fields Text to synthesis, one of text synthesis markups. |
textTemplate |
Text template instance, e.g. Includes only one of the fields Text to synthesis, one of text synthesis markups. |
hints[] |
Optional hints for synthesis. |
outputAudioSpec |
Optional. Default: 22050 Hz, linear 16-bit signed little-endian PCM, with WAV header |
loudnessNormalizationType |
enum LoudnessNormalizationType Specifies type of loudness normalization.
|
unsafeMode |
bool Optional. Automatically split long text to several utterances and bill accordingly. Some degradation in service quality is possible. |
TextTemplate
Field |
Description |
textTemplate |
string Template text. Sample: |
variables[] |
Defining variables in template text. Sample: |
TextVariable
Field |
Description |
variableName |
string The name of the variable. |
variableValue |
string The text of the variable. |
Hints
Field |
Description |
voice |
string Name of speaker to use. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
audioTemplate |
Template for synthesizing. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
speed |
double Hint to change speed. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
volume |
double Hint to regulate normalization level.
Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
role |
string Hint to specify pronunciation character for the speaker. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
pitchShift |
double Hint to increase (or decrease) speaker's pitch, measured in Hz. Valid values are in range [-1000;1000], default value is 0. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
duration |
Hint to limit both minimum and maximum audio duration. Includes only one of the fields The hint for TTS engine to specify synthesised audio characteristics. |
AudioTemplate
Field |
Description |
audio |
Audio file. |
textTemplate |
Template and description of its variables. |
variables[] |
Describing variables in audio. |
AudioContent
Field |
Description |
content |
bytes Bytes with audio data. Includes only one of the fields The audio source to read the data from. |
audioSpec |
Description of the audio format. |
AudioFormatOptions
Field |
Description |
rawAudio |
The audio format specified in request parameters. Includes only one of the fields |
containerAudio |
The audio format specified inside the container metadata. Includes only one of the fields |
RawAudio
Field |
Description |
audioEncoding |
enum AudioEncoding Encoding type.
|
sampleRateHertz |
int64 Sampling frequency of the signal. |
ContainerAudio
Field |
Description |
containerAudioType |
enum ContainerAudioType
|
AudioVariable
Field |
Description |
variableName |
string The name of the variable. |
variableStartMs |
int64 Start time of the variable in milliseconds. |
variableLengthMs |
int64 Length of the variable in milliseconds. |
DurationHint
Field |
Description |
policy |
enum DurationHintPolicy Type of duration constraint.
|
durationMs |
int64 Constraint on audio duration in milliseconds. |
UtteranceSynthesisResponse
{
"audioChunk": {
"data": "bytes"
},
"textChunk": {
"text": "string"
},
"startMs": "int64",
"lengthMs": "int64"
}
Field |
Description |
audioChunk |
Part of synthesized audio. |
textChunk |
Part of synthesized text. |
startMs |
int64 Start time of the audio chunk in milliseconds. |
lengthMs |
int64 Length of the audio chunk in milliseconds. |
AudioChunk
Field |
Description |
data |
bytes Sequence of bytes of the synthesized audio in format specified in output_audio_spec. |
TextChunk
Field |
Description |
text |
string Synthesized text. |