Streaming speech recognition with auto language detection in the API v3
The example shows how you can recognize speech in LPCM format in real time using the SpeechKit API v3 with auto language detection.
The example uses the following parameters:
- Recognition language:
Auto
(automatic language detection). - Format of the audio stream: LPCM with a sampling rate of 8000 Hz.
- Number of audio channels: 1 (default).
- Other parameters left by default.
Automatic language detection
SpeechKit automatically detects the language during speech recognition and returns language labels along with the recognition results indicating the probability of correct language detection.
To configure automatic language detection, set the language_code
parameter of the LanguageRestrictionOptions()
method to auto
:
language_restriction=stt_pb2.LanguageRestrictionOptions(
restriction_type=stt_pb2.LanguageRestrictionOptions.WHITELIST,
language_code=['auto']
)
Language labels contain the language code and specify the probability of its correct detection:
language_code: "ru-RU" probability: 0.91582357883453369
Note
Language detection and setting language labels are only available in gRPC API v3.
Prepare the required resources
- Create a service account and assign it the
ai.speechkit-stt.user
role. - Get an IAM token for the service account and save it.
- Download a sample
audio file for recognition or generate your own one.
Create an application for streaming speech recognition
To implement an example from this section:
-
Clone the Yandex Cloud API
repository:git clone https://github.com/yandex-cloud/cloudapi
-
Create a client application:
Python 3-
Install the
grpcio-tools
package using the pip package manager:pip install grpcio-tools
-
Go to the folder hosting the cloned Yandex Cloud API repository, create a folder named
output
, and generate the client interface code there:cd <path_to_cloudapi_folder> mkdir output python3 -m grpc_tools.protoc -I . -I third_party/googleapis \ --python_out=output \ --grpc_python_out=output \ google/api/http.proto \ google/api/annotations.proto \ yandex/cloud/api/operation.proto \ google/rpc/status.proto \ yandex/cloud/operation/operation.proto \ yandex/cloud/validation.proto \ yandex/cloud/ai/stt/v3/stt_service.proto \ yandex/cloud/ai/stt/v3/stt.proto
As a result, the
stt_pb2.py
,stt_pb2_grpc.py
,stt_service_pb2.py
, andstt_service_pb2_grpc.py
client interface files as well as dependency files will be created in theoutput
directory. -
In the root of the
output
directory, create a file, e.g.,test.py
, and add to it the following code:#coding=utf8 import argparse import grpc import yandex.cloud.ai.stt.v3.stt_pb2 as stt_pb2 import yandex.cloud.ai.stt.v3.stt_service_pb2_grpc as stt_service_pb2_grpc CHUNK_SIZE = 4000 def gen(audio_file_name): # Specify the recognition settings. recognize_options = stt_pb2.StreamingOptions( recognition_model=stt_pb2.RecognitionModelOptions( audio_format=stt_pb2.AudioFormatOptions( raw_audio=stt_pb2.RawAudio( audio_encoding=stt_pb2.RawAudio.LINEAR16_PCM, sample_rate_hertz=8000, audio_channel_count=1 ) ), # Set auto language detection. language_restriction=stt_pb2.LanguageRestrictionOptions( restriction_type=stt_pb2.LanguageRestrictionOptions.WHITELIST, language_code=['auto'] ), # Select streaming recognition models. audio_processing_type=stt_pb2.RecognitionModelOptions.REAL_TIME ) ) # Send a message with recognition settings. yield stt_pb2.StreamingRequest(session_options=recognize_options) # Read the audio file and send its contents in chunks. with open(audio_file_name, 'rb') as f: data = f.read(CHUNK_SIZE) while data != b'': yield stt_pb2.StreamingRequest(chunk=stt_pb2.AudioChunk(data=data)) data = f.read(CHUNK_SIZE) # When authorizing with an API key # as a service account, provide api_key instead of iam_token. # def run(api_key, audio_file_name): def run(iam_token, audio_file_name): # Establish a server connection. cred = grpc.ssl_channel_credentials() channel = grpc.secure_channel('stt.api.cloud.yandex.net:443', cred) stub = stt_service_pb2_grpc.RecognizerStub(channel) # Send data for recognition. it = stub.RecognizeStreaming(gen(audio_file_name), metadata=( # Parameters for authenticating with an IAM token ('authorization', f'Bearer {iam_token}'), # Parameters for authenticating as a service account with an API key # ('authorization', f'Api-Key {api_key}'), )) # Process the server responses and output the result to the console. try: for r in it: event_type, alternatives = r.WhichOneof('Event'), None if event_type == 'partial' and len(r.partial.alternatives) > 0: alternatives = [a.text for a in r.partial.alternatives] if event_type == 'final': alternatives = [a.text for a in r.final.alternatives] # Get language labels: langs = [a.languages for a in r.final.alternatives] if event_type == 'final_refinement': alternatives = [a.text for a in r.final_refinement.normalized_text.alternatives] print(f'type={event_type}, alternatives={alternatives}') # Output the language labels to the console for final versions: if event_type == 'final': print(f'Language labels:') for lang in langs: for line in lang: words=f'{line}'.splitlines() for word in words: print(f' {word}', end="") print() except grpc._channel._Rendezvous as err: print(f'Error code {err._state.code}, message: {err._state.details}') raise err if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--token', required=True, help='IAM token or API key') parser.add_argument('--path', required=True, help='audio file path') args = parser.parse_args() run(args.token, args.path)
Where:
audio_encoding
: Format of the audio stream.sample_rate_hertz
: Sampling rate of the audio stream.audio_channel_count
: Number of audio channels.language_code
: Recognition language.
-
-
Use the IAM token of the service account:
export IAM_TOKEN=<service_account_IAM_token>
-
Run the created file:
python3 output/test.py --token ${IAM_TOKEN} --path <path_to_speech.pcm>
Where
path
is the path to the audio file to recognize:Result:
type=status_code, alternatives=None type=partial, alternatives=None type=partial, alternatives=['hello'] type=final, alternatives=['hello world'] Language guess: language_code: "en-US" probability: 1 type=final_refinement, alternatives=['hello world'] type=eou_update, alternatives=None type=partial, alternatives=None type=status_code, alternatives=None