Asynchronous WAV audio file recognition using the API v3
The example below illustrates how to use the SpeechKit API v3 for asynchronous speech recognition from a WAV audio file. This example uses the following parameters:
- Audio stream format: WAV.
- Speech recognition model:
. - Other parameters are left at their defaults.
Authentication is performed under a service account using an API key or IAM token. For more information about authentication in the SpeechKit API, see the API reference.
Getting started
Create a service account.
Please note that you can only recognize audio files asynchronously under a service account. Do not use any other accounts in Yandex Cloud for that.
Assign the service account the
roles for the folder where you created the bucket. -
Get an IAM token or API key for the created service account.
If you do not have a WAV audio file, you can use this sample file
Perform speech recognition via the API v3
Get a link to an audio file in Object Storage.
Create a file, e.g.,
, and add the following code to it:{ "uri": "<bucket_name>/<path_to_WAV_file_in_bucket>", "recognition_model": { "model": "general", "audio_format": { "container_audio": { "container_audio_type": "WAV" } } } }
: Link to the audio file in Object Storage. Here is an example of such a link:
.The link contains additional query parameters (after
) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored. -
: Speech recognition model. -
: Audio container format.
Run the request using one of the service account authentication methods:
With an IAM token:
export IAM_TOKEN=<service_account_IAM_token> && \ curl \ --insecure \ --header "Authorization: Bearer ${IAM_TOKEN}" \ --data @request.json
is the IAM token of the service account. -
With an API key.
Use API keys if requesting an IAM token automatically is not an option.
export API_KEY=<service_account_API_key> && \ curl \ --insecure \ --header "Authorization: Api-Key ${API_KEY}" \ --data @request.json
Result example:
{ "id":"f8ddr61b30fk********", "description":"STT v3 async recognition", "createdAt":"2024-07-15T07:39:36Z", "createdBy":"ajehumcuv38h********", "modifiedAt":"2024-07-15T07:39:36Z", "done":false, "metadata":null }
Save the recognition operation
you get in the response. -
Wait until the recognition is completed. It takes about 10 seconds to recognize one minute of audio.
Request information about the operation:
Authentication with an IAM token:
curl \ --insecure \ --request GET \ --header "Authorization: Bearer ${IAM_TOKEN}" \<recognition_operation_ID>
Authentication with an API key:
curl \ --insecure \ --request GET \ --header "Authorization: Api-key ${API_KEY}" \<recognition_operation_ID>
Result example:
{ "done": true, "id": "f8ddr61b30fk********", "description": "STT v3 async recognition", "createdAt": "2024-07-15T07:39:36Z", "createdBy": "ajehumcuv38h********", "modifiedAt": "2024-07-15T07:39:37Z" }
Request the operation result:
Authentication with an IAM token:
curl \ --insecure \ --request GET \ --header "Authorization: Bearer ${IAM_TOKEN}" \<recognition_operation_ID>
Authentication with an API key:
curl \ --insecure \ --request GET \ --header "Authorization: Api-key ${API_KEY}" \<recognition_operation_ID>
Result example
{ "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "0" }, "responseWallTimeMs": "189", "final": { "alternatives": [ { "words": [ { "text": "I", "startTimeMs": "459", "endTimeMs": "520" }, { "text": "Yandex", "startTimeMs": "640", "endTimeMs": "1060" }, { "text": "SpeechKit", "startTimeMs": "1120", "endTimeMs": "1959" }, { "text": "I", "startTimeMs": "2480", "endTimeMs": "2520" }, { "text": "can", "startTimeMs": "2580", "endTimeMs": "2800" }, { "text: "turn", "startTimeMs": "2860", "endTimeMs": "3360" }, { "text": "any", "startTimeMs": "3439", "endTimeMs": "3709" }, { "text": "text", "startTimeMs": "3800", "endTimeMs": "4140" }, { "text": "into", "startTimeMs": "4200", "endTimeMs": "4220" }, { "text": "speech", "startTimeMs": "4279", "endTimeMs": "4740" }, { "text": "now", "startTimeMs": "5140", "endTimeMs": "5759" }, { "text": "you", "startTimeMs": "5859", "endTimeMs": "5900" }, { "text": "can", "startTimeMs": "5980", "endTimeMs": "6399" }, { "text": "too", "startTimeMs": "6660", "endTimeMs": "7180" } ], "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too", "startTimeMs": "0", "endTimeMs": "7400", "confidence": 0, "languages": [] } ], "channelTag": "0" }, "channelTag": "0" } } { "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "0" }, "responseWallTimeMs": "189", "finalRefinement": { "finalIndex": "0", "normalizedText": { "alternatives": [ { "words": [ { "text": "I", "startTimeMs": "459", "endTimeMs": "520" }, { "text": "Yandex", "startTimeMs": "640", "endTimeMs": "1060" }, { "text": "SpeechKit", "startTimeMs": "1120", "endTimeMs": "1959" }, { "text": "I", "startTimeMs": "2480", "endTimeMs": "2520" }, { "text": "can", "startTimeMs": "2580", "endTimeMs": "2800" }, { "text": "turn", "startTimeMs": "2860", "endTimeMs": "3360" }, { "text": "any", "startTimeMs": "3439", "endTimeMs": "3709" }, { "text": "text", "startTimeMs": "3800", "endTimeMs": "4140" }, { "text": "into", "startTimeMs": "4200", "endTimeMs": "4220" }, { "text": "speech", "startTimeMs": "4279", "endTimeMs": "4740" }, { "text": "now", "startTimeMs": "5140", "endTimeMs": "5759" }, { "text": "you", "startTimeMs": "5859", "endTimeMs": "5900" }, { "text": "can", "startTimeMs": "5980", "endTimeMs": "6399" }, { "text": "too", "startTimeMs": "6660", "endTimeMs": "7180" } ], "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too", "startTimeMs": "0", "endTimeMs": "7400", "confidence": 0, "languages": [] } ], "channelTag": "0" } }, "channelTag": "0" } } { "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "7400" }, "responseWallTimeMs": "190", "eouUpdate": { "timeMs": "7400" }, "channelTag": "0" } }
Clone the Yandex Cloud API
repository:git clone
Use the pip package
manager to install thegrpcio-tools
package:pip install grpcio-tools
Go to the folder hosting the cloned Yandex Cloud API repository, create a folder named
, and generate the client interface code there:cd <path_to_cloudapi_folder> mkdir output python3 -m grpc_tools.protoc -I . -I third_party/googleapis \ --python_out=output \ --grpc_python_out=output \ google/api/http.proto \ google/api/annotations.proto \ yandex/cloud/api/operation.proto \ google/rpc/status.proto \ yandex/cloud/operation/operation.proto \ yandex/cloud/validation.proto \ yandex/cloud/ai/stt/v3/stt_service.proto \ yandex/cloud/ai/stt/v3/stt.proto
client interface files, as well as dependency files, will be created in theoutput
folder. -
Create a file, e.g.,
, in theoutput
folder root and add the following API request code to it:import grpc from import stt_pb2, stt_service_pb2_grpc request = stt_pb2.RecognizeFileRequest( uri='<bucket_name>/<path_to_WAV_file_in_bucket>', recognition_model=stt_pb2.RecognitionModelOptions( model='general', audio_format=stt_pb2.AudioFormatOptions( container_audio=stt_pb2.ContainerAudio( container_audio_type=stt_pb2.ContainerAudio.WAV ) ) ) ) cred = grpc.ssl_channel_credentials() chan = grpc.secure_channel('', cred) stub = stt_service_pb2_grpc.AsyncRecognizerStub(chan) # Choose one of the authentication methods: # Authentication with an IAM token response = stub.RecognizeFile(request, metadata=[('authorization', 'Bearer <IAM_token>')]) # Authentication with an API key # response = stub.RecognizeFile(request, metadata=[('authorization', 'Api-Key <API_key>')]) print(response)
Run this request:
id: "f8dem628l2mq********" description: "STT v3 async recognition" created_at { seconds: 1721032219 } created_by: "ajehumcuv38h********" modified_at { seconds: 1721032219 }
Save the recognition operation
you get in the response. -
Create a file, e.g.,
, in theoutput
folder root and add the following code to it to get the operation result:import grpc from import stt_pb2, stt_service_pb2_grpc, stt_service_pb2 request = stt_service_pb2.GetRecognitionRequest( operation_id="<operation_ID>" ) cred = grpc.ssl_channel_credentials() chan = grpc.secure_channel('', cred) stub = stt_service_pb2_grpc.AsyncRecognizerStub(chan) # Authentication with an IAM token response = stub.GetRecognition(request, metadata=[('authorization', 'Bearer <IAM_token>')]) # Authentication with an API key # response = stub.GetRecognition(request, metadata=[('authorization', 'Api-Key <API_key>')]) print(list(response))
Run this request:
Result example
[session_uuid { uuid: "df49eaa2-25a55218-ae967fa1-********" user_request_id: "f8dkup42nmhk********" } audio_cursors { received_data_ms: 6600 partial_time_ms: 6600 final_time_ms: 6600 } response_wall_time_ms: 204 final { alternatives { words { text: "I" start_time_ms: 380 end_time_ms: 420 } words { "text": "Yandex" start_time_ms: 539 end_time_ms: 919 } words { "text": "SpeechKit" start_time_ms: 960 end_time_ms: 1719 } words { text: "I" start_time_ms: 2159 end_time_ms: 2200 } words { "text": "can" start_time_ms: 2260 end_time_ms: 2440 } words { text: "turn" start_time_ms: 2520 end_time_ms: 3000 } words { "text": "any" start_time_ms: 3060 end_time_ms: 3320 } words { "text": "text" start_time_ms: 3419 end_time_ms: 3740 } words { "text": "into" start_time_ms: 3780 end_time_ms: 3800 } words { "text": "speech" start_time_ms: 3860 end_time_ms: 4279 } words { "text": "now" start_time_ms: 4680 end_time_ms: 5240 } words { "text": "you" start_time_ms: 5339 end_time_ms: 5380 } words { "text": "can" start_time_ms: 5460 end_time_ms: 5766 } words { text: "too" start_time_ms: 5920 end_time_ms: 6393 } text: "I'm Yandex SpeechKit I can turn any text into speech now you can too" end_time_ms: 6600 } channel_tag: "0" } channel_tag: "0" , session_uuid { uuid: "df49eaa2-25a55218-ae967fa1-********" user_request_id: "f8dkup42nmhk********" } audio_cursors { received_data_ms: 6600 partial_time_ms: 6600 final_time_ms: 6600 } response_wall_time_ms: 204 final_refinement { normalized_text { alternatives { words { text: "I" start_time_ms: 380 end_time_ms: 420 } words { "text": "Yandex" start_time_ms: 539 end_time_ms: 919 } words { "text": "SpeechKit" start_time_ms: 960 end_time_ms: 1719 } words { text: "I" start_time_ms: 2159 end_time_ms: 2200 } words { "text": "can" start_time_ms: 2260 end_time_ms: 2440 } words { text: "turn" start_time_ms: 2520 end_time_ms: 3000 } words { "text": "any" start_time_ms: 3060 end_time_ms: 3320 } words { "text": "text" start_time_ms: 3419 end_time_ms: 3740 } words { "text": "into" start_time_ms: 3780 end_time_ms: 3800 } words { "text": "speech" start_time_ms: 3860 end_time_ms: 4279 } words { "text": "now" start_time_ms: 4680 end_time_ms: 5240 } words { "text": "you" start_time_ms: 5339 end_time_ms: 5380 } words { "text": "can" start_time_ms: 5460 end_time_ms: 5766 } words { text: "too" start_time_ms: 5920 end_time_ms: 6393 } text: "I'm Yandex SpeechKit I can turn any text into speech now you can too" end_time_ms: 6600 } channel_tag: "0" } } channel_tag: "0" , session_uuid { uuid: "df49eaa2-25a55218-ae967fa1-********" user_request_id: "f8dkup42nmhk********" } audio_cursors { received_data_ms: 6600 partial_time_ms: 6600 final_time_ms: 6600 eou_time_ms: 6600 } response_wall_time_ms: 204 eou_update { time_ms: 6600 } channel_tag: "0" ]