How to recognize long audio files in SpeechKit
The service can recognize speech in different ways. The provided example demonstrates asynchronous recognition of an audio file. Asynchronous recognition is available via API v3 and API v2. Asynchronous recognition is subject to these restrictions:
- Maximum audio duration: 4 hours
- Maximum file size: 1 GB
In the example, the API is used via the cURL
Getting started
-
Create a bucket and upload to it the audio file you want to recognize.
-
Create a service account.
Warning
You can recognize audio files asynchronously only as a service account. Do not use any other Yandex Cloud accounts for the purpose.
-
Assign to the service account the
storage.uploader
andai.speechkit-stt.user
roles for the folder you had created the bucket in. -
Get an API key or IAM token for your service account.
-
Download a sample audio file:
- For API v3: a WAV file.
- For API v2: an LPCM file.
Speech recognition
-
Get a link to an audio file in Object Storage.
-
Create a file, e.g.,
request.json
, and add the following code to it:{ "uri": "https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>", "recognition_model": { "model": "general", "audio_format": { "container_audio": { "container_audio_type": "WAV" } } } }
Where:
-
uri
: Link to the audio file in Object Storage. Here is an example of such a link:https://storage.yandexcloud.net/speechkit/speech.wav
.The link contains additional query parameters (after
?
) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored. -
model
: Speech recognition model. -
container_audio_type
: Audio container format.
-
-
Run the request using one of the service account authentication methods:
-
With an IAM token:
export IAM_TOKEN=<service_account_IAM_token> && \ curl \ --insecure \ --header "Authorization: Bearer ${IAM_TOKEN}" \ --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
Where
IAM_TOKEN
is the IAM token of the service account. -
With API keys.
Use API keys if requesting an IAM token automatically is not an option.
export API_KEY=<service_account_API_key> && \ curl \ --insecure \ --header "Authorization: Api-Key ${API_KEY}" \ --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
Result example:
{ "id":"f8ddr61b30fk********", "description":"STT v3 async recognition", "createdAt":"2024-07-15T07:39:36Z", "createdBy":"ajehumcuv38h********", "modifiedAt":"2024-07-15T07:39:36Z", "done":false, "metadata":null }
Save the recognition operation
id
you get in the response. -
-
Wait for the recognition to complete. It takes about 10 seconds to recognize one minute of audio.
-
Request information about the operation:
-
Authorization using an IAM token:
curl \ --insecure \ --request GET \ --header "Authorization: Bearer ${IAM_TOKEN}" \ https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
-
Authorization using an API key:
curl \ --insecure \ --request GET \ --header "Authorization: Api-key ${API_KEY}" \ https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
Result example:
{ "done": true, "id": "f8ddr61b30fk********", "description": "STT v3 async recognition", "createdAt": "2024-07-15T07:39:36Z", "createdBy": "ajehumcuv38h********", "modifiedAt": "2024-07-15T07:39:37Z" }
-
-
Request the operation result:
-
Authorization using an IAM token:
curl \ --insecure \ --request GET \ --header "Authorization: Bearer ${IAM_TOKEN}" \ https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
-
Authorization using an API key:
curl \ --insecure \ --request GET \ --header "Authorization: Api-key ${API_KEY}" \ https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
Result example
{ "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "0" }, "responseWallTimeMs": "189", "final": { "alternatives": [ { "words": [ { "text": "I", "startTimeMs": "459", "endTimeMs": "520" }, { "text": "Yandex", "startTimeMs": "640", "endTimeMs": "1060" }, { "text": "SpeechKit", "startTimeMs": "1120", "endTimeMs": "1959" }, { "text": "I", "startTimeMs": "2480", "endTimeMs": "2520" }, { "text": "can", "startTimeMs": "2580", "endTimeMs": "2800" }, { "text: "turn", "startTimeMs": "2860", "endTimeMs": "3360" }, { "text": "any", "startTimeMs": "3439", "endTimeMs": "3709" }, { "text": "text", "startTimeMs": "3800", "endTimeMs": "4140" }, { "text": "into", "startTimeMs": "4200", "endTimeMs": "4220" }, { "text": "speech", "startTimeMs": "4279", "endTimeMs": "4740" }, { "text": "now", "startTimeMs": "5140", "endTimeMs": "5759" }, { "text": "you", "startTimeMs": "5859", "endTimeMs": "5900" }, { "text": "can", "startTimeMs": "5980", "endTimeMs": "6399" }, { "text": "too", "startTimeMs": "6660", "endTimeMs": "7180" } ], "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too", "startTimeMs": "0", "endTimeMs": "7400", "confidence": 0, "languages": [] } ], "channelTag": "0" }, "channelTag": "0" } } { "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "0" }, "responseWallTimeMs": "189", "finalRefinement": { "finalIndex": "0", "normalizedText": { "alternatives": [ { "words": [ { "text": "I", "startTimeMs": "459", "endTimeMs": "520" }, { "text": "Yandex", "startTimeMs": "640", "endTimeMs": "1060" }, { "text": "SpeechKit", "startTimeMs": "1120", "endTimeMs": "1959" }, { "text": "I", "startTimeMs": "2480", "endTimeMs": "2520" }, { "text": "can", "startTimeMs": "2580", "endTimeMs": "2800" }, { "text": "turn", "startTimeMs": "2860", "endTimeMs": "3360" }, { "text": "any", "startTimeMs": "3439", "endTimeMs": "3709" }, { "text": "text", "startTimeMs": "3800", "endTimeMs": "4140" }, { "text": "into", "startTimeMs": "4200", "endTimeMs": "4220" }, { "text": "speech", "startTimeMs": "4279", "endTimeMs": "4740" }, { "text": "now", "startTimeMs": "5140", "endTimeMs": "5759" }, { "text": "you", "startTimeMs": "5859", "endTimeMs": "5900" }, { "text": "can", "startTimeMs": "5980", "endTimeMs": "6399" }, { "text": "too", "startTimeMs": "6660", "endTimeMs": "7180" } ], "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too", "startTimeMs": "0", "endTimeMs": "7400", "confidence": 0, "languages": [] } ], "channelTag": "0" } }, "channelTag": "0" } } { "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "7400" }, "responseWallTimeMs": "190", "eouUpdate": { "timeMs": "7400" }, "channelTag": "0" } }
-
-
Get a link to an audio file in Object Storage.
-
Create a file named
body.json
and add the following code to it:{ "config": { "specification": { "languageCode": "ru-RU", "model": "general", "audioEncoding": "LINEAR16_PCM", "sampleRateHertz": 8000, "audioChannelCount": 1 } }, "audio": { "uri": "<link_to_audio_file>" } }
Where:
-
languageCode
: Recognition language. -
model
: Speech recognition model. -
audioEncoding
: Format of the submitted audio file. -
sampleRateHertz
: Audio file sampling rate in Hz. -
audioChannelCount
: Number of audio channels. -
uri
: Link to the audio file in Object Storage. Here is an example of such a link:https://storage.yandexcloud.net/speechkit/speech.pcm
.The link contains additional query parameters (after
?
) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.
-
-
Run the file you created:
export API_KEY=<service_account_API_key> && \ curl \ --insecure \ --header "Authorization: Api-Key ${API_KEY}" \ --data "@body.json"\ https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize
Result example:
{ "done": false, "id": "e03sup6d5h1q********", "createdAt": "2019-04-21T22:49:29Z", "createdBy": "ajes08feato8********", "modifiedAt": "2019-04-21T22:49:29Z" }
Save the recognition operation
id
you get in the response. -
Wait for the recognition to complete. It takes about 10 seconds to recognize one minute of single-channel audio.
-
Send a request to get information about the operation:
curl \ --insecure \ --header "Authorization: Api-key ${API_KEY}" \ https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
Result example:
{ "done": true, "response": { "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse", "chunks": [ { "alternatives": [ { "words": [ { "startTime": "0.160s", "endTime": "0.500s", "word": "hello", "confidence": 1 }, { "startTime": "0.580s", "endTime": "0.800s", "word": "world", "confidence": 1 } ], "text": "Hello world", "confidence": 1 } ], "channelTag": "1" } ] }, "id": "e03jjenu23uc********", "createdAt": "2024-08-22T11:39:22Z", "createdBy": "aje3bg430agh********", "modifiedAt": "2024-08-22T11:39:23Z" }