Data search via the gRPC API
Getting started
To use the Yandex Cloud API, you will need Git, Python 3.6 or higher, and the grpcio-tools
package. Learn how to install Python
To search data using the Yandex Cloud gRPC API:
-
In the management console, create a service account.
-
Add the service account to the namespace with the
Data editor
role. This will allow the service account to upload data to SpeechSense. -
To authenticate to the Yandex Cloud API, create an API key or IAM token for the service account.
-
Clone the Yandex Cloud API repository
:git clone https://github.com/yandex-cloud/cloudapi
-
Install the
grpcio-tools
package using the pip package manager:pip install grpcio-tools
- Upload voice call recordings or chat transcripts to SpeechSense.
Finding data
-
Go to the directory hosting the Yandex Cloud API repository, create a directory named
search_data
, and generate the client interface code in it. Then, go to thesearch_data
directory:Bashcd <path_to_cloudapi_directory> && \ mkdir search_data && \ python3 -m grpc_tools.protoc -I . -I third_party/googleapis \ --python_out=./search_data/ \ --grpc_python_out=./search_data/ \ google/api/http.proto \ google/api/annotations.proto \ yandex/cloud/api/operation.proto \ google/rpc/status.proto \ yandex/cloud/operation/operation.proto \ yandex/cloud/validation.proto \ yandex/cloud/speechsense/v1/*.proto \ yandex/cloud/speechsense/v1/*/*.proto cd search_data
-
In the
search_data
directory, create thesearch_data.py
Python script that will search for conversations in SpeechSense.Example of the
search_data.py
script with filtering and full-text searchimport argparse import re from datetime import datetime from typing import NamedTuple import grpc from google.protobuf.field_mask_pb2 import FieldMask from google.protobuf.json_format import MessageToJson from yandex.cloud.speechsense.v1 import search_pb2 from yandex.cloud.speechsense.v1 import talk_service_pb2 from yandex.cloud.speechsense.v1 import talk_service_pb2_grpc class IntRangeFilter(NamedTuple): key: str lower_bound: int lb_inclusive: bool upper_bound: int ub_inclusive: bool def parse_int_range(s: str) -> IntRangeFilter: pattern = r'(-?\d+)(<=|<)(\w+)(<=|<)(-?\d+)' match = re.match(pattern, s) if not match: raise ValueError(f"Could not parse int range from: '{s}'") lower_bound = int(match.group(1)) lower_bound_inclusive = match.group(2) == "<=" key = match.group(3) upper_bound_inclusive = match.group(4) == "<=" upper_bound = int(match.group(5)) return IntRangeFilter( key=key, lower_bound=lower_bound, lb_inclusive=lower_bound_inclusive, upper_bound=upper_bound, ub_inclusive=upper_bound_inclusive ) def build_search_request( organization_id: str, space_id: str, connection_id: str, project_id: str, query=None, from_date=None, to_date=None, match_filter=None, classifier_filter=None, page_size=100, page_token='') -> talk_service_pb2.SearchTalkRequest: request = talk_service_pb2.SearchTalkRequest( organization_id=organization_id, space_id=space_id, connection_id=connection_id, project_id=project_id, page_size=page_size, page_token=page_token ) # Adding a full-text search request if query: request.query.text = query # Adding a filter by date if from_date: date_filter = search_pb2.DateRangeFilter() date_filter.from_value.FromDatetime(datetime.fromisoformat(from_date)) request.filters.append(search_pb2.Filter(key="userMeta.date", date_range=date_filter)) if to_date: date_filter = search_pb2.DateRangeFilter() date_filter.to_value.FromDatetime(datetime.fromisoformat(to_date)) request.filters.append(search_pb2.Filter(key="userMeta.date", date_range=date_filter)) # Adding a match filter if match_filter: key, value = match_filter.split(':') # key/value any_match_filter = search_pb2.AnyMatchFilter() any_match_filter.values.append(value) request.filters.append(search_pb2.Filter(key=key, any_match=any_match_filter)) # Adding a classifier filter if classifier_filter: filter_values = parse_int_range(classifier_filter) int_range_filter = search_pb2.IntRangeFilter() int_range_filter.from_value.value=filter_values.lower_bound int_range_filter.to_value.value=filter_values.upper_bound int_range_filter.bounds_inclusive.from_inclusive=filter_values.lb_inclusive int_range_filter.bounds_inclusive.to_inclusive=filter_values.ub_inclusive request.filters.append( search_pb2.Filter(key='talk.classifiers.' + filter_values.key + '.count', int_range=int_range_filter)) return request # For IAM token authentication, replace the `api_key` parameter with `iam_token` def print_talks( api_key: str, organization_id: str, space_id: str, connection_id: str, project_id: str, query=None, from_date=None, to_date=None, match_filter=None, classifier_filter=None ): credentials = grpc.ssl_channel_credentials() channel = grpc.secure_channel('api.speechsense.yandexcloud.net:443', credentials) talk_service_stub = talk_service_pb2_grpc.TalkServiceStub(channel) page_token = '' while True: search_request = build_search_request( organization_id=organization_id, space_id=space_id, connection_id=connection_id, project_id=project_id, query=query, from_date=from_date, to_date=to_date, match_filter=match_filter, classifier_filter=classifier_filter, page_token=page_token) # Searching for IDs of suitable conversations search_response = talk_service_stub.Search(search_request, metadata=( ('authorization', f'Api-Key {api_key}'), # For IAM token authentication, provide the header # ('authorization', f'Bearer {iam_token}'), )) page_token = search_response.next_page_token # print(f'found falks {search_response.talks_count}') # By default, only the basic fields of the conversation will be returned. # To include the analysis results, add them to the request fields_to_include = FieldMask( paths=['transcription', 'speech_statistics', 'silence_statistics', 'interrupts_statistics', 'conversation_statistics', 'points', 'text_classifiers']) # Requesting full data on conversations by ID get_request = talk_service_pb2.GetTalkRequest( organization_id=organization_id, space_id=space_id, connection_id=connection_id, project_id=project_id, talk_ids=search_response.talk_ids, results_mask=fields_to_include ) get_response = talk_service_stub.Get(get_request, metadata=( ('authorization', f'Api-Key {api_key}'), # For IAM token authentication, provide the header # ('authorization', f'Bearer {iam_token}'), )) # Printing conversations for talk in get_response.talk: print(MessageToJson(talk, ensure_ascii=False)) # If the token is empty, we have reached the last result page if not page_token: break if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--key', required=True, help='API key or IAM token', type=str) parser.add_argument('--organization-id', required=True, help='Organization ID', type=str) parser.add_argument('--space-id', required=True, help='Space ID', type=str) parser.add_argument('--connection-id', required=True, help='Connection ID', type=str) parser.add_argument('--project-id', required=True, help='Project ID', type=str) parser.add_argument('--query', required=False, help='Full-text search query', type=str) parser.add_argument('--match-filter', required=False, help='Simple match filter in format key:value', type=str) parser.add_argument('--classifier-filter', required=False, help='range for classifier X in format fromValue</<=X</<=toValue', type=str) parser.add_argument('--before', required=False, help='Search for talks before timestamp', type=str) parser.add_argument('--after', required=False, help='Search for talks after timestamp', type=str) args = parser.parse_args() print_talks(args.key, args.organization_id, args.space_id, args.connection_id, args.project_id, query=args.query, from_date=args.after, to_date=args.before, match_filter=args.match_filter, classifier_filter=args.classifier_filter)
-
Specify the service account's API key:
export API_KEY=<service_account_API_key>
If using an IAM token, provide it instead of the API key:
export IAM_TOKEN=<service_account_IAM_token>
-
Run the
search_data.py
script with the parameters you need:python3 search_data.py \ --organization-id <organization_ID> \ --space-id <space_ID> \ --connection-id <connection_ID> \ --project-id <project_ID> \ --key ${API_KEY}
Where:
--organization-id
: ID of the organization the request takes place in. To get the ID, go to Cloud Center and click under the name of the organization in the section.--space-id
: ID of the space the request takes place in. To get the ID, go to SpeechSense , open the page of the space you need and click ID.--connection-id
: ID of the connection the request takes place in. To get the ID, go to SpeechSense , open the page of the space you need. On the Connection tab, open the page of the connection and click ID.--project-id
: ID of the project the request takes place in. To get an ID, go to SpeechSense , open the page of the space you need. On the Projects tab, open the page of the project and click ID.--key
: API key for authentication. If using an IAM token, specify theIAM_TOKEN
environment variable instead ofAPI_KEY
.
As the result of running the script, data in JSON format will be displayed.
Query parameters
Search query
-
Query
: Full-text search that allows you to search through an audio text transcript or chat text messages. -
Filter
: Allows you to search by user metadata, classifiers, conversation summary, or statistics.Under
Filter
, provide the dialog feature you are searching by in thekey
field.-
userMeta.<field_name>
: Searching by user metadata. Where<field_name>
is the user metadata field that was specified when uploading the conversation, e.g.,userMeta.date
. The filter type must match the metadata field type, which you selected when creating the connection. -
talk.classifiers.<classifier_name>.count
: Searching by classifiers. -
talk.summarization.points.<question_ID>
: Searching by conversation summary. You can view IDs of the questions from the conversation summary in the response of the GET request. -
Searching by statistics (only for audio):
talk.statistics.duration_seconds
: Duration of the conversation.talk.statistics.simultaneous_silence.duration_seconds
,talk.statistics.simultaneous_silence.ratio
: Simultaneous silence in seconds or as a percentage.talk.statistics.simultaneous_speech.duration_seconds
,talk.statistics.simultaneous_speech.ratio
: Simultaneous speech in seconds or as a percentage.talk.statistics.interrupts.count
: Number of conversation partner interruptions.talk.statistics.phrases.count
,talk.statistics.words.count
,talk.statistics.letters.count
: Number of phrases, words, or characters in the conversation.talk.statistics.words.count_per_second
,talk.statistics.letters.count_per_second
: Number of words or characters per second in the specified channel (the channel should be specified in the filter).talk.statistics.interrupts.duration_seconds
: Duration of interruption by the specified conversation channel (the channel should be specified in the filter), in seconds.
-
Under Filter
and Query
, you can provide a channel. For a full-text search, it means that the search will only be conducted in the audio text transcript for the specified channel. For filters, this means that filtering will be done only by metadata, classifier positives, or statistics related to this channel.
In connections for chats, the channels are numbered as follows:
0
: Agent channel.1
: Customer channel.2
: Bot channel.
In connections for audio recordings, the channels have a preset numeration.
Other types of filters in the search
request:
AnyMatchFilter
: Specifies whether the metadata, classifier, statistics, or conversation summary fields contain the value from the filter. For example, a filter with thekey = userMeta.ticket_id
andvalues = [123, 345]
parameters will find conversations with123
or345
in theticket_id
metadata field.IntRangeFilter
: Checks if the keyed integer value belongs to the specified range. Suitable for searching by classifiers, integer metadata fields, and some types of statistics.DoubleRangeFilter
: Same asIntRangeFilter
but used for floating-point numbers. Suitable for searching by some types of statistics and metadata fields of the required type.BooleanFilter
: Checks if the keyedboolean
has the suitable value (True
orFalse
). Suitable for searching by conversation summary andboolean
metadata fields.
For more information about search query parameters, see the API reference.
GET request
In the GET request, one important parameter is the result mask provided in the fields_to_include
field. If the fields_to_include
parameter is not provided, you will get only the basic information about the conversation, such as the IDs of the project, connection, and space, information about when and by whom the conversation was created or modified, and the metadata added when uploading the conversation.
To get additional information, provide the keys you need in the mask:
transcription
: Text transcript of the audio recording or text messages from the chat.speech_statistics
: Speech statistics.silence_statistics
: Statistics of pauses in the conversation.interrupts_statistics
: Statistics of conversation partner interruptions.conversation_statistics
: Conversation statistics.points
: Conversation summary.text_classifiers
: Statistics for classifiers (tags).
For more information about GET request parameters, see the API reference.