Using embeddings to search through the knowledge base
Embeddings make it possible to search a knowledge base for the most relevant answer to your question.
Getting started
To use the examples:
-
Create a service account and assign the
ai.languageModels.user
role to it. -
Get the service account API key and save it.
The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.
-
Use the pip
package manager to install the ML SDK library:pip install yandex-cloud-ml-sdk
- Create a service account and assign the
ai.languageModels.user
role to it. - Get an IAM token for your service account.
Run the search
In the example, the doc_texts
array contains the source data for vectorization, i.e., the knowledge base, while the query_text
variable contains the search query text. After you get the embeddings, it is time to calculate the distance between each vector in the knowledge base and the query vector to find the most closely related text in the knowledge base.
Note
As input data for a request, Yandex Cloud ML SDK can accept a string, a dictionary, an object of the TextMessage
class, or an array containing any combination of these data types. For more information, see Yandex Cloud ML SDK usage.
-
Create a file named
search-knowledge-base.py
and paste the following code into it:#!/usr/bin/env python3 # pylint: disable=import-outside-toplevel from __future__ import annotations from yandex_cloud_ml_sdk import YCloudML doc_texts = [ """Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow – January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.""", """Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life.""", ] query_text = "when is Pushkin's birthday?" def main(): import numpy as np from scipy.spatial.distance import cdist sdk = YCloudML( folder_id="<folder_ID>", auth="<API_key>", ) query_model = sdk.models.text_embeddings("query") query_embedding = query_model.run(query_text) doc_model = sdk.models.text_embeddings("doc") doc_embeddings = [doc_model.run(text) for text in doc_texts] query_embedding = np.array(query_embedding) dist = cdist([query_embedding], doc_embeddings, metric="cosine") sim = 1 - dist result = doc_texts[np.argmax(sim)] print(result) if __name__ == "__main__": main()
Where:
-
<folder_ID>
: ID of the folder in which the service account was created. -
<API_key>
: Service account API key you got earlier required for authentication in the API.The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.
-
-
Run the created file:
python3 search-knowledge-base.py
Result:
Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.
-
Create a file named
search-knowledge-base.py
and paste the following code into it:import requests import numpy as np from scipy.spatial.distance import cdist FOLDER_ID = "<folder_ID>" IAM_TOKEN = "<IAM_token>" doc_uri = f"emb://{FOLDER_ID}/text-search-doc/latest" query_uri = f"emb://{FOLDER_ID}/text-search-query/latest" embed_url = "https://llm.api.cloud.yandex.net:443/foundationModels/v1/textEmbedding" headers = {"Content-Type": "application/json", "Authorization": f"Bearer {IAM_TOKEN}", "x-folder-id": f"{FOLDER_ID}"} doc_texts = [ """Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.""", """Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life.""" ] query_text = "when is Pushkin's birthday?" def get_embedding(text: str, text_type: str = "doc") -> np.array: query_data = { "modelUri": doc_uri if text_type == "doc" else query_uri, "text": text, } return np.array( requests.post(embed_url, json=query_data, headers=headers).json()["embedding"] ) query_embedding = get_embedding(query_text, text_type="query") docs_embedding = [get_embedding(doc_text) for doc_text in doc_texts] # Calculating cosine distance dist = cdist(query_embedding[None, :], docs_embedding, metric="cosine") # Calculating cosine similarity sim = 1 - dist # most similar doc text print(doc_texts[np.argmax(sim)])
Where:
<folder_ID>
: Yandex Cloud folder ID.<IAM_token>
: Service account IAM token you got earlier required for authentication in the API.
-
Run the created file:
python3 search-knowledge-base.py
Result:
Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.
See also
- Text vectorization
- Examples of working with ML SDK on GitHub