Using embeddings to search through the knowledge base

Written by

Yandex Cloud

Updated at December 27, 2024

Getting started
Run the search

Embeddings make it possible to search a knowledge base for the most relevant answer to your question.

Getting started

To use the examples:

SDK

Python 3

Create a service account and assign the ai.languageModels.user role to it.
Get the service account API key and save it.

The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.
Use the pip package manager to install the ML SDK library:
```
pip install yandex-cloud-ml-sdk
```

Create a service account and assign the ai.languageModels.user role to it.
Get an IAM token for your service account.

In the example, the doc_texts array contains the source data for vectorization, i.e., the knowledge base, while the query_text variable contains the search query text. After you get the embeddings, it is time to calculate the distance between each vector in the knowledge base and the query vector to find the most closely related text in the knowledge base.

Note

As input data for a request, Yandex Cloud ML SDK can accept a string, a dictionary, an object of the TextMessage class, or an array containing any combination of these data types. For more information, see Yandex Cloud ML SDK usage.

SDK

Python 3

Create a file named search-knowledge-base.py and paste the following code into it:

#!/usr/bin/env python3
# pylint: disable=import-outside-toplevel

from __future__ import annotations
from yandex_cloud_ml_sdk import YCloudML

doc_texts = [
    """Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow – January 29 [February 10], 1837, St. Petersburg)
    was a Russian poet, playwright, and novelist, the progenitor of Russian realism,
    a literary critic and theorist, historian, essay writer, and journalist.""",
    """Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life.""",
    
    
]
query_text = "when is Pushkin's birthday?"


def main():
    import numpy as np
    from scipy.spatial.distance import cdist

    sdk = YCloudML(
        folder_id="<folder_ID>",
        auth="<API_key>",
    )

    query_model = sdk.models.text_embeddings("query")
    query_embedding = query_model.run(query_text)

    doc_model = sdk.models.text_embeddings("doc")
    doc_embeddings = [doc_model.run(text) for text in doc_texts]

    query_embedding = np.array(query_embedding)

    dist = cdist([query_embedding], doc_embeddings, metric="cosine")
    sim = 1 - dist
    result = doc_texts[np.argmax(sim)]
    print(result)


if __name__ == "__main__":
    main()

Where:

<folder_ID>: ID of the folder in which the service account was created.
<API_key>: Service account API key you got earlier required for authentication in the API.

The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.

Run the created file:

python3 search-knowledge-base.py

Result:

Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg)
    was a Russian poet, playwright, and novelist, the progenitor of Russian realism,
    a literary critic and theorist, historian, essay writer, and journalist.

Create a file named search-knowledge-base.py and paste the following code into it:

import requests
import numpy as np
from scipy.spatial.distance import cdist

FOLDER_ID = "<folder_ID>"
IAM_TOKEN = "<IAM_token>"
doc_uri = f"emb://{FOLDER_ID}/text-search-doc/latest"
query_uri = f"emb://{FOLDER_ID}/text-search-query/latest"
embed_url = "https://llm.api.cloud.yandex.net:443/foundationModels/v1/textEmbedding"
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {IAM_TOKEN}", "x-folder-id": f"{FOLDER_ID}"}
doc_texts = [
  """Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.""",
  """Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life."""
]
query_text = "when is Pushkin's birthday?"

def get_embedding(text: str, text_type: str = "doc") -> np.array:
    query_data = {
        "modelUri": doc_uri if text_type == "doc" else query_uri,
        "text": text,
    }

    return np.array(
        requests.post(embed_url, json=query_data, headers=headers).json()["embedding"]
    )

query_embedding = get_embedding(query_text, text_type="query")
docs_embedding = [get_embedding(doc_text) for doc_text in doc_texts]

# Calculating cosine distance
dist = cdist(query_embedding[None, :], docs_embedding, metric="cosine")

# Calculating cosine similarity
sim = 1 - dist

# most similar doc text
print(doc_texts[np.argmax(sim)])

Where:

<folder_ID>: Yandex Cloud folder ID.
<IAM_token>: Service account IAM token you got earlier required for authentication in the API.

Run the created file:

python3 search-knowledge-base.py

Result:

Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.

Using embeddings to search through the knowledge base

Getting started

Run the search

See also

Was the article helpful?

Using embeddings to search through the knowledge base

Getting startedGetting started

Run the searchRun the search

See alsoSee also

Was the article helpful?

Getting started

Run the search

See also