Text vectorization
Computers process all information as numbers. To effectively process texts written in natural languages, AI models convert words, phrases, and sentences into numeric vectors of a fixed size preserving the characteristics of words and relations between them.
The process of converting text into such vectors is called vectorization, and the resulting vector is called an embedding. Embeddings store information about texts and allow using mathematical methods to process them. You can use embeddings to classify information, compare texts, or search through a knowledge base of your own.
Models for text vectorization
Yandex Foundation Models provides two text vectorization models. You can access the model's API via its URI/latest
segment specifies the model version and is optional.
What for | Output vector size | Operation mode | URI |
---|---|---|---|
Vectorization of large source texts, e.g., documentation articles | 256 | Synchronous | emb://<folder_ID>/text-search-doc/latest |
Vectorization of short texts, such as search queries, requests, etc. | 256 | Synchronous | emb://<folder_ID>/text-search-query/latest |
To use the Yandex Foundation Models text vectorization models, you need the ai.languageModels.user
role or higher for the folder.
Embedding use case
This primitive example demonstrates how you can use embeddings to find the closest answer to a question in a knowledge base. The doc_texts
array contains the source data for vectorization, i.e., the knowledge base, while the query_text
variable contains the search query. After you get the embeddings, you can calculate the distance between each vector in the knowledge base and the query vector to find the most closely related text.
#!/usr/bin/env python3
# pylint: disable=import-outside-toplevel
from __future__ import annotations
from yandex_cloud_ml_sdk import YCloudML
doc_texts = [
"""Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow – January 29 [February 10], 1837, St. Petersburg)
was a Russian poet, playwright, and novelist, the progenitor of Russian realism,
a literary critic and theorist, historian, essay writer, and journalist.""",
"""Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life.""",
]
query_text = "when is Pushkin's birthday?"
def main():
import numpy as np
from scipy.spatial.distance import cdist
sdk = YCloudML(
folder_id="<folder_ID>",
auth="<API_key>",
)
query_model = sdk.models.text_embeddings("query")
query_embedding = query_model.run(query_text)
doc_model = sdk.models.text_embeddings("doc")
doc_embeddings = [doc_model.run(text) for text in doc_texts]
query_embedding = np.array(query_embedding)
dist = cdist([query_embedding], doc_embeddings, metric="cosine")
sim = 1 - dist
result = doc_texts[np.argmax(sim)]
print(result)
if __name__ == "__main__":
main()
Where:
<folder_ID>
: ID of the folder in which the service account was created.<API_key>
: Service account API key required for authentication in the API.
Result:
Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.
import requests
import numpy as np
from scipy.spatial.distance import cdist
FOLDER_ID = "<folder_ID>"
IAM_TOKEN = "<IAM_token>"
doc_uri = f"emb://{FOLDER_ID}/text-search-doc/latest"
query_uri = f"emb://{FOLDER_ID}/text-search-query/latest"
embed_url = "https://llm.api.cloud.yandex.net:443/foundationModels/v1/textEmbedding"
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {IAM_TOKEN}", "x-folder-id": f"{FOLDER_ID}"}
doc_texts = [
"""Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.""",
"""Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life."""
]
query_text = "when is Pushkin's birthday?"
def get_embedding(text: str, text_type: str = "doc") -> np.array:
query_data = {
"modelUri": doc_uri if text_type == "doc" else query_uri,
"text": text,
}
return np.array(
requests.post(embed_url, json=query_data, headers=headers).json()["embedding"]
)
query_embedding = get_embedding(query_text, text_type="query")
docs_embedding = [get_embedding(doc_text) for doc_text in doc_texts]
# Calculating cosine distance
dist = cdist(query_embedding[None, :], docs_embedding, metric="cosine")
# Calculating cosine similarity
sim = 1 - dist
# most similar doc text
print(doc_texts[np.argmax(sim)])
Where:
<folder_ID>
: Yandex Cloud folder ID.<IAM_token>
: IAM token of the account for authentication with the API.
Result:
Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.
See also
- Using embeddings to search through the knowledge base
- Examples of working with ML SDK on GitHub