Text vectorization
Computers process all information as numbers. To effectively process texts written in natural languages, AI models convert words, phrases, and sentences into numeric vectors of a fixed size preserving the characteristics of words and relations between them.
The process of converting text into such vectors is called vectorization, and the resulting vector is called an embedding. Embeddings store information about texts and allow using mathematical methods to process them. You can use embeddings to classify information, compare texts, or search through a knowledge base of your own.
Models for text vectorization
Yandex Foundation Models provides two text vectorization models.
Target | URI | Output vector size | Operating modes |
---|---|---|---|
Vectorization of large source texts, e.g., documentation articles | emb://<folder_ID>/text-search-doc/latest |
256 | Synchronous |
Vectorization of short texts, such as search queries, requests, etc. | emb://<folder_ID>/text-search-query/latest |
256 | Synchronous |
To use the Yandex Foundation Models text vectorization models, you need the ai.languageModels.user
role or higher for the folder.
Accessing models
You can access text vector representation models in a number of ways.
When working with text vector representation models via Yandex Cloud ML SDK, use one of the following formats:
-
Model name, provided as a string.
model = ( sdk.models.text_embeddings("text-search-doc") )
-
Model name and version, provided as strings in the
model_name
andmodel_version
fields, respectively.model = ( sdk.models.text_embeddings(model_name="text-search-query", model_version="latest") )
-
Model URI, provided as a string containing the full URI of the model.
model = ( sdk.models.text_embeddings("emb://b1gt6g8ht345********/text-search-query/latest") )
To access a model via the REST API or gRPC API, specify the model's URI containing the folder ID in the modelUri
field of the request body. The /latest
segment indicates the model version and is optional.
Example:
{
"modelUri":
emb://b1gt6g8ht345********/text-search-doc/latest
...
}
To access the Latest
version, you do not need to specify the model version explicitly because Latest
is used by default.
Embedding use case
This primitive example demonstrates how you can use embeddings to find the closest answer to a question in a knowledge base. The doc_texts
array contains the source data for vectorization, i.e., the knowledge base, while the query_text
variable contains the search query. After you get the embeddings, you can calculate the distance between each vector in the knowledge base and the query vector to find the most closely related text.
#!/usr/bin/env python3
# pylint: disable=import-outside-toplevel
from __future__ import annotations
from yandex_cloud_ml_sdk import YCloudML
doc_texts = [
"""Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow – January 29 [February 10], 1837, St. Petersburg)
was a Russian poet, playwright, and novelist, the progenitor of Russian realism,
a literary critic and theorist, historian, essay writer, and journalist.""",
"""Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life.""",
]
query_text = "when is Pushkin's birthday?"
def main():
import numpy as np
from scipy.spatial.distance import cdist
sdk = YCloudML(
folder_id="<folder_ID>",
auth="<API_key>",
)
query_model = sdk.models.text_embeddings("query")
query_embedding = query_model.run(query_text)
doc_model = sdk.models.text_embeddings("doc")
doc_embeddings = [doc_model.run(text) for text in doc_texts]
query_embedding = np.array(query_embedding)
dist = cdist([query_embedding], doc_embeddings, metric="cosine")
sim = 1 - dist
result = doc_texts[np.argmax(sim)]
print(result)
if __name__ == "__main__":
main()
Where:
<folder_ID>
: ID of the folder in which the service account was created.<API_key>
: Service account API key required for authentication in the API.
Result:
Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg)
was a Russian poet, playwright, and novelist, the progenitor of Russian realism,
a literary critic and theorist, historian, essay writer, and journalist.
import requests
import numpy as np
from scipy.spatial.distance import cdist
FOLDER_ID = "<folder_ID>"
IAM_TOKEN = "<IAM_token>"
doc_uri = f"emb://{FOLDER_ID}/text-search-doc/latest"
query_uri = f"emb://{FOLDER_ID}/text-search-query/latest"
embed_url = "https://llm.api.cloud.yandex.net:443/foundationModels/v1/textEmbedding"
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {IAM_TOKEN}", "x-folder-id": f"{FOLDER_ID}"}
doc_texts = [
"""Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.""",
"""Matricaria is a genus of annual flowering plants of the aster (composite) family. According to the today's classification, it includes around 70 species of low-rise fragrant herbs that blossom from the first year of life."""
]
query_text = "when is Pushkin's birthday?"
def get_embedding(text: str, text_type: str = "doc") -> np.array:
query_data = {
"modelUri": doc_uri if text_type == "doc" else query_uri,
"text": text,
}
return np.array(
requests.post(embed_url, json=query_data, headers=headers).json()["embedding"]
)
query_embedding = get_embedding(query_text, text_type="query")
docs_embedding = [get_embedding(doc_text) for doc_text in doc_texts]
# Calculating cosine distance
dist = cdist(query_embedding[None, :], docs_embedding, metric="cosine")
# Calculating cosine similarity
sim = 1 - dist
# most similar doc text
print(doc_texts[np.argmax(sim)])
Where:
<folder_ID>
: Yandex Cloud folder ID.<IAM_token>
: IAM token of the account for authentication with the API.
Result:
Alexander Sergeyevich Pushkin (May 26 [June 6], 1799, Moscow — January 29 [February 10], 1837, St. Petersburg) was a Russian poet, playwright, and novelist, the progenitor of Russian realism, a literary critic and theorist, historian, essay writer, and journalist.
See also
- Using embeddings to search through the knowledge base
- Examples of working with ML SDK on GitHub