Tuning classifiers based on YandexGPT
In Yandex DataSphere
Note
Foundation model tuning is at the Preview stage.
To tune a YandexGPT classifier:
- Prepare your infrastructure.
- Prepare data for model training.
- Tune the classifier.
- Send a request to the classifier.
If you no longer need the resources you created, delete them.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account in the identity federation (SSO). - Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure that you have a billing account linked and its status is
ACTIVE
orTRIAL_ACTIVE
. If you do not have a billing account yet, create one in the DataSphere interface.
Prepare the infrastructure
Log in to the Yandex Cloud management console
If you have an active billing account, you can create or select a folder to deploy your infrastructure in, on the cloud page
Note
If you use an identity federation to access Yandex Cloud, billing details might be unavailable to you. In this case, contact your Yandex Cloud organization administrator.
Create a folder
- In the management console
, select a cloud and click Create folder. - Name your folder, e.g.,
data-folder
. - Click Create.
Create a service account for the DataSphere project
You can send requests to a tuned model through the DataSphere interface (Playground) or through the Foundation Models API. If you are going to make API requests, you need a service account with the ai.languageModels.user
role. The service account must be a member of the DataSphere project in which the classifier will be tuned.
- Go to
data-folder
. - In the Service accounts tab, click Create service account.
- Enter a name for the service account, e.g.,
ai-user
. - Click Add role and assign the
ai.languageModels.user
role to the service account. - Click Create.
Add the service account to a project
To enable the service account to access the tuned classifier, add it to the list of project members.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - In the Members tab, click Add member.
- Select the
ai-user
account and click Add.
Prepare data for model training
Note
To improve the quality of the responses you get, YandexGPT API logs user prompts. Do not use sensitive information and personal data in your prompts.
The following limits apply when tuning a classifier model based on YandexGPT:
Type of limit | Minimum value | Maximum value |
---|---|---|
Number of examples per dataset | 100 | 50,000 |
Number of classes | 2 | 100 |
Number of same class examples per dataset | 1 | — |
Class name length, in characters | — | 100 |
Number of characters in the text of the request to classify | — | 10,000 |
We recommend tuning models on datasets containing at least 1,000 examples and at least 100 examples for each class.
Example of file contents for binary classification training:
{"text":"I am fine","neutral":1,"emotional":0}
{"text":"I am doing great ","neutral":0,"emotional":1}
{"text":"You could not possibly understand how tough it is to get up for work at six in the morning every day and spend two hours commuting on public transport","neutral":0,"emotional":1}
{"text":"it is the same as always: work, home, and family.","neutral":1,"emotional":0}
Where:
text
: Message text.neutral
andemotional
: Two classes of binary classification.
Example of file contents for multi-class classification training:
{"text":"wow, how did that happen","anger":0,"fear":0,"joy":0,"sadness":0,"surprise":1}
{"text":"what should I do, what if they find out ?","anger":0,"fear":1,"joy":0,"sadness":0,"surprise":0}
{"text":"today is Friday, and tonight we are going to the club with friends","anger":0,"fear":0,"joy":1,"sadness":0,"surprise":0}
{"text":"do not lie to me, you just overslept again and were late for school because of that","anger":1,"fear":0,"joy":0,"sadness":0,"surprise":0}
Where:
text
: Message text.anger
,fear
,joy
,sadness
, andsurprise
: Classes.
Example of file contents for multi-label classification training:
{"computer_science":0,"physics":0,"mathematics":1,"statistics":1,"quantitative_biology":0,"quantitative_finance":0,"text":"Title: Bias Reduction in Instrumental Variable Estimation through First-Stage Shrinkage\nAbstract: The two-stage least-squares (2SLS) estimator is known to be biased when its\nfirst-stage fit is poor. I show that better first-stage prediction can\nalleviate this bias. In a two-stage linear regression model with Normal noise,\nI consider shrinkage in the estimation of the first-stage instrumental variable\ncoefficients. For at least four instrumental variables and a single endogenous\nregressor, I establish that the standard 2SLS estimator is dominated with\nrespect to bias. The dominating IV estimator applies James-Stein type shrinkage\nin a first-stage high-dimensional Normal-means problem followed by a\ncontrol-function approach in the second stage. It preserves invariances of the\nstructural instrumental variable equations.\n"}
{"computer_science":0,"physics":0,"mathematics":1,"statistics":0,"quantitative_biology":0,"quantitative_finance":0,"text":"Title: Essentially Finite Vector Bundles on Normal Pseudo-proper Algebraic Stacks\nAbstract: Let $X$ be a normal, connected and projective variety over an algebraically\nclosed field $k$. It is known that a vector bundle $V$ on $X$ is essentially\nfinite if and only if it is trivialized by a proper surjective morphism $f:Y\\to\nX$. In this paper we introduce a different approach to this problem which\nallows to extend the results to normal, connected and strongly pseudo-proper\nalgebraic stack of finite type over an arbitrary field $k$.\n"}
{"computer_science":1,"physics":0,"mathematics":0,"statistics":1,"quantitative_biology":0,"quantitative_finance":0,"text":"Title: MOLIERE: Automatic Biomedical Hypothesis Generation System\nAbstract: Hypothesis generation is becoming a crucial time-saving technique which\nallows biomedical researchers to quickly discover implicit connections between\nimportant concepts. Typically, these systems operate on domain-specific\nfractions of public medical data. MOLIERE, in contrast, utilizes information\nfrom over 24.5 million documents. At the heart of our approach lies a\nmulti-modal and multi-relational network of biomedical objects extracted from\nseveral heterogeneous datasets from the National Center for Biotechnology\nInformation (NCBI). These objects include but are not limited to scientific\npapers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses\nusing Latent Dirichlet Allocation applied on abstracts found near shortest\npaths discovered within this network, and demonstrate the effectiveness of\nMOLIERE by performing hypothesis generation on historical data. Our network,\nimplementation, and resulting data are all publicly available for the broad\nscientific community.\n"}
Where:
-
computer_science
,physics
,mathematics
,statistics
,quantitative_biology
, andquantitative_finance
: Classes. -
text
: Message text:Title
: Message title.Abstract
: Main text of the message.
After completing the training, you will get the ID of the model tuned for classification needs. Provide this ID in the modelUri
field of the request body in the Text Classification API classify method.
Tune the model
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. -
In the left-hand panel, click Foundation models.
-
Select the YandexGPT classifier model and click Tune model.
-
In the window that opens, specify your project and click Add.
-
In the Name field, enter a name for the model.
-
Select the classification type:
- Binary: If each request belongs to one of the two groups.
- Multi-class: If you need to divide multiple requests into several groups.
- Multi-label: If each request can belong to more than one group.
-
In the File with samples field, attach a JSON file with request and class pairs.
-
Click Start tuning and wait for the model to be tuned. This may take several hours.
-
To check the status of your fine-tuned model:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. -
In the list of available project resources, select Models.
-
In the Project tab, select Tuned foundation models.
You can also get the model ID here. You will need it to make API requests.
-
Send a request to the tuned classifier
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - In the list of available project resources, select Models.
- In the Project tab, select Tuned foundation models.
- Select your fine-tuned model and click Test in Playground.
- Under Request, enter the text you want to classify.
- Click Send request.
To use the examples, install cURL
-
Create a file with the request body, e.g.,
body.json
:{ "model_uri": "cls://<folder_ID>/<classifier_ID>", "text": "<prompt_text>" }
Where:
-
Send a request to the classifier by running the following command:
export IAM_TOKEN=<IAM_token> curl --request POST \ --header "Authorization: Bearer ${IAM_TOKEN}" \ --data "@<path_to_request_body_file>" \ "https://llm.api.cloud.yandex.net:443/foundationModels/v1/textClassification"
In the response, the service will return the classification results with the
confidence
values for the probability of classifying the request text into each class:{ "predictions": [ { "label": "<class_1_name>", "confidence": 0.00010150671005249023 }, { "label": "<class_2_name>", "confidence": 0.000008225440979003906 }, ... { "label": "<class_n_name>", "confidence": 0.93212890625 } ], "modelVersion": "<model_version>" }
In multi-class classification, the sum of the
confidence
values for all classes is always1
.In multi-label classification, the
confidence
value for each class is calculated independently (the sum of the values is not equal to1
).