Managing machine learning models in Managed Service for ClickHouse®
Managed Service for ClickHouse® enables data analysis using CatBoost
To apply a model, add it to your cluster and call it in an SQL query using the built-in catboostEvaluate() function. This query will return model predictions for each row of the input data.
Read more about the catboostEvaluate() function in this ClickHouse® article
Before adding a model
Managed Service for ClickHouse® only works with models uploaded to Yandex Object Storage and accessible for reading:
- To attach a service account to a cluster, assign the iam.serviceAccounts.user role or higher to your Yandex Cloud account.
- Upload the trained model file to Yandex Object Storage.
- Attach the service account to the cluster. You will use this service account to configure access to the model file.
- Assign the
storage.viewerrole to the service account. - In the bucket ACL, add the
READpermission to the service account. - Get a link to the model file.
Getting a list of models in a cluster
- In the management console
, select the folder the cluster is in. - Go to Managed Service for ClickHouse.
- Click the name of your cluster and select the Machine learning tab in the left-hand panel.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To get a list of models in a cluster, run this command:
yc managed-clickhouse ml-model list --cluster-name=<cluster_name>
You can get the cluster name with the list of clusters in the folder.
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Call the MlModel.List method, e.g., via the following cURL
request:curl \ --request GET \ --header "Authorization: Bearer $IAM_TOKEN" \ --url 'https://mdb.api.cloud.yandex.net/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels'You can get the cluster ID with the list of clusters in the folder.
-
View the server response to make sure your request was successful.
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume the repository contents are stored in the
~/cloudapi/directory. -
Call the MlModelService.List method, e.g., via the following gRPCurl
request:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.ListYou can get the cluster ID with the list of clusters in the folder.
-
View the server response to make sure your request was successful.
Getting detailed information about a model
- In the management console
, select the folder the cluster is in. - Go to Managed Service for ClickHouse.
- Click the name of your cluster and select the Machine learning tab in the left-hand panel.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To get model details, run this command:
yc managed-clickhouse ml-model get <model_name> \
--cluster-name=<cluster_name>
You can get the model name with a list of cluster models, and the cluster name, with a list of clusters in the folder.
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Call the MlModel.Get method, e.g., via the following cURL
request:curl \ --request GET \ --header "Authorization: Bearer $IAM_TOKEN" \ --url 'https://mdb.api.cloud.yandex.net/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels/<model_name>'You can get the cluster ID with a list of clusters in the folder, and the model name, with a list of models in the cluster.
-
View the server response to make sure your request was successful.
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume the repository contents are stored in the
~/cloudapi/directory. -
Call the MlModelService.Get method, e.g., via the following gRPCurl
request:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<model_name>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.GetYou can get the cluster ID with a list of clusters in the folder, and the model name, with a list of models in the cluster.
-
View the server response to make sure your request was successful.
Creating a model
Note
The only supported model type is CatBoost: ML_MODEL_TYPE_CATBOOST.
-
Select the cluster:
- In the management console
, select the folder the cluster is in. - Go to Managed Service for ClickHouse.
- Click the name of your cluster and select the Machine learning tab in the left-hand panel.
- Click Create model.
- In the management console
-
Configure the model:
- Type:
ML_MODEL_TYPE_CATBOOST. - Name: Model name. It is one of the arguments of the
catboostEvaluate()function, which is used to call the model in ClickHouse®. - URL: Model address in Yandex Object Storage.
- Type:
-
Click Create and wait until the model is created.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To create a model, run this command:
yc managed-clickhouse ml-model create <model_name> \
--cluster-name=<cluster_name> \
--type=ML_MODEL_TYPE_CATBOOST \
--uri=<link_to_model_file_in_Object_Storage>
You can get the cluster name with the list of clusters in the folder.
-
Open the current Terraform configuration file describing your infrastructure.
For information on how to create such a file, see Creating a cluster.
-
Add the
ml_modelsection describing the ML model being added to the Managed Service for ClickHouse® cluster description:resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" { ... ml_model { name = "<model_name>" type = "ML_MODEL_TYPE_CATBOOST" uri = "<link_to_model_file_in_Object_Storage>" } } -
Make sure the settings are correct.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information, see this Terraform provider guide.
Timeouts
The Terraform provider sets the following timeouts for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring from a backup: 60 minutes.
- Updating a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the timeout are aborted.
How do I change these limits?
Add a timeouts section to the cluster description, e.g.:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Call the MlModel.Create method, e.g., via the following cURL
request:curl \ --request POST \ --header "Authorization: Bearer $IAM_TOKEN" \ --header "Content-Type: application/json" \ --url 'https://{{ api-host-mdb }/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels' \ --data '{ "mlModelName": "<model_name>", "type": "ML_MODEL_TYPE_CATBOOST", "uri": "<file_link>" }'Where:
mlModelName: Model name.type: Model type, which is alwaysML_MODEL_TYPE_CATBOOST.uri: Link to the model file in Object Storage.
You can get the cluster ID with the list of clusters in the folder.
-
View the server response to make sure your request was successful.
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume the repository contents are stored in the
~/cloudapi/directory. -
Call the MlModelService.Create method, e.g., via the following gRPCurl
request:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<model_name>", "type": "ML_MODEL_TYPE_CATBOOST", "uri": "<file_link>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.CreateWhere:
ml_model_name: Model name.type: Model type, which is alwaysML_MODEL_TYPE_CATBOOST.uri: Link to the model file in Object Storage.
You can get the cluster ID with the list of clusters in the folder.
-
View the server response to make sure your request was successful.
Applying a model
To apply a model to data stored in a ClickHouse® cluster:
-
Run the following SQL query:
SELECT catboostEvaluate('<path_to_model_file>', <column_1_name>, <column_2_name>, ... <column_N_name>) FROM <table_name>
Specify the following as the catboostEvaluate() function arguments:
- Path to the model file in
/var/lib/clickhouse/models/<model_name>.binformat. - Names of columns containing the input data.
The query will return a column with model predictions for each row of the source table.
Updating a model
Managed Service for ClickHouse® does not track changes in a model file located in a Yandex Object Storage bucket.
To update the contents of a model that is already added to the cluster:
- Upload the current model file to Yandex Object Storage.
- Get a link to this file.
- Update the settings of the model added to Managed Service for ClickHouse® by providing a new link to the model file.
- In the management console
, select the folder the cluster is in. - Go to Managed Service for ClickHouse.
- Click the name of your cluster and select the Machine learning tab in the left-hand panel.
- Select the model, click
, and select Edit.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To update the link to a model file in a Yandex Object Storage bucket, run this command:
yc managed-clickhouse ml-model update <model_name> \
--cluster-name=<cluster_name> \
--uri=<new_link_to_file_in_Object_Storage>
You can get the model name with a list of cluster models, and the cluster name, with a list of clusters in the folder.
-
Open the current Terraform configuration file describing your infrastructure.
For information on how to create such a file, see Creating a cluster.
-
In the Managed Service for ClickHouse® cluster description, change the
uriparameter value underml_model:resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" { ... ml_model { name = "<model_name>" type = "ML_MODEL_TYPE_CATBOOST" uri = "<new_link_to_model_file_in_Object_Storage>" } } -
Make sure the settings are correct.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information, see this Terraform provider guide.
Timeouts
The Terraform provider sets the following timeouts for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring from a backup: 60 minutes.
- Updating a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the timeout are aborted.
How do I change these limits?
Add a timeouts section to the cluster description, e.g.:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Call the MlModel.Update method, e.g., via the following cURL
request:Warning
The API method will assign default values to all the parameters of the object you are modifying unless you explicitly provide them in your request. To avoid this, list the settings you want to change in the
updateMaskparameter as a single comma-separated string.curl \ --request PATCH \ --header "Authorization: Bearer $IAM_TOKEN" \ --header "Content-Type: application/json" \ --url 'https://{{ api-host-mdb }/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels/<model_name>' \ --data '{ "updateMask": "uri", "uri": "<file_link>" }'Where:
-
updateMask: Comma-separated list of settings you want to update.Here, we only specified a single parameter,
uri. -
uri: Link to the new model file in Object Storage.
You can get the cluster ID with the list of clusters in the folder.
-
-
View the server response to make sure your request was successful.
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume the repository contents are stored in the
~/cloudapi/directory. -
Call the MlModelService.Update method, e.g., via the following gRPCurl
request:Warning
The API method will assign default values to all the parameters of the object you are modifying unless you explicitly provide them in your request. To avoid this, list the settings you want to change in the
update_maskparameter as an array ofpaths[]strings.Format for listing settings
"update_mask": { "paths": [ "<setting_1>", "<setting_2>", ... "<setting_N>" ] }grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<schema_name>", "update_mask": { "paths": ["uri"] }, "uri": "<file_link>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.CreateWhere:
-
ml_model_name: Model name. -
update_mask: List of settings you want to update as an array of strings (paths[]).Here, we only specified a single setting,
uri. -
uri: Link to the new model file in Object Storage.
You can get the cluster ID with the list of clusters in the folder.
-
-
View the server response to make sure your request was successful.
Disabling a model
Note
After you disable a model, the relevant object remains in the Yandex Object Storage bucket. If you no longer need this model object, you can delete it.
- In the management console
, select the folder the cluster is in. - Go to Managed Service for ClickHouse.
- Click the name of your cluster and select the Machine learning tab in the left-hand panel.
- Select the model, click
, and select Delete.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To remove a model, run this command:
yc managed-clickhouse ml-model delete <model_name> \
--cluster-name=<cluster_name>
You can get the model name with a list of cluster models, and the cluster name, with a list of clusters in the folder.
-
Open the current Terraform configuration file describing your infrastructure.
For information on how to create such a file, see Creating a cluster.
-
Delete the section describing
ml_modelin question from the Managed Service for ClickHouse® cluster description. -
Make sure the settings are correct.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information, see this Terraform provider guide.
Timeouts
The Terraform provider sets the following timeouts for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring from a backup: 60 minutes.
- Updating a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the timeout are aborted.
How do I change these limits?
Add a timeouts section to the cluster description, e.g.:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Call the MlModel.Delete method, e.g., via the following cURL
request:curl \ --request DELETE \ --header "Authorization: Bearer $IAM_TOKEN" \ --url 'https://mdb.api.cloud.yandex.net/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels/<model_name>'You can get the cluster ID with a list of clusters in the folder, and the model name, with a list of models in the cluster.
-
View the server response to make sure your request was successful.
-
Get an IAM token for API authentication and put it into an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume the repository contents are stored in the
~/cloudapi/directory. -
Call the MlModelService.Delete method, e.g., via the following gRPCurl
request:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<schema_name>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.DeleteYou can get the cluster ID with the list of clusters in the folder.
-
View the server response to make sure your request was successful.
Example
If you do not have an appropriate dataset and model to process it, you can test machine learning in Managed Service for ClickHouse® with this example. We prepared a data file and trained a model to analyze it. You can upload data to ClickHouse® and see model predictions for different table rows.
Note
In this example, we will be using public data from the Amazon Employee Access ChallengeACTION column. GitHub
To upload data to ClickHouse® and test the model:
-
In the management console
, add the test model:- Type:
ML_MODEL_TYPE_CATBOOST. - Name:
ml_test. - URL:
https://storage.yandexcloud.net/managed-clickhouse/catboost_model.bin.
- Type:
-
Download the data file
to analyze. -
Create a test table:
CREATE TABLE ml_test_table (date Date MATERIALIZED today(), ACTION UInt8, RESOURCE UInt32, MGR_ID UInt32, ROLE_ROLLUP_1 UInt32, ROLE_ROLLUP_2 UInt32, ROLE_DEPTNAME UInt32, ROLE_TITLE UInt32, ROLE_FAMILY_DESC UInt32, ROLE_FAMILY UInt32, ROLE_CODE UInt32) ENGINE = MergeTree() PARTITION BY date ORDER BY date; -
Upload the data to the table:
INSERT INTO ml_test_table FROM INFILE '<file_path>/train.csv' FORMAT CSVWithNames; -
Test the model:
-
Get predicted values of the
ACTIONcolumn for the first ten rows in the table:SELECT catboostEvaluate('/var/lib/clickhouse/models/ml_test.bin', RESOURCE, MGR_ID, ROLE_ROLLUP_1, ROLE_ROLLUP_2, ROLE_DEPTNAME, ROLE_TITLE, ROLE_FAMILY_DESC, ROLE_FAMILY, ROLE_CODE) > 0 AS prediction, ACTION AS target FROM ml_test_table LIMIT 10; -
Get the predicted probability for the first ten rows in the table:
SELECT catboostEvaluate('/var/lib/clickhouse/models/ml_test.bin', RESOURCE, MGR_ID, ROLE_ROLLUP_1, ROLE_ROLLUP_2, ROLE_DEPTNAME, ROLE_TITLE, ROLE_FAMILY_DESC, ROLE_FAMILY, ROLE_CODE) AS prediction, 1. / (1 + exp(-prediction)) AS probability, ACTION AS target FROM ml_test_table LIMIT 10;
-
ClickHouse® is a registered trademark of ClickHouse, Inc