Managing machine learning models in Managed Service for ClickHouse®
Managed Service for ClickHouse® allows you to analyze data by applying CatBoost
To apply a model, add it to your cluster and call it in an SQL query using the built-in catboostEvaluate()
function. After running this query, you get model predictions for each row of input data.
Read more about the catboostEvaluate()
function in the ClickHouse® documentation
Before adding a model
Managed Service for ClickHouse® only works with readable models uploaded to Yandex Object Storage:
- To bind your service account to the cluster, make sure your account in Yandex Cloud is assigned the iam.serviceAccounts.user role or higher.
- Upload the trained model file to Yandex Object Storage.
- Connect a service account to a cluster. Use your service account to configure access permissions for the model file.
- Assign the
storage.viewer
role to the service account. - In the bucket's ACL, add the
READ
permission to the service account. - Get a link to the model file.
Getting a list of models in a cluster
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get a list of models in a cluster, run the command:
yc managed-clickhouse ml-model list --cluster-name=<cluster_name>
You can request the cluster name with the list of clusters in the folder.
To get a list of models in a cluster, use the list REST API method for the MlModel resource or the MlModelService/List gRPC API call and provide the cluster ID in the clusterId
request parameter.
You can get the cluster ID with a list of clusters in the folder.
Getting detailed information about a model
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get model details, run this command:
yc managed-clickhouse ml-model get <model_name> \
--cluster-name=<cluster_name>
You can request the model name with a list of cluster models and the cluster name with a list of clusters in the folder.
To get model details, use the get REST API method for the MlModel resource or the MlModelService/Get gRPC API call and provide the following in the request:
- Cluster ID in the
clusterId
parameter. - Model name in the
mlModelName
parameter.
You can request the model name with a list of cluster models and the cluster name with a list of clusters in the folder.
Creating a model
Note
The only supported model type is CatBoost: ML_MODEL_TYPE_CATBOOST
.
-
Select the cluster:
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
- Click Create model.
- In the management console
-
Configure the model parameters:
- Type:
ML_MODEL_TYPE_CATBOOST
. - Name: Model name. Model name is one of the arguments of the
catboostEvaluate()
function, which is used to call the model in ClickHouse®. - URL: Model address in Yandex Object Storage.
- Type:
-
Click Create and wait for the model to be created.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a model, run this command:
yc managed-clickhouse ml-model create <model_name> \
--cluster-name=<cluster_name> \
--type=ML_MODEL_TYPE_CATBOOST \
--uri=<link_to_model_file_in_Object_Storage>
You can request the cluster name with the list of clusters in the folder.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about creating this file, see Creating clusters.
-
To the Managed Service for ClickHouse® cluster description, add the
ml_model
block with a description of the added machine learning model:resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" { ... ml_model { name = "<model_name>" type = "ML_MODEL_TYPE_CATBOOST" uri = "<link_to_model_file_in_Object_Storage>" } }
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information, see the Terraform provider documentation
Time limits
A Terraform provider sets the timeout for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring one from a backup: 60 minutes.
- Editing a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the set timeout are interrupted.
How do I change these limits?
Add the timeouts
block to the cluster description, for example:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
To create a model, use the create REST API method for the MlModel resource or the MlModelService/Create gRPC API call and provide the following in the request:
- Cluster ID in the
clusterId
parameter. - Model name in the
mlModelName
parameter. ML_MODEL_TYPE_CATBOOST
model type in thetype
parameter.- Link to the model file in Yandex Object Storage in the
uri
parameter.
You can get the cluster ID with a list of clusters in the folder.
Applying a model
To apply the model to data stored in a ClickHouse® cluster:
-
Execute an SQL query in the format:
SELECT catboostEvaluate('<model_file_path>', <column_1_name>, <column_2_name>, ... <column_N_name>) FROM <table_name>
As the catboostEvaluate()
function arguments, specify the following:
- Path to the model file in
/var/lib/clickhouse/models/<model_name>.bin
format. - Names of columns containing the input data.
The result of the query execution will be a column with model predictions for each row of the source table.
Updating a model
Managed Service for ClickHouse® does not track changes in the model file located in the Yandex Object Storage bucket.
To update the contents of a model that is already connected to the cluster:
- Upload the file with the current model to Yandex Object Storage.
- Get a link to this file.
- Change the parameters of the model connected to Managed Service for ClickHouse® by providing a new link to the model file.
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
- Select the appropriate model, click
, and select Edit.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To change the link to the model file in the Yandex Object Storage bucket, run the command:
yc managed-clickhouse ml-model update <model_name> \
--cluster-name=<cluster_name> \
--uri=<new_link_to_file_in_Object_Storage>
You can request the model name with a list of cluster models and the cluster name with a list of clusters in the folder.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about creating this file, see Creating clusters.
-
In the Managed Service for ClickHouse® cluster description, change the
uri
parameter value underml_model
:resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" { ... ml_model { name = "<model_name>" type = "ML_MODEL_TYPE_CATBOOST" uri = "<new_link_to_model_file_in_Object_Storage>" } }
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information, see the Terraform provider documentation
Time limits
A Terraform provider sets the timeout for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring one from a backup: 60 minutes.
- Editing a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the set timeout are interrupted.
How do I change these limits?
Add the timeouts
block to the cluster description, for example:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
To update a model, use the update REST API method for the MlModel resource or the MlModelService/Update gRPC API call and provide the following in the request:
- Cluster ID in the
clusterId
parameter. To find out the cluster ID, get a list of clusters in the folder. - Model name in the
mlModelName
parameter. - New link to the model file in Yandex Object Storage in the
uri
parameter. - List of cluster configuration fields to update in the
updateMask
parameter.
Warning
The API method will assign default values to all the parameters of the object you are modifying unless you explicitly provide them in your request. To avoid this, list the settings you want to change in the updateMask
parameter as a single comma-separated string.
Disabling a model
Note
After disabling a model, the corresponding object is kept in the Yandex Object Storage bucket. If you no longer need this model object, you can delete it.
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
- Select the appropriate model, click
, and select Delete.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To disable a model, run the command:
yc managed-clickhouse ml-model delete <model_name> \
--cluster-name=<cluster_name>
You can request the model name with a list of cluster models and the cluster name with a list of clusters in the folder.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about creating this file, see Creating clusters.
-
Delete the description block of the appropriate
ml_model
model from the Managed Service for ClickHouse® cluster description. -
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information, see the Terraform provider documentation
Time limits
A Terraform provider sets the timeout for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring one from a backup: 60 minutes.
- Editing a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the set timeout are interrupted.
How do I change these limits?
Add the timeouts
block to the cluster description, for example:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
To delete a model, use the delete REST API method for the MlModel resource or the MlModelService/Delete gRPC API call and provide the following in the request:
- Cluster ID in the
clusterId
parameter. - Model name in the
mlModelName
parameter.
You can request the cluster ID with a list of clusters in the folder and model name with a list of models in the cluster.
Example
If you do not have a suitable data set or model to process it, you can test machine learning in Managed Service for ClickHouse® using this example. We prepared a data file for it and trained a model to analyze it. You can upload data to ClickHouse® and see model predictions for different rows of the table.
Note
In this example, we are going to use public data from the Amazon Employee Access ChallengeACTION
column. The same data and model are used on GitHub
To upload data to ClickHouse® and test the model:
-
In the management console
, add the test model:- Type:
ML_MODEL_TYPE_CATBOOST
. - Name:
ml_test
. - URL:
https://storage.yandexcloud.net/managed-clickhouse/catboost_model.bin
.
- Type:
-
Download the file with data
to analyze. -
Create a test table:
CREATE TABLE ml_test_table (date Date MATERIALIZED today(), ACTION UInt8, RESOURCE UInt32, MGR_ID UInt32, ROLE_ROLLUP_1 UInt32, ROLE_ROLLUP_2 UInt32, ROLE_DEPTNAME UInt32, ROLE_TITLE UInt32, ROLE_FAMILY_DESC UInt32, ROLE_FAMILY UInt32, ROLE_CODE UInt32) ENGINE = MergeTree() PARTITION BY date ORDER BY date;
-
Upload the data to the table:
INSERT INTO ml_test_table FROM INFILE '<path_to_file>/train.csv' FORMAT CSVWithNames;
-
Test the model:
-
Get predicted values in the
ACTION
column for the first 10 rows in the table:SELECT catboostEvaluate('/var/lib/clickhouse/models/ml_test.bin', RESOURCE, MGR_ID, ROLE_ROLLUP_1, ROLE_ROLLUP_2, ROLE_DEPTNAME, ROLE_TITLE, ROLE_FAMILY_DESC, ROLE_FAMILY, ROLE_CODE) > 0 AS prediction, ACTION AS target FROM ml_test_table LIMIT 10;
-
Get the probability prediction for the first 10 rows in the table:
SELECT catboostEvaluate('/var/lib/clickhouse/models/ml_test.bin', RESOURCE, MGR_ID, ROLE_ROLLUP_1, ROLE_ROLLUP_2, ROLE_DEPTNAME, ROLE_TITLE, ROLE_FAMILY_DESC, ROLE_FAMILY, ROLE_CODE) AS prediction, 1. / (1 + exp(-prediction)) AS probability, ACTION AS target FROM ml_test_table LIMIT 10;
-
ClickHouse® is a registered trademark of ClickHouse, Inc