Managing machine learning models in Managed Service for ClickHouse®
Managed Service for ClickHouse® allows you to analyze data by applying CatBoost
To apply a model, add it to your cluster and call it in an SQL query using the built-in catboostEvaluate()
function. After running this query, you get model predictions for each row of input data.
Read more about the catboostEvaluate()
function in the ClickHouse® documentation
Before adding a model
Managed Service for ClickHouse® only works with readable models uploaded to Yandex Object Storage:
- To link your service account to the cluster, make sure your Yandex Cloud account has the iam.serviceAccounts.user role or higher.
- Upload the trained model file to Yandex Object Storage.
- Connect the service account to the cluster. You will use your service account to configure permissions to access the model file.
- Assign the
storage.viewer
role to the service account. - In the bucket's ACL, add the
READ
permission to the service account. - Get a link to the model file.
Getting a list of models in a cluster
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get a list of models in a cluster, run the command:
yc managed-clickhouse ml-model list --cluster-name=<cluster_name>
You can request the cluster name with a list of clusters in the folder.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Use the MlModel.List method and send the following request, e.g., via cURL
:curl \ --request GET \ --header "Authorization: Bearer $IAM_TOKEN" \ --url 'https://mdb.api.cloud.yandex.net/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels'
You can get the cluster ID with a list of clusters in the folder.
-
View the server response to make sure the request was successful.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapi
Below, we assume the repository contents are stored in the
~/cloudapi/
directory. -
Use the MlModelService.List call and send the following request, e.g., via gRPCurl
:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.List
You can get the cluster ID with a list of clusters in the folder.
-
View the server response to make sure the request was successful.
Getting detailed information about a model
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get model details, run this command:
yc managed-clickhouse ml-model get <model_name> \
--cluster-name=<cluster_name>
You can request the model name with a list of cluster models and the cluster name with a list of clusters in the folder.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Use the MlModel.Get method and send the following request, e.g., via cURL
:curl \ --request GET \ --header "Authorization: Bearer $IAM_TOKEN" \ --url 'https://mdb.api.cloud.yandex.net/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels/<model_name>'
You can request the cluster ID with a list of clusters in the folder and model name with a list of models in the cluster.
-
View the server response to make sure the request was successful.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapi
Below, we assume the repository contents are stored in the
~/cloudapi/
directory. -
Use the MlModelService.Get call and send the following request, e.g., via gRPCurl
:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<model_name>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.Get
You can request the cluster ID with a list of clusters in the folder and model name with a list of models in the cluster.
-
View the server response to make sure the request was successful.
Creating a model
Note
The only supported model type is CatBoost: ML_MODEL_TYPE_CATBOOST
.
-
Select the cluster:
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
- Click Create model.
- In the management console
-
Configure the model parameters:
- Type:
ML_MODEL_TYPE_CATBOOST
. - Name: Model name. Model name is one of the arguments of the
catboostEvaluate()
function, which is used to call the model in ClickHouse®. - URL: Model address in Yandex Object Storage.
- Type:
-
Click Create and wait for the model to be created.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a model, run this command:
yc managed-clickhouse ml-model create <model_name> \
--cluster-name=<cluster_name> \
--type=ML_MODEL_TYPE_CATBOOST \
--uri=<link_to_model_file_in_Object_Storage>
You can request the cluster name with a list of clusters in the folder.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about creating this file, see Creating clusters.
-
To the Managed Service for ClickHouse® cluster description, add the
ml_model
block with a description of the added machine learning model:resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" { ... ml_model { name = "<model_name>" type = "ML_MODEL_TYPE_CATBOOST" uri = "<link_to_model_file_in_Object_Storage>" } }
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information, see the Terraform provider documentation
Time limits
A Terraform provider sets the timeout for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring one from a backup: 60 minutes.
- Editing a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the set timeout are interrupted.
How do I change these limits?
Add the timeouts
block to the cluster description, for example:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Use the MlModel.Create method and send the following request, e.g., via cURL
:curl \ --request POST \ --header "Authorization: Bearer $IAM_TOKEN" \ --header "Content-Type: application/json" \ --url 'https://{{ api-host-mdb }/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels' \ --data '{ "mlModelName": "<model_name>", "type": "ML_MODEL_TYPE_CATBOOST", "uri": "<file_link>" }'
Where:
mlModelName
: Model name.type
: Model type, always takes theML_MODEL_TYPE_CATBOOST
value.uri
: Link to the model file in Object Storage.
You can get the cluster ID with a list of clusters in the folder.
-
View the server response to make sure the request was successful.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapi
Below, we assume the repository contents are stored in the
~/cloudapi/
directory. -
Use the MlModelService.Create call and send the following request, e.g., via gRPCurl
:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<model_name>", "type": "ML_MODEL_TYPE_CATBOOST", "uri": "<file_link>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.Create
Where:
ml_model_name
: Model name.type
: Model type, always takes theML_MODEL_TYPE_CATBOOST
value.uri
: Link to the model file in Object Storage.
You can get the cluster ID with a list of clusters in the folder.
-
View the server response to make sure the request was successful.
Applying a model
To apply the model to data stored in a ClickHouse® cluster:
-
Execute an SQL query in the format:
SELECT catboostEvaluate('<path_to_model_file>', <column_1_name>, <column_2_name>, ... <column_N_name>) FROM <table_name>
As the catboostEvaluate()
function arguments, specify the following:
- Path to the model file in
/var/lib/clickhouse/models/<model_name>.bin
format. - Names of columns containing the input data.
The result of the query execution will be a column with model predictions for each row of the source table.
Updating a model
Managed Service for ClickHouse® does not track changes in the model file located in the Yandex Object Storage bucket.
To update the contents of a model that is already connected to the cluster:
- Upload the file with the current model to Yandex Object Storage.
- Get a link to this file.
- Change the parameters of the model connected to Managed Service for ClickHouse® by providing a new link to the model file.
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
- Select the appropriate model, click
, and select Edit.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To change the link to the model file in the Yandex Object Storage bucket, run the command:
yc managed-clickhouse ml-model update <model_name> \
--cluster-name=<cluster_name> \
--uri=<new_link_to_file_in_Object_Storage>
You can request the model name with a list of cluster models and the cluster name with a list of clusters in the folder.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about creating this file, see Creating clusters.
-
In the Managed Service for ClickHouse® cluster description, change the
uri
parameter value underml_model
:resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" { ... ml_model { name = "<model_name>" type = "ML_MODEL_TYPE_CATBOOST" uri = "<new_link_to_model_file_in_Object_Storage>" } }
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information, see the Terraform provider documentation
Time limits
A Terraform provider sets the timeout for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring one from a backup: 60 minutes.
- Editing a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the set timeout are interrupted.
How do I change these limits?
Add the timeouts
block to the cluster description, for example:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Use the MlModel.Update method and send the following request, e.g., via cURL
:Warning
The API method will assign default values to all the parameters of the object you are modifying unless you explicitly provide them in your request. To avoid this, list the settings you want to change in the
updateMask
parameter as a single comma-separated string.curl \ --request PATCH \ --header "Authorization: Bearer $IAM_TOKEN" \ --header "Content-Type: application/json" \ --url 'https://{{ api-host-mdb }/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels/<model_name>' \ --data '{ "updateMask": "uri", "uri": "<file_link>" }'
Where:
-
updateMask
: List of parameters to update as a single string, separated by commas.Here only one parameter is specified:
uri
. -
uri
: Link to the new model file in Object Storage.
You can get the cluster ID with a list of clusters in the folder.
-
-
View the server response to make sure the request was successful.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapi
Below, we assume the repository contents are stored in the
~/cloudapi/
directory. -
Use the MlModelService.Update call and send the following request, e.g., via gRPCurl
:Warning
The API method will assign default values to all the parameters of the object you are modifying unless you explicitly provide them in your request. To avoid this, list the settings you want to change in the
update_mask
parameter as an array ofpaths[]
strings.Format for listing settings
"update_mask": { "paths": [ "<setting_1>", "<setting_2>", ... "<setting_N>" ] }
grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<schema_name>", "update_mask": { "paths": ["uri"] }, "uri": "<file_link>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.Create
Where:
-
ml_model_name
: Model name. -
update_mask
: List of parameters to update as an array ofpaths[]
strings.Here only one parameter is specified:
uri
. -
uri
: Link to the new model file in Object Storage.
You can get the cluster ID with a list of clusters in the folder.
-
-
View the server response to make sure the request was successful.
Disabling a model
Note
After disabling a model, the corresponding object is kept in the Yandex Object Storage bucket. If you no longer need this model object, you can delete it.
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and select the Machine learning tab in the left-hand panel.
- Select the appropriate model, click
, and select Delete.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To disable a model, run the command:
yc managed-clickhouse ml-model delete <model_name> \
--cluster-name=<cluster_name>
You can request the model name with a list of cluster models and the cluster name with a list of clusters in the folder.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about creating this file, see Creating clusters.
-
Delete the description block of the appropriate
ml_model
model from the Managed Service for ClickHouse® cluster description. -
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information, see the Terraform provider documentation
Time limits
A Terraform provider sets the timeout for Managed Service for ClickHouse® cluster operations:
- Creating a cluster, including by restoring one from a backup: 60 minutes.
- Editing a cluster: 90 minutes.
- Deleting a cluster: 30 minutes.
Operations exceeding the set timeout are interrupted.
How do I change these limits?
Add the timeouts
block to the cluster description, for example:
resource "yandex_mdb_clickhouse_cluster" "<cluster_name>" {
...
timeouts {
create = "1h30m" # 1 hour 30 minutes
update = "2h" # 2 hours
delete = "30m" # 30 minutes
}
}
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Use the MlModel.Delete method and send the following request, e.g., via cURL
:curl \ --request DELETE \ --header "Authorization: Bearer $IAM_TOKEN" \ --url 'https://mdb.api.cloud.yandex.net/managed-clickhouse/v1/clusters/<cluster_ID>/mlModels/<model_name>'
You can request the cluster ID with a list of clusters in the folder and model name with a list of models in the cluster.
-
View the server response to make sure the request was successful.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapi
Below, we assume the repository contents are stored in the
~/cloudapi/
directory. -
Use the MlModelService.Delete call and send the following request, e.g., via gRPCurl
:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/clickhouse/v1/ml_model_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "ml_model_name": "<schema_name>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.clickhouse.v1.MlModelService.Delete
You can get the cluster ID with a list of clusters in the folder.
-
View the server response to make sure the request was successful.
Example
If you do not have a suitable data set or model to process it, you can test machine learning in Managed Service for ClickHouse® using this example. We prepared a data file for it and trained a model to analyze it. You can upload data to ClickHouse® and see model predictions for different rows of the table.
Note
In this example, we are going to use public data from the Amazon Employee Access ChallengeACTION
column. The same data and model are used on GitHub
To upload data to ClickHouse® and test the model:
-
In the management console
, add the test model:- Type:
ML_MODEL_TYPE_CATBOOST
. - Name:
ml_test
. - URL:
https://storage.yandexcloud.net/managed-clickhouse/catboost_model.bin
.
- Type:
-
Download the file with data
to analyze. -
Create a test table:
CREATE TABLE ml_test_table (date Date MATERIALIZED today(), ACTION UInt8, RESOURCE UInt32, MGR_ID UInt32, ROLE_ROLLUP_1 UInt32, ROLE_ROLLUP_2 UInt32, ROLE_DEPTNAME UInt32, ROLE_TITLE UInt32, ROLE_FAMILY_DESC UInt32, ROLE_FAMILY UInt32, ROLE_CODE UInt32) ENGINE = MergeTree() PARTITION BY date ORDER BY date;
-
Upload the data to the table:
INSERT INTO ml_test_table FROM INFILE '<path_to_file>/train.csv' FORMAT CSVWithNames;
-
Test the model:
-
Get predicted values in the
ACTION
column for the first 10 rows in the table:SELECT catboostEvaluate('/var/lib/clickhouse/models/ml_test.bin', RESOURCE, MGR_ID, ROLE_ROLLUP_1, ROLE_ROLLUP_2, ROLE_DEPTNAME, ROLE_TITLE, ROLE_FAMILY_DESC, ROLE_FAMILY, ROLE_CODE) > 0 AS prediction, ACTION AS target FROM ml_test_table LIMIT 10;
-
Get the probability prediction for the first 10 rows in the table:
SELECT catboostEvaluate('/var/lib/clickhouse/models/ml_test.bin', RESOURCE, MGR_ID, ROLE_ROLLUP_1, ROLE_ROLLUP_2, ROLE_DEPTNAME, ROLE_TITLE, ROLE_FAMILY_DESC, ROLE_FAMILY, ROLE_CODE) AS prediction, 1. / (1 + exp(-prediction)) AS probability, ACTION AS target FROM ml_test_table LIMIT 10;
-
ClickHouse® is a registered trademark of ClickHouse, Inc