Using data from Object Storage to train a model in Yandex DataSphere
You will create a bucket in Object Storage, mount it to the Windows file system using rclone
To use data from Object Storage to train a model in DataSphere:
- Prepare your infrastructure.
- Create a static access key.
- Create a bucket.
- Set up a connection to Object Storage.
- Mount the bucket.
- Prepare data for model training.
- Create an S3 connector.
- Attach your bucket to a project.
- Train the model.
If you no longer need the resources you created, delete them.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account in the identity federation (SSO). - Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure that you have a billing account linked and its status is
ACTIVE
orTRIAL_ACTIVE
. If you do not have a billing account yet, create one in the DataSphere interface.
Required paid resources
The cost of training a model based on data from Object Storage includes:
- Fee for DataSphere computing resource usage.
- Fee for storing data in a bucket (see Object Storage pricing).
- Fee for data operations (see Object Storage pricing).
Prepare the infrastructure
Log in to the Yandex Cloud management console
If you have an active billing account, you can create or select a folder to deploy your infrastructure in, on the cloud page
Note
If you use an identity federation to access Yandex Cloud, billing details might be unavailable to you. In this case, contact your Yandex Cloud organization administrator.
Set up a working environment
Download and install the winfsp distribution
Download it from the developer website and unpack to the working directory on your local computer:
Create a folder
- In the management console
, select a cloud and click Create folder. - Give your folder a name, e.g.,
data-folder
. - Click Create.
Create a service account for Object Storage
To access a bucket in Object Storage, you need a service account with the storage.editor
role.
- In the management console
, go todata-folder
. - In the list of services, select Identity and Access Management.
- Click Create service account.
- Enter a name for the service account, e.g.,
datasphere-sa
. - Click Add role and assign the
storage.editor
role to the service account. - Click Create.
Create a static access key
To access Object Storage from DataSphere, you need a static key.
- In the management console
, navigate to the folder the service account belongs to. - In the list of services, select Identity and Access Management.
- In the left-hand panel, select
Service accounts. - In the list that opens, select the
datasphere-sa
service account. - In the top panel, click
Create new key. - Select Create static access key.
- Specify the static key description and click Create.
- Save the ID and secret key. After you close the dialog, the key value will become unavailable.
-
Create a static access key for the
datasphere-sa
service account:yc iam access-key create --service-account-name datasphere-sa
Result:
access_key: id: aje6t3vsbj8l******** service_account_id: ajepg0mjt06s******** created_at: "2022-07-18T14:37:51Z" key_id: 0n8X6WY6S24N7Oj***** secret: JyTRFdqw8t1kh2-OJNz4JX5ZTz9Dj1rI9hx*****
For more information about the
yc iam access-key create
command, see the CLI reference. -
Save the ID (
key_id
) and secret key (secret
). You will not be able to get the key value again.
Create a bucket
- In the management console
, select the folder you want to create a bucket in. - In the list of services, select Object Storage.
- At the top right, click Create bucket.
- In the ** Name** field, enter a name for the bucket consistent with the naming conventions.
- In the Object read access, Object listing access, and Read access to settings fields, select Restricted.
- Click Create bucket.
-
If you do not have the AWS CLI yet, install and configure it.
-
Enter the bucket name following the naming conventions:
aws --endpoint-url https://storage.yandexcloud.net \ s3 mb s3://<bucket_name>
Result:
make_bucket: <bucket_name>
Note
Terraform uses a service account to interact with Object Storage. Assign to the service account the required role, e.g., storage.admin
, for the folder where you are going to create resources.
If you don't have Terraform, install it and configure the Yandex Cloud provider.
-
Describe the parameters for creating a service account and access key in the configuration file:
... // Creating a service account resource "yandex_iam_service_account" "sa" { name = "<service_account_name>" } // Assigning a role to a service account resource "yandex_resourcemanager_folder_iam_member" "sa-admin" { folder_id = "<folder_ID>" role = "storage.admin" member = "serviceAccount:${yandex_iam_service_account.sa.id}" } // Creating a static access key resource "yandex_iam_service_account_static_access_key" "sa-static-key" { service_account_id = yandex_iam_service_account.sa.id description = "static access key for object storage" }
-
Add a section with bucket parameters to the configuration file and enter the bucket name following the naming conventions:
resource "yandex_storage_bucket" "<bucket_name>" { access_key = yandex_iam_service_account_static_access_key.sa-static-key.access_key secret_key = yandex_iam_service_account_static_access_key.sa-static-key.secret_key bucket = "<bucket_name>" }
For more information about the
yandex_storage_bucket
resource, see the Terraform provider documentation . -
Create resources:
-
In the terminal, change to the folder where you edited the configuration file.
-
Make sure the configuration file is correct using the command:
terraform validate
If the configuration is correct, the following message is returned:
Success! The configuration is valid.
-
Run the command:
terraform plan
The terminal will display a list of resources with parameters. No changes are made at this step. If the configuration contains errors, Terraform will point them out.
-
Apply the configuration changes:
terraform apply
-
Confirm the changes: type
yes
in the terminal and press Enter.
Terraform will create all the required resources. You can check the new resources and their configuration using the management console
. -
Use the create REST API method for the Bucket resource, the BucketService/Create gRPC API call, or the create S3 API method.
Set up a connection to Object Storage
To move data from your local disk to Object Storage, configure the rclone
utility.
-
Open the command line in your working directory as an admin user and run the following command:
.\psexec -i -s cmd.exe
-
In the console that opens, run
whoami
and make sure the session has been started by the system user. -
Go to the working directory and run a configuration session for
rclone
:rclone.exe config
-
Follow the prompts to create a new connection profile:
- Start creating a new profile by entering
n
in the terminal. - Enter the connection name:
s3-connect
. - Select the storage type by entering
4
in the terminal. - Select a provider by entering
1
in the terminal. - Select manual entry of credentials by entering
1
in the terminal. - Enter the secret key ID in the terminal.
- Enter the secret key value in the terminal.
- Specify the region by entering
ru-central1
in the terminal. - Specify the endpoint by entering
storage.yandexcloud.net
in the terminal. - You can leave the other settings at their defaults; press Enter to skip them.
- Start creating a new profile by entering
Note
You can perform advanced connection setup if needed. To do this, type y
at the Edit advanced config?
step. For more information about advanced settings, see the rclone
documentation
Mount the bucket to the Windows file system
-
Check your connection to the bucket. In the same command line you used to configure the connection, run the following command with the bucket name specified:
rclone.exe ls s3-connect:<bucket_name>
If the configuration is set up correctly, the objects in the bucket will be listed in the console.
-
Mount the bucket to the file system specifying the bucket name and an available drive letter in the file system:
rclone.exe mount s3-connect:<bucket_name> <disk_letter>: --vfs-cache-mode full
You will see a new disk with the objects from the bucket in Windows Explorer.
Prepare data for model training
- Download a CSV file
with diabetes prediction data. A dataset is created based on Kaggle data for 2015. - Copy the file to the mounted bucket.
Attach your bucket to a project
To connect to the bucket from DataSphere, you need an S3 connector:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - In the top-right corner, click Create resource. In the pop-up window, select S3 Connector.
- Fill in the fields as follows:
- Name: Name of the new connector, e.g.,
s3-datasphere-connect
. - Endpoint: Object Storage host, which is
https://storage.yandexcloud.net/
. - Bucket: Name of your bucket.
- Mount name: Name of the volume for mounting the bucket into the project file system.
- Static access key ID used to connect to storage.
- In the Static access key field, click Create. In the window that opens, enter the secret name and the secret key being used to connect to storage.
- Name: Name of the new connector, e.g.,
- Click Create.
- Go to the S3 connector page and click Activate. Once activated, the bucket will be listed on the
S3 Mounts tab in the JupyterLab interface, and you will be able to view it as a file system.
Train the model
Clone the Git repository containing the diabetes_catboost.ipynb
notebook with a sample for CatBoost
-
Open the DataSphere project:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- Open the notebook tab.
-
- In the top menu, click Git and select Clone.
- In the window that opens, enter
https://github.com/yandex-cloud-examples/yc-datasphere-s3-mount.git
and click Clone.
In the diabetes_catboost.ipynb
notebook, you will connect to the diabetes_data.csv
dataset in the bucket you mounted and train the CatBoost
model:
-
Install the required modules:
%pip install catboost %pip install scikit-learn %pip install ipywidgets
-
Import the required libraries:
import pandas as pd from catboost import CatBoostClassifier from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split
-
Upload data for model training:
data = pd.read_csv('<path_to_dataset>') data.head()
To get the
diabetes_data.csv
file path, open the S3 Mounts tab, right-click the file, and select Copy path. -
Separate training factors from the target variable:
X = data.loc[:, data.columns != 'Diabetes_binary'] y = data['Diabetes_binary']
-
Split the sample into training and test ones:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)
-
Create a classifier named
СatBoost
and set hyperparameters:model = CatBoostClassifier( iterations=100, random_seed=42, logging_level='Verbose' )
-
Start model training:
model.fit( X_train, y_train, eval_set=(X_validation, y_validation), plot=True )
-
Output the trained model quality metrics:
print(classification_report(y_validation, model.predict(X_validation)))
How to delete the resources you created
To stop paying for the resources you created:
- Delete the objects from the bucket.
- Delete the bucket.
- Delete the project.