Getting started with DataSphere
Yandex DataSphere is an end-to-end ML development environment where you can use well-known IDEs, serverless computing technology, and seamlessly combine a broad range of Yandex Cloud computing resource configurations. Yandex DataSphere is part of the data platform and provides powerful features to work with Yandex Cloud services. As an IDE, DataSphere provides Jupyter® Notebook
In this section, you will learn how to:
Getting started
- Go to the management console
and log in to Yandex Cloud or sign up if not signed up yet. - Go to Yandex Cloud Billing
and make sure you have a billing account linked and its status isACTIVEorTRIAL_ACTIVE. If you do not have a billing account yet, create one. - Open the DataSphere home page
. - Accept the user agreement.
- Select the organization to work with DataSphere in or create a new one.
Create a project
- Open the DataSphere home page
. - In the left-hand panel, select
Communities. - Select the community where you want to create a project.
- On the community page, click
Create project. - In the window that opens, enter a name for the project. You can also add a description as needed.
- Click Create.
Run the project
To run a project, click Open project in JupyterLab.
Set up your environment
Popular packages for data analysis and machine learning are pre-installed and ready for use. You can find a list of these packages here.
To install missing packages, use pip.
To install a package:
-
Enter the following command in a notebook cell:
%pip install <package_name>For example, install the seaborn
package to visualize statistics:%pip install seabornYou can use various options the pip install
command supports. For examples of using this command, follow this link . -
Run the cell. To do this, click
.The package installation result will show up under the cell.
You can also configure the environment to run your code using Docker images.
Upload data to the project
You can upload small data amounts (up to 100 MB) to your DataSphere project through the JupyterLab interface. For larger amounts of data, we recommend loading from network storage or databases. To handle large data amounts, you can also use datasets.
To upload data to your project via the JupyterLab interface:
- Under
File Browser, select the folder for uploading data. - At the top left, click
. - Select the files to upload.
Learn more about project storage here.
DataSphere allows you to upload data from different sources:
- Connecting to S3 using boto3
- Connecting to Google Drive
- Connecting to a ClickHouse® database
- Connecting to a PostgreSQL database
- Connecting to Yandex Disk
Start training
To start computations:
-
Under
File Browser, select the notebook with the Python or Bash code. -
Select and run one or more code cells by selecting Run → Run Selected Cells, or pressing Shift + Enter.
-
Wait for the operation to complete.
The result will show up under the cell.