DataSphere Jobs
In DataSphere, you can run jobs, i.e., computations on DataSphere VMs outside JupyterLab, remotely. These can be Python scripts and projects, Bash scripts, and executable binary files compiled for Linux x86_64.
Jobs are created and run in projects. However, they do not depend on project notebooks and running VMs.
To run a job, set up a Python virtual environment, install DataSphere CLI using the pip install datasphere
command and prepare a configuration file describing all the parameters for running the job. You can also install and configure the Yandex Cloud CLI to use it for authentication in Yandex Cloud.
DataSphere Jobs stores input data cache, environments, logs, and job execution results. You can reuse the data and share it across jobs in a single project. The size of stored data is limited. For more information about DataSphere limits, see Quotas and limits in DataSphere.
You can find jobs in the DataSphere Jobs tab of a project. Their progress and results will be available under Run history.
Job configuration file
When creating a job, specify its parameters in the config.yaml
file: a configuration of computing resources that will be used for job execution and required files with input data. Depending on the settings specified in the configuration file, DataSphere analyzes the job, identifies dependencies, deploys the environment on the VM, and runs the job code. Job execution results are saved in the DataSphere project files listed in the job configuration.
# Job name
name: simple-python-script
# Job description
desc: Program description
# Entry point parameters to run computations
cmd: > # multi-line YAML string
python src/main.py
--params ${PARAMS}
--features ${<connector_ID>}/features.tsv
--validate ${CIFAR}/val.json
--normalizer ${DS_PROJECT_HOME}/misc/norm.bin
--model ${MODEL}
--epochs 5
# Files with input data
inputs:
- misc/logging.yaml # File path relative to the job run directory on local computer
- /usr/share/params.json: # Absolute file path on local computer is saved to the PARAMS variable
var: PARAMS
# Files with results
outputs:
- data/model.bin: MODEL # Relative file path is saved to the MODEL variable
- other/metrics.png # Relative file path
# Resources required for running the job must be available in the project
s3-mounts: # S3 connectors
- <connector_ID> # S3 connector ID
# Since no connector name is set, the connector can be accessed by ID
datasets:
- <dataset_ID>: # ID of the available project dataset
var: CIFAR # CIFAR is a variable to access the dataset
# Environment parameters
env:
vars: # Environment variables
- DEVICE_COUNT: 8 # Environment variable can be specified explicitly
- PYTHONBUFFERED # if the parameter is not specified, its value will be determined from the current environment
docker: <Docker_image_ID> # Docker image available in the project DataSphere
# You can also specify a Docker image in an image registry
# docker:
# image: <image_path_in_registry>:<tag> # E.g., <cr.yandex/crtabcdef12345678900/myenv:0.1>
# For Docker Hub, it is enough to specify `<name>:<tag>`, e.g., `ubuntu:focal`
# username: <username>
# password: <password> # in text format or DataSphere secret name
# # password:
# # secret-id: PASSWORD
# Environmet dependency build method
python: auto # Fully automated environment build
# python: # Environment parameters are set manually. If no parameters are specified, their values will be determined from the current environment automatically
# type: manual
# version: 3.10.5 # Python version
# requirements-file: requirements.txt # File with environment parameters
# Job run flags
flags:
- attach-project-disk # Mount the project storage
# Computing resource configuration for running the job
cloud-instance-type: g2.1
# Working directory configuration
working-storage:
- type: SSD # type of the disk being used. Optional, SSD by default. Possible values: SSD
- size: 150Gb # size of the working directory ranges from 100 GB to 10 TB.
The job config.yaml
file contains multiple sections.
-
The
name
anddescription
sections specify the job name and description. The job name must be unique within the project. A description is optional and helps find the required job in the project. -
The
cmd
section specifies the job entry point. Specify how to run your code and set all the run parameters to be provided to the executable file or script. You can use variables and IDs that will be specified in the sections below to set paths to the required files. -
The
input
section specifies files with input data and other information to be provided for running the job from your local computer. You can set a file path relative to the job directory or save the relative or absolute file path to a variable to use it in other sections of the configuration file. -
The
output
section specifies files to save the computation results to. Once the job is executed, these files will appear on your PC. The rules for setting file paths are the same as in theinput
section. -
DataSphere resources used in jobs are specified in the
s3-mounts
(for S3 connectors) anddatasets
(for datasets) sections. To use an S3 connector or a dataset in a job, specify the ID of an available project resource and, optionally, define a variable for it. If no variable is set, resources in thecmd
section can be accessed by their ID.You can also use project storage in your job. To do this, set the
attach-project-disk
flag in theflags
section. The project storage will be mounted to the VM the job is running on as an external disk for data reads. The storage path will be available in theDS_PROJECT_HOME
environment variable. -
The
env
section defines environment parameters for running the job: the VM environment build method, environment variables and, optionally, a Docker image that is built in DataSphere or stored in a different image registry. You can specify the registry authorization credentials if needed.There are two ways to set an environment for Python projects:
- Allow DataSphere to automatically identify any required dependencies, analyze the current environment on your local computer, and build and transfer the environment on its own. To enable this, set
python: auto
in theenv
section. - You can specify the Python interpreter version and used libraries yourself right in the configuration file or in a separate
requirements.txt
file. If you explicitly set at least one parameter, missing parameter values will be fetched from the current environment automatically.
Warning
If you use auto mode for setting the environment or only specify some dependencies, run your job from the Python virtual environment where all the relevant packages and local modules are installed. This will enable the
datasphere
library to automatically identify the job runtime environment parameters.The environment variables are specified in the
vars
section underenv
. Project secrets will also be added to the environment variables when running the job.Note
By default, DataSphere uses the conda package manager with pre-installed Python 3.10 to run jobs. To reduce environment migration time, use the same Python version for your jobs.
- Allow DataSphere to automatically identify any required dependencies, analyze the current environment on your local computer, and build and transfer the environment on its own. To enable this, set
-
The
cloud-instance-type
section defines the type of the computing resource configuration to run the job on. -
The
working-storage
section defines the working directory parameters. By default, the working directory is created on the system disk, with a size of about 20 GB. If you need more space to complete the job, you can create another working directory from 100 GB to 10 TB in size. Then, DataSphere will connect it to the VM as a secondary disk and remove it after completing the job.Extra storage is charged in accordance with the data storage pricing policy.
See also
- Running jobs in DataSphere Jobs
- GitHub repository
with job run examples