Rerunning jobs in DataSphere Jobs

Written by

Updated at January 23, 2025

In Yandex DataSphere, you can rerun a job with the required parameters redefined. A rerun creates a job fork and the original job becomes the parent one. A job fork can also be rerun, in which case the job will become both the fork of one and the parent of the other.

To regularly run the same job with some of its parameters redefined, you can use DataSphere Jobs along with Yandex Managed Service for Apache Airflow™.

To rerun a job with new parameters, there is the fork command in DataSphere CLI and in DataSphere Jobs SDK, which can be used to redefine the following parameters:

name: Job name.
desc: Job description.
args: Job arguments.
vars: Input and output data files.
env_vars: Environment variables.
working_storage: Extended working directory configuration.
cloud_instance_types: Computing resource configuration.

Example

Let's take a look at the config.yaml job configuration file for the code that runs a substring search (grep) in the input file:

name: simple-bash-script
desc: Find text pattern in input file with grep
cmd: grep -C ${RANGE} ${OPTIONS} -f ${PATTERN} ${INPUT} > ${OUTPUT}
args:
  RANGE: 0
  OPTIONS: "-h -r"
inputs:
  - pattern.txt: PATTERN
  - input.txt: INPUT
outputs:
  - output.txt: OUTPUT

Where:

RANGE: Search output interval.
OPTIONS: Additional flags of the grep command.
PATTERN: Substring pattern file.
INPUT: Input data file.
OUTPUT: Output data file.

After you run a job, you can get its ID from CLI logs using the execute command or under the DataSphere Jobs tab on the project page in your browser. To rerun this job using the SDK fork command, specify its ID and redefine its parameters as you need. For example, specify a new search output interval and a new input data file:

from datasphere import SDK

sdk = SDK()

sdk.fork_job(
  '<job_ID>',
  args={'RANGE': '1'},
  vars={'INPUT': 'new_input.txt'},
)

Job data lifetime

By default, job data is retained for 14 days. Once the data is deleted, you will not be able to re-run the job. You can change the job data lifetime by running the command below:

datasphere project job set-data-ttl --id <job_ID> --days <number_of_days>

Where:

--id: Job ID.
--days: Number of days after which the job data will be deleted.

Rerunning jobs in DataSphere Jobs

Example

Job data lifetime

See also

Was the article helpful?

Rerunning jobs in DataSphere Jobs

ExampleExample

Job data lifetimeJob data lifetime

See alsoSee also

Was the article helpful?

Example

Job data lifetime

See also