Rerunning jobs in DataSphere Jobs
In Yandex DataSphere, you can rerun a job with the required parameters redefined. A rerun creates a job fork and the original job becomes the parent one. A job fork can also be rerun, in which case the job will become both the fork of one and the parent of the other.
To regularly run the same job with some of its parameters redefined, you can use DataSphere Jobs along with Yandex Managed Service for Apache Airflow™.
To rerun a job with new parameters, there is the fork
command in DataSphere CLI and in DataSphere Jobs SDK, which can be used to redefine the following parameters:
name
: Job name.desc
: Job description.args
: Job arguments.vars
: Input and output data files.env_vars
: Environment variables.working_storage
: Extended working directory configuration.cloud_instance_types
: Computing resource configuration.
Example
Let's take a look at the config.yaml
job configuration file for the code that runs a substring search (grep
) in the input file:
name: simple-bash-script
desc: Find text pattern in input file with grep
cmd: grep -C ${RANGE} ${OPTIONS} -f ${PATTERN} ${INPUT} > ${OUTPUT}
args:
RANGE: 0
OPTIONS: "-h -r"
inputs:
- pattern.txt: PATTERN
- input.txt: INPUT
outputs:
- output.txt: OUTPUT
Where:
RANGE
: Search output interval.OPTIONS
: Additional flags of thegrep
command.PATTERN
: Substring pattern file.INPUT
: Input data file.OUTPUT
: Output data file.
After you run a job, you can get its ID from CLI logs using the execute
command or under the DataSphere Jobs tab on the project page in your browser. To rerun this job using the SDK fork
command, specify its ID and redefine its parameters as you need. For example, specify a new search output interval and a new input data file:
from datasphere import SDK
sdk = SDK()
sdk.fork_job(
'<job_ID>',
args={'RANGE': '1'},
vars={'INPUT': 'new_input.txt'},
)
Job data lifetime
By default, job data is retained for 14 days. Once the data is deleted, you will not be able to re-run the job. You can change the job data lifetime by running the command below:
datasphere project job set-data-ttl --id <job_ID> --days <number_of_days>
Where:
--id
: Job ID.
--days
: Number of days after which the job data will be deleted.