Rerunning jobs in DataSphere Jobs
In Yandex DataSphere, you can rerun a job with the required parameters redefined. A rerun creates a job fork and the original job becomes the parent one. A job fork can also be rerun, in which case the job will become both the fork of one and the parent of the other.
To regularly run the same job with some of its parameters redefined, you can use DataSphere Jobs toggether with Yandex Managed Service for Apache Airflow™.
To rerun a job with new parameters, there is the fork
command in DataSphere CLI and in DataSphere Jobs SDK, which can be used to redefine the following parameters:
name
: Job namedesc
: Job descriptionargs
: Job argumentsvars
: Input and output data filesenv_vars
: Environment variablesworking_storage
: Extended working directory configurationcloud_instance_types
: Configuration of computing resources configuration
Example
Let's take a look at the config.yaml
job configuration file for the code that runs a substring search (grep
) in the input file:
name: simple-bash-script
desc: Find text pattern in input file with grep
cmd: grep -C ${RANGE} ${OPTIONS} -f ${PATTERN} ${INPUT} > ${OUTPUT}
args:
RANGE: 0
OPTIONS: "-h -r"
inputs:
- pattern.txt: PATTERN
- input.txt: INPUT
outputs:
- output.txt: OUTPUT
Where:
RANGE
: Search output intervalOPTIONS
: Additional flags of thegrep
commandPATTERN
: Substring pattern fileINPUT
: Input data fileOUTPUT
: Output data file
After you run a job you can get its ID from CLI logs using the execute
command or under the DataSphere Jobs tab on the project page in your browser. To rerun this job using the SDK fork
command, specify its ID and redefine its parameters as needed. For example, specify a new search output interval and a new input data file:
from datasphere import SDK
sdk = SDK()
sdk.fork_job(
'<job_ID>',
args={'RANGE': '1'},
vars={'INPUT': 'new_input.txt'},
)