Initialization scripts
When creating a cluster, you can specify host initialization scripts. This can be useful to automatically install or update software you need to run jobs. Each script will be run under the root superuser only once, when the host starts for the first time.
In the first line of the script file, specify the full path to the interpreter, e.g., #!/bin/sh or #!/usr/bin/python.
You can specify the script URI as https://, http://, hdfs://, or s3a://. For s3a://, at least one of the following conditions must be met:
- The bucket ACL must allow a cluster service account to perform read operations.
- The cluster's service account must have the
storage.viewerrole. - The access to the bucket must be public.
Warning
Initialization scripts run every time a node is created. If they depend on external resources, such as public repositories, this may have the following consequences:
- Slower cluster creation or scaling.
- Failure creating a cluster or adding a node should the external service be temporarily unavailable.
- Disruption of automatic node recovery.
Yandex Data Processing images contain links to public repositories that are not part of Yandex Data Processing. Their unavailability may disrupt cluster operation. Therefore, we recommend minimizing the number of operations in your initialization scripts.
Environment variables
You can use these environment variables in your initialization scripts:
CLUSTER_ID: Cluster ID.S3_BUCKET: Name of the linked Yandex Object Storage bucket.ROLE: Host role,masternode,computenodeordatanode.CLUSTER_SERVICES: List of components.MAX_WORKER_COUNT: Maximum number of hosts in data storage and processing subclusters.MIN_WORKER_COUNT: Minimum number of hosts in data storage and processing subclusters.
For example, to run a part of a script only on the master host (masternode), check the value of the ROLE environment variable:
if [[ "${ROLE}" == "masternode" ]]; then
...
fi
Script initialization errors
If the script fails and the cluster switches to DEAD:
- View logs in Yandex Cloud Logging or on cluster hosts in the
/var/log/yandex/dataproc-init-actions.logfile. - Correct the error.
- Delete this cluster and create a new one.
If the initialization script returns an error on an existing cluster (such as when adding a subcluster) and recreating the cluster disrupts your workflow, you can fix the error manually:
-
Connect to the problematic host and perform the steps required to resolve the issue.
-
Run the script that marks the initialization script execution results as successful:
sudo /opt/yandex/complete_init_action.py -
Check the initialization script results in the
/home/dataproc-agent/dataproc-init-acts/states.jsonfile on the master host.
Syntax errors
To check a script for syntax errors, download the script file manually and run it:
-
Connect to the cluster host.
-
Download the script file from the storage via the link used when creating the cluster. Here is an example:
wget <HTTP_link_to_script_file> -
Run the script.
If any error occurs during the script run, you will see an error message in the console.
For instance, an error may occur because of incompatible formats. The script runtime environment being Linux (Ubuntu), scripts created in Windows may terminate with one of the following errors:
-
^M: bad interpreter -
FileNotFoundError: [Errno 2] No such file or directory: '<executable_file_name>'.
These errors are caused by using the CR/LF line break character in Windows (LF in Linux). To fix the error, run this command:
sed -i -e 's/\r$//' <script_file_name>
$file = "<script_file_name>"; $text = [IO.File]::ReadAllText($file) -replace "`r`n", "`n"; [IO.File]::WriteAllText($file, $text)