Initialization scripts
When creating a cluster, you can specify host initialization scripts. This can be useful to automatically install or update software you need to run jobs. Each script will be run under the root
superuser only once, when the host starts for the first time.
In the first line of the script file, specify the full path to the interpreter, e.g., #!/bin/sh
or #!/usr/bin/python
.
You can specify the script URI as https://
, http://
, hdfs://
, or s3a://
. For s3a://
, at least one of the following conditions must be met:
- The bucket ACL must allow a cluster service account to perform read operations.
- The cluster's service account must have the
storage.viewer
role. - The access to the bucket must be public.
Environment variables
You can use these environment variables in your initialization scripts:
CLUSTER_ID
: Cluster ID.S3_BUCKET
: Name of the linked Yandex Object Storage bucket.ROLE
: Host role (masternode
,computenode
, ordatanode
).CLUSTER_SERVICES
: List of components.MAX_WORKER_COUNT
: Maximum number of hosts in data storage and processing subclusters.MIN_WORKER_COUNT
: Minimum number of hosts in data storage and processing subclusters.
For example, to run a part of a script only on the master host (masternode
), check the value of the ROLE
environment variable:
if [[ "${ROLE}" == "masternode" ]]; then
...
fi
Script initialization errors
If the script fails and the cluster switches to DEAD
:
-
View logs in Yandex Cloud Logging or on cluster hosts in the
/var/log/yandex/dataproc-init-actions.log
file. -
Correct the error.
If the initialization script returns an error on an existing cluster (such as when adding a subcluster) and recreating the cluster disrupts your workflow, you can fix the error manually:
-
Connect to the problematic host and perform the steps required to resolve the issue.
-
Run the script that marks the initialization script execution results as successful:
sudo /opt/yandex/complete_init_action.py
-
Check the initialization script results in the
/home/dataproc-agent/dataproc-init-acts/states.json
file on the master host.
Syntax errors
To check a script for syntax errors, download the script file manually and run it:
-
Connect to the cluster host.
-
Download the script file from the storage via the link used when creating the cluster. For example:
wget <HTTP_link_to_script_file>
-
Run the script.
If any error occurs during the script run, you will see an error message in the console.
For instance, an error may occur because of incompatible formats. The script runtime environment being Linux (Ubuntu), scripts created in Windows may terminate with one of the following errors:
-
^M: bad interpreter
-
FileNotFoundError: [Errno 2] No such file or directory: '<executable_file_name>'
.
These errors are caused by using the CR/LF
line break character in Windows (LF
in Linux). To fix the error, run this command:
sed -i -e 's/\r$//' <script_file_name>
$file = "<script_file_name>"; $text = [IO.File]::ReadAllText($file) -replace "`r`n", "`n"; [IO.File]::WriteAllText($file, $text)