Running jobs from remote hosts that are not part of the Yandex Data Processing cluster

Written by

Updated at April 23, 2025

Getting started
Running a job

This guide describes how to use the spark-submit utility to run Spark jobs in the Yandex Data Processing cluster from hosts that are not part of the Yandex Data Processing cluster.

Note

You can also run jobs in the Yandex Data Processing cluster from Yandex DataSphere. For more information, see this concept.

Getting started

Create and configure a host to run jobs remotely on the Yandex Data Processing cluster:

Image version 1.4

Image version 2.0

Create a VM with Ubuntu 16.04 LTS.
To provide network access to the Yandex Data Processing cluster hosting the created VM, set up security groups for the cluster.
Connect to the VM over SSH:
```
ssh -A <username>@<VM_FDQN>
```
Copy the repository settings from any of the Yandex Data Processing cluster hosts. To do this, run a sequence of commands on the VM you created.
1. Copy the repository address:
```
ssh root@<cluster_host_FQDN> \
"cat /etc/apt/sources.list.d/yandex-dataproc.list" | \
sudo tee /etc/apt/sources.list.d/yandex-dataproc.list
```
2. Copy the GPG key to verify Debian package signatures:
```
ssh root@<cluster_host_FQDN> \
"cat /srv/dataproc.gpg" | sudo apt-key add -
```
3. Update the repository cache:
```
sudo apt update
```

Install the required packages:

sudo apt install openjdk-8-jre-headless hadoop-client hadoop-hdfs spark-core spark-python

Note

You only need the spark-python package to run PySpark jobs.

Copy the Hadoop and Spark configuration files:

sudo -E scp -r \
    root@<cluster_host_FQDN>:/etc/hadoop/conf/* \
    /etc/hadoop/conf/ && \
sudo -E scp -r \
    root@<cluster_host_FQDN>:/etc/spark/conf/* \
    /etc/spark/conf/

Create a user named sparkuser to run jobs:

sudo useradd sparkuser && \
ssh root@<cluster_host_FQDN> "
  hadoop fs -mkdir /user/sparkuser
  sudo -u hdfs hdfs dfs -chown sparkuser:sparkuser /user/sparkuser
  sudo -u hdfs hdfs dfs -ls /user/sparkuser
"

Create a VM with Ubuntu 20.04 LTS.
To provide network access to the Yandex Data Processing cluster hosting the created VM, set up security groups for the cluster.
Connect to the VM over SSH:
```
ssh -A <username>@<VM_FDQN>
```
Copy the repository settings from any of the Yandex Data Processing cluster hosts. To do this, run a sequence of commands on the VM you created.
1. Copy the repository address:
```
ssh ubuntu@<cluster_host_FQDN> \
"cat /etc/apt/sources.list.d/yandex-dataproc.list" | \
sudo tee /etc/apt/sources.list.d/yandex-dataproc.list
```
2. Copy the GPG key to verify Debian package signatures:
```
ssh ubuntu@<cluster_host_FQDN> \
"cat /srv/dataproc.gpg" | sudo apt-key add -
```
3. Update the repository cache:
```
sudo apt update
```

Install the required packages:

sudo apt install openjdk-8-jre-headless hadoop-client hadoop-hdfs spark-core spark-python

Note

You only need the spark-python package to run PySpark jobs.

Copy the Hadoop and Spark configuration files:

sudo -E scp -r \
    ubuntu@<cluster_host_FQDN>:/etc/hadoop/conf/* \
    /etc/hadoop/conf/ && \
sudo -E scp -r \
    ubuntu@<cluster_host_FQDN>:/etc/spark/conf/* \
    /etc/spark/conf/

Create a user named sparkuser to run jobs:

sudo useradd sparkuser && \
ssh ubuntu@<cluster_host_FQDN> "
  hadoop fs -mkdir /user/sparkuser
  sudo -u hdfs hdfs dfs -chown sparkuser:sparkuser /user/sparkuser
  sudo -u hdfs hdfs dfs -ls /user/sparkuser
"

Running a job

Spark job

PySpark job

Run a job using the command:

sudo -u sparkuser spark-submit \
     --master yarn \
     --deploy-mode cluster \
     --class org.apache.spark.examples.SparkPi \
         /usr/lib/spark/examples/jars/spark-examples.jar 1000

Result:

20/04/19 16:43:58 INFO client.RMProxy: Connecting to ResourceManager at rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/  10.13.13.18:8032
20/04/19 16:43:58 INFO client.AHSProxy: Connecting to Application History server at rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/10.13.13.18:10200
20/04/19 16:43:58 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
...
20/04/19 16:43:58 INFO yarn.Client: Preparing resources for our AM container
20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/usr/lib/spark/examples/jars/spark-examples.jar -> hdfs://  rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.sparkStaging/application_1586176069782_0003/  spark-examples.jar
20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://  rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.sparkStaging/application_1586176069782_0003/hive-site.  xml
20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/tmp/spark-6dff3163-089b-4634-8f74-c8301d424567/  __spark_conf__8717606866210190000.zip -> hdfs://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.  sparkStaging/application_1586176069782_0003/__spark_conf__.zip
20/04/19 16:44:00 INFO yarn.Client: Submitting application application_1586176069782_0003 to ResourceManager
20/04/19 16:44:00 INFO impl.YarnClientImpl: Submitted application application_1586176069782_0003
20/04/19 16:44:01 INFO yarn.Client: Application report for application_1586176069782_0003 (state: ACCEPTED)
20/04/19 16:44:01 INFO yarn.Client:
   client token: N/A
   diagnostics: AM container is launched, waiting for AM container to Register with RM
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: default
   start time: 1587314639386
   final status: UNDEFINED
   tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/
   user: sparkuser
20/04/19 16:44:05 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:05 INFO yarn.Client:
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: rc1b-dataproc-d-9cd9yoenm4npsznt.mdb.yandexcloud.net
   ApplicationMaster RPC port: 41648
   queue: default
   start time: 1587314639386
   final status: UNDEFINED
   tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/
   user: sparkuser
20/04/19 16:44:06 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:07 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:08 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:09 INFO yarn.Client: Application report for application_1586176069782_0003 (state: FINISHED)
20/04/19 16:44:09 INFO yarn.Client:
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: rc1b-dataproc-d-9cd9yoenm4npsznt.mdb.yandexcloud.net
   ApplicationMaster RPC port: 41648
   queue: default
   start time: 1587314639386
   final status: SUCCEEDED
   tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/
   user: sparkuser
20/04/19 16:44:09 INFO util.ShutdownHookManager: Shutdown hook called
20/04/19 16:44:09 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-6dff3163-089b-4634-8f74-c8301d424567
20/04/19 16:44:09 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-826498b1-8dec-4229-905e-921203b7b1d0

Check the job execution status using the yarn application utility:

yarn application -status application_1586176069782_0003

Result:

20/04/19 16:47:03 INFO client.RMProxy: Connecting to ResourceManager at rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net/10.13.13.18:8032
20/04/19 16:47:03 INFO client.AHSProxy: Connecting to Application History server at rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net/10.13.13.18:10200
Application Report :
    Application-Id : application_1586176069782_0003
    Application-Name : org.apache.spark.examples.SparkPi
    Application-Type : SPARK
    User : sparkuser
    Queue : default
    Application Priority : 0
    Start-Time : 1587314639386
    Finish-Time : 1587314647621
    Progress : 100%
    State : FINISHED
    Final-State : SUCCEEDED
    Tracking-URL : rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net:18080/history/application_1586176069782_0003/1
    RPC Port : 41648
    AM Host : rc1b-dataproc-d-9cd9yoen********.mdb.yandexcloud.net
    Aggregate Resource Allocation : 141510 MB-seconds, 11 vcore-seconds
    Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
    Log Aggregation Status : SUCCEEDED
    Diagnostics :
    Unmanaged Application : false
    Application Node Label Expression : <Not set>
    AM container Node Label Expression : <DEFAULT_PARTITION>
    TimeoutType : LIFETIME    ExpiryTime : UNLIMITED    RemainingTime : -1seconds

View logs from all running containers using the yarn logs utility:

sudo -u sparkuser yarn logs \
     -applicationId application_1586176069782_0003 | grep "Pi is"

Result:

Pi is roughly 3.14164599141646

On the VM, create a file named month_stat.py with the following code:

import sys

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

def main():
    conf = SparkConf().setAppName("Month Stat - Python")
    conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
    sc = SparkContext(conf=conf)

    sql = SQLContext(sc)
    df = sql.read.parquet("s3a://yc-mdb-examples/dataproc/example01/set01")
    defaultFS = sc._jsc.hadoopConfiguration().get("fs.defaultFS")
    month_stat = df.groupBy("Month").count()
    month_stat.repartition(1).write.format("csv").save(defaultFS+"/tmp/month_stat")

if __name__ == "__main__":
        main()

Copy the month_stat.py file to the cluster's master host:
```
sudo -E scp month_stat.py <username>@<cluster_host_FQDN>:~/month_stat.py
```
For image version 2.0, specify the ubuntu user; for image version 1.4, specify root.

Run the application:

sudo -u sparkuser spark-submit \
     --master yarn \
     --deploy-mode cluster \
     month_stat.py

The result of running the application will be exported to HDFS on the cluster. You can list the resulting files using the command:
```
ssh <username>@<cluster_host_FQDN> "hdfs dfs -ls /tmp/month_stat"
```
For image version 2.0, specify the ubuntu user; for image version 1.4, specify root.

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Running jobs from remote hosts that are not part of the Yandex Data Processing cluster

Getting startedGetting started

Running a jobRunning a job

Was the article helpful?

Getting started

Running a job