Running jobs from remote hosts that are not part of the Yandex Data Processing cluster

Written by

Updated at July 16, 2026

Getting started
- Required paid resources
Set up your infrastructure
Running a job
Delete the resources you created

This tutorial describes how to use the spark-submit utility to run Spark jobs in the Yandex Data Processing cluster from hosts that are not part of the cluster.

Note

You can also run jobs in the Yandex Data Processing cluster from Yandex DataSphere. For more information, see this concept.

To run jobs from remote hosts that are not part of the cluster:

If you no longer need the resources you created, delete them.

Getting started

Navigate to the management console and log in to Yandex Cloud or create a new account.
On the Yandex Cloud Billing page, make sure you have a billing account linked and it has the ACTIVE or TRIAL_ACTIVE status. If you do not have a billing account, create one and link a cloud to it.

If you have an active billing account, you can create or select a folder for your infrastructure on the cloud page.

Learn more about clouds and folders here.

Required paid resources

Yandex Data Processing cluster: use of computing resources with a Yandex Data Processing markup, use of network drives, retrieval and storage of logs, amount of outgoing traffic (see Yandex Data Processing pricing).
Public IP addresses if public access is enabled for cluster hosts (see Yandex Virtual Private Cloud pricing).
VM instance: use of computing resources, storage, public IP address, and OS (see Yandex Compute Cloud pricing).

Set up your infrastructure

Create and configure a host to run jobs remotely on the Yandex Data Processing cluster:

Image version 1.4

Image version 2.0

Create a Yandex Data Processing cluster.
Create a VM running Ubuntu 16.04 LTS.
To provide network access to the Yandex Data Processing cluster hosting this VM, set up security groups for the cluster.
Connect to the VM over SSH:
```
ssh -A <username>@<VM_FDQN>
```
Copy the repository settings from any of the Yandex Data Processing cluster hosts. To do this, run a sequence of commands on the VM you created.
1. Copy the repository address:
```
ssh root@<cluster_host_FQDN> \
"cat /etc/apt/sources.list.d/yandex-dataproc.list" | \
sudo tee /etc/apt/sources.list.d/yandex-dataproc.list
```
2. Copy the GPG key to verify Debian package signatures:
```
ssh root@<cluster_host_FQDN> \
"cat /srv/dataproc.gpg" | sudo apt-key add -
```
3. Update the repository cache:
```
sudo apt update
```

Install the required packages:

sudo apt install openjdk-8-jre-headless hadoop-client hadoop-hdfs spark-core spark-python

Note

You need the spark-python package only to run PySpark jobs.

Copy the Hadoop and Spark configuration files:

sudo -E scp -r \
    root@<cluster_host_FQDN>:/etc/hadoop/conf/* \
    /etc/hadoop/conf/ && \
sudo -E scp -r \
    root@<cluster_host_FQDN>:/etc/spark/conf/* \
    /etc/spark/conf/

Create a user named sparkuser to run jobs:

sudo useradd sparkuser && \
ssh root@<cluster_host_FQDN> "
  hadoop fs -mkdir /user/sparkuser
  sudo -u hdfs hdfs dfs -chown sparkuser:sparkuser /user/sparkuser
  sudo -u hdfs hdfs dfs -ls /user/sparkuser
"

Create a Yandex Data Processing cluster.
Create a VM running Ubuntu 20.04 LTS.
To provide network access to the Yandex Data Processing cluster hosting this VM, set up security groups for the cluster.
Connect to the VM over SSH:
```
ssh -A <username>@<VM_FDQN>
```
Copy the repository settings from any of the Yandex Data Processing cluster hosts. To do this, run a sequence of commands on the VM you created.
1. Copy the repository address:
```
ssh ubuntu@<cluster_host_FQDN> \
"cat /etc/apt/sources.list.d/yandex-dataproc.list" | \
sudo tee /etc/apt/sources.list.d/yandex-dataproc.list
```
2. Copy the GPG key to verify Debian package signatures:
```
ssh ubuntu@<cluster_host_FQDN> \
"cat /srv/dataproc.gpg" | sudo apt-key add -
```
3. Update the repository cache:
```
sudo apt update
```

Install the required packages:

sudo apt install openjdk-8-jre-headless hadoop-client hadoop-hdfs spark-core spark-python

Note

You need the spark-python package only to run PySpark jobs.

Copy the Hadoop and Spark configuration files:

sudo -E scp -r \
    ubuntu@<cluster_host_FQDN>:/etc/hadoop/conf/* \
    /etc/hadoop/conf/ && \
sudo -E scp -r \
    ubuntu@<cluster_host_FQDN>:/etc/spark/conf/* \
    /etc/spark/conf/

Create a user named sparkuser to run jobs:

sudo useradd sparkuser && \
ssh ubuntu@<cluster_host_FQDN> "
  hadoop fs -mkdir /user/sparkuser
  sudo -u hdfs hdfs dfs -chown sparkuser:sparkuser /user/sparkuser
  sudo -u hdfs hdfs dfs -ls /user/sparkuser
"

Running a job

Spark job

PySpark job

Run a job using this command:

sudo -u sparkuser spark-submit \
     --master yarn \
     --deploy-mode cluster \
     --class org.apache.spark.examples.SparkPi \
         /usr/lib/spark/examples/jars/spark-examples.jar 1000

Result:

20/04/19 16:43:58 INFO client.RMProxy: Connecting to ResourceManager at rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/  10.13.13.18:8032
20/04/19 16:43:58 INFO client.AHSProxy: Connecting to Application History server at rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/10.13.13.18:10200
20/04/19 16:43:58 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
...
20/04/19 16:43:58 INFO yarn.Client: Preparing resources for our AM container
20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/usr/lib/spark/examples/jars/spark-examples.jar -> hdfs://  rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.sparkStaging/application_1586176069782_0003/  spark-examples.jar
20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://  rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.sparkStaging/application_1586176069782_0003/hive-site.  xml
20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/tmp/spark-6dff3163-089b-4634-8f74-c8301d424567/  __spark_conf__8717606866210190000.zip -> hdfs://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.  sparkStaging/application_1586176069782_0003/__spark_conf__.zip
20/04/19 16:44:00 INFO yarn.Client: Submitting application application_1586176069782_0003 to ResourceManager
20/04/19 16:44:00 INFO impl.YarnClientImpl: Submitted application application_1586176069782_0003
20/04/19 16:44:01 INFO yarn.Client: Application report for application_1586176069782_0003 (state: ACCEPTED)
20/04/19 16:44:01 INFO yarn.Client:
   client token: N/A
   diagnostics: AM container is launched, waiting for AM container to Register with RM
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: default
   start time: 1587314639386
   final status: UNDEFINED
   tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/
   user: sparkuser
20/04/19 16:44:05 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:05 INFO yarn.Client:
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: rc1b-dataproc-d-9cd9yoenm4npsznt.mdb.yandexcloud.net
   ApplicationMaster RPC port: 41648
   queue: default
   start time: 1587314639386
   final status: UNDEFINED
   tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/
   user: sparkuser
20/04/19 16:44:06 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:07 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:08 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING)
20/04/19 16:44:09 INFO yarn.Client: Application report for application_1586176069782_0003 (state: FINISHED)
20/04/19 16:44:09 INFO yarn.Client:
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: rc1b-dataproc-d-9cd9yoenm4npsznt.mdb.yandexcloud.net
   ApplicationMaster RPC port: 41648
   queue: default
   start time: 1587314639386
   final status: SUCCEEDED
   tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/
   user: sparkuser
20/04/19 16:44:09 INFO util.ShutdownHookManager: Shutdown hook called
20/04/19 16:44:09 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-6dff3163-089b-4634-8f74-c8301d424567
20/04/19 16:44:09 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-826498b1-8dec-4229-905e-921203b7b1d0

Check the job execution status using the yarn application utility:

yarn application -status application_1586176069782_0003

Result:

20/04/19 16:47:03 INFO client.RMProxy: Connecting to ResourceManager at rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net/10.13.13.18:8032
20/04/19 16:47:03 INFO client.AHSProxy: Connecting to Application History server at rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net/10.13.13.18:10200
Application Report :
    Application-Id : application_1586176069782_0003
    Application-Name : org.apache.spark.examples.SparkPi
    Application-Type : SPARK
    User : sparkuser
    Queue : default
    Application Priority : 0
    Start-Time : 1587314639386
    Finish-Time : 1587314647621
    Progress : 100%
    State : FINISHED
    Final-State : SUCCEEDED
    Tracking-URL : rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net:18080/history/application_1586176069782_0003/1
    RPC Port : 41648
    AM Host : rc1b-dataproc-d-9cd9yoen********.mdb.yandexcloud.net
    Aggregate Resource Allocation : 141510 MB-seconds, 11 vcore-seconds
    Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
    Log Aggregation Status : SUCCEEDED
    Diagnostics :
    Unmanaged Application : false
    Application Node Label Expression : <Not set>
    AM container Node Label Expression : <DEFAULT_PARTITION>
    TimeoutType : LIFETIME    ExpiryTime : UNLIMITED    RemainingTime : -1seconds

View logs from all running containers using the yarn logs utility:

sudo -u sparkuser yarn logs \
     -applicationId application_1586176069782_0003 | grep "Pi is"

Result:

Pi is roughly 3.14164599141646

On the VM, create a file named month_stat.py with the following code:

import sys

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

def main():
    conf = SparkConf().setAppName("Month Stat - Python")
    conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
    sc = SparkContext(conf=conf)

    sql = SQLContext(sc)
    df = sql.read.parquet("s3a://yc-mdb-examples/dataproc/example01/set01")
    defaultFS = sc._jsc.hadoopConfiguration().get("fs.defaultFS")
    month_stat = df.groupBy("Month").count()
    month_stat.repartition(1).write.format("csv").save(defaultFS+"/tmp/month_stat")

if __name__ == "__main__":
        main()

Copy the month_stat.py file on the cluster's master host:
```
sudo -E scp month_stat.py <username>@<cluster_host_FQDN>:~/month_stat.py
```
For image version 2.0, specify the ubuntu user; for image version 1.4, specify root.

Run the application:

sudo -u sparkuser spark-submit \
     --master yarn \
     --deploy-mode cluster \
     month_stat.py

The result will be exported to HDFS on the cluster. You can list the files you got using this command:
```
ssh <username>@<cluster_host_FQDN> "hdfs dfs -ls /tmp/month_stat"
```
For image version 2.0, specify the ubuntu user; for image version 1.4, specify root.

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Delete the resources you created

Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:

Running jobs from remote hosts that are not part of the Yandex Data Processing cluster

Getting startedGetting started

Required paid resourcesRequired paid resources

Set up your infrastructureSet up your infrastructure

Running a jobRunning a job

Delete the resources you createdDelete the resources you created

Was the article helpful?

Getting started

Required paid resources

Set up your infrastructure

Running a job

Delete the resources you created