Running jobs from remote hosts that are not part of the Yandex Data Processing cluster
This guide describes how to use the spark-submit utility
Note
You can also run jobs in the Yandex Data Processing cluster from Yandex DataSphere. For more information, see this concept.
Getting started
Create and configure a host to run jobs remotely on the Yandex Data Processing cluster:
-
Create a VM with Ubuntu 16.04 LTS.
-
To provide network access to the Yandex Data Processing cluster hosting the created VM, set up security groups for the cluster.
-
Connect to the VM over SSH:
ssh -A <username>@<VM_FDQN>
-
Copy the repository settings from any of the Yandex Data Processing cluster hosts. To do this, run a sequence of commands on the VM you created.
-
Copy the repository address:
ssh root@<cluster_host_FQDN> \ "cat /etc/apt/sources.list.d/yandex-dataproc.list" | \ sudo tee /etc/apt/sources.list.d/yandex-dataproc.list
-
Copy the GPG key to verify Debian package signatures:
ssh root@<cluster_host_FQDN> \ "cat /srv/dataproc.gpg" | sudo apt-key add -
-
Update the repository cache:
sudo apt update
-
-
Install the required packages:
sudo apt install openjdk-8-jre-headless hadoop-client hadoop-hdfs spark-core spark-python
Note
You only need the
spark-python
package to run PySpark jobs. -
Copy the Hadoop and Spark configuration files:
sudo -E scp -r \ root@<cluster_host_FQDN>:/etc/hadoop/conf/* \ /etc/hadoop/conf/ && \ sudo -E scp -r \ root@<cluster_host_FQDN>:/etc/spark/conf/* \ /etc/spark/conf/
-
Create a user named
sparkuser
to run jobs:sudo useradd sparkuser && \ ssh root@<cluster_host_FQDN> " hadoop fs -mkdir /user/sparkuser sudo -u hdfs hdfs dfs -chown sparkuser:sparkuser /user/sparkuser sudo -u hdfs hdfs dfs -ls /user/sparkuser "
-
Create a VM with Ubuntu 20.04 LTS.
-
To provide network access to the Yandex Data Processing cluster hosting the created VM, set up security groups for the cluster.
-
Connect to the VM over SSH:
ssh -A <username>@<VM_FDQN>
-
Copy the repository settings from any of the Yandex Data Processing cluster hosts. To do this, run a sequence of commands on the VM you created.
-
Copy the repository address:
ssh ubuntu@<cluster_host_FQDN> \ "cat /etc/apt/sources.list.d/yandex-dataproc.list" | \ sudo tee /etc/apt/sources.list.d/yandex-dataproc.list
-
Copy the GPG key to verify Debian package signatures:
ssh ubuntu@<cluster_host_FQDN> \ "cat /srv/dataproc.gpg" | sudo apt-key add -
-
Update the repository cache:
sudo apt update
-
-
Install the required packages:
sudo apt install openjdk-8-jre-headless hadoop-client hadoop-hdfs spark-core spark-python
Note
You only need the
spark-python
package to run PySpark jobs. -
Copy the Hadoop and Spark configuration files:
sudo -E scp -r \ ubuntu@<cluster_host_FQDN>:/etc/hadoop/conf/* \ /etc/hadoop/conf/ && \ sudo -E scp -r \ ubuntu@<cluster_host_FQDN>:/etc/spark/conf/* \ /etc/spark/conf/
-
Create a user named
sparkuser
to run jobs:sudo useradd sparkuser && \ ssh ubuntu@<cluster_host_FQDN> " hadoop fs -mkdir /user/sparkuser sudo -u hdfs hdfs dfs -chown sparkuser:sparkuser /user/sparkuser sudo -u hdfs hdfs dfs -ls /user/sparkuser "
Running a job
-
Run a job using the command:
sudo -u sparkuser spark-submit \ --master yarn \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ /usr/lib/spark/examples/jars/spark-examples.jar 1000
Result:
20/04/19 16:43:58 INFO client.RMProxy: Connecting to ResourceManager at rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/ 10.13.13.18:8032 20/04/19 16:43:58 INFO client.AHSProxy: Connecting to Application History server at rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/10.13.13.18:10200 20/04/19 16:43:58 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers ... 20/04/19 16:43:58 INFO yarn.Client: Preparing resources for our AM container 20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/usr/lib/spark/examples/jars/spark-examples.jar -> hdfs:// rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.sparkStaging/application_1586176069782_0003/ spark-examples.jar 20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs:// rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/.sparkStaging/application_1586176069782_0003/hive-site. xml 20/04/19 16:43:58 INFO yarn.Client: Uploading resource file:/tmp/spark-6dff3163-089b-4634-8f74-c8301d424567/ __spark_conf__8717606866210190000.zip -> hdfs://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net/user/sparkuser/. sparkStaging/application_1586176069782_0003/__spark_conf__.zip 20/04/19 16:44:00 INFO yarn.Client: Submitting application application_1586176069782_0003 to ResourceManager 20/04/19 16:44:00 INFO impl.YarnClientImpl: Submitted application application_1586176069782_0003 20/04/19 16:44:01 INFO yarn.Client: Application report for application_1586176069782_0003 (state: ACCEPTED) 20/04/19 16:44:01 INFO yarn.Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1587314639386 final status: UNDEFINED tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/ user: sparkuser 20/04/19 16:44:05 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING) 20/04/19 16:44:05 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: rc1b-dataproc-d-9cd9yoenm4npsznt.mdb.yandexcloud.net ApplicationMaster RPC port: 41648 queue: default start time: 1587314639386 final status: UNDEFINED tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/ user: sparkuser 20/04/19 16:44:06 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING) 20/04/19 16:44:07 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING) 20/04/19 16:44:08 INFO yarn.Client: Application report for application_1586176069782_0003 (state: RUNNING) 20/04/19 16:44:09 INFO yarn.Client: Application report for application_1586176069782_0003 (state: FINISHED) 20/04/19 16:44:09 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: rc1b-dataproc-d-9cd9yoenm4npsznt.mdb.yandexcloud.net ApplicationMaster RPC port: 41648 queue: default start time: 1587314639386 final status: SUCCEEDED tracking URL: http://rc1b-dataproc-m-ds7lj5gnnnqggbqd.mdb.yandexcloud.net:8088/proxy/application_1586176069782_0003/ user: sparkuser 20/04/19 16:44:09 INFO util.ShutdownHookManager: Shutdown hook called 20/04/19 16:44:09 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-6dff3163-089b-4634-8f74-c8301d424567 20/04/19 16:44:09 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-826498b1-8dec-4229-905e-921203b7b1d0
-
Check the job execution status using the yarn application
utility:yarn application -status application_1586176069782_0003
Result:
20/04/19 16:47:03 INFO client.RMProxy: Connecting to ResourceManager at rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net/10.13.13.18:8032 20/04/19 16:47:03 INFO client.AHSProxy: Connecting to Application History server at rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net/10.13.13.18:10200 Application Report : Application-Id : application_1586176069782_0003 Application-Name : org.apache.spark.examples.SparkPi Application-Type : SPARK User : sparkuser Queue : default Application Priority : 0 Start-Time : 1587314639386 Finish-Time : 1587314647621 Progress : 100% State : FINISHED Final-State : SUCCEEDED Tracking-URL : rc1b-dataproc-m-ds7lj5gn********.mdb.yandexcloud.net:18080/history/application_1586176069782_0003/1 RPC Port : 41648 AM Host : rc1b-dataproc-d-9cd9yoen********.mdb.yandexcloud.net Aggregate Resource Allocation : 141510 MB-seconds, 11 vcore-seconds Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds Log Aggregation Status : SUCCEEDED Diagnostics : Unmanaged Application : false Application Node Label Expression : <Not set> AM container Node Label Expression : <DEFAULT_PARTITION> TimeoutType : LIFETIME ExpiryTime : UNLIMITED RemainingTime : -1seconds
-
View logs from all running containers using the yarn logs
utility:sudo -u sparkuser yarn logs \ -applicationId application_1586176069782_0003 | grep "Pi is"
Result:
Pi is roughly 3.14164599141646
-
On the VM, create a file named
month_stat.py
with the following code:import sys from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext def main(): conf = SparkConf().setAppName("Month Stat - Python") conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") sc = SparkContext(conf=conf) sql = SQLContext(sc) df = sql.read.parquet("s3a://yc-mdb-examples/dataproc/example01/set01") defaultFS = sc._jsc.hadoopConfiguration().get("fs.defaultFS") month_stat = df.groupBy("Month").count() month_stat.repartition(1).write.format("csv").save(defaultFS+"/tmp/month_stat") if __name__ == "__main__": main()
-
Copy the
month_stat.py
file to the cluster's master host:sudo -E scp month_stat.py <username>@<cluster_host_FQDN>:~/month_stat.py
For image version 2.0, specify the
ubuntu
user; for image version 1.4, specifyroot
. -
Run the application:
sudo -u sparkuser spark-submit \ --master yarn \ --deploy-mode cluster \ month_stat.py
-
The result of running the application will be exported to HDFS on the cluster. You can list the resulting files using the command:
ssh <username>@<cluster_host_FQDN> "hdfs dfs -ls /tmp/month_stat"
For image version 2.0, specify the
ubuntu
user; for image version 1.4, specifyroot
.
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.