Managing PySpark jobs
Create a job
Note
If the Yandex Data Processing cluster is connected to the Metastore cluster, add the spark:spark.sql.hive.metastore.sharedPrefixes
property with the com.amazonaws,ru.yandex.cloud
value to the Yandex Data Processing cluster. Then your PySpark jobs will be successfully completed. You can set the property when creating or modifying the Yandex Data Processing cluster.
To create a PySpark job:
-
Go to the folder page
and select Yandex Data Processing. -
Click the cluster name and open the Jobs tab.
-
Click Submit job.
-
(Optional) Enter a name for the job.
-
In the Job type field, select
PySpark
. -
In the Main python file field, specify the path to the main PY application file in the following format:
File Path format Instance file system file:///<path_to_file>
Distributed cluster file system hdfs:///<path_to_file>
Bucket Object Storage s3a://<bucket_name>/<path_to_file>
Internet http://<path_to_file>
orhttps://<path_to_file>
Archives in standard Linux formats, such as
zip
,gz
,xz
,bz2
, and others, are supported.The cluster service account needs read access to all the files in the bucket. For step-by-step guides on setting up access to Object Storage, see Editing a bucket's ACL.
-
(Optional) Specify the paths to the PY files, if any.
-
Specify job arguments.
If an argument, variable, or property is in several space-separated parts, specify each part separately. At the same time, it is important to preserve the order in which you declare arguments, variables, and properties.
The
-mapper mapper.py
argument, for instance, must be converted into two arguments:-mapper
andmapper.py
in that order. -
(Optional) Specify the paths to the JAR files, if any.
-
(Optional) Configure advanced settings:
- Specify paths to the required files and archives.
- In the Properties field, specify component properties as
key-value
pairs.
-
Click Submit job.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a job:
-
View the description of the CLI create command for
PySpark
jobs:yc dataproc job create-pyspark --help
-
Create a job (the example does not show all the available parameters):
yc dataproc job create-pyspark \ --cluster-name=<cluster_name> \ --name=<job_name> \ --main-python-file-uri=<path_to_main_py_file_of_application> \ --python-file-uris=<paths_to_additional_py_files> \ --jar-file-uris=<paths_to_jar_files> \ --archive-uris=<paths_to_archives> \ --properties=<component_properties> \ --args=<arguments> \ --packages=<Maven_coordinates_of_jar_files> \ --repositories=<additional_repositories> \ --exclude-packages=<packages_to_exclude>
Where:
--properties
: Component properties askey-value
pairs.--args
: Arguments passed to the job.--packages
: Maven coordinates of JAR files ingroupId:artifactId:version
format.--repositories
: Additional repositories to search forpackages
.--exclude-packages
: Packages to exclude ingroupId:artifactId
format.
Provide the paths to the files required for the job in the following format:
File Path format Instance file system file:///<path_to_file>
Distributed cluster file system hdfs:///<path_to_file>
Bucket Object Storage s3a://<bucket_name>/<path_to_file>
Internet http://<path_to_file>
orhttps://<path_to_file>
Archives in standard Linux formats, such as
zip
,gz
,xz
,bz2
, and others, are supported.The cluster service account needs read access to all the files in the bucket. For step-by-step guides on setting up access to Object Storage, see Editing a bucket's ACL.
You can get the cluster ID and name with a list of clusters in the folder.
Use the create API method and include the following information in the request:
- Cluster ID in the
clusterId
parameter. - Job name in the
name
parameter. - Job properties in the
pysparkJob
parameter.
You can get the cluster ID with a list of clusters in the folder.
Cancel a job
Note
You cannot cancel jobs with the status ERROR
, DONE
, or CANCELLED
. To find out a job's status, retrieve a list of jobs in the cluster.
- Go to the folder page
and select Yandex Data Processing. - Click the cluster name and open the Jobs tab.
- Click the job name.
- Click Cancel in the top-right corner of the page.
- In the window that opens, select Cancel.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To cancel a job, run the command below:
yc dataproc job cancel <job_name_or_ID> \
--cluster-name=<cluster_name>
You can get the job name or ID with the list of cluster jobs, and the cluster name, with the list of folder clusters.
Use the API cancel method and include the following in the request:
- Cluster ID in the
clusterId
parameter. - Job ID in the
jobId
parameter.
You can get the cluster ID with the list of folder clusters, and the job ID, with the list of cluster jobs.
Get a list of jobs
- Go to the folder page
and select Yandex Data Processing. - Click the cluster name and open the Jobs tab.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get a list of jobs, run the following command:
yc dataproc job list --cluster-name=<cluster_name>
You can get the cluster ID and name with a list of clusters in the folder.
Use the list API method and provide the cluster ID in the clusterId
request parameter.
You can get the cluster ID with a list of clusters in the folder.
Get general information about the job
- Go to the folder page
and select Yandex Data Processing. - Click the cluster name and open the Jobs tab.
- Click the job name.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get general information about the job, run the command:
yc dataproc job get \
--cluster-name=<cluster_name> \
--name=<job_name>
You can get the cluster ID and name with a list of clusters in the folder.
Use the get API method and include the following in the request:
- Cluster ID in the
clusterId
parameter. You can get it together with a list of clusters in the folder. - Job ID in the
jobId
parameter. You can get it with the list of cluster jobs.
Get job execution logs
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.
- Go to the folder page
and select Yandex Data Processing. - Click the cluster name and open the Jobs tab.
- Click the job name.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get job execution logs, run the following command:
yc dataproc job log \
--cluster-name=<cluster_name> \
--name=<job_name>
You can get the cluster ID and name with a list of clusters in the folder.
Use the API listLog method and include the following in the request:
- Cluster ID in the
clusterId
parameter. You can get it with a list of clusters in the folder. - Job ID in the
jobId
parameter. You can get it with the list of cluster jobs.