Managing Hive jobs
In a Yandex Data Processing cluster, you can manage jobs and receive execution logs for them. For examples of jobs, see Working with jobs.
Creating a job
-
Open the folder dashboard
. -
Navigate to Yandex Data Processing.
-
Click the name of your cluster and select the Jobs tab.
-
Click Submit job.
-
Optionally, enter a name for the job.
-
In the Job type field, select
Hive. -
Optionally, in the Properties field, specify component properties as
key-valuepairs.If an argument, variable, or property is in several space-separated parts, specify each part separately. At the same time, it is important to preserve the order in which you declare arguments, variables, and properties.
The
-mapper mapper.pyargument, for instance, must be converted into two arguments,-mapperandmapper.py, in that order. -
Optionally, enable Continue on failure.
-
Specify Script variables as a
key-valuepair. -
Optionally, specify the paths to JAR files, if any.
File location Path format Instance file system file:///<path_to_file>Distributed cluster file system hdfs:///<path_to_file>Object Storage bucket s3a://<bucket_name>/<path_to_file>Internet http://<path_to_file>orhttps://<path_to_file>Archives in standard Linux formats, such as
zip,gz,xz,bz2, etc., are supported.The cluster service account needs read access to all the files in the bucket. Step-by-step guides on how to set up access to Object Storage are provided in Editing a bucket ACL.
-
Select one of the driver types and specify which to use to run the job:
- List of queries to run.
- Path to the file with the queries to run.
-
Click Submit job.
If you do not have the Yandex Cloud CLI yet, install and initialize it.
The folder used by default is the one specified when creating the CLI profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also specify a different folder for any command using --folder-name or --folder-id. If you access a resource by its name, the search will be limited to the default folder. If you access a resource by its ID, the search will be global, i.e., through all folders based on access permissions.
To create a job:
-
See the description of the CLI command for creating
Hivejobs:yc dataproc job create-hive --help -
Create a job (the example does not illustrate all available parameters):
yc dataproc job create-hive \ --cluster-name=<cluster_name> \ --name=<job_name> \ --query-file-uri=<query_file_URI> \ --script-variables=<list_of_values>Where
--script-variablesis the comma-separated list of variable values.Provide the paths to the files required for the job in the following format:
File location Path format Instance file system file:///<path_to_file>Distributed cluster file system hdfs:///<path_to_file>Object Storage bucket s3a://<bucket_name>/<path_to_file>Internet http://<path_to_file>orhttps://<path_to_file>Archives in standard Linux formats, such as
zip,gz,xz,bz2, etc., are supported.The cluster service account needs read access to all the files in the bucket. Step-by-step guides on how to set up access to Object Storage are provided in Editing a bucket ACL.
You can get the cluster ID and name with the list of clusters in the folder.
Call the create API method and provide the following in the request:
- Cluster ID in the
clusterIdparameter. You can get it with the list of clusters in the folder. - Job name in the
nameparameter. - Job properties in the
hiveJobparameter.
Cancel a job
Note
You cannot cancel jobs with the ERROR, DONE, or CANCELLED status. To find out the job status, get the list of jobs in the cluster.
- Open the folder dashboard
. - Navigate to Yandex Data Processing.
- Click the name of your cluster and select the Jobs tab.
- Click the job name.
- Click Cancel in the top-right corner of the page.
- In the window that opens, select Cancel.
If you do not have the Yandex Cloud CLI yet, install and initialize it.
The folder used by default is the one specified when creating the CLI profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also specify a different folder for any command using --folder-name or --folder-id. If you access a resource by its name, the search will be limited to the default folder. If you access a resource by its ID, the search will be global, i.e., through all folders based on access permissions.
To cancel a job, run this command:
yc dataproc job cancel <job_name_or_ID> \
--cluster-name=<cluster_name>
You can get the job ID and name with the list of jobs in the cluster, and the cluster name, with the list of clusters in the folder.
Call the cancel API method and provide the following in the request:
- Cluster ID in the
clusterIdparameter. - Job ID in the
jobIdparameter.
You can get the cluster ID with the list of clusters in the folder, and the job ID, with the list of cluster jobs.
Getting a list of jobs
- Open the folder dashboard
. - Navigate to Yandex Data Processing.
- Click the name of your cluster and select the Jobs tab.
If you do not have the Yandex Cloud CLI yet, install and initialize it.
The folder used by default is the one specified when creating the CLI profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also specify a different folder for any command using --folder-name or --folder-id. If you access a resource by its name, the search will be limited to the default folder. If you access a resource by its ID, the search will be global, i.e., through all folders based on access permissions.
To get a list of jobs, run the following command:
yc dataproc job list --cluster-name=<cluster_name>
You can get the cluster ID and name with the list of clusters in the folder.
Call the list API method, providing the cluster ID in the clusterId request parameter.
You can get the cluster ID with the list of clusters in the folder.
Getting general info about a job
- Open the folder dashboard
. - Navigate to Yandex Data Processing.
- Click the name of your cluster and select the Jobs tab.
- Click the job name.
If you do not have the Yandex Cloud CLI yet, install and initialize it.
The folder used by default is the one specified when creating the CLI profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also specify a different folder for any command using --folder-name or --folder-id. If you access a resource by its name, the search will be limited to the default folder. If you access a resource by its ID, the search will be global, i.e., through all folders based on access permissions.
To get general info about a job, run this command:
yc dataproc job get \
--cluster-name=<cluster_name> \
--name=<job_name>
You can get the cluster ID and name with the list of clusters in the folder.
Call the get API method and provide the following in the request:
- Cluster ID in the
clusterIdparameter. You can get it with the list of clusters in the folder. - Job ID in the
jobIdparameter. You can get it with the list of cluster jobs.
Getting job execution logs
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.
- Open the folder dashboard
. - Navigate to Yandex Data Processing.
- Click the name of your cluster and select the Jobs tab.
- Click the job name.
If you do not have the Yandex Cloud CLI yet, install and initialize it.
The folder used by default is the one specified when creating the CLI profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also specify a different folder for any command using --folder-name or --folder-id. If you access a resource by its name, the search will be limited to the default folder. If you access a resource by its ID, the search will be global, i.e., through all folders based on access permissions.
To get the job execution logs, run the following command:
yc dataproc job log \
--cluster-name=<cluster_name> \
--name=<job_name>
You can get the cluster ID and name with the list of clusters in the folder.
Call the API listLog method and provide the following in the request:
- Cluster ID in the
clusterIdparameter. You can get it with the list of clusters in the folder. - Job ID in the
jobIdparameter. You can get it with the list of cluster jobs.