Running Apache Hive jobs
Hive
You can run Hive jobs from the Yandex Cloud CLI and directly on the server using the Hive CLI.
Getting started
-
Create a service account with the
dataproc.agent
anddataproc.provisioner
roles. -
In Object Storage, create buckets and configure access to them:
- Create a bucket for the input data and grant the
READ
permission for this bucket to the cluster service account. - Create a bucket for the processing output and grant the cluster service account
READ and WRITE
permissions for this bucket.
- Create a bucket for the input data and grant the
-
Create a Yandex Data Processing cluster with the following settings:
- Services:
HDFS
SPARK
HIVE
- Service account: Select the service account you previously created.
- Bucket name: Select a bucket to hold the processing results.
- Public access: Select this option to grant access to hosts in all subclusters.
- Services:
Running jobs using the Yandex Cloud CLI
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
Jobs are run using the Yandex Cloud CLI through the Yandex Data Processing agent installed on the cluster master host. The agent gets job parameters through the Yandex Data Processing API.
The executable file and its dependencies must be located in a storage accessible to the Yandex Data Processing cluster service account. The executed application itself must have access to the storages in which the source data set and execution results are saved.
The calculation result is saved in the Yandex Object Storage bucket along with the service output.
There are two ways to send an SQL query to Hive:
-
In the run job command:
yc dataproc job create-hive \ --cluster-id=<cluster_ID> \ --name=<job_name> \ --query-list="SELECT Month, COUNT(*) FROM flights GROUP BY Month;"
-
In the Object Storage object that the Yandex Data Processing cluster service account has read access to:
yc dataproc job create-hive \ --cluster-id=<cluster_ID> \ --name=<job_name> \ --query-file-uri="s3a://<bucket>/hive-query.sql"
You can find the query execution results and additional diagnostic information in the Object Storage bucket that you specified when creating the cluster: s3://<bucket>/dataproc/clusters/<cluster_ID>/jobs/<job_ID>/
.
The job ID is contained in the YC CLI job execution command output and in the API response to the job execution.
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.
Running jobs using the Hive CLI
-
Connect to the master host over SSH and run the
hive
command. -
Test Hive by running the
select 1;
command. The correct result looks like this:OK 1 Time taken: 0.077 seconds, Fetched: 1 row(s)
-
Create an external table for the data from the example in parquet format. The table will contain a list of flights between US cities in 2018. Run the following query in the Hive CLI:
CREATE EXTERNAL TABLE flights (Year bigint, Month bigint, FlightDate string, Flight_Number_Reporting_Airline bigint, OriginAirportID bigint, DestAirportID bigint) STORED AS PARQUET LOCATION 's3a://yc-mdb-examples/dataproc/example01/set01';
-
Check the list of tables:
show tables;
-
The list of tables should look like this:
OK flights Time taken: 0.043 seconds, Fetched: 1 row(s)
-
Request the number of flights by month:
SELECT Month, COUNT(*) FROM flights GROUP BY Month;
-
Example of the query results:
Query ID = root_20200119195338_28049b67-4de9-4568-a4c4-3bbe******** Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_157925157****_****) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 6 6 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 11.52 s ---------------------------------------------------------------------------------------------- OK 1 570118 2 520731 3 611987 4 596046 5 616529 6 626193 7 645299 8 644673 9 585749 10 616101 11 586178 12 593842 Time taken: 12.137 seconds, Fetched: 12 row(s)
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.