Working with Hive jobs
Apache Hive
In this article, a simple example demonstrates how Hive is used in Yandex Data Processing for data analysis. In the example below, we use Hive to analyze the population of the largest cities in Russia.
Getting started
-
Create a service account with the
dataproc.agent
anddataproc.provisioner
roles. -
In Object Storage, create buckets and configure access to them:
- Create a bucket for the input data and grant the
READ
permission for this bucket to the cluster service account. - Create a bucket for the processing output and grant the cluster service account
READ and WRITE
permissions for this bucket.
- Create a bucket for the input data and grant the
-
Create a Yandex Data Processing cluster with the following settings:
- Services:
HDFS
SPARK
HIVE
- Service account: Select the service account you previously created.
- Bucket name: Select a bucket to hold the processing results.
- Services:
Create a Hive job
-
In the input data bucket, create a folder named
cities
and upload thecities.csv
file to this folder for processing:cities.csv
Moscow,12655000 Saint Petersburg,5384000 Novosibirsk,1620000 Yekaterinburg,1495000 Kazan,1257000 Nizhny Novgorod,1244000 Chelyabinsk,1188000 Samara,1145000 Omsk,1140000 Rostov-on-Don,1138000 Ufa,1126000 Krasnoyarsk,1093000 Voronezh,1051000 Perm,1049000 Volgograd,1005000
The file shows the population of Russia's largest cities with over a million inhabitants based on 2021 estimates (rounded).
-
Create a file with SQL queries named
cities.sql
and upload it to the input data bucket:cities.sql
/* Create an external table with the data from the CSV files: */ CREATE EXTERNAL TABLE IF NOT EXISTS cities (city_name string, population decimal) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 's3a://<input_data_bucket>/cities'; /* Show the number of cities and their total population: */ SELECT COUNT(*) num_cities, SUM(population) sum_populataion FROM cities; /* Show the minimum and maximum number of inhabitants: */ SELECT MIN(population) min_population, MAX(population) max_population FROM cities;
-
Create a Hive job with the following parameters:
- Driver:
File
- Query file uri:
s3a://<input_data_bucket_name>/cities.sql
- Driver:
-
Wait for the job status to change to
Done
. -
Open the job logs and view the processing results:
Logs
... OK 15 33590000 Time taken: 21.104 seconds, Fetched: 1 row(s) ... OK 1005000 12655000 Time taken: 3.393 seconds, Fetched: 1 row(s)
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.
Delete the resources you created
Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need: