Working with Hive jobs

Written by

Updated at April 23, 2025

Getting started
Create a Hive job
Delete the resources you created

Apache Hive is a system for managing large distributed datasets that is built on Hadoop and uses the SQL query language.

In this article, a simple example demonstrates how Hive is used in Yandex Data Processing for data analysis. In the example below, we use Hive to analyze the population of the largest cities in Russia.

Getting started

Create a service account with the dataproc.agent and dataproc.provisioner roles.
In Object Storage, create buckets and configure access to them:
1. Create a bucket for the input data and grant the READ permission for this bucket to the cluster service account.
2. Create a bucket for the processing output and grant the cluster service account READ and WRITE permissions for this bucket.
Create a Yandex Data Processing cluster with the following settings:
- Environment: PRODUCTION
- Services:
  - HDFS
  - SPARK
  - HIVE
- Service account: Select the service account you previously created.
- Bucket name: Select a bucket to hold the processing results.

Create a Hive job

In the input data bucket, create a folder named cities and upload the cities.csv file to this folder for processing:

cities.csv

Moscow,12655000
Saint Petersburg,5384000
Novosibirsk,1620000
Yekaterinburg,1495000
Kazan,1257000
Nizhny Novgorod,1244000
Chelyabinsk,1188000
Samara,1145000
Omsk,1140000
Rostov-on-Don,1138000
Ufa,1126000
Krasnoyarsk,1093000
Voronezh,1051000
Perm,1049000
Volgograd,1005000

The file shows the population of Russia's largest cities with over a million inhabitants based on 2021 estimates (rounded).

Create a file with SQL queries named cities.sql and upload it to the input data bucket:

cities.sql

/* Create an external table with the data from the CSV files: */
CREATE EXTERNAL TABLE IF NOT EXISTS cities
(city_name string, population decimal)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://<input_data_bucket>/cities';

/* Show the number of cities and their total population: */
SELECT COUNT(*) num_cities, SUM(population) sum_populataion FROM cities;

/* Show the minimum and maximum number of inhabitants: */
SELECT MIN(population) min_population, MAX(population) max_population FROM cities;

Create a Hive job with the following parameters:
- Driver: File
- Query file uri: s3a://<input_data_bucket_name>/cities.sql
Wait for the job status to change to Done.

Open the job logs and view the processing results:

Logs

...
OK
15  33590000
Time taken: 21.104 seconds, Fetched: 1 row(s)
...
OK
1005000 12655000
Time taken: 3.393 seconds, Fetched: 1 row(s)

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Delete the resources you created

Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:

Working with Hive jobs

Getting startedGetting started

Create a Hive jobCreate a Hive job

Delete the resources you createdDelete the resources you created

Was the article helpful?

Getting started

Create a Hive job

Delete the resources you created