Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex Data Processing
  • Getting started
    • All tutorials
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Running Apache Hive jobs
      • Running Spark applications
      • Running jobs from a remote host
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Getting started
  • Create a Hive job
  • Delete the resources you created
  1. Tutorials
  2. Working with jobs
  3. Working with Hive jobs

Working with Hive jobs

Written by
Yandex Cloud
Updated at April 23, 2025
  • Getting started
  • Create a Hive job
  • Delete the resources you created

Apache Hive is a system for managing large distributed datasets that is built on Hadoop and uses the SQL query language.

In this article, a simple example demonstrates how Hive is used in Yandex Data Processing for data analysis. In the example below, we use Hive to analyze the population of the largest cities in Russia.

Getting startedGetting started

  1. Create a service account with the dataproc.agent and dataproc.provisioner roles.

  2. In Object Storage, create buckets and configure access to them:

    1. Create a bucket for the input data and grant the READ permission for this bucket to the cluster service account.
    2. Create a bucket for the processing output and grant the cluster service account READ and WRITE permissions for this bucket.
  3. Create a Yandex Data Processing cluster with the following settings:

    • Environment: PRODUCTION
    • Services:
      • HDFS
      • SPARK
      • HIVE
    • Service account: Select the service account you previously created.
    • Bucket name: Select a bucket to hold the processing results.

Create a Hive jobCreate a Hive job

  1. In the input data bucket, create a folder named cities and upload the cities.csv file to this folder for processing:

    cities.csv
    Moscow,12655000
    Saint Petersburg,5384000
    Novosibirsk,1620000
    Yekaterinburg,1495000
    Kazan,1257000
    Nizhny Novgorod,1244000
    Chelyabinsk,1188000
    Samara,1145000
    Omsk,1140000
    Rostov-on-Don,1138000
    Ufa,1126000
    Krasnoyarsk,1093000
    Voronezh,1051000
    Perm,1049000
    Volgograd,1005000
    

    The file shows the population of Russia's largest cities with over a million inhabitants based on 2021 estimates (rounded).

  2. Create a file with SQL queries named cities.sql and upload it to the input data bucket:

    cities.sql
    /* Create an external table with the data from the CSV files: */
    CREATE EXTERNAL TABLE IF NOT EXISTS cities
    (city_name string, population decimal)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION 's3a://<input_data_bucket>/cities';
    
    /* Show the number of cities and their total population: */
    SELECT COUNT(*) num_cities, SUM(population) sum_populataion FROM cities;
    
    /* Show the minimum and maximum number of inhabitants: */
    SELECT MIN(population) min_population, MAX(population) max_population FROM cities;
    
  3. Create a Hive job with the following parameters:

    • Driver: File
    • Query file uri: s3a://<input_data_bucket_name>/cities.sql
  4. Wait for the job status to change to Done.

  5. Open the job logs and view the processing results:

    Logs
    ...
    OK
    15  33590000
    Time taken: 21.104 seconds, Fetched: 1 row(s)
    ...
    OK
    1005000 12655000
    Time taken: 3.393 seconds, Fetched: 1 row(s)
    

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Delete the resources you createdDelete the resources you created

Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:

  1. Delete the cluster.
  2. Delete the buckets.
  3. Delete the service account.

Was the article helpful?

Previous
Overview
Next
Working with MapReduce jobs
Yandex project
© 2025 Yandex.Cloud LLC