Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • All tutorials
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Running Apache Hive jobs
      • Running Spark applications
      • Running jobs from a remote host
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Getting started
  • Create a MapReduce job
  • Delete the resources you created
  1. Tutorials
  2. Working with jobs
  3. Working with MapReduce jobs

Working with MapReduce jobs

Written by
Yandex Cloud
Updated at April 23, 2025
  • Getting started
  • Create a MapReduce job
  • Delete the resources you created

MapReduce is a parallel processing tool for large datasets (a few dozen TB or more) on clusters in the Hadoop ecosystem. It allows you to work with data in different formats. Job input and output are stored in Yandex Object Storage. MapReduce uses a number of libraries, while Apache Bigtop determines the path to them.

In this article, we use a simple example to show how MapReduce works in Yandex Data Processing. We will use MapReduce to compute the population of the world's 500 largest cities based on a set of data on the cities.

To run MapReduce on Hadoop, we use the Streaming interface. At the same time, the data preprocessing (map) and the final output computation (reduce) stages use programs that read data from a standard program input (stdin) and write their output to a standard output (stdout).

Getting startedGetting started

  1. Create a service account with the dataproc.agent and dataproc.provisioner roles.

  2. In Object Storage, create buckets and configure access to them:

    1. Create a bucket for the input data and grant the READ permission for this bucket to the cluster service account.
    2. Create a bucket for the processing output and grant the cluster service account READ and WRITE permissions for this bucket.
  3. Create a Yandex Data Processing cluster with the following settings:

    • Environment: PRODUCTION
    • Services:
      • HDFS
      • MAPREDUCE
      • YARN
    • Service account: Select the service account you previously created.
    • Bucket name: Select a bucket to hold the processing results.

Create a MapReduce jobCreate a MapReduce job

  1. Download an archived CSV file with a dataset on the cities and upload it to the input data bucket.

  2. Upload Python files to the input data bucket: mapper.py, which contains the code for data preprocessing (map stage), and reducer.py, which contains the code for the final computations (reduce stage):

    mapper.py

    #!/usr/bin/python
    import sys
    
    population = sum(int(line.split('\t')[14]) for line in sys.stdin)
    print(population)
    

    reducer.py

    #!/usr/bin/python
    import sys
    
    population = sum(int(value) for value in sys.stdin)
    print(population)
    
  3. Create a MapReduce job with the following parameters:

    • Main class: org.apache.hadoop.streaming.HadoopStreaming
    • Arguments:
      • -mapper
      • mapper.py
      • -reducer
      • reducer.py
      • -numReduceTasks
      • 1
      • -input
      • s3a://<input_data_bucket_name>/cities500.txt
      • -output
      • s3a://<output_bucket_name>/<output_directory>
    • Files:
      • s3a://<input_data_bucket_name>/mapper.py
      • s3a://<input_data_bucket_name>/reducer.py
    • Properties:
      • mapreduce.job.maps: 6
      • yarn.app.mapreduce.am.resource.mb: 2048
      • yarn.app.mapreduce.am.command-opts: -Xmx2048m
  4. Wait for the job status to change to Done.

  5. Download from the bucket and review the file with the result from the bucket:

    part-00000

    3157107417
    

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Delete the resources you createdDelete the resources you created

Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:

  1. Delete the cluster.
  2. Delete the buckets.
  3. Delete the service account.

Was the article helpful?

Previous
Working with Hive jobs
Next
Working with PySpark jobs
© 2025 Direct Cursus Technology L.L.C.