Working with MapReduce jobs

Written by

Updated at July 16, 2026

Getting started
- Required paid resources
Set up your infrastructure
Create a MapReduce job
Delete the resources you created

MapReduce is a parallel processing tool for large datasets (a few dozen TB or more) on clusters in the Hadoop ecosystem. It allows you to work with data in different formats. Job input and output are stored in Yandex Object Storage. MapReduce uses a number of libraries, while Apache Bigtop determines the path to them.

In this article, we use a simple example to show how MapReduce works in Yandex Data Processing. We will use MapReduce to compute the population of the world's 500 largest cities based on the cities dataset.

To run MapReduce on Hadoop, we use the Streaming interface. At the same time, the data preprocessing (map) and the final output computation (reduce) stages use programs that read data from a standard program input (stdin) and write their result to a standard output (stdout).

To execute a MapReduce job:

If you no longer need the resources you created, delete them.

Getting started

Navigate to the management console and log in to Yandex Cloud or create a new account.
On the Yandex Cloud Billing page, make sure you have a billing account linked and it has the ACTIVE or TRIAL_ACTIVE status. If you do not have a billing account, create one and link a cloud to it.

If you have an active billing account, you can create or select a folder for your infrastructure on the cloud page.

Learn more about clouds and folders here.

Required paid resources

Yandex Data Processing cluster: use of computing resources with a Yandex Data Processing markup, use of network drives, retrieval and storage of logs, amount of outgoing traffic (see Yandex Data Processing pricing).
Public IP addresses if public access is enabled for cluster hosts (see Yandex Virtual Private Cloud pricing).
Yandex Object Storage buckets: use of storage, data operations (see Object Storage pricing).

Set up your infrastructure

Create a service account with the dataproc.agent and dataproc.provisioner roles.
In Object Storage, create buckets and configure access to them:
1. Create a bucket for the input data and grant the READ permission for this bucket to the cluster service account.
2. Create a bucket for the processing output and grant the cluster service account READ and WRITE permissions for this bucket.
Create a Yandex Data Processing cluster with the following settings:
- Environment: PRODUCTION.
- Services:
  - HDFS
  - MAPREDUCE
  - YARN
- Service account: Select the service account you created earlier.
- Bucket name: Select a bucket for the processing results.

Create a MapReduce job

Download an archived CSV file with the cities dataset and upload it to the input data bucket.

Upload Python files to the input data bucket: mapper.py, which contains the code for data preprocessing (map stage), and reducer.py, which contains the code for the final computations (reduce stage):

mapper.py

#!/usr/bin/python
import sys

population = sum(int(line.split('\t')[14]) for line in sys.stdin)
print(population)

reducer.py

#!/usr/bin/python
import sys

population = sum(int(value) for value in sys.stdin)
print(population)

Create a MapReduce job with the following parameters:
- Main class: org.apache.hadoop.streaming.HadoopStreaming
- Arguments:
  - -mapper
  - mapper.py
  - -reducer
  - reducer.py
  - -numReduceTasks
  - 1
  - -input
  - s3a://<input_data_bucket_name>/cities500.txt
  - -output
  - s3a://<output_bucket_name>/<output_directory>
- Files:
  - s3a://<input_data_bucket_name>/mapper.py
  - s3a://<input_data_bucket_name>/reducer.py
- Properties:
  - mapreduce.job.maps: 6
  - yarn.app.mapreduce.am.resource.mb: 2048
  - yarn.app.mapreduce.am.command-opts: -Xmx2048m
Wait for the job status to change to Done.
Download the file with the result from the bucket and review it:

part-00000
```
3157107417
```

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Delete the resources you created

Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:

Working with MapReduce jobs

Getting startedGetting started

Required paid resourcesRequired paid resources

Set up your infrastructureSet up your infrastructure

Create a MapReduce jobCreate a MapReduce job

Delete the resources you createdDelete the resources you created

Was the article helpful?

Getting started

Required paid resources

Set up your infrastructure

Create a MapReduce job

Delete the resources you created