Working with MapReduce jobs
MapReduce
In this article, we use a simple example to show how MapReduce works in Yandex Data Processing. We will use MapReduce to compute the population of the world's 500 largest cities based on a set of data on the cities.
To run MapReduce on Hadoop, we use the Streaming interface. At the same time, the data preprocessing (map) and the final output computation (reduce) stages use programs that read data from a standard program input (stdin
) and write their output to a standard output (stdout
).
Getting started
-
Create a service account with the
dataproc.agent
anddataproc.provisioner
roles. -
In Object Storage, create buckets and configure access to them:
- Create a bucket for the input data and grant the
READ
permission for this bucket to the cluster service account. - Create a bucket for the processing output and grant the cluster service account
READ and WRITE
permissions for this bucket.
- Create a bucket for the input data and grant the
-
Create a Yandex Data Processing cluster with the following settings:
- Services:
HDFS
MAPREDUCE
YARN
- Service account: Select the service account you previously created.
- Bucket name: Select a bucket to hold the processing results.
- Services:
Create a MapReduce job
-
Download
an archived CSV file with a dataset on the cities and upload it to the input data bucket. -
Upload Python files to the input data bucket:
mapper.py
, which contains the code for data preprocessing (map stage), andreducer.py
, which contains the code for the final computations (reduce stage):mapper.py
#!/usr/bin/python import sys population = sum(int(line.split('\t')[14]) for line in sys.stdin) print(population)
reducer.py
#!/usr/bin/python import sys population = sum(int(value) for value in sys.stdin) print(population)
-
Create a MapReduce job with the following parameters:
- Main class:
org.apache.hadoop.streaming.HadoopStreaming
- Arguments:
-mapper
mapper.py
-reducer
reducer.py
-numReduceTasks
1
-input
s3a://<input_data_bucket_name>/cities500.txt
-output
s3a://<output_bucket_name>/<output_directory>
- Files:
s3a://<input_data_bucket_name>/mapper.py
s3a://<input_data_bucket_name>/reducer.py
- Properties:
mapreduce.job.maps: 6
yarn.app.mapreduce.am.resource.mb: 2048
yarn.app.mapreduce.am.command-opts: -Xmx2048m
- Main class:
-
Wait for the job status to change to
Done
. -
Download from the bucket and review the file with the result from the bucket:
part-00000
3157107417
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.
Delete the resources you created
Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need: