Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • All guides
      • All jobs
      • Running jobs
      • Spark jobs
      • PySpark jobs
      • Hive jobs
      • MapReduce jobs
    • Setting up and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Create a job
  • Cancel a job
  • Get a list of jobs
  • Get general information about the job
  • Get job execution logs
  1. Step-by-step guides
  2. Jobs
  3. MapReduce jobs

Managing MapReduce jobs

Written by
Yandex Cloud
Updated at May 5, 2025
  • Create a job
  • Cancel a job
  • Get a list of jobs
  • Get general information about the job
  • Get job execution logs

MapReduce is a parallel processing tool for large datasets (a few dozen TB or more) on clusters in the Hadoop ecosystem. It allows you to work with data in different formats. Job input and output are stored in Yandex Object Storage. MapReduce uses a number of libraries, while Apache Bigtop determines the path to them.

Create a jobCreate a job

Management console
CLI
API
  1. Go to the folder page and select Yandex Data Processing.

  2. Click the cluster name and open the Jobs tab.

  3. Click Submit job.

  4. (Optional) Enter a name for the job.

  5. In the Job type field, select Mapreduce.

  6. Select one of the driver types and specify which to use to start the job:

    • Main class name.

    • Path to the main JAR file in the following format:

      File location Path format
      Instance file system file:///<path_to_file>
      Distributed cluster file system hdfs:///<path_to_file>
      Object Storage bucket s3a://<bucket_name>/<path_to_file>
      Internet http://<path_to_file> or https://<path_to_file>

      Archives in standard Linux formats, such as zip, gz, xz, bz2, etc., are supported.

      The cluster service account needs read access to all the files in the bucket. Step-by-step guides on how to set up access to Object Storage are provided in Editing a bucket ACL.

  7. Specify job arguments.

    If an argument, variable, or property is in several space-separated parts, specify each part separately. At the same time, it is important to preserve the order in which you declare arguments, variables, and properties.

    The -mapper mapper.py argument, for instance, must be converted into two arguments, -mapper and mapper.py, in that order.

  8. (Optional) Specify the paths to the additional JAR files, if any.

  9. (Optional) Configure advanced settings:

    • Specify paths to the required files and archives.
    • In the Properties field, specify component properties as key-value pairs.
  10. Click Submit job.

If you do not have the Yandex Cloud CLI yet, install and initialize it.

The folder specified when creating the CLI profile is used by default. To change the default folder, use the yc config set folder-id <folder_ID> command. You can specify a different folder using the --folder-name or --folder-id parameter.

To create a job:

  1. View the description of the CLI create command for Mapreduce jobs:

    yc dataproc job create-mapreduce --help
    
  2. Create a job (the example does not show all the available parameters):

    yc dataproc job create-mapreduce \
       --cluster-name=<cluster_name> \
       --name=<job_name> \
       --main-class=<main_class_name> \
       --file-uris=<path_to_file> \
       --archive-uris=<paths_to_archives> \
       --properties=<component_properties> \
       --args=<argument>
    

    Where --properties stands for component properties as a key-value pair.

    Provide the paths to the files required for the job in the following format:

    File location Path format
    Instance file system file:///<path_to_file>
    Distributed cluster file system hdfs:///<path_to_file>
    Object Storage bucket s3a://<bucket_name>/<path_to_file>
    Internet http://<path_to_file> or https://<path_to_file>

    Archives in standard Linux formats, such as zip, gz, xz, bz2, etc., are supported.

    The cluster service account needs read access to all the files in the bucket. Step-by-step guides on how to set up access to Object Storage are provided in Editing a bucket ACL.

You can get the cluster ID and name with the list of clusters in the folder.

Use the create API method and include the following information in the request:

  • Cluster ID in the clusterId parameter.
  • Job name in the name parameter.
  • Job properties in the mapreduceJob parameter.

You can get the cluster ID with the list of clusters in the folder.

Cancel a jobCancel a job

Note

You cannot cancel jobs with the ERROR, DONE, or CANCELLED status. To find out a job's status, retrieve a list of jobs in the cluster.

Management console
CLI
API
  1. Go to the folder page and select Yandex Data Processing.
  2. Click the cluster name and open the Jobs tab.
  3. Click the job name.
  4. Click Cancel in the top-right corner of the page.
  5. In the window that opens, select Cancel.

If you do not have the Yandex Cloud CLI yet, install and initialize it.

The folder specified when creating the CLI profile is used by default. To change the default folder, use the yc config set folder-id <folder_ID> command. You can specify a different folder using the --folder-name or --folder-id parameter.

To cancel a job, run the command below:

yc dataproc job cancel <job_name_or_ID> \
  --cluster-name=<cluster_name>

You can get the job name or ID with the list of cluster jobs, and the cluster name, with the list of folder clusters.

Use the API cancel method and include the following in the request:

  • Cluster ID in the clusterId parameter.
  • Job ID in the jobId parameter.

You can get the cluster ID with the list of folder clusters, and the job ID, with the list of cluster jobs.

Get a list of jobsGet a list of jobs

Management console
CLI
API
  1. Go to the folder page and select Yandex Data Processing.
  2. Click the cluster name and open the Jobs tab.

If you do not have the Yandex Cloud CLI yet, install and initialize it.

The folder specified when creating the CLI profile is used by default. To change the default folder, use the yc config set folder-id <folder_ID> command. You can specify a different folder using the --folder-name or --folder-id parameter.

To get a list of jobs, run the following command:

yc dataproc job list --cluster-name=<cluster_name>

You can get the cluster ID and name with a list of clusters in the folder.

Use the list API method and provide the cluster ID in the clusterId request parameter.

You can get the cluster ID with a list of clusters in the folder.

Get general information about the jobGet general information about the job

Management console
CLI
API
  1. Go to the folder page and select Yandex Data Processing.
  2. Click the cluster name and open the Jobs tab.
  3. Click the job name.

If you do not have the Yandex Cloud CLI yet, install and initialize it.

The folder specified when creating the CLI profile is used by default. To change the default folder, use the yc config set folder-id <folder_ID> command. You can specify a different folder using the --folder-name or --folder-id parameter.

To get general information about the job, run the command:

yc dataproc job get \
   --cluster-name=<cluster_name> \
   --name=<job_name>

You can get the cluster ID and name with a list of clusters in the folder.

Use the get API method and include the following in the request:

  • Cluster ID in the clusterId parameter. You can get it together with a list of clusters in the folder.
  • Job ID in the jobId parameter. You can get it with the list of cluster jobs.

Get job execution logsGet job execution logs

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Management console
CLI
API
  1. Go to the folder page and select Yandex Data Processing.
  2. Click the cluster name and open the Jobs tab.
  3. Click the job name.

If you do not have the Yandex Cloud CLI yet, install and initialize it.

The folder specified when creating the CLI profile is used by default. To change the default folder, use the yc config set folder-id <folder_ID> command. You can specify a different folder using the --folder-name or --folder-id parameter.

To get job execution logs, run the following command:

yc dataproc job log \
   --cluster-name=<cluster_name> \
   --name=<job_name>

You can get the cluster ID and name with the list of clusters in the folder.

Use the API listLog method and include the following in the request:

  • Cluster ID in the clusterId parameter. You can get it with the list of clusters in the folder.
  • Job ID in the jobId parameter. You can get it with the list of cluster jobs.

Note

To work with MapReduce, you can use the libraries from the /usr/lib/hadoop-mapreduce/ directory.

Was the article helpful?

Previous
Hive jobs
Next
Working with logs
© 2025 Direct Cursus Technology L.L.C.