Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • All tutorials
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Running Apache Hive jobs
      • Running Spark applications
      • Running jobs from a remote host
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Getting started
  • Create a PySpark job
  • Delete the resources you created
  1. Tutorials
  2. Working with jobs
  3. Working with PySpark jobs

Working with PySpark jobs

Written by
Yandex Cloud
Updated at April 28, 2025
  • Getting started
  • Create a PySpark job
  • Delete the resources you created

Apache Spark is a distributed processing framework for unstructured and semi-structured data and a part of the Hadoop project ecosystem.

In this section, we provide a simple example that demonstrates how to use PySpark, the Spark interface for Python, in Yandex Data Processing. In the example, we use PySpark to count the number of times each word is seen in a short text.

Getting startedGetting started

  1. Create a service account with the dataproc.agent and dataproc.provisioner roles.

  2. In Object Storage, create buckets and configure access to them:

    1. Create a bucket for the input data and grant the READ permission for this bucket to the cluster service account.
    2. Create a bucket for the processing output and grant the cluster service account READ and WRITE permissions for this bucket.
  3. Create a Yandex Data Processing cluster with the following settings:

    • Environment: PRODUCTION
    • Services:
      • HDFS
      • SPARK
      • YARN
    • Service account: Select the service account you previously created.
    • Bucket name: Select a bucket to hold the processing results.

Create a PySpark jobCreate a PySpark job

  1. Upload a file for processing:

    1. Copy and save the following to a file named text.txt:

      text.txt
      she sells sea shells on the sea shore
      the shells that she sells are sea shells I am sure
      so if she sells sea shells on the sea shore
      I am sure that the shells are sea shore shells
      
    2. Upload the text.txt file to the source data bucket.

  2. Upload a file with the Python code of the analysis program:

    1. Copy and save to the word_count.py file:

      word_count.py
      import sys
      from pyspark import SparkConf, SparkContext
      
      
      def main():
      
          if len(sys.argv) != 3:
              print('Usage job.py <input_directory> <output_directory>')
              sys.exit(1)
      
          in_dir = sys.argv[1]
          out_dir = sys.argv[2]
      
          conf = SparkConf().setAppName("Word count - PySpark")
          sc = SparkContext(conf=conf)
      
          text_file = sc.textFile(in_dir)
          counts = text_file.flatMap(lambda line: line.split(" ")) \
              .map(lambda word: (word, 1)) \
              .reduceByKey(lambda a, b: a + b)
      
          if out_dir.startswith('s3a://'):
              counts.saveAsTextFile(out_dir) 
          else:
              default_fs = sc._jsc.hadoopConfiguration().get('fs.defaultFS')
              counts.saveAsTextFile(default_fs + out_dir)
      
      
      if __name__ == "__main__":
          main()
      
    2. Upload the word_count.py file to the source data bucket.

  3. Create a PySpark job with the following parameters:

    • Main python file: s3a://<input_data_bucket_name>/word_count.py

    • Arguments:

      • s3a://<input_data_bucket_name>/text.txt
      • s3a://<output_bucket_name>/<output_directory>
  4. Wait for the job status to change to Done.

  5. Download from the bucket and review the files with the results from the bucket:

    part-00000
    ('sea', 6)
    ('are', 2)
    ('am', 2)
    ('sure', 2)
    
    part-00001
    ('she', 3)
    ('sells', 3)
    ('shells', 6)
    ('on', 2)
    ('the', 4)
    ('shore', 3)
    ('that', 2)
    ('I', 2)
    ('so', 1)
    ('if', 1)
    

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Delete the resources you createdDelete the resources you created

Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:

  1. Delete the cluster.
  2. Delete the buckets.
  3. Delete the service account.

Was the article helpful?

Previous
Working with MapReduce jobs
Next
Working with Spark jobs
© 2025 Direct Cursus Technology L.L.C.