Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex Vision OCR
  • Getting started
    • All guides
    • Text recognition in images
    • Text recognition from PDF files
    • Handwriting recognition
    • Table recognition
    • Base64 encoding
    • Setting up access with API keys
  • Access management
  • Pricing policy
  • Release notes
  • FAQ

In this article:

  • Getting started
  • Recognizing text from a PDF file through the OCR API
  1. Step-by-step guides
  2. Text recognition from PDF files

Text recognition from PDF files

Written by
Yandex Cloud
Updated at March 28, 2025
  • Getting started
  • Recognizing text from a PDF file through the OCR API

You can recognize text from a PDF file using the OCR API. The OCR API is an updated and revised interface with enhanced features, including multi-column text recognition.

Getting startedGetting started

To use the examples, install cURL.

Get your account data for authentication:

Yandex or federated account
Service account
  1. Get an IAM token for your Yandex account or federated account.

  2. Get the ID of the folder for which your account has the ai.vision.user role or higher.

  3. When accessing Vision OCR via the API, provide the received parameters in each request:

    • For the Vision API and Classifier API:

      Specify the IAM token in the Authorization header as follows:

      Authorization: Bearer <IAM_token>
      

      Specify the folder ID in the request body in the folderId parameter.

    • For the OCR API:

      • Specify the IAM token in the Authorization header.
      • Specify the folder ID in the x-folder-id header.
      Authorization: Bearer <IAM_token>
      x-folder-id <folder_ID>
      

Vision OCR supports two authentication methods based on service accounts:

  • With an IAM token:

    1. Get an IAM token.

    2. Provide the IAM token in the Authorization header in the following format:

      Authorization: Bearer <IAM_token>
      
  • With API keys.

    Use API keys if requesting an IAM token automatically is not an option.

    1. Get an API key.

    2. Provide the API key in the Authorization header in the following format:

      Authorization: Api-Key <API_key>
      

Do not specify the folder ID in your requests, as the service uses the folder the service account was created in.

Recognizing text from a PDF file through the OCR APIRecognizing text from a PDF file through the OCR API

Text recognition from a PDF file is implemented through OCR API methods, such as TextRecognition.recognize for single-page PDF files and TextRecognitionAsync.recognize for multi-page ones.

  1. Prepare a PDF file for recognition. Make sure its size does not exceed 10 MB and a single file contains no more than 200 pages.

  2. Encode the PDF file as Base64:

    UNIX
    Windows
    PowerShell
    Python
    Node.js
    Java
    Go
    base64 -i input.pdf > output.txt
    
    C:> Base64.exe -e input.pdf > output.txt
    
    [Convert]::ToBase64String([IO.File]::ReadAllBytes("./input.pdf")) > output.txt
    
    # Import a library for encoding files in Base64
    import base64
    
    # Create a function that will encode a file and return results.
    def encode_file(file):
      file_content = file.read()
      return base64.b64encode(file_content)
    
    // Read the file contents to memory.
    var fs = require('fs');
    var file = fs.readFileSync('/path/to/file');
    
    // Get the file contents in Base64 format.
    var encoded = Buffer.from(file).toString('base64');
    
    // Import a library for encoding files in Base64.
    import org.apache.commons.codec.binary.Base64;
    
    // Get the file contents in Base64 format.
    byte[] fileData = Base64.encodeBase64(yourFile.getBytes());
    
    import (
        "bufio"
        "encoding/base64"
        "io/ioutil"
        "os"
    )
    
    // Open the file
    f, _ := os.Open("/path/to/file")
    
    // Read the file contents.
    reader := bufio.NewReader(f)
    content, _ := ioutil.ReadAll(reader)
    
    // Get the file contents in Base64 format.
    base64.StdEncoding.EncodeToString(content)
    
  3. Create a file with the request body, e.g., body.json.

    body.json:

    {
      "mimeType": "application/pdf",
      "languageCodes": ["*"],
      "model": "page",
      "content": "<base64-encoded_PDF_file>"
    }
    

    In the content property, specify the PDF file contents encoded as Base64.

    To automatically detect the text language, specify the "languageCodes": ["*"] property in the configuration.

  4. Send your request:

    Single-page PDF file
    Multi-page PDF file

    Send a request using the recognize method and save the response to a file, e.g., output.json:

    UNIX
    Python
    export IAM_TOKEN=<IAM_token>
    curl \
      --request POST \
      --header "Content-Type: application/json" \
      --header "Authorization: Bearer ${IAM_TOKEN}" \
      --header "x-folder-id: <folder_ID>" \
      --header "x-data-logging-enabled: true" \
      --data '{
        "mimeType": "JPEG",
        "languageCodes": ["ru","en"],
        "model": "handwritten",
        "content": "<base64_encoded_image>"
      }' \
      https://ocr.api.cloud.yandex.net/ocr/v1/recognizeText \
      --output output.json
    

    Where:

    • <IAM_token>: Previously obtained IAM token.
    • <folder_ID>: Previously obtained folder ID.
    data = {"mimeType": <mime_type>,
            "languageCodes": ["ru","en"],
            "content": content}
    
    url = "https://ocr.api.cloud.yandex.net/ocr/v1/recognizeText"
    
    headers= {"Content-Type": "application/json",
              "Authorization": "Bearer {:s}".format(<IAM_token>),
              "x-folder-id": "<folder_ID>",
              "x-data-logging-enabled": "true"}
      
      w = requests.post(url=url, headers=headers, data=json.dumps(data))
    

    The result will consist of recognized blocks of text, lines, and words with their positions on the PDF file's page.

    Result
    {
      "result": {
        "text_annotation": {
          "width": "3312",
          "height": "4683",
          "blocks": [
            {
              "bounding_box": {
                "vertices": [
                  {
                    "x": "373",
                    "y": "371"
                  },
                  {
                    "x": "373",
                    "y": "580"
                  },
                  {
                    "x": "1836",
                    "y": "580"
                  },
                  {
                    "x": "1836",
                    "y": "371"
                  }
                ]
              },
              "lines": [
                {
                  "bounding_box": {
                    "vertices": [
                      {
                        "x": "373",
                        "y": "371"
                      },
                      {
                        "x": "373",
                        "y": "430"
                      },
                      {
                        "x": "1836",
                        "y": "430"
                      },
                      {
                        "x": "1836",
                        "y": "371"
                      }
                    ]
                  },
                  "alternatives": [
                    {
                      "text": "Page №1, line 1",
                      "words": [
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "373",
                                "y": "358"
                              },
                              {
                                "x": "373",
                                "y": "444"
                              },
                              {
                                "x": "967",
                                "y": "444"
                              },
                              {
                                "x": "967",
                                "y": "358"
                              }
                            ]
                          },
                          "text": "Page",
                          "entity_index": "-1"
                        },
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "1014",
                                "y": "358"
                              },
                              {
                                "x": "1014",
                                "y": "444"
                              },
                              {
                                "x": "1278",
                                "y": "444"
                              },
                              {
                                "x": "1278",
                                "y": "358"
                              }
                            ]
                          },
                          "text": "№1,",
                          "entity_index": "-1"
                        },
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "1303",
                                "y": "358"
                              },
                              {
                                "x": "1303",
                                "y": "444"
                              },
                              {
                                "x": "1718",
                                "y": "444"
                              },
                              {
                                "x": "1718",
                                "y": "358"
                              }
                            ]
                          },
                          "text": "line",
                          "entity_index": "-1"
                        },
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "1765",
                                "y": "358"
                              },
                              {
                                "x": "1765",
                                "y": "444"
                              },
                              {
                                "x": "1836",
                                "y": "444"
                              },
                              {
                                "x": "1836",
                                "y": "358"
                              }
                            ]
                          },
                          "text": "1",
                          "entity_index": "-1"
                        }
                      ]
                    }
                  ]
                },
                {
                  "bounding_box": {
                    "vertices": [
                      {
                        "x": "373",
                        "y": "520"
                      },
                      {
                        "x": "373",
                        "y": "580"
                      },
                      {
                        "x": "1836",
                        "y": "580"
                      },
                      {
                        "x": "1836",
                        "y": "520"
                      }
                    ]
                  },
                  "alternatives": [
                    {
                      "text": "Page №1, line 2",
                      "words": [
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "373",
                                "y": "508"
                              },
                              {
                                "x": "373",
                                "y": "594"
                              },
                              {
                                "x": "967",
                                "y": "594"
                              },
                              {
                                "x": "967",
                                "y": "508"
                              }
                            ]
                          },
                          "text": "Page",
                          "entity_index": "-1"
                        },
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "1014",
                                "y": "507"
                              },
                              {
                                "x": "1014",
                                "y": "593"
                              },
                              {
                                "x": "1277",
                                "y": "593"
                              },
                              {
                                "x": "1277",
                                "y": "507"
                              }
                            ]
                          },
                          "text": "№1,",
                          "entity_index": "-1"
                        },
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "1302",
                                "y": "507"
                              },
                              {
                                "x": "1302",
                                "y": "593"
                              },
                              {
                                "x": "1718",
                                "y": "593"
                              },
                              {
                                "x": "1718",
                                "y": "507"
                              }
                            ]
                          },
                          "text": "line",
                          "entity_index": "-1"
                        },
                        {
                          "bounding_box": {
                            "vertices": [
                              {
                                "x": "1765",
                                "y": "507"
                              },
                              {
                                "x": "1765",
                                "y": "593"
                              },
                              {
                                "x": "1836",
                                "y": "593"
                              },
                              {
                                "x": "1836",
                                "y": "507"
                              }
                            ]
                          },
                          "text": "2",
                          "entity_index": "-1"
                        }
                      ]
                    }
                  ]
                }
              ],
              "languages": [
                {
                  "language_code": "ru"
                }
              ]
            }
          ],
          "entities": []
        },
        "page": "0"
      }
    }
    
    • Send a request using the recognize method:

      export IAM_TOKEN=<IAM_token>
      curl \
        --request POST \
        --header "Content-Type: application/json" \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        --header "x-folder-id: <folder_ID>" \
        --header "x-data-logging-enabled: true" \
        --data "@body.json" \
        https://ocr.api.cloud.yandex.net/ocr/v1/recognizeTextAsync
      

      Where:

      • <IAM_token>: Previously obtained IAM token.
      • <folder_ID>: Previously obtained folder ID.

      Result:

      {
        "id": "cfrtr5q0hdhl********",
        "description": "OCR async recognition",
        "created_at": "2023-10-24T09:12:48Z",
        "created_by": "ajeol2afu1js********",
        "modified_at": "2023-10-24T09:12:48Z",
        "done": false,
        "metadata": null
      }
      

      Save the recognition operation id you get in the response.

    • Send a recognition request using the getRecognition method:

      export IAM_TOKEN=<IAM_token>
      curl \
        --request GET \
        --header "Content-Type: application/json" \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        --header "x-folder-id: <folder_ID>" \
        --header "x-data-logging-enabled: true" \
        https://ocr.api.cloud.yandex.net/ocr/v1/getRecognition?operationId=<operation_ID> \
        --output output.json
      

      Where:

      • <IAM_token>: IAM token you got earlier.
      • <folder_ID>: Folder ID you got earlier.
      • <operation_ID>: Recognition operation ID you got earlier.

      The result will consist of recognized blocks of text, lines, and words with their positions on the PDF file page. The recognition result for each page is presented in a separate result section.

      Result
      {
        "result": {
          "text_annotation": {
            "width": "3312",
            "height": "4683",
            "blocks": [
              {
                "bounding_box": {
                  "vertices": [
                    {
                      "x": "373",
                      "y": "371"
                    },
                    {
                      "x": "373",
                      "y": "580"
                    },
                    {
                      "x": "1836",
                      "y": "580"
                    },
                    {
                      "x": "1836",
                      "y": "371"
                    }
                  ]
                },
                "lines": [
                  {
                    "bounding_box": {
                      "vertices": [
                        {
                          "x": "373",
                          "y": "371"
                        },
                        {
                          "x": "373",
                          "y": "430"
                        },
                        {
                          "x": "1836",
                          "y": "430"
                        },
                        {
                          "x": "1836",
                          "y": "371"
                        }
                      ]
                    },
                    "alternatives": [
                      {
                        "text": "Page 1, line 1",
                        "words": [
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "373",
                                  "y": "358"
                                },
                                {
                                  "x": "373",
                                  "y": "444"
                                },
                                {
                                  "x": "967",
                                  "y": "444"
                                },
                                {
                                  "x": "967",
                                  "y": "358"
                                }
                              ]
                            },
                            "text": "Page",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1014",
                                  "y": "358"
                                },
                                {
                                  "x": "1014",
                                  "y": "444"
                                },
                                {
                                  "x": "1278",
                                  "y": "444"
                                },
                                {
                                  "x": "1278",
                                  "y": "358"
                                }
                              ]
                            },
                            "text": "№1,",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1303",
                                  "y": "358"
                                },
                                {
                                  "x": "1303",
                                  "y": "444"
                                },
                                {
                                  "x": "1718",
                                  "y": "444"
                                },
                                {
                                  "x": "1718",
                                  "y": "358"
                                }
                              ]
                            },
                            "text": "line",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1765",
                                  "y": "358"
                                },
                                {
                                  "x": "1765",
                                  "y": "444"
                                },
                                {
                                  "x": "1836",
                                  "y": "444"
                                },
                                {
                                  "x": "1836",
                                  "y": "358"
                                }
                              ]
                            },
                            "text": "1",
                            "entity_index": "-1"
                          }
                        ]
                      }
                    ]
                  },
                  {
                    "bounding_box": {
                      "vertices": [
                        {
                          "x": "373",
                          "y": "520"
                        },
                        {
                          "x": "373",
                          "y": "580"
                        },
                        {
                          "x": "1836",
                          "y": "580"
                        },
                        {
                          "x": "1836",
                          "y": "520"
                        }
                      ]
                    },
                    "alternatives": [
                      {
                        "text": "Page 1, line 2",
                        "words": [
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "373",
                                  "y": "508"
                                },
                                {
                                  "x": "373",
                                  "y": "594"
                                },
                                {
                                  "x": "967",
                                  "y": "594"
                                },
                                {
                                  "x": "967",
                                  "y": "508"
                                }
                              ]
                            },
                            "text": "Page",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1014",
                                  "y": "507"
                                },
                                {
                                  "x": "1014",
                                  "y": "593"
                                },
                                {
                                  "x": "1277",
                                  "y": "593"
                                },
                                {
                                  "x": "1277",
                                  "y": "507"
                                }
                              ]
                            },
                            "text": "№1,",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1302",
                                  "y": "507"
                                },
                                {
                                  "x": "1302",
                                  "y": "593"
                                },
                                {
                                  "x": "1718",
                                  "y": "593"
                                },
                                {
                                  "x": "1718",
                                  "y": "507"
                                }
                              ]
                            },
                            "text": "line",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1765",
                                  "y": "507"
                                },
                                {
                                  "x": "1765",
                                  "y": "593"
                                },
                                {
                                  "x": "1836",
                                  "y": "593"
                                },
                                {
                                  "x": "1836",
                                  "y": "507"
                                }
                              ]
                            },
                            "text": "2",
                            "entity_index": "-1"
                          }
                        ]
                      }
                    ]
                  }
                ],
                "languages": [
                  {
                    "language_code": "ru"
                  }
                ]
              }
            ],
            "entities": []
          },
          "page": "0"
        }
      }
      {
        "result": {
          "text_annotation": {
            "width": "3312",
            "height": "4683",
            "blocks": [
              {
                "bounding_box": {
                  "vertices": [
                    {
                      "x": "371",
                      "y": "371"
                    },
                    {
                      "x": "371",
                      "y": "580"
                    },
                    {
                      "x": "1836",
                      "y": "580"
                    },
                    {
                      "x": "1836",
                      "y": "371"
                    }
                  ]
                },
                "lines": [
                  {
                    "bounding_box": {
                      "vertices": [
                        {
                          "x": "371",
                          "y": "371"
                        },
                        {
                          "x": "371",
                          "y": "430"
                        },
                        {
                          "x": "1820",
                          "y": "430"
                        },
                        {
                          "x": "1820",
                          "y": "371"
                        }
                      ]
                    },
                    "alternatives": [
                      {
                        "text": "Page №2, line 1",
                        "words": [
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "371",
                                  "y": "357"
                                },
                                {
                                  "x": "371",
                                  "y": "444"
                                },
                                {
                                  "x": "964",
                                  "y": "444"
                                },
                                {
                                  "x": "964",
                                  "y": "357"
                                }
                              ]
                            },
                            "text": "Page",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "993",
                                  "y": "357"
                                },
                                {
                                  "x": "993",
                                  "y": "444"
                                },
                                {
                                  "x": "1292",
                                  "y": "444"
                                },
                                {
                                  "x": "1292",
                                  "y": "357"
                                }
                              ]
                            },
                            "text": "№2,",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1317",
                                  "y": "357"
                                },
                                {
                                  "x": "1317",
                                  "y": "444"
                                },
                                {
                                  "x": "1701",
                                  "y": "444"
                                },
                                {
                                  "x": "1701",
                                  "y": "357"
                                }
                              ]
                            },
                            "text": "line",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1748",
                                  "y": "357"
                                },
                                {
                                  "x": "1748",
                                  "y": "444"
                                },
                                {
                                  "x": "1820",
                                  "y": "444"
                                },
                                {
                                  "x": "1820",
                                  "y": "357"
                                }
                              ]
                            },
                            "text": "1",
                            "entity_index": "-1"
                          }
                        ]
                      }
                    ]
                  },
                  {
                    "bounding_box": {
                      "vertices": [
                        {
                          "x": "373",
                          "y": "520"
                        },
                        {
                          "x": "373",
                          "y": "580"
                        },
                        {
                          "x": "1836",
                          "y": "580"
                        },
                        {
                          "x": "1836",
                          "y": "520"
                        }
                      ]
                    },
                    "alternatives": [
                      {
                        "text": "Page №2, line 2",
                        "words": [
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "373",
                                  "y": "507"
                                },
                                {
                                  "x": "373",
                                  "y": "594"
                                },
                                {
                                  "x": "967",
                                  "y": "594"
                                },
                                {
                                  "x": "967",
                                  "y": "507"
                                }
                              ]
                            },
                            "text": "Page",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1014",
                                  "y": "507"
                                },
                                {
                                  "x": "1014",
                                  "y": "594"
                                },
                                {
                                  "x": "1277",
                                  "y": "594"
                                },
                                {
                                  "x": "1277",
                                  "y": "507"
                                }
                              ]
                            },
                            "text": "№2,",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1302",
                                  "y": "507"
                                },
                                {
                                  "x": "1302",
                                  "y": "594"
                                },
                                {
                                  "x": "1718",
                                  "y": "594"
                                },
                                {
                                  "x": "1718",
                                  "y": "507"
                                }
                              ]
                            },
                            "text": "line",
                            "entity_index": "-1"
                          },
                          {
                            "bounding_box": {
                              "vertices": [
                                {
                                  "x": "1765",
                                  "y": "506"
                                },
                                {
                                  "x": "1765",
                                  "y": "593"
                                },
                                {
                                  "x": "1836",
                                  "y": "593"
                                },
                                {
                                  "x": "1836",
                                  "y": "506"
                                }
                              ]
                            },
                            "text": "2",
                            "entity_index": "-1"
                          }
                        ]
                      }
                    ]
                  }
                ],
                "languages": [
                  {
                    "language_code": "ru"
                  }
                ]
              }
            ],
            "entities": []
          },
          "page": "1"
        }
      }
      
  5. To get the recognized words from the PDF file, find all values with the text property.

Note

If the coordinates you got do not match the position of displayed elements, set up support for exif metadata in your image viewing tool or remove the Orientation attribute from the exif image section when running a transfer to the service.

Was the article helpful?

Previous
Text recognition in images
Next
Handwriting recognition
© 2025 Direct Cursus Technology L.L.C.