Recognizing text in image archives in Yandex Vision OCR
With Vision OCR and Yandex Object Storage, you can manage recognizing image text and storing both the source image archive and the recognition results.
To set up an infrastructure for text recognition using Vision OCR and export the results automatically to Object Storage:
- Prepare your cloud.
- Create a bucket.
- Create a VM.
- Set up the VM.
- Create an archive with images.
- Prepare a script for recognition and uploading of images.
- Double-check the recognition results.
If you no longer need the resources you created, delete them.
Getting started
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The infrastructure costs for image recognition and data storage include:
- Fee for VM computing resources and disks (see Yandex Compute Cloud pricing).
- Fee for data storage in a bucket and operations with data (see Object Storage pricing).
- Fee for using a dynamic or static public IP address (see Yandex Virtual Private Cloud pricing).
- Fee for using Vision OCR (see Vision OCR pricing).
Create a bucket
To create an Object Storage bucket to store the source images and recognition results:
- Go to the management console
and select the folder to perform the operations in. - Select Object Storage.
- Click Create bucket.
- Enter a name for the bucket according to the naming requirements.
- In the Object read access field, select Restricted.
- In the Storage class field, select Cold.
- Click Create bucket.
Create a VM
-
On the folder page in the management console
, click Create resource and selectVirtual machine instance
. -
Under Boot disk image, in the Product search field, enter
CentOS 7
and select a public CentOS 7 image. -
Under Location, select an availability zone to create your VM in. If you do not know which availability zone you need, leave the default one.
-
Under Disks and file storages, select the
SSD
disk type and specify the size:19 GB
. -
Under Computing resources, navigate to the
Custom
tab and specify the required platform, number of vCPUs, and amount of RAM:- Platform:
Intel Cascade Lake
. - vCPU:
2
. - Guaranteed vCPU performance:
20%
. - RAM:
2 GB
- Platform:
-
Under Network settings:
- In the Subnet field, select the network and subnet to connect your VM to. If the required network or subnet is not listed, create it.
- Under Public IP, keep
Auto
to assign your VM a random external IP address from the Yandex Cloud pool or select a static address from the list if you reserved one in advance.
-
Under Access, select SSH key and specify the access credentials for the VM:
- Under Login, enter the username. Do not use
root
or other names reserved by the OS. To perform operations requiring superuser permissions, use thesudo
command. -
In the SSH key field, select the SSH key saved in your organization user profile.
If there are no saved SSH keys in your profile, or you want to add a new key:
- Click Add key.
- Enter a name for the SSH key.
- Upload or paste the contents of the public key file. You need to create a key pair for the SSH connection to a VM yourself.
- Click Add.
The SSH key will be added to your organization user profile.
If users cannot add SSH keys to their profiles in the organization, the added public SSH key will only be saved to the user profile of the VM being created.
- Under Login, enter the username. Do not use
-
Under General information, specify the VM name. The naming requirements are as follows:
- The name must be from 2 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- The first character must be a letter and the last character cannot be a hyphen.
-
Click Create VM.
-
Wait for the VM status to change to
Running
and save its public IP address: you will need it for SSH connection.
Set up the VM
Set up the Yandex Cloud CLI
-
Connect to the VM via SSH.
-
Make sure that the Yandex Cloud CLI runs correctly:
CLIRun the following command on the VM:
yc config list
Result:
token: AQ...gs cloud-id: b1gdtdqb1900******** folder-id: b1gveg9vude9********
Save the folder ID (the
folder-id
parameter): you will need it to configure a service account.
Set up a service account
-
Create a service account:
yc iam service-account create \ --name <service_account_name> \ --description "<service_account_description>"
Where:
--name
: Service account name, e.g.,vision-sa
.--description
: Service account description, e.g.,this is vision service account
.
Result:
id: aje6aoc8hccu******** folder_id: b1gv87ssvu49******** created_at: "2022-10-12T14:04:43.198559512Z" name: vision-sa description: this is vision service account
Save the service account ID (the
id
parameter): you will need it for further configuration. -
Assign the
editor
role to the service account:yc resource-manager folder add-access-binding <folder_ID> \ --role editor \ --subject serviceAccount:<service_account_ID>
Where:
--role
: Role you want to assign.--subject serviceAccount
: Service account ID.
-
Create a static access key for your service account:
yc iam access-key create \ --service-account-id <service_account_ID> \ --description "<key_description>"
Where:
--service-account-id
: Service account ID.--description
: Key description, e.g.,this key is for vision
.
Result:
access_key: id: ajen8d7fur27******** service_account_id: aje6aoc8hccu******** created_at: "2022-10-12T15:08:08.045280520Z" description: this key is for vision key_id: YC...li secret: YC...J5
Save the following parameters (you will need them to configure the AWS CLI utility):
key_id
: Static access key ID.secret
: Secret key.
-
Create an authorized key for the service account:
yc iam key create \ --service-account-id <service_account_ID> \ --output key.json
Where:
--service-account-id
: Service account ID.--output
: Name of the JSON file with the authorized key.
Result:
id: aje3qc9pagb9******** service_account_id: aje6aoc8hccu******** created_at: "2022-10-13T12:53:04.810240976Z" key_algorithm: RSA_2048
-
Create a Yandex Cloud CLI profile to run on behalf of the service account, such as
vision-profile
:yc config profile create vision-profile
Result:
Profile 'vision-profile' created and activated
-
Specify the authorized key of the service account in the profile configuration:
yc config set service-account-key key.json
-
Get a Yandex Identity and Access Management token for the service account:
yc iam create-token
Save the Identity and Access Management token, you will need it to send images to Vision OCR.
Set up the AWS CLI
-
Update the packages installed in the VM operating system. To do this, run this command:
sudo yum update -y
-
Install the AWS CLI:
sudo yum install awscli -y
-
Set up the AWS CLI:
aws configure
Specify the parameter values:
AWS Access Key ID
: Static access key ID (key_id
) you got when configuring the service account.AWS Secret Access Key
: Secret key (secret
) you got when configuring the service account.Default region name
:ru-central1
.Default output format
:json
.
-
Check that the
~/.aws/credentials
file contains the correctkey_id
andsecret
values:cat ~/.aws/credentials
-
Check that the
~/.aws/config
file contains the correctDefault region name
andDefault output format
values:cat ~/.aws/config
Create an archive with images
-
Upload your images that include recognizable text to the bucket.
Tip
Use the sample image
of the penguin crossing road sign. -
To make sure that the images were uploaded, use the request with the bucket name:
aws --endpoint-url=https://storage.yandexcloud.net s3 ls s3://<bucket_name>/
-
Save the images from the bucket to the VM, e.g., to the
my_pictures
directory:aws --endpoint-url=https://storage.yandexcloud.net s3 cp s3://<bucket_name>/ my_pictures --recursive
-
Pack the images into an archive, e.g.,
my_pictures.tar
:tar -cf my_pictures.tar my_pictures/*
-
Delete the image directory:
rm -rfd my_pictures
Prepare a script for digitizing and uploading images
Configure the environment
-
Install the
epel
repository for additional packages:sudo yum install epel-release -y
-
Install the
jq
package to process the results from Vision OCR:sudo yum install jq -y
-
Install the
nano
text editor:sudo yum install nano -y
-
Set the environment variables required for the script to run:
export BUCKETNAME="<bucket_name>" export FOLDERID="<folder_ID>" export IAMTOKEN="<IAM_token>"
Where:
BUCKETNAME
: Bucket name.FOLDERID
: Folder ID.IAMTOKEN
: Identity and Access Management token you got when configuring the service account.
Create a script
The script includes the following steps:
- Create the relevant directories.
- Unpack the archive with images.
- Process all images one by one:
- Encode the image as Base64.
- Create a request body for the specific image.
- Send the image in a POST request to Vision OCR for recognition.
- Save the result to the
output.json
file. - Extract the recognized text from
output.json
and save it to a text file.
- Add the resulting text files to an archive.
- Upload the archive with the text files to Object Storage.
- Delete the auxiliary files.
For your convenience, the text of the script includes comments to each step.
To implement the script:
-
Create a file, e.g.,
vision.sh
, and open it in thenano
text editor:sudo nano vision.sh
-
Copy the Bash script text to
vision.sh
:#!/bin/bash # Create the relevant directories. echo "Creating directories..." # Create a directory for recognized text. mkdir my_pictures_text # Unpack the archive with images to the directory you created. echo "Extract pictures in my_pictures directory..." tar -xf my_pictures.tar # Recognize the images from the archive. FILES=my_pictures/* for f in $FILES # For each file in the directory, perform the following actions in a loop: do # Encode the image as Base64 for sending to Vision OCR. CODEIMG=$(base64 -i $f | cat) # Create a `body.json` file to send to Vision OCR in a POST request. cat <<EOF > body.json { "mimeType": "JPEG", "languageCodes": ["*"], "model": "page", "content": "$CODEIMG" } EOF # Send the image to Vision OCR for recognition and write the result to the `output.json` file. echo "Processing file $f in Vision..." curl --request POST \ --header "Content-Type: application/json" \ --header "Authorization: Bearer ${IAMTOKEN}" \ --header "x-data-logging-enabled: true" \ --header "x-folder-id: ${FOLDERID}" \ --data '@body.json' \ https://ocr.api.cloud.yandex.net/ocr/v1/recognizeText \ --output output.json # Get the image file name to use it later. IMAGE_BASE_NAME=$(basename -- "$f") IMAGE_NAME="${IMAGE_BASE_NAME%.*}" # Get the text data from the `output.json` file and write it to a .txt file with the same name as the image file. cat output.json | jq -r '.result[].blocks[].lines[].text' | awk -v ORS=" " '{print}' > my_pictures_text/$IMAGE_NAME".txt" done # Archive the contents of the text file directory. echo "Packing text files to archive..." tar -cf my_pictures_text.tar my_pictures_text # Move the text file archive to your bucket. echo "Sending archive to Object Storage Bucket..." aws --endpoint-url=https://storage.yandexcloud.net s3 cp my_pictures_text.tar s3://$BUCKETNAME/ > /dev/null # Delete the auxiliary files. echo "Cleaning up..." rm -f body.json rm -f output.json rm -rfd my_pictures rm -rfd my_pictures_text rm -r my_pictures_text.tar
-
Set the permissions to run the script:
sudo chmod 755 vision.sh
-
Run the script:
./vision.sh
Double-check the recognition results
- In the Yandex Cloud management console
, select the folder containing the bucket with the recognition results. - Select Object Storage.
- Open the bucket with the recognition results.
- Make sure the bucket contains the
my_pictures_text.tar
archive. - Download and unpack the archive.
- Make sure the text in the
<image_name>.txt
files matches the text in the respective images.
How to delete the resources you created
To stop paying for the resources you created: