Recognizing text in image archives in Yandex Vision OCR
Use the Yandex Vision OCR service to recognize text in images. You can also store both the source images and recognition results in Yandex Object Storage.
To set up an infrastructure for text recognition using Vision OCR and export the results automatically to Object Storage:
- Prepare your cloud.
- Create a bucket.
- Create a VM.
- Configure the VM.
- Create an archive with images.
- Prepare a script for recognition and uploading of images.
- Double-check the recognition results.
If you no longer need these resources, delete them.
Before you begin
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The infrastructure costs for image recognition and data storage include:
- A fee for VM computing resources (see Yandex Compute Cloud pricing).
- A fee for data storage in a bucket and operations with data (see Yandex Object Storage pricing).
- A fee for using a dynamic or a static public IP (see Yandex Virtual Private Cloud pricing).
- A fee for using Yandex Vision OCR (see pricing for Yandex Vision OCR).
Create a bucket
To create an Object Storage bucket to store the source images and recognition results:
- Go to the Yandex Cloud management console
and select the folder where you will perform the operations. - On the folder page, click Create resource and select Bucket.
- In the Name field, enter the bucket name following the naming conventions, such as
vision-bucket
. - In the Bucket access field, select Restricted.
- In the Storage class field, select Cold.
- Click Create bucket.
Create a VM
-
In the management console
, click Create resource and select Virtual machine. -
In the Name field, enter a name for the VM, such as
vision-vm
. For naming requirements, see below:- It must be 2 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
-
Select an availability zone to place the VM in.
-
Under Image/boot disk selection, go to the Cloud Marketplace tab and select a public CentOS 7 image.
-
Under Disks and file storages, select the parameters:
- Type: SSD.
- Size: 19 GB.
-
Under Computing resources, select:
- Platform: Intel Cascade Lake.
- Guaranteed vCPU share: 20%.
- vCPU: 2.
- RAM: 2 GB.
-
Under Network settings, select the network and subnet to connect the VM to. If there aren't any networks, create one:
-
Select
Create network. -
In the window that opens, enter the network name and the folder to host the network.
-
(optional) To automatically create subnets, select the Create subnets option.
-
Click Create.
Each network must have at least one subnet. If there is no subnet, create one by selecting
Add subnet.
-
-
In the Public address field, keep Auto to assign your VM a random external IP address from the Yandex Cloud pool, or select a static address from the list if you reserved one in advance.
-
Enter the VM access information:
-
Enter the username in the Login field.
-
In the SSH key field, paste the contents of the public key file.
You will need to create a key pair for the SSH connection yourself, see Creating an SSH key pair.
-
-
Click Create VM.
-
Wait for the VM status to change to
Running
and save its public IP address: you'll need it for SSH connection.
Configure the VM
Set up the Yandex Cloud CLI
-
Connect to the VM via SSH.
-
Make sure that the Yandex Cloud CLI runs correctly:
CLIRun the following command on the VM:
yc config list
Result:
token: AQ...gs cloud-id: b1gdtdqb1900f5rqqvli folder-id: b1gveg9vude9g3uioa50
Save the
folder-id
parameter: you'll need it to set up a service account.
Set up a service account
-
Create a service account:
yc iam service-account create \ --name <service_account_name> \ --description "<service_account_description>"
Where:
--name
is the service account name, such asvision-sa
.--description
is a description of the service account, for example,this is the vision service account
.
Result:
id: aje6aoc8hccuh5tp55bg folder_id: b1gv87ssvu497lpgjh5o created_at: "2022-10-12T14:04:43.198559512Z" name: vision-sa description: this is vision service account
Save the
id
parameter: this is the service account ID you'll need in the setup process. -
Assign the
editor
role to the service account.yc resource-manager folder add-access-binding <folder_id> \ --role editor \ --subject serviceAccount:<service_account_ID>
Where:
--role
: The role assigned.--subject serviceAccount
: Service account ID.
-
Create a static access key for the service account.
yc iam access-key create \ --service-account-id <service_account_ID> \ --description "<key_description>"
Where:
--service-account-id
: Service account ID.--description
: A description for the key, for example,this key is for vision
.
Result:
access_key: id: ajen8d7fur27bt8losom service_account_id: aje6aoc8hccuh5tp55bg created_at: "2022-10-12T15:08:08.045280520Z" description: this key is for vision key_id: YC...li secret: YC...J5
Save the following parameters (you'll need them to set up the AWS CLI utility):
key_id
: The ID of the static access key.secret
: The secret key.
-
Create an authorized key for a service account:
yc iam key create \ --service-account-id <service_account_ID> \ --output key.json
Where:
--service-account-id
: Service account ID.--output
: The name of JSON file with an authorized key.
Result:
id: aje3qc9pagb9kedkhdn5 service_account_id: aje6aoc8hccuh5tp55bg created_at: "2022-10-13T12:53:04.810240976Z" key_algorithm: RSA_2048
-
Create a Yandex Cloud CLI profile to run on behalf of the service account, such as
vision-profile
:yc config profile create vision-profile
Result:
Profile 'vision-profile' created and activated
-
Specify the authorized key of the service account in the profile configuration:
yc config set service-account-key key.json
-
Get an IAM token for the service account:
yc iam create-token
Save the IAM token, you'll need it to send images to Vision OCR.
Set up the AWS CLI
-
Update the packages installed in the VM operating system. To do this, run the command:
sudo yum update -y
-
Install the AWS CLI:
sudo yum install awscli -y
-
Set up the AWS CLI:
aws configure
Specify the parameter values:
AWS Access Key ID
: The ID of thekey_id
static access key that you generated when setting up the service account.AWS Secret Access Key
: Thesecret
key that you generated when setting up the service account.Default region name
:ru-central1
.Default output format
:json
.
-
Make sure that the
~/.aws/credentials
file contains relevant values forkey_id
andsecret
:cat ~/.aws/credentials
-
Make sure that the
~/.aws/config
file contains relevant values forDefault region name
andDefault output format
:cat ~/.aws/config
Create an archive with images
-
Upload your images that include recognizable text to the bucket.
Tip
Use the sample image
of the penguin crossing road sign. -
To make sure that the images were uploaded, use the request with the bucket name:
aws --endpoint-url=https://storage.yandexcloud.net s3 ls s3://<bucket_name>/
-
Save the images from the bucket to the VM, for example, to the
my_pictures
folder:aws --endpoint-url=https://storage.yandexcloud.net s3 cp s3://<bucket_name>/ my_pictures --recursive
-
Compress the images into an archive, for example,
my_pictures.tar
:tar -cf my_pictures.tar my_pictures/*
-
Delete the image directory:
rm -rfd my_pictures
Prepare a script for digitizing and uploading images
Configure the environment
-
Install the
jq
package. The script will use it to process the results from Vision OCR:sudo yum install jq -y
-
Install the text editor
nano
:sudo yum install nano -y
-
Set the environment variables necessary for the script to run:
export BUCKETNAME="<bucket_name>" export FOLDERID="<folder_id>" export IAMTOKEN="<IAM_token>"
Where:
BUCKETNAME
: The bucket name.FOLDERID
: The folder ID.IAMTOKEN
: The IAM token that you issued when setting up the service account.
Create a script
The script includes the following steps:
- Create the relevant directories.
- Unpack the archive with images.
- Process all the images one-by-one:
- Base64-encode the image.
- Create a request body for the given image.
- Send the image in a POST request to Vision OCR for recognition.
- Save the result to the
output.json
file. - Extract the recognized text from
output.json
and save it to a text file.
- Add the resulting text files to an archive.
- Upload the archive with the text files to Object Storage.
- Delete the auxiliary files.
For your convenience, the text of the script includes comments to each step.
To implement the script:
-
Create a file, for example,
vision.sh
and open it in thenano
text editor:sudo nano vision.sh
-
Copy the script text to
vision.sh
:#!/bin/bash # Create the relevant directories echo "Creating directories..." # Create a directory for the recognized text mkdir my_pictures_text # Unpack the archive with images to the created directory echo "Extract pictures in my_pictures directory..." tar -xf my_pictures.tar # Recognize the images from the archive FILES=my_pictures/* for f in $FILES # Loop through the files in the directory to run the actions: do # Base64-encode the image to send it to Vision OCR CODEIMG=$(base64 -i $f | cat) # Create the body.json file to be sent in a POST request to Vision OCR cat <<EOF > body.json { "folderId": "$FOLDERID", "analyze_specs": [{ "content": "$CODEIMG", "features": [{ "type": "TEXT_DETECTION", "text_detection_config": { "language_codes": ["en","ru"] } }] }] } EOF # Send the image to Vision OCR for recognition and write the result to output.json echo "Processing file $f in Vision OCR..." curl -X POST --silent \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${IAMTOKEN}" \ -d '@body.json' \ https://vision.api.cloud.yandex.net/vision/v1/batchAnalyze > output.json # Get the image file name to be used below IMAGE_BASE_NAME=$(basename -- "$f") IMAGE_NAME="${IMAGE_BASE_NAME%.*}" # Get text data from output.json and write it to a TXT file named identically with the image file cat output.json | jq -r '.results[].results[].textDetection.pages[].blocks[].lines[].words[].text' | awk -v ORS=" " '{print}' > my_pictures_text/$IMAGE_NAME".txt" done # Add the directory with the text files to an archive echo "Packing text files to archive..." tar -cf my_pictures_text.tar my_pictures_text # Send the text file archive to the bucket echo "Sending archive to Object Storage Bucket..." aws --endpoint-url=https://storage.yandexcloud.net s3 cp my_pictures_text.tar s3://$BUCKETNAME/ > /dev/null # Delete the auxiliary files echo "Cleaning up..." rm -f body.json rm -f output.json rm -rfd my_pictures rm -rfd my_pictures_text rm -r my_pictures_text.tar
-
Set the permissions to run the script:
sudo chmod 755 vision.sh
-
Run the script:
./vision.sh
Double-check the recognition results
- In the Yandex Cloud management console
, select the folder where the bucket with the recognition results is located. - Select Object Storage.
- Open the bucket with the recognition results.
- Make sure that the bucket contains the
my_pictures_text.tar
archive. - Download and unpack the archive.
- Make sure that the text in the
<image name>.txt
file matches the text in the image.
How to delete created resources
To stop paying for the resources created: