Recognizing text in image archives in Yandex Vision OCR
Using Vision OCR and Yandex Object Storage, you can manage image text recognition and store both the source image archive and recognition results.
To configure a text recognition infrastructure using Vision OCR and automatically export the results to Object Storage:
- Get your cloud ready.
- Create a bucket.
- Create a VM.
- Configure the VM.
- Create an image archives.
- Prepare a script for recognizing and uploading images.
- Double-check the recognition results.
If you no longer need the resources you created, delete them.
Getting started
Sign up for Yandex Cloud and create a billing account:
- Navigate to the management console
and log in to Yandex Cloud or create a new account. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVEorTRIAL_ACTIVEstatus. If you do not have a billing account, create one and link a cloud to it.
If you have an active billing account, you can navigate to the cloud page
Learn more about clouds and folders here.
Required paid resources
The infrastructure costs for image recognition and data storage include:
- Fee for VM computing resources and disks (see Yandex Compute Cloud pricing).
- Fee for data storage in a bucket and data operations (see Object Storage pricing).
- Fee for using a dynamic or static public IP address (see Yandex Virtual Private Cloud pricing).
- Fee for using Vision OCR (see Vision OCR pricing).
Create a bucket
To create an Object Storage bucket to store the source images and recognition results:
- Go to the management console
and select the folder where you will perform the operations. - Select Object Storage.
- Click Create bucket.
- Enter a name for the bucket as per the naming conventions.
- In the Read objects field, select With authorization.
- In the Storage class field, select Cold.
- Click Create bucket.
Create a VM
-
On the folder dashboard in the management console
, click Create resource and selectVirtual machine instance. -
Select Advanced setup.
-
Under Boot disk image, in the Product search field, enter
CentOS 7and select a public CentOS 7 image. -
Under Location, select an availability zone where your VM will reside. If you are not sure which one to choose, leave the default.
-
Under Disks and file storages, select the
SSDdisk type and specify its size:19 GB. -
Under Computing resources, navigate to the
Customtab and specify the platform, number of vCPUs, and amount of RAM:- Platform:
Intel Cascade Lake - vCPU:
2 - Guaranteed vCPU performance:
20% - RAM:
2 GB
- Platform:
-
Under Network settings:
- In the Subnet field, select the network and subnet to connect your VM to. If the required network or subnet is not there, create it.
- In the Public IP address field, keep
Autoto assign the VM a random external IP address from the Yandex Cloud pool or select a static address from the list if you reserved one.
-
Under Access, select SSH key and specify the VM access credentials:
- In the Login field, enter the username. Do not use
rootor other reserved usernames. To perform operations requiring root privileges, use thesudocommand. -
In the SSH key field, select the SSH key saved in your organization user profile.
If there are no SSH keys in your profile or you want to add a new key:
-
Click Add key.
-
Enter a name for the SSH key.
-
Select one of the following:
-
Enter manually: Paste the contents of the public SSH key. You need to create an SSH key pair on your own. -
Load from file: Upload the public part of the SSH key. You need to create an SSH key pair on your own. -
Generate key: Automatically create an SSH key pair.When adding a new SSH key, an archive containing the key pair will be created and downloaded. In Linux or macOS-based operating systems, unpack the archive to the
/home/<user_name>/.sshdirectory. In Windows, unpack the archive to theC:\Users\<user_name>/.sshdirectory. You do not need additionally enter the public key in the management console.
-
-
Click Add.
The system will add the SSH key to your organization user profile. If the organization has disabled the ability for users to add SSH keys to their profiles, the added public SSH key will only be saved in the user profile inside the newly created resource.
-
- In the Login field, enter the username. Do not use
-
Under General information, specify the VM name. The naming requirements are as follows:
- It must be from 2 to 63 characters long.
- It can only contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
-
Click Create VM.
-
Wait until the VM status switches to
Runningand save the VM’s public IP address required for SSH connection.
Configure the VM
Configure the Yandex Cloud CLI
-
Connect to the VM over SSH.
-
Make sure the Yandex Cloud CLI runs correctly:
CLIRun this command on the VM:
yc config listResult:
token: AQ...gs cloud-id: b1gdtdqb1900******** folder-id: b1gveg9vude9********Save the folder ID from the
folder-idproperty. This ID is required for configuring a service account.
Configure a service account
-
Create a service account:
yc iam service-account create \ --name <service_account_name> \ --description "<service_account_description>"Where:
--name: Service account name, e.g.,vision-sa.--description: Service account description, e.g.,this is vision service account.
Result:
id: aje6aoc8hccu******** folder_id: b1gv87ssvu49******** created_at: "2022-10-12T14:04:43.198559512Z" name: vision-sa description: this is vision service accountSave the service account ID from the
idproperty. This ID is required for further configuration. -
Assign the
editorrole to the service account:yc resource-manager folder add-access-binding <folder_ID> \ --role editor \ --subject serviceAccount:<service_account_ID>Where:
--role: Role you want to assign.--subject serviceAccount: Service account ID.
-
Create a static access key for your service account:
yc iam access-key create \ --service-account-id <service_account_ID> \ --description "<key_description>"Where:
--service-account-id: Service account ID.--description: Key description, e.g.,this key is for vision.
Result:
access_key: id: ajen8d7fur27******** service_account_id: aje6aoc8hccu******** created_at: "2022-10-12T15:08:08.045280520Z" description: this key is for vision key_id: YC...li secret: YC...J5Save these properties, as you will need them to configure the AWS CLI:
key_id: Static access key IDsecret: Secret key
-
Create an authorized key for the service account:
yc iam key create \ --service-account-id <service_account_ID> \ --output key.jsonWhere:
--service-account-id.--output: Name of the authorized key JSON file.
Result:
id: aje3qc9pagb9******** service_account_id: aje6aoc8hccu******** created_at: "2022-10-13T12:53:04.810240976Z" key_algorithm: RSA_2048 -
Create a Yandex Cloud CLI profile to run under the service account, such as
vision-profile:yc config profile create vision-profileResult:
Profile 'vision-profile' created and activated -
Specify the service account’s authorized key in the profile configuration:
yc config set service-account-key key.json -
Get a IAM token for the service account:
yc iam create-tokenSave the IAM token. You will need it to send images to Vision OCR.
Configure the AWS CLI
-
Update the packages on the VM operating system. To do this, run this command:
sudo yum update -y -
Install the AWS CLI:
sudo yum install awscli -y -
Configure the AWS CLI:
aws configureSpecify these settings:
AWS Access Key ID: Static access key ID (key_id) you got when configuring the service account.AWS Secret Access Key: Secret key (secret) you got when configuring the service account.Default region name:ru-central1.Default output format:json.
-
Make sure the
~/.aws/credentialsfile contains the correctkey_idandsecretvalues:cat ~/.aws/credentials -
Make sure the
~/.aws/configfile contains the correctDefault region nameandDefault output formatvalues:cat ~/.aws/config
Create an image archive
-
Upload your images that include recognizable text to the bucket.
Tip
Need an example? download an image
of the penguin crossing road sign. -
To make sure you have uploaded the images, use a request with the bucket name:
aws --endpoint-url=https://storage.yandexcloud.net s3 ls s3://<bucket_name>/ -
Save the images from the bucket to the VM, e.g., to the
my_picturesdirectory:aws --endpoint-url=https://storage.yandexcloud.net s3 cp s3://<bucket_name>/ my_pictures --recursive -
Pack the images into an archive, e.g.,
my_pictures.tar:tar -cf my_pictures.tar my_pictures/* -
Delete the image directory:
rm -rfd my_pictures
Prepare a script for digitizing and uploading images
Set up your environment
-
Install the
epelrepository for additional packages:sudo yum install epel-release -y -
Install the
jqpackage to process the results from Vision OCR:sudo yum install jq -y -
Install the
nanotext editor:sudo yum install nano -y -
Set the environment variables required for the script to run:
export BUCKETNAME="<bucket_name>" export FOLDERID="<folder_ID>" export IAMTOKEN="<IAM_token>"Where:
BUCKETNAME: Bucket name.FOLDERID: Folder ID.IAMTOKEN: IAM token you got when configuring the service account.
Create a script
The script includes these steps:
- Create the appropriate directories.
- Unpack the image archive.
- Process all images one by one:
- Encode the image as Base64.
- Create a request body for the specific image.
- Send the image in a POST request to Vision OCR for recognition.
- Save the result to the
output.jsonfile. - Extract the recognized text from
output.jsonand save it to a text file.
- Pack the text files you got into an archive.
- Upload the text files archive to Object Storage.
- Delete the auxiliary files.
For your convenience, the script text includes comments to each step.
To implement the script:
-
Create a file, e.g.,
vision.sh, and open it in thenanotext editor:sudo nano vision.sh -
Copy the Bash script text to
vision.sh:#!/bin/bash # Create the appropriate directories. echo "Creating directories..." # Create a directory for recognized text. mkdir my_pictures_text # Unpack the image archive to the directory you created. echo "Extract pictures in my_pictures directory..." tar -xf my_pictures.tar # Recognize the images from the archive. FILES=my_pictures/* for f in $FILES # For each file in the directory, perform these actions in a loop: do # Encode the image as Base64 for sending it to Vision OCR. CODEIMG=$(base64 -i $f | cat) # Create a `body.json` file to send to Vision OCR in a POST request. cat <<EOF > body.json { "mimeType": "JPEG", "languageCodes": ["*"], "model": "page", "content": "$CODEIMG" } EOF # Send the image to Vision OCR for recognition and write the result to the `output.json` file. echo "Processing file $f in Vision..." curl --request POST \ --header "Content-Type: application/json" \ --header "Authorization: Bearer ${IAMTOKEN}" \ --header "x-data-logging-enabled: true" \ --header "x-folder-id: ${FOLDERID}" \ --data '@body.json' \ https://ocr.api.cloud.yandex.net/ocr/v1/recognizeText \ --output output.json # Get the image file name to use it later. IMAGE_BASE_NAME=$(basename -- "$f") IMAGE_NAME="${IMAGE_BASE_NAME%.*}" # Get the text data from the `output.json` file and write it to a .txt file with the same name as the image file. cat output.json | jq -r '.result[].blocks[].lines[].text' | awk -v ORS=" " '{print}' > my_pictures_text/$IMAGE_NAME".txt" done # Archive the contents of the text file directory. echo "Packing text files to archive..." tar -cf my_pictures_text.tar my_pictures_text # Move the text file archive to your bucket. echo "Sending archive to Object Storage Bucket..." aws --endpoint-url=https://storage.yandexcloud.net s3 cp my_pictures_text.tar s3://$BUCKETNAME/ > /dev/null # Delete the auxiliary files. echo "Cleaning up..." rm -f body.json rm -f output.json rm -rfd my_pictures rm -rfd my_pictures_text rm -r my_pictures_text.tar -
Set the permissions to run the script:
sudo chmod 755 vision.sh -
Run the script:
./vision.sh
Double-check the recognition results
- In the Yandex Cloud management console
, select the folder containing the bucket with the recognition results. - Select Object Storage.
- Open the bucket with the recognition results.
- Make sure the bucket contains the
my_pictures_text.tararchive. - Download and unpack the archive.
- Make sure the text in the
<image_name>.txtfiles matches that in the respective images.
How to delete the resources you created
To stop paying for the resources you created: