Creating an AI assistant with search through PDF files with complex formatting
The AI Assistant API feature is at the Preview stage.
This tutorial provides an example of creating an AI assistant that searches through information in a source containing complex formatting and tabular data and analyzes the retrieved data using the YandexGPT Pro RC text generation model.
A PDF
You will access the AI Assistant API and Text Generation API through Yandex Cloud ML SDK for Python.
Getting started
-
Create a service account and assign the
ai.assistants.editor
andai.languageModels.user
roles to it. -
Get the service account API key and save it.
The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.
-
Install the Python venv
utility to create isolated virtual environments in Python. -
Create and log in to a new Python virtual environment:
python3 -m venv new-env source new-env/bin/activate
-
Using the pip
package manager, install the docling library to convert documents from PDF to Markdown format:pip install docling
-
Install the Yandex Cloud ML SDK
library to work with the Yandex Foundation Models tools:pip install yandex-cloud-ml-sdk
-
Download
theassistant-example.pdf
file with the document to use as the information source for the AI assistant and convert it to Markdown format using thedocling
library you installed earlier:-
Create a file named
convert.py
containing the following code:from docling.document_converter import DocumentConverter converter = DocumentConverter() result = converter.convert("assistant-example.pdf") content = result.document.export_to_markdown() with open("assistant-example.md", "wt") as f: f.write(content)
-
Execute the
convert.py
file:python3 convert.py
As a result, the
assistant-example.md
file with the content from the original PDF file in Markdown format will be saved in the current directory.
-
Required paid resources
The cost of using an AI assistant includes a text generation fee (see Yandex Foundation Models pricing).
Create an AI assistant and test it
You will create a search index, sourced from a Markdown file. You will then ask the created assistant a few questions based on the information source:
Create an AI assistant with a text search index
-
Create a file named
assistant-text.py
containing the following code:from yandex_cloud_ml_sdk import YCloudML from yandex_cloud_ml_sdk.search_indexes import ( HybridSearchIndexType, TextSearchIndexType, StaticIndexChunkingStrategy, ReciprocalRankFusionIndexCombinationStrategy, ) queries = [ "What services are included in Ridetech?", "What is the adjusted EBITDA margin of ad services?", "For how many billion rubles did Yandex issue loans in 9 months to September 30, 2024?", ] def main(): sdk = YCloudML( folder_id="<folder_ID>", auth="<API_key>", ) file = sdk.files.upload("<file_path>") operation = sdk.search_indexes.create_deferred( [file], index_type=TextSearchIndexType() ) text_index = operation.wait() text_tool = sdk.tools.search_index(text_index) model = sdk.models.completions("yandexgpt", model_version="rc") assistant = sdk.assistants.create(model, tools=[text_tool]) text_index_thread = sdk.threads.create() for query in queries: text_index_thread.write(query) print(query) run = assistant.run(text_index_thread) result = run.wait().message for part in result.parts: print(part) print("\n") if __name__ == "__main__": main()
Where:
folder_id
: ID of the folder where you created the service account.auth
: Service account API key you created earlier. You use an API key for authentication in the Yandex Cloud API. To learn more about authentication methods in Yandex Cloud ML SDK, see Authentication.<file_path>
: Path to theassistant-example.md
file with source data for the search index.
-
Run the
assistant-text.py
file to create an assistant and get answers to questions from it:python3 assistant-text.py
Result:
What services does Ridetech include? Ridetech includes: * Online taxi ordering service * Yandex Drive carsharing service * Scooter rental service * Other promising services What is the adjusted EBITDA margin of ad services? Ad services have an adjusted EBITDA margin of 51.3%. How many billion rubles of loans did Yandex issue over nine months prior to September 30, 2024? In the provided context, there is no information about how many billion rubles Yandex issued in loans over the nine months prior to September 30, 2024.
As you can see from the result, the text search only handled the first, easiest question. The second question yielded an incorrect result, and the third one, no answer at all.
Create an AI assistant with a hybrid search index
-
Create a file named
assistant-hybrid.py
containing the following code:from yandex_cloud_ml_sdk import YCloudML from yandex_cloud_ml_sdk.search_indexes import ( HybridSearchIndexType, TextSearchIndexType, StaticIndexChunkingStrategy, ReciprocalRankFusionIndexCombinationStrategy, ) queries = [ "What services are included in Ridetech?", "What is the adjusted EBITDA margin of ad services?", "For how many billion rubles did Yandex issue loans in 9 months to September 30, 2024?", ] def main(): sdk = YCloudML( folder_id="<folder_ID>", auth="<API_key>", ) file = sdk.files.upload("<file_path>") operation = sdk.search_indexes.create_deferred( [file], index_type=HybridSearchIndexType( chunking_strategy=StaticIndexChunkingStrategy( max_chunk_size_tokens=512, chunk_overlap_tokens=256, ), combination_strategy=ReciprocalRankFusionIndexCombinationStrategy(), ), ) hybrid_index = operation.wait() hybrid_tool = sdk.tools.search_index(hybrid_index) model = sdk.models.completions("yandexgpt", model_version="rc") assistant = sdk.assistants.create(model, tools=[hybrid_tool]) hybrid_index_thread = sdk.threads.create() for query in queries: hybrid_index_thread.write(query) print(query) run = assistant.run(hybrid_index_thread) result = run.wait().message for part in result.parts: print(part) print("\n") if __name__ == "__main__": main()
Where:
folder_id
: ID of the folder where you created the service account.auth
: Service account API key you created earlier. You use an API key for authentication in the Yandex Cloud API. To learn more about authentication methods in Yandex Cloud ML SDK, see Authentication.<file_path>
: Path to theassistant-example.md
file with source data for the search index.
-
Run the
assistant-hybrid.py
file to create an assistant and get answers to questions from it:python3 assistant-hybrid.py
Result:
Ridetech includes: * Online taxi ordering service and distribution of such technologies abroad. * Yandex Drive carsharing service. * Scooter rental service. * Other promising services. What is the adjusted EBITDA margin of ad services? The adjusted EBITDA margin of ad services over the third quarter of 2024 was 6.5%. How many billion rubles of loans did Yandex issue over nine months prior to September 30, 2024? Over the nine months prior to September 30, 2024, Yandex issued loans worth 6.5 billion rubles.
As you can see, the hybrid search provided a better result: the answers to all three questions are correct.
Refine your assistant
Expand the capabilities of the created assistant, turning it from a simple informant into a smart assistant that analyzes found data and provides evaluations based on it.
-
Create a file named
assistant-analyst.py
containing the following code:from yandex_cloud_ml_sdk import YCloudML from yandex_cloud_ml_sdk.search_indexes import ( HybridSearchIndexType, TextSearchIndexType, StaticIndexChunkingStrategy, ReciprocalRankFusionIndexCombinationStrategy, ) query = ( "What is the adjusted EBITDA margin of ad services?" ) instruction = """ You are a highly qualified investment assistant. Your task is to help the user analyze companies based on financial reports for investors. At the input, you will be given a user query and pieces of reports, based on which you need to give an answer to the user's question and add an evaluation as an investment assistant. If you need to output the value of a metric in response to a user query, you should interpret the value. Ideally, briefly explain the meaning of the metric and give an evaluation as to whether the metric value is large/small. """ def main(): sdk = YCloudML( folder_id="<folder_ID>", auth="<API_key>", ) file = sdk.files.upload("<file_path>") operation = sdk.search_indexes.create_deferred( [file], index_type=HybridSearchIndexType( chunking_strategy=StaticIndexChunkingStrategy( max_chunk_size_tokens=1024, chunk_overlap_tokens=512, ), combination_strategy=ReciprocalRankFusionIndexCombinationStrategy(), ), ) hybrid_index = operation.wait() hybrid_tool = sdk.tools.search_index(hybrid_index) model = sdk.models.completions("yandexgpt", model_version="rc") assistant = sdk.assistants.create( model, tools=[hybrid_tool], instruction=instruction ) hybrid_index_thread = sdk.threads.create() hybrid_index_thread.write(query) print(query) run = assistant.run(hybrid_index_thread) result = run.wait().message for part in result.parts: print(part) print("\n") if __name__ == "__main__": main()
Where:
folder_id
: ID of the folder where you created the service account.auth
: Service account API key you created earlier. You use an API key for authentication in the Yandex Cloud API. To learn more about authentication methods in Yandex Cloud ML SDK, see Authentication.<file_path>
: Path to theassistant-example.md
file with source data for the search index.
-
Run the
assistant-analyst.py
file to create an assistant and get answers to questions from it:python3 assistant-analyst.py
Result:
What is the adjusted EBITDA margin of ad services? The adjusted EBITDA margin of ad services over the third quarter of 2024 was 6.5%. This is a significant improvement compared to the same period last year, when the profit margin was 3.3%. This growth indicates the efficiency of the segment and the improvement of its financial performance.