Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Tutorials
    • All tutorials
      • Implementing an AI assistant with search capabilities for PDF files with complex formatting

In this article:

  • Getting started
  • Required paid resources
  • Create an AI assistant and test it
  • Create an AI assistant with a text search index
  • Create an AI assistant with a hybrid search index
  • Refine your assistant
  1. Machine learning and artificial intelligence
  2. Using Yandex Foundation Models generative models
  3. Implementing an AI assistant with search capabilities for PDF files with complex formatting

Creating an AI assistant with search through PDF files with complex formatting

Written by
Yandex Cloud
Updated at April 11, 2025
  • Getting started
    • Required paid resources
  • Create an AI assistant and test it
    • Create an AI assistant with a text search index
    • Create an AI assistant with a hybrid search index
    • Refine your assistant

The AI Assistant API feature is at the Preview stage.

This tutorial provides an example of creating an AI assistant that searches through information in a source containing complex formatting and tabular data and analyzes the retrieved data using the YandexGPT Pro RC text generation model.

A PDF file serves as an example of a source file with complex formatting. The source file will be converted to Markdown format. Many models are trained on and work best with Markdown data, so this format is the most efficient to use as a source for a search index. The search indexes you will create from the source file will use two search options: text and hybrid.

You will access the AI Assistant API and Text Generation API through Yandex Cloud ML SDK for Python.

Getting startedGetting started

  1. Create a service account and assign the ai.assistants.editor and ai.languageModels.user roles to it.

  2. Get the service account API key and save it.

    The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.

  3. Install Python 3.10 or 3.11.

  4. Install the Python venv utility to create isolated virtual environments in Python.

  5. Create and log in to a new Python virtual environment:

    python3 -m venv new-env
    source new-env/bin/activate
    
  6. Using the pip package manager, install the docling library to convert documents from PDF to Markdown format:

    pip install docling
    
  7. Install the Yandex Cloud ML SDK library to work with the Yandex Foundation Models tools:

    pip install yandex-cloud-ml-sdk
    
  8. Download the assistant-example.pdf file with the document to use as the information source for the AI assistant and convert it to Markdown format using the docling library you installed earlier:

    1. Create a file named convert.py containing the following code:

      from docling.document_converter import DocumentConverter
      
      converter = DocumentConverter()
      result = converter.convert("assistant-example.pdf")
      content = result.document.export_to_markdown()
      with open("assistant-example.md", "wt") as f:
         f.write(content)
      
    2. Execute the convert.py file:

      python3 convert.py
      

      As a result, the assistant-example.md file with the content from the original PDF file in Markdown format will be saved in the current directory.

Required paid resourcesRequired paid resources

The cost of using an AI assistant includes a text generation fee (see Yandex Foundation Models pricing).

Create an AI assistant and test itCreate an AI assistant and test it

You will create a search index, sourced from a Markdown file. You will then ask the created assistant a few questions based on the information source:

Create an AI assistant with a text search indexCreate an AI assistant with a text search index

  1. Create a file named assistant-text.py containing the following code:

    from yandex_cloud_ml_sdk import YCloudML
    from yandex_cloud_ml_sdk.search_indexes import (
        HybridSearchIndexType,
        TextSearchIndexType,
        StaticIndexChunkingStrategy,
        ReciprocalRankFusionIndexCombinationStrategy,
    )
    
    queries = [
        "What services are included in Ridetech?",
        "What is the adjusted EBITDA margin of ad services?",
        "For how many billion rubles did Yandex issue loans in 9 months to September 30, 2024?",
    ]
    
    
    def main():
        sdk = YCloudML(
            folder_id="<folder_ID>",
            auth="<API_key>",
        )
    
        file = sdk.files.upload("<file_path>")
        operation = sdk.search_indexes.create_deferred(
            [file], index_type=TextSearchIndexType()
        )
        text_index = operation.wait()
        text_tool = sdk.tools.search_index(text_index)
        model = sdk.models.completions("yandexgpt", model_version="rc")
        assistant = sdk.assistants.create(model, tools=[text_tool])
        text_index_thread = sdk.threads.create()
    
        for query in queries:
            text_index_thread.write(query)
            print(query)
            run = assistant.run(text_index_thread)
            result = run.wait().message
            for part in result.parts:
                print(part)
            print("\n")
    
    
    if __name__ == "__main__":
        main()
    

    Where:

    • folder_id: ID of the folder where you created the service account.
    • auth: Service account API key you created earlier. You use an API key for authentication in the Yandex Cloud API. To learn more about authentication methods in Yandex Cloud ML SDK, see Authentication.
    • <file_path>: Path to the assistant-example.md file with source data for the search index.
  2. Run the assistant-text.py file to create an assistant and get answers to questions from it:

    python3 assistant-text.py
    

    Result:

    What services does Ridetech include?
    Ridetech includes:
    * Online taxi ordering service
    * Yandex Drive carsharing service
    * Scooter rental service
    * Other promising services
    
    What is the adjusted EBITDA margin of ad services?
    Ad services have an adjusted EBITDA margin of 51.3%.
    
    How many billion rubles of loans did Yandex issue over nine months prior to September 30, 2024?
    In the provided context, there is no information about how many billion rubles Yandex issued in loans over the nine months prior to September 30, 2024.
    

    As you can see from the result, the text search only handled the first, easiest question. The second question yielded an incorrect result, and the third one, no answer at all.

Create an AI assistant with a hybrid search indexCreate an AI assistant with a hybrid search index

  1. Create a file named assistant-hybrid.py containing the following code:

    from yandex_cloud_ml_sdk import YCloudML
    from yandex_cloud_ml_sdk.search_indexes import (
        HybridSearchIndexType,
        TextSearchIndexType,
        StaticIndexChunkingStrategy,
        ReciprocalRankFusionIndexCombinationStrategy,
    )
    
    queries = [
        "What services are included in Ridetech?",
        "What is the adjusted EBITDA margin of ad services?",
        "For how many billion rubles did Yandex issue loans in 9 months to September 30, 2024?",
    ]
    
    
    def main():
        sdk = YCloudML(
            folder_id="<folder_ID>",
            auth="<API_key>",
        )
    
        file = sdk.files.upload("<file_path>")
        operation = sdk.search_indexes.create_deferred(
            [file],
            index_type=HybridSearchIndexType(
                chunking_strategy=StaticIndexChunkingStrategy(
                    max_chunk_size_tokens=512,
                    chunk_overlap_tokens=256,
                ),
                combination_strategy=ReciprocalRankFusionIndexCombinationStrategy(),
            ),
        )
        hybrid_index = operation.wait()
        hybrid_tool = sdk.tools.search_index(hybrid_index)
        model = sdk.models.completions("yandexgpt", model_version="rc")
        assistant = sdk.assistants.create(model, tools=[hybrid_tool])
        hybrid_index_thread = sdk.threads.create()
    
        for query in queries:
            hybrid_index_thread.write(query)
            print(query)
            run = assistant.run(hybrid_index_thread)
            result = run.wait().message
            for part in result.parts:
                print(part)
            print("\n")
    
    
    if __name__ == "__main__":
        main()
    

    Where:

    • folder_id: ID of the folder where you created the service account.
    • auth: Service account API key you created earlier. You use an API key for authentication in the Yandex Cloud API. To learn more about authentication methods in Yandex Cloud ML SDK, see Authentication.
    • <file_path>: Path to the assistant-example.md file with source data for the search index.
  2. Run the assistant-hybrid.py file to create an assistant and get answers to questions from it:

    python3 assistant-hybrid.py
    

    Result:

    Ridetech includes:
    * Online taxi ordering service and distribution of such technologies abroad.
    * Yandex Drive carsharing service.
    * Scooter rental service.
    * Other promising services.
    
    What is the adjusted EBITDA margin of ad services?
    The adjusted EBITDA margin of ad services over the third quarter of 2024 was 6.5%.
    
    How many billion rubles of loans did Yandex issue over nine months prior to September 30, 2024?
    Over the nine months prior to September 30, 2024, Yandex issued loans worth 6.5 billion rubles.
    

    As you can see, the hybrid search provided a better result: the answers to all three questions are correct.

Refine your assistantRefine your assistant

Expand the capabilities of the created assistant, turning it from a simple informant into a smart assistant that analyzes found data and provides evaluations based on it.

  1. Create a file named assistant-analyst.py containing the following code:

    from yandex_cloud_ml_sdk import YCloudML
    from yandex_cloud_ml_sdk.search_indexes import (
        HybridSearchIndexType,
        TextSearchIndexType,
        StaticIndexChunkingStrategy,
        ReciprocalRankFusionIndexCombinationStrategy,
    )
    
    query = (
        "What is the adjusted EBITDA margin of ad services?"
    )
    instruction = """
        You are a highly qualified investment assistant. 
        Your task is to help the user analyze companies based on financial reports for investors.
        At the input, you will be given a user query and pieces of reports, based on which you need to give an answer to the user's question and add an evaluation as an investment assistant.
         
        If you need to output the value of a metric in response to a user query, you should interpret the value. 
        Ideally, briefly explain the meaning of the metric and give an evaluation as to whether the metric value is large/small. 
    """
    
    
    def main():
        sdk = YCloudML(
            folder_id="<folder_ID>",
            auth="<API_key>",
        )
    
        file = sdk.files.upload("<file_path>")
        operation = sdk.search_indexes.create_deferred(
            [file],
            index_type=HybridSearchIndexType(
                chunking_strategy=StaticIndexChunkingStrategy(
                    max_chunk_size_tokens=1024,
                    chunk_overlap_tokens=512,
                ),
                combination_strategy=ReciprocalRankFusionIndexCombinationStrategy(),
            ),
        )
        hybrid_index = operation.wait()
        hybrid_tool = sdk.tools.search_index(hybrid_index)
        model = sdk.models.completions("yandexgpt", model_version="rc")
        assistant = sdk.assistants.create(
            model, tools=[hybrid_tool], instruction=instruction
        )
        hybrid_index_thread = sdk.threads.create()
    
        hybrid_index_thread.write(query)
        print(query)
        run = assistant.run(hybrid_index_thread)
        result = run.wait().message
        for part in result.parts:
            print(part)
        print("\n")
    
    
    if __name__ == "__main__":
        main()
    

    Where:

    • folder_id: ID of the folder where you created the service account.
    • auth: Service account API key you created earlier. You use an API key for authentication in the Yandex Cloud API. To learn more about authentication methods in Yandex Cloud ML SDK, see Authentication.
    • <file_path>: Path to the assistant-example.md file with source data for the search index.
  2. Run the assistant-analyst.py file to create an assistant and get answers to questions from it:

    python3 assistant-analyst.py
    

    Result:

    What is the adjusted EBITDA margin of ad services?
    The adjusted EBITDA margin of ad services over the third quarter of 2024 was 6.5%.
     This is a significant improvement compared to the same period last year, when the profit margin was 3.3%.
     This growth indicates the efficiency of the segment and the improvement of its financial performance.
    
    

Was the article helpful?

Previous
Regular asynchronous recognition of audio files from Object Storage
© 2025 Direct Cursus Technology L.L.C.