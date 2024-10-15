To implement this solution, HSE extracted and analyzed content from education program websites using the the BeautifulSoup open-source library. This process included removing HTML/XML tags from the data as well as defining all parent and child elements for each object.

The team then split all content into 1024-character chunks with a 200-character overlap, converted it into embeddings using Yandex Foundation Models Embedding API, and put it together into a ClickHouse® vector database hosted on their own HSE server. The DB is around 500 MB large. Whenever teachers update the existing education programs or add new ones, the ClickHouse® database gets updates as well.

The staff developed the program finder in Python, embedded it into the HSE university website, and created a pipeline ensuring that text queries are processed by YandexGPT and voice messages, by Yandex SpeechKit.

The chatbot is highly user-friendly. A website visitor asks a question as a text or voice message in which they specify their interests, achievements, age, intended studying location, final exam score, and any other information they deem fit. If the user asks a question using their voice, Yandex SpeechKit converts it into text. The text query then goes to YandexGPT API, which highlights key data for getting a precise answer. After that, Yandex Foundation Models Embedding API vectorizes the query. Prompting queries are around 40,000 characters long at both input and output. The team then employs retrieval augmented generation (RAG), which means using the vectorized query to search the vector database for the closest matches about the education programs best fitting the user. All found chunks, along with the user query and system prompt, go to YandexGPT, which, in turn, generates an answer to the query based on all information from the chunks.

At first, the developers decided to provide users with information about the suitable programs along with an additional description of those programs by YandexGPT. However, some issues arose at this stage. The team has not yet managed to train the network to correctly highlight key points in the answers, with minor features persistently coming to the fore. Therefore, the staff settled on only providing program details from the knowledge base.

The HSE team spent around two months working on the project, and much of that time went into creating design layouts and fine-tuning the user interface aspects. Engineers at the university did all work on their own, with support from Yandex Cloud experts.

The bulk of the project’s difficulties was due to a lack of experience with generative AI. After a while, the developers realized it was insufficient to just have YandexGPT API receive questions, so they pivoted to a pipeline whereby answers are based on information from a vector database. Following that, the development went much faster.