Integrating AI Studio with Data Catalog
You can use an AI assistant to search and analyze patterns in metadata catalogs deployed in Data Catalog. To do that, you need to connect the Data Catalog MCP server to MCP Hub. The server allows you to request the list of metadata catalogs, search through metadata, and obtain its lineage graph at the table and column level for use in the context of conversation with agents.
To set up integration with Data Catalog in AI Studio:
- Set up your infrastructure.
- Prepare the metadata catalog.
- Connect an external MCP server.
- Test a conversation with the agent.
Getting started
Sign up for Yandex Cloud and create a billing account:
- Navigate to the management console
and log in to Yandex Cloud or create a new account. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVEorTRIAL_ACTIVEstatus. If you do not have a billing account, create one and link a cloud to it.
If you have an active billing account, you can navigate to the cloud page
Learn more about clouds and folders here.
Required paid resources
The integration infrastructure cost includes a fee for Agent Atelier based on the number of tokens in request and response (see Yandex AI Studio pricing). You start paying for the agent as soon as you activate it.
Set up your infrastructure
Create a folder and network
Create a resource folder to host your metadata catalog.
- In the management console
, select a cloud and click Create folder. - Name your folder, e.g.,
data-folder. - Select Create a default network. This will create a network with subnets in each availability zone.
- Click Create.
Learn more about clouds and folders.
Create a service account
-
Navigate to
data-folder. -
In the list of services, select Identity and Access Management.
-
Click Create service account.
-
Name the service account, e.g.,
sa-for-mcp-server. -
Click Add role and assign the following roles to the service account:
data-catalog.userfor access to the metadata catalog resources.serverless.mcpGateways.invokerfor access to the MCP server in MCP Hub.serverless.mcpGateways.anonymousInvokerfor access to the external MCP server.
-
Click Create.
Prepare the metadata catalog
Create a metadata catalog
- In the management console
, select the resource folder where you want to create a metadata catalog. - Select Yandex MetaData Hub.
- In the left-hand panel, select
Data Catalog. - Click Creating a catalog.
- In the Name field, enter the catalog name,
test-sales. - Click Create.
Note
When you create a metadata catalog, the metadata AI markup is on by default.
With this option enabled, the AI assistant suggests descriptions, domains, classifications and tags, glossaries and terms, and marks up your metadata using them. You can confirm, edit, or reject any suggestion your AI assistant makes by hovering over the AI icon next to the suggestion and selecting the action.
After the catalog is created, you can manage the AI markup on the Overview page or when updating the catalog.
Create a metadata source
-
In the left-hand panel, select
Data sources. -
Click Create data source.
-
Specify
test-sales-sourceas the source name. -
Select the type of the backend that will supply metadata for analysis. Once the source is created, you cannot change the database type. Available backends:
- PostgreSQL
- MySQL®
- ClickHouse®
- Yandex Data Transfer
- WebSQL
- Yandex StoreDoc/MongoDB
- OpenSearch
- Greenplum®
-
Specify the source parameters for the selected database type:
- Connection ID: Managed connection ID in Yandex Connection Manager.
- Database name: Name of the database to ingest metadata from.
-
Click Create.
Create and start a data ingestion
-
In the left-hand panel, select
Ingestions. -
Click Create ingestion.
-
Specify the ingestion settings:
-
In the Name field, enter
load-salesas the ingestion name. -
Select the metadata source you created earlier.
-
Specify the ingestion configuration for the data source:
- Select Manually for the ingestion schedule.
- Optionally, under Data Filters, use regular expressions to specify which databases and database objects to include in or exclude from the ingestion.
-
Under Metadata Types, select the metadata types to extract from the source.
-
Optionally, under Data Profiling:
- Select Enable Profiling to perform data profiling, i.e., analysis and collection of statistics on the data being extracted.
- Select Table level only to skip data profiling in every table column. With this option on, data characteristics will only be collected for the table as a whole.
- In the Max Workers field, specify the number of computing threads for profiling.
- In the Sample Size field, specify the number of rows for sampling for column profiling. This setting applies when the Use Sampling option is enabled.
- In the Table size limit field, specify the table size in GB above which the table will be excluded from profiling.
- In the Table row limit field, specify the number of rows above which the table will be excluded from profiling.
- Select Enable field null count to get the number of rows with
NULLfor each column. - Select Enable distinct value count to get the number of unique values for each column.
- Select Enable field min value to get the minimum value for each numeric column.
- Select Enable field max value to get the maximum value for each numeric column.
- Select Enable field mean value to get the mean value for each numeric column.
- Select Enable field median value to get the median value for each numeric column.
- Select Enables field value stddev to get the standard deviation value for each numeric column.
- Select Enables field quintiles to get quantiles for each numeric column.
- Select Enable distinct value frequency count to get the frequency of unique values for each column.
- Select Enable field histogram to get a histogram for each numeric column.
- Select Enable field sample values to get sample values for each column.
- Select Enable query joining to dynamically combine SQL queries for faster profiling.
- In the Limit field, specify the maximum number of rows to profile. If set to
0, all rows will be profiled.
-
Under Metadata Processing, select the image for metadata processing:
- Enable Use File Cache to improve ingestion performance.
-
-
Click Create.
-
In the list of ingestions, click
in the line with your new ingestion and select Start.During ingestion, the AI assistant will automatically mark up the data. Once successfully completed, the ingestion will get the Success status.
-
To view ingested and marked-up data, select
Metadata search in the left-hand panel.The page displays the info about the data, i.e., data source, database, and tables.
Note
The AI assistant automatically creates entities for metadata markup (domains, glossaries, tags, classifications, and terms) and their descriptions. You can confirm, edit, or reject the markup suggested by your AI assistant by hovering over the AI icon next to the suggestion and selecting the action.
Connect an external MCP server
Connecting in AI Studio
-
Navigate to
data-folder. -
Select AI Studio.
-
In the left-hand panel, select MCP servers and click Create MCP server. In the window that opens:
-
Under Add method, select
Connect. -
Under Tools, click Add tools. In the window that opens, configure the MCP server connection:
-
Transport: Streamable HTTP.
-
URL:
https://datacatalog-consumer.mcp.cloud.yandex.net/mcp -
Authorization type:
Access token. -
Under Authorization header, set the Value field to
Bearer <IAM_token>. To do it, get an IAM token for the service account created earlier, then paste it into the field.Note
The IAM token lifetime does not exceed 12 hours; however, we recommend requesting a token more often, e.g., every hour.
-
-
Click Connect.
-
In the Add tools window that opens, select all tools and click Add.
-
Under Server parameters:
-
In the Name field, enter a name for the new MCP server. Follow these naming requirements:
- Length: between 3 and 63 characters.
- It can only contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
-
Optionally, add a description and labels for the server you are creating by using the corresponding buttons.
- In the Access field, select Private.
- In the Service account field, select the service account you previously created.
-
Optionally, turn on the Enable logging option and configure the logging settings to keep a log of the MCP server you are creating.
-
-
Click Save.
-
-
In the left-hand panel, select
Agents and click Create agent. -
Specify the agent settings:
- Name: Agent name.
- Model: Language model.
- Under Instructions, select a ready-made system instruction template for the agent or describe how the agent should behave and what it should do.
- Under Tools:
- Click Add and select Add MCP.
- In the list, select the MCP server you created earlier and click Select.
- In the Default behavior for all tools field, select Confirmation not needed.
- Click Create and continue.
Connecting to an external AI agent
-
Get an IAM token for the service account you created earlier.
Note
The IAM token lifetime does not exceed 12 hours; however, we recommend requesting a token more often, e.g., every hour.
-
Specify the Data Catalog MCP server configuration for your agent:
{ "mcpServers": { "yandex-cloud-datacatalog-consumer": { "type": "streamableHttp", "url": "https://datacatalog-consumer.mcp.cloud.yandex.net/mcp", "headers": { "Authorization": "Bearer <IAM_token>" } } } }
Test a conversation with the agent
Tip
If using the agent in AI Studio, do the testing in the right-hand Agent testing panel.
-
Start a conversation with the agent by specifying the data catalog ID as shown below:
Use the marked-up data in the apah36iavgh5******** data catalog. -
Use the examples of prompts to respond to which the agent will be analyzing the marked-up data from Data Catalog. It is assumed that the data contains sales-related information:
Write an SQL query to generate YoY sales analyticsFind all tables with user payment informationWhich tables are marked as containing sensitive data?Where does the customer_transactions table get its data from?Help find the tables needed to calculate the user retention metricWhere can I find the website users' behavior data?Which data should I use to analyze sales funnel conversion rate?Show all dependencies of the transactions table to see how schema changes affect it