Creating a data ingestion
Note
This feature is in the Preview stage.
-
In the management console
, select the resource folder you created the metadata catalog in. -
Select Yandex MetaData Hub.
-
In the left-hand panel, select
Data Catalog. -
In the list that opens, select the metadata catalog where you want to create an ingestion.
-
In the left-hand panel, select
Ingestions. -
Click Create ingestion.
-
Specify the ingestion settings:
-
In the Name field, enter a unique name for the ingestion.
-
Optionally, describe the ingestion.
-
Select a data source or create one.
-
Specify the ingestion configuration for the data source:
-
Select an ingestion schedule:
-
Monthly: Select the dates and the ingestion start and end time.
-
Weekly: Select the days of the week and the ingestion start and end time.
Note
If scheduled for Monthly or Weekly, the ingestion will start at the specified time and stop as soon as new data has been ingested. If there are errors while ingesting, the ingestion will restart until the data has been ingested or until the specified time is over.
-
Daily: Select time intervals for ingestion.
-
Manually: For manual start only.
-
-
Optionally, under Data Filters, use regular expressions to specify which databases and database objects to include or exclude from the ingestion.
-
Under Metadata Types, select the metadata types to extract from the source.
-
Optionally, under Data Profiling:
- Select Enable Profiling to perform data profiling, i.e., analysis and collection of statistics on the data being extracted.
- Select Table level only to skip data profiling in every table column. With this option on, data characteristics will only be collected for the table as a whole.
- In the Max Workers field, specify the number of computing threads for profiling.
- In the Sample Size field, specify the number of rows for sampling for column profiling. This setting applies when the Use Sampling option is enabled.
- In the Table size limit field, specify the table size in GB above which the table will be excluded from profiling.
- In the Table row limit field, specify the number of rows above which the table will be excluded from profiling.
- Select Enable field null count to get the number of rows with
NULLfor each column. - Select Enable distinct value count to get the number of unique values for each column.
- Select Enable field min value to get the minimum value for each numeric column.
- Select Enable field max value to get the maximum value for each numeric column.
- Select Enable field mean value to get the mean value for each numeric column.
- Select Enable field median value to get the median value for each numeric column.
- Select Enables field value stddev to get the standard deviation value for each numeric column.
- Select Enables field quintiles to get quantiles for each numeric column.
- Select Enable distinct value frequency count to get the frequency of unique values for each column.
- Select Enable field histogram to get a histogram for each numeric column.
- Select Enable field sample values to get sample values for each column.
- Select Enable query joining to dynamically combine SQL queries for faster profiling.
- In the Limit field, specify the maximum number of rows to profile. If set to
0, all rows will be profiled.
-
Under Metadata Processing, select the image for metadata processing:
- Enable Use File Cache to improve ingestion performance.
-
-
-
Click Create.