Hive Metastore clusters
Note
This feature is at the Preview stage.
You can create Hive Metastore clusters in Yandex MetaData Hub.
Hive Metastore
- Provides client applications with the information on where to get the data to process and how to interpret it.
- Saves the table metadata between running the short-lived computing clusters.
- Shares the data space between concurrently run clusters.
- Links together different ETL systems and tools for working with shared data and simplifies their deployment.
- Provides fault tolerance, scalable storage, and metadata backup.
- Simplifies sending logs and metrics, as well as the update and migration processes.
- Has a key role in cloud data processing scenarios by enabling different tools (Spark, Trino, Hive) to access the same metadata.
Some Apache products, including Hive
Why use Hive Metastore
When handling big data and analytics in the cloud, there is often a need to transform file sets into tables so you can easily work with them using SQL. Metastore is a persistent database with a data dictionary. Persistence means that information is saved to a disk and remains available after computer shutdown or system restart. A data dictionary contains definitions describing the structure and format of data. Metastore retains data about tables physically located in Yandex Object Storage. It includes information about the location of data files, their arrangement, column structure, data types, partitioning, etc. Essentially, Metastore creates abstraction over raw files, converting them into logical tables you can manage with SQL.
One may compare this to cataloging books in a library. In a large library with thousands of books, you would have to check every shelf to find the book you need unless you had a catalog to help you quickly locate it. Metastore has a similar role when it comes to Object Storage data.
In relational databases (Oracle, PostgreSQL), the data dictionary is built into the DBMS itself. When you create a table in PostgreSQL, information about its structure is stored in system tables within the same database. However, in big data ecosystems where files may be stored independently from processing tools, you need a dedicated service to store such information, Metastore.
Use cases
Metastore does not solve any business tasks, yet it is an essential tool for certain use cases.
Operations with data from different analytical tools
Modern data processing architectures often use multiple tools to work with the same data in Object Storage. This is because different tools are optimal for different tasks. For example, you may want to use Apache Spark™ for bulk data processing and ETL (Extract, Transform, Load), and Trino, for interactive analytics and quick queries.
If there is no single common cluster, each tool will maintain its own copy of metadata, which creates problems when the data structure changes. When you add a new column to a table and update metadata in Apache Spark™ without making the same updates in Trino, queries from Trino will return incomplete data or produce errors.
Metastore prevents this by providing a single source of truth for all metadata. The table structure is described once, and all connected tools automatically get access to up-to-date information. This streamlines administration and significantly reduces the risk of metadata inconsistency errors.
For example, if a team of data analysts uses Trino for interactive queries while an engineering team uses Apache Spark™ for ETL, a single shared Metastore will provide both teams with consistent data ensuring accurate results.
Cluster lifecycle management
One of the key cloud computing advantages is that you pay only for resources you actually use. This is relevant for data processing tasks you may only have occasionally.
A lot of Yandex Cloud customers use powerful compute clusters (Yandex Data Processing with Apache Spark™ or custom Apache Hadoop® clusters) exclusively for such tasks: report generation, overnight batch processing, analytical model updates, etc.
Such clusters may consist of hundreds of CPU cores and terabytes of RAM and are very expensive to use. Having these resources available all the time is not cost-effective, especially if they are only used for a few hours a day.
A better approach is to create temporary clusters for specific tasks and then delete them. However, in traditional Apache Hadoop® architectures, Metastore is a component within a cluster, so if you delete a cluster, you will lose all table metadata. With the next run, you would need to redefine table structure manually, which is an error-prone and labor-intensive process.
You may tackle this by creating a standalone managed Metastore cluster. It is independent of compute clusters and retains all metadata even after cluster deletion. When creating a new cluster for the next processing session, it connects to the same Metastore and gets access to all table definitions.
In Yandex Cloud, many users implement this scenario with the help of Managed Service for Apache Airflow™, which is a workflow orchestration tool. Managed Service for Apache Airflow™ schedules the creation of powerful Yandex Data Processing clusters for data processing and then, once the computations are complete, deletes them to optimize costs. All metadata is retained in a separate managed Metastore cluster, ensuring a seamless user experience.
Working with modern data formats for analytics
In recent years, new data formats have been developed, designed specifically for analytical tasks: Apache Iceberg, Delta Lake, Apache Hudi. Compared to traditional ones such as CSV or Parquet, these formats offer more capabilities and are more user-friendly.
Here are the features they provide:
- Automic transactions for data writes.
- Data versioning and time travel.
- Schemas and schema evolution.
- Table optimization and size management.
- Query isolation from parallel writes.
To implement these features, formats such as Iceberg and Delta Lake will benefit from centralized metadata management. They need storage for information about table versions, transactions, schema changes, etc., and Metastore provides an optimal infrastructure for that.
Without Metastore, using these advanced formats would be a lot more complex, and some features would be entirely unavailable. With Metastore, you can leverage all the benefits of modern data formats without creating a custom infrastructure for metadata management.
In Yandex Cloud, Metastore is particularly useful for creating data lakes (Data Lake) and analytical data lakes (Data Lakehouse) using Delta Lake and Iceberg formats. It provides the required infrastructure for storing these formats' metadata and makes their use simple and reliable.
Metastore integration with Yandex Cloud services
In Yandex Cloud, Metastore integrates with other services, enhancing their data capabilities and simplifying the creation of comprehensive solutions.
Yandex Data Processing and Metastore
Yandex Data Processing is a service for running distributed computations using Apache Spark™
Connecting Yandex Data Processing to a managed Metastore cluster in Yandex Cloud is simple: when creating a cluster, specify the Metastore URI in the additional settings. Apache Spark™ will then automatically connect to Metastore and get access to all tables defined in it.
This offers many data management capabilities:
- Using SparkSQL to run complex analytical queries on Object Storage data.
- Using different Yandex Data Processing clusters with the same tables without duplicating definitions.
- Creating and deleting clusters as you need without losing table metadata.
For example, Yandex Data Processing can be used to create ETL pipelines that read data from various sources, transform it, and write it to tables defined in Metastore. This data then becomes available for analytics through any other service connected to the same Metastore.
Managed Service for Trino and Metastore
Trino is a distributed SQL engine for analytical queries. Trino can work various data sources, including files in Object Storage. Yandex Cloud offers Managed Service for Trino with Metastore connectivity.
Trino uses a connector system to access different data sources. Metastore uses the Hive connector for data management. When creating a Managed Service for Trino cluster, you can add a Hive folder and specify the Metastore URI, then Trino will get access to all tables defined in Metastore.
Integrating Managed Service for Trino with Metastore is particularly useful for interactive analytics. Analysts can run SQL queries against Object Storage data without knowing its physical storage details. They work with table abstractions while Metastore and Trino handle all tasks related to data access.
For example, a business analyst can connect to Managed Service for Trino via WebSQL or a BI tool, run complex analytical query against data processed using Yandex Data Processing, and get results in just a few seconds. There is no need to know file locations, partitioning methods, or storage formats: Trino retrieves all this information from Metastore.
Some current aspects of using Metastore
Currently, there are some important aspects of using managed Metastore clusters in Yandex Cloud for you to consider when designing and deploying solutions.
The first one is service availability. Currently, Metastore only works with storage objects within Yandex Cloud and does not support connections to external S3-compatible storages. This means it cannot be used with data stored in services like Amazon S3 or MinIO deployed in private data centers.
Additionally, Metastore is only accessible via an internal VPC IP address and does not have a public DNS name. This provides additional security but requires all services connecting to Metastore to be in the same VPC or have configured network access.
One more aspect to consider is network security. For Metastore to work properly, you need to configure security groups correctly to allow the required network traffic. Othewise clusters may indicate a DEAD
state, which makes it difficult to diagnose issues (see Security group setup guide).
For more information about Metastore, see the Apache® documentation