Getting started with Apache Hive™ Metastore
In Yandex MetaData Hub, you can create Apache Hive™ Metastore clusters and use them to work with Yandex Data Processing clusters.
Getting started
-
Navigate to the management console
and log in to Yandex Cloud or sign up if not signed up yet. -
If you do not have a folder yet, create one:
-
In the management console
, select the appropriate cloud from the list on the left. -
At the top right, click Create folder.
-
Give your folder a name. The naming requirements are as follows:
- It must be from 2 to 63 characters long.
- It can only contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
-
Optionally, specify the description for your folder.
-
Select Create a default network. This will create a network with subnets in each availability zone. Within this network, you will also have a default security group, within which all network traffic will be allowed.
-
Click Create.
-
-
To link a service account to an Apache Hive™ Metastore cluster, assign the iam.serviceAccounts.user role or higher to your Yandex Cloud account.
Note
If you are unable to manage roles, contact your cloud or organization administrator.
-
Set up a NAT gateway in the subnet to host Apache Hive™ Metastore and Yandex Data Processing clusters.
-
Create a security group for Apache Hive™ Metastore and Yandex Data Processing clusters.
-
Add Apache Hive™ Metastore cluster rules to the security group:
-
For incoming client traffic:
- Port range:
30000-32767
- Protocol:
Any
(Any
) - Source:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
For incoming load balancer traffic:
- Port range:
10256
- Protocol:
Any
(Any
) - Source:
Load balancer healthchecks
- Port range:
-
-
Add Yandex Data Processing cluster rules to the security group:
-
One inbound and one outbound rule for service traffic:
- Port range:
0-65535
- Protocol:
Any
- Source/Destination name:
Security group
- Security group:
Current
- Port range:
-
A separate rule for outgoing HTTPS traffic to all addresses. This will allow you to use Yandex Object Storage buckets, UI Proxy, and autoscaling of Yandex Data Processing subclusters.
- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
Rule that allows access to NTP servers for time syncing:
- Port range:
123
- Protocol:
UDP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
-
Create a service account with the
dataproc.agent
,dataproc.provisioner
, andmanaged-metastore.integrationProvider
roles. -
Create an Object Storage bucket to interact with a Yandex Data Processing cluster.
-
In the network you created earlier, create a Yandex Data Processing cluster. In the settings, specify:
SPARK
andYARN
services.- Service account you created earlier.
spark:spark.sql.hive.metastore.sharedPrefixes
property with thecom.amazonaws,ru.yandex.cloud
value. It is required for PySpark jobs and integration with Apache Hive™ Metastore.- Bucket you created earlier.
- Security group you configured earlier.
Create a Apache Hive™ Metastore cluster
- In the management console, go to the folder you created earlier.
- Select Yandex MetaData Hub.
- In the left-hand panel, select
Metastore. - Click Create cluster.
- Enter a name for the cluster. It must be unique within the folder.
- Select a service account under which the Apache Hive™ Metastore cluster will interact with other Yandex Cloud services, or create a new one.
- Under Network settings, select the network and subnet you created earlier. Specify the security group you configured previously.
- Optionally, under Logging, enable logging, select the minimum logging level, and specify the folder or log group.
- If required, enable protection of the cluster from accidental deletion by a user.
- Click Create.
Connect the Apache Hive™ Metastore cluster to the Yandex Data Processing cluster
-
In the Yandex Data Processing cluster you created earlier, specify the following property:
spark:spark.hive.metastore.uris : thrift://<Apache Hive™ Metastore_cluster_IP_address>:9083
To find out the Apache Hive™ Metastore cluster IP address, select Yandex MetaData Hub in the management console and then select
Metastore in the left-hand panel. Copy the IP address column value for the cluster in question. -
Add the following outgoing traffic rule to the security group:
- Port range:
9083
- Protocol:
Any
(Any
) - Source:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
What's next
- Work with tables using Apache Hive™ Metastore.
- Use Apache Hive™ Metastore to move data between Yandex Data Processing clusters.
- Store tabular data in Apache Hive™ Metastore when using Apache Airflow™.
- Export and import Hive metadata in a Apache Hive™ Metastore cluster.
Apache® and Apache Hive™