Getting started with Hive Metastore
Note
This feature is in the Preview stage.
In Yandex MetaData Hub, you can create Hive Metastore clusters and use them to work with Yandex Data Processing clusters.
Getting started
-
Go to the management console
and log in to Yandex Cloud or sign up if not signed up yet. -
If you do not have a folder yet, create one:
-
In the management console
, select the appropriate cloud from the list on the left. -
At the top right, click Create folder.
-
Enter the folder name. The naming requirements are as follows:
- The name must be from 3 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- The first character must be a letter and the last character cannot be a hyphen.
-
(Optional) Enter a description of the folder.
-
Select Create a default network. This will create a network with subnets in each availability zone. Within this network, a default security group will be created, inside which all network traffic is allowed.
-
Click Create.
-
-
To link your service account to a Metastore cluster, make sure your Yandex Cloud account has the iam.serviceAccounts.user role or higher.
-
Set up a NAT gateway in the subnet to host Metastore and Yandex Data Processing clusters.
-
Create a security group for Metastore and Yandex Data Processing clusters.
-
Add rules Metastore cluster to the security group:
-
For incoming client traffic:
- Port range:
30000-32767
- Protocol:
Any
(Any
) - Source:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
For incoming load balancer traffic:
- Port range:
10256
- Protocol:
Any
(Any
) - Source:
Load balancer healthchecks
- Port range:
-
-
Add Yandex Data Processing cluster rules to the security group:
-
One rule for inbound and another one for outbound service traffic:
- Port range:
0-65535
- Protocol:
Any
- Source/Destination name:
Security group
- Security group:
Current
- Port range:
-
A separate rule for outgoing HTTPS traffic to all addresses. This will allow you to use Yandex Object Storage buckets, UI Proxy, and autoscaling of Yandex Data Processing subclusters.
- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
Rule that allows access to NTP servers for time syncing:
- Port range:
123
- Protocol:
UDP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
-
Create a service account with the
dataproc.agent
,dataproc.provisioner
, andmanaged-metastore.integrationProvider
roles. -
Create an Object Storage bucket to interact with a Yandex Data Processing cluster.
-
In the network you created earlier, create a Yandex Data Processing cluster. In the settings, set:
SPARK
andYARN
services.- Service account you previously created.
spark:spark.sql.hive.metastore.sharedPrefixes
property tocom.amazonaws,ru.yandex.cloud
. Required for PySpark jobs and integration with Metastore.- Bucket you created earlier.
- Security group you configured earlier.
Create a Metastore cluster
- In the management console, go to the folder you created earlier.
- Select Yandex MetaData Hub.
- In the left-hand panel, select the
Metastore. - Click Create cluster.
- Enter a name for the cluster. It must be unique within the folder.
- Select a service account under which the Metastore cluster will interact with other Yandex Cloud services, or create a new one.
- Under Network settings, select the network and subnet you created earlier. Specify the security group you configured previously.
- Optionally, under Logging, enable logging, select the minimum logging level, and specify the folder or log group.
- If required, enable protection of the cluster from accidental deletion by a user.
- Click Create.
Connect the Metastore cluster to the Yandex Data Processing cluster
-
In the Yandex Data Processing cluster you created earlier, specify the following property:
spark:spark.hive.metastore.uris : thrift://<Metastore_cluster_IP_address>:9083
To find out the Metastore cluster IP address, select Yandex MetaData Hub in the management console, then
Metastore in the left-hand panel, and open the relevant cluster. You will see the cluster IP address under General information. -
Add the following outgoing traffic rule to the security group:
- Port range:
9083
- Protocol:
Any
(Any
) - Source:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range: