Connecting to a Yandex Data Processing cluster
After creating a Yandex Data Processing cluster, you can connect to hosts of subclusters:
- Over the internet, if you have public access configured for your subcluster.
- From a Yandex Cloud virtual machine located on the same cloud network.
Configuring security groups
Security groups might block cluster connections. If this happens, edit the group rules.
Alert
When setting up security groups, do not change the rules for service traffic. This may cause the cluster to fail.
Rule settings depend on the connection method you select:
-
To connect to the subcluster hosts with public access from cloud networks and the internet, configure cluster security groups to allow incoming traffic from any IP address on port
22
. To do this, create the following rule for incoming traffic:- Port range:
22
. - Protocol:
TCP
. - Source:
CIDR
. - CIDR blocks:
0.0.0.0/0
.
- Port range:
-
To connect to a cluster from an intermediate VM:
-
Configure the security group where the VM is located to allow connections to the VM and traffic between the VM and subcluster hosts. To do this, create the following rules:
-
For incoming traffic:
- Port range:
22
. - Protocol:
TCP
. - Source:
CIDR
. - CIDR blocks:
0.0.0.0/0
.
- Port range:
-
For outgoing traffic:
- Port range:
22
. - Protocol:
TCP
. - Destination name:
CIDR
. - CIDR blocks: Address range of the subnet in which the cluster hosts are located. If subclusters are in different subnets, create this rule for each subnet.
- Port range:
-
-
Configure the cluster security groups to allow incoming traffic from the security group where the VM is located on port
22
. To do this, create the following rule for incoming traffic:- Port range:
22
. - Protocol:
TCP
. - Source:
CIDR
. - CIDR blocks: Address range of the subnet in which the cluster hosts are located.
- Port range:
-
To use UI Proxy, add rules to the subcluster host's security group to allow incoming traffic via port 443
:
- Port range:
443
. - Protocol:
TCP
. - Source:
CIDR
. - CIDR blocks:
0.0.0.0/0
.
If the connection is established via an intermediate VM, add rules to the subcluster host's security group to allow connections via this VM:
-
For incoming traffic:
- Port range:
443
. - Protocol:
TCP
. - Source:
CIDR
. - CIDR blocks:
0.0.0.0/0
.
- Port range:
-
For outgoing traffic:
- Port range:
443
. - Protocol:
TCP
. - Destination name:
CIDR
. - CIDR blocks: Address range of the subnet in which the subcluster host is located.
- Port range:
When using port forwarding, add rules to the intermediate VM security group that allow incoming and outgoing traffic via the required components' ports:
-
Port range:
<component_port>
.Port numbers for Yandex Data Processing components are shown in the table:
Service Port HBase Master 16010 HBase REST 8085 HDFS Name Node 9870 Hive Server2 10002 Livy 8998 MapReduce Application History 19888 Oozie 11000 Spark History 18080 YARN Application History 8188 YARN Resource Manager 8088 Zeppelin 8890 -
Protocol:
TCP
. -
Source:
CIDR
. -
CIDR blocks:
0.0.0.0/0
.
Note
You can specify more detailed rules for your security groups, e.g., to allow traffic only in specific subnets.
Security groups must be configured correctly for all subnets where cluster hosts will be located.
For more information about security groups, see Security groups.
Yandex Data Processing host FQDN
To connect to a host, you need its fully qualified domain name (FQDN). You can view it in the management console:
- Go to the cluster page.
- Go to Hosts.
- Copy the Host FQDN column value.
Connecting to a Yandex Data Processing host via SSH
To connect to a Yandex Data Processing host, make sure the SSH key you specified when creating the Yandex Data Processing cluster is accessible on the local machine or intermediate VM. You can copy the key to the machine from which the connection is performed to the cluster, or connect to it with an SSH agent.
-
If you do not have public access configured for your subcluster, connect to the intermediate VM via SSH.
-
Launch the SSH agent:
eval `ssh-agent -s`
-
Add the required key to the list of those available to the agent:
ssh-add ~/.ssh/example-key
-
Open an SSH connection to the Yandex Data Processing host by specifying its FQDN or IP address if public access is enabled for the host. For image version 2.0, specify the
ubuntu
user; for image version 1.4,root
, for example:ssh ubuntu@rc1b-dataproc-m-fh4y4nur********.mdb.yandexcloud.net
Result:
ubuntu@rc1b-dataproc-m-fh4y4nur********:~#
-
Make sure that Hadoop commands are executed, for example:
hadoop version
Result:
Hadoop 2.8.5 Subversion https://github.yandex-team.ru/mdb/bigtop.git -r 78508f2a4b4f3dc8b3d295ccb50a45a4******** Compiled by robot-pgaas-ci on 2019-04-16T10:35Z Compiled with protoc 2.5.0 From source with checksum 9942ca5c745417c14e31883******** This command was run using /usr/lib/hadoop/hadoop-common-2.8.5.jar
Connecting from graphical IDEs
Connections were tested in the following environment:
- Ubuntu 20.04, DBeaver:
22.2.4
- MacOS Monterey 12.7:
- JetBrains DataGrip:
2023.3.4
- DBeaver Community:
24.0.0
- JetBrains DataGrip:
To use graphical IDEs, save a certificate
- Create a data source:
-
Select File → New → Data Source → Apache Hive.
Note
Select the data source depending on the Yandex Data Processing component you are connecting to:
- Hive: Select Apache Hive.
- Spark: Select Apache Spark.
The list of settings does not change.
-
Specify the connection parameters on the General tab:
- Host: FQDN of the cluster master host or its public IP address.
- If connecting for the first time, click Download to download the connection driver.
-
On the SSH/SSL tab:
- Enable the Use SSL setting and specify the SSL connection parameters:
- CA file: Downloaded SSL certificate for the connection.
- Client key file, Client key password: File with the private key required to connect to the Yandex Data Processing cluster and its password.
- (Optional) To connect via an intermediate VM, configure the parameters of the SSH tunnel:
- Select Use SSH tunnel, create an SSH configuration, and specify the parameters:
- Host: IP address of the VM.
- User name: VM user's name.
- Private key file, Passphrase: File with the private key required to connect to the VM and its password.
- Click Test Connection to test the connection to the VM from DataGrip.
- Click OK to save the configuration.
- Select Use SSH tunnel, create an SSH configuration, and specify the parameters:
- Enable the Use SSL setting and specify the SSL connection parameters:
-
- Click Test Connection to test the connection. If the connection is successful, you will get the OK connection status and information about the DBMS and driver.
- Click OK to save the data source.
- Download the SSH key to the local machine or VM to connect to a Yandex Data Processing cluster.
- Create a new DB connection:
-
In the Database menu, select New connection.
-
Select a data source from the DB list depending on the configuration of the Yandex Data Processing cluster you are connecting to:
- If the cluster uses Hive, select Apache Hive.
- If only Spark is enabled in the cluster and the Thrift server is enabled, select Apache Spark.
-
Click Next.
-
On the SSH tab, enable the Use SSH tunnel setting and specify the parameters:
- Host/IP: FQDN (to connect via an intermediate VM) or public IP address of the master host.
- Username: Enter the username:
- For version 2.0:
ubuntu
- For version 1.4:
root
- For version 2.0:
- Authentication method:
Public key
. - Secret key: Path to the cluster private key file.
- Passphrase: Private key password.
- (Optional) To connect via an intermediate VM, enable the Use jump server setting and specify the parameters:
- Host/IP: Public IP address of the VM for connection.
- Username: Username for connecting to the VM.
- Authentication method:
Public key
. - Secret key: Path to the VM private key file.
- Passphrase: Private key password.
-
- Click Test connection ... to test the connection. If the connection is successful, you will see the connection status and information about the DBMS and driver.
- Click Ready to save the database connection settings.