Using a Yandex Lockbox secret in a PySpark job to connect to Yandex Managed Service for PostgreSQL
You can use a Yandex Lockbox secret to connect to a Yandex Managed Service for PostgreSQL cluster from a PySpark job in Yandex Managed Service for Apache Spark™. To do this, grant the Yandex Managed Service for Apache Spark™ cluster service account access to the secret. Yandex Connection Manager creates the secret automatically when creating the Managed Service for PostgreSQL user.
For the PySpark job, we use a Python script stored in a Yandex Object Storage bucket. The script retrieves the user password from the secret and uses it to connect to the Managed Service for PostgreSQL cluster.
To get information from the secret and connect to the Managed Service for PostgreSQL cluster:
- Set up your infrastructure.
- Configure permissions for access to the user's secret.
- Prepare a PySpark job.
- Run the job and check the result.
If you no longer need the resources you created, delete them.
Getting started
Sign up for Yandex Cloud and create a billing account:
- Navigate to the management console
and log in to Yandex Cloud or create a new account. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVEorTRIAL_ACTIVEstatus. If you do not have a billing account, create one and link a cloud to it.
If you have an active billing account, you can create or select a folder for your infrastructure on the cloud page
Learn more about clouds and folders here.
Required paid resources
- Yandex Managed Service for Apache Spark™ cluster: computing resources of cluster components (see Yandex Managed Service for Apache Spark™ pricing).
- Managed Service for PostgreSQL cluster, which includes the use of computing resources allocated to hosts, storage and backup size (see Managed Service for PostgreSQL pricing).
- Object Storage bucket: use of storage, data operations (see Object Storage pricing).
- NAT gateway: hourly use of the gateway and its outgoing traffic (see Virtual Private Cloud pricing).
- Public IP addresses if public access is enabled for cluster hosts (see Virtual Private Cloud pricing).
There is no charge for using Yandex Lockbox secrets created with Connection Manager.
Set up your infrastructure
-
Create a service account named
spark-agentand assign it the following roles:- managed-spark.integrationProvider: For Yandex Managed Service for Apache Spark™ integration with other services.
- storage.editor: For access to the PySpark job file in the Object Storage bucket.
-
Grant the
READpermission to thespark-agentservice account for the new bucket. -
Create a cloud network named
spark-network.This will automatically create three subnets in different availability zones.
-
In
spark-network, create a security group namedspark-sgfor the Yandex Managed Service for Apache Spark™ cluster to allow outgoing TCP connections on:- Port
6432to connect to PostgreSQL. - Port
443to download Maven packages.
- Port
-
In
spark-network, create a security group namedpg-sgfor the Managed Service for PostgreSQL cluster to allow incoming TCP connections on port6432. -
Create a Yandex Managed Service for Apache Spark™ cluster with the following settings:
- Service account:
spark-agent - Network:
spark-network - Security group:
spark-sg
- Service account:
-
Set up a NAT gateway for the subnet where the Yandex Managed Service for Apache Spark™ cluster was created. The NAT gateway is required to download the PostgreSQL JDBC driver from the Maven repository.
-
Create a Managed Service for PostgreSQL cluster with the following settings:
- Network:
spark-network - Security group:
pg-sg
- Network:
Configure permissions for access to a Managed Service for PostgreSQL user's secret
-
In the management console
, select the folder where you created the infrastructure. -
Go to Managed Service for PostgreSQL.
-
Click the name of your cluster and select the Users tab.
-
Click the connection ID next to the user you need.
This will take you to the Connection Manager connection page for the selected user.
-
On the page that opens, click the secret ID.
This will open the page of the Yandex Lockbox secret storing the PostgreSQL user password.
-
On the left-hand panel, select
Access bindings and click Assign roles. -
Select the
spark-agentservice account you want to grant secret access to. -
Click
Add role and selectlockbox.payloadViewer. -
Click Save.
If you do not have the Yandex Cloud CLI yet, install and initialize it.
The folder used by default is the one specified when creating the CLI profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id options.
To configure access to the Managed Service for PostgreSQL user secret:
-
Get the secret ID from the Connection Manager connection:
yc metadata-hub connection-manager connection list \ --mdb-cluster-id <PostgreSQL_cluster_ID>You can get the cluster ID with the list of clusters in the folder.
-
Configure access to the user secret:
yc lockbox secret add-access-binding <secret_ID> \ --role lockbox.payloadViewer \ --service-account-name spark-agentThe
spark-agentservice account will get access to the Managed Service for PostgreSQL user secret.
Prepare a PySpark job
-
Create a local file named
job-mpg-connection-with-secret.pyand paste the following script to it:job-mpg-connection-with-secret.py
import json import os import urllib.request from pyspark.sql import SparkSession def get_secret(secret_id: str) -> dict: """Getting a secret from Yandex Lockbox.""" # Getting an IAM token via metadata service metadata_host = os.environ.get("YC_METADATA_ADDR", "169.254.169.254") token_url = f"http://{metadata_host}/computeMetadata/v1/instance/service-accounts/default/token" req = urllib.request.Request(token_url) req.add_header("Metadata-Flavor", "Google") with urllib.request.urlopen(req, timeout=10) as resp: token = json.loads(resp.read())["access_token"] # Getting a Yandex Lockbox secret secret_url = f"https://payload.lockbox.api.cloud.yandex.net/lockbox/v1/secrets/{secret_id}/payload" req = urllib.request.Request(secret_url) req.add_header("Authorization", f"Bearer {token}") with urllib.request.urlopen(req, timeout=30) as resp: payload = json.loads(resp.read()) return {e["key"]: e["textValue"] for e in payload["entries"]} # Using the secret retrieval function secrets = get_secret("<secret_ID>") # Initializing a Spark session spark = SparkSession.builder.appName("db-reader").getOrCreate() # Connecting to a Managed Service for PostgreSQL cluster df = spark.read \ .format("jdbc") \ .option("url", f"jdbc:postgresql://<host_FQDN>:6432/<DB_name>") \ .option("driver", "org.postgresql.Driver") \ .option("user", "<username>") \ .option("password", secrets["postgresql_password"]) \ .option("dbtable", "<table_name>") \ .load() # Showing the first 20 rows of the table df.show() -
In the script file, specify the following settings:
- Secret ID.
- Host FQDN to connect to the Managed Service for PostgreSQL cluster.
- Database name.
- Table name.
Run the job and check the result
-
Upload the
job-mpg-connection-with-secret.pyfile to the bucket. -
In the Yandex Managed Service for Apache Spark™ cluster, create a job with the following settings:
- Job type: PySpark
- Main python file:
s3a://<bucket_name>/job-mpg-connection-with-secret.py - Packages:
org.postgresql:postgresql:42.7.3
-
Wait for the PySpark job you created to change its status to Done.
-
The job logs will display rows from the table specified in the script.
-
Upload the
job-mpg-connection-with-secret.pyfile to the bucket:yc storage s3api put-object \ --body <local_file_path> \ --bucket <bucket_name> \ --key "job-mpg-connection-with-secret.py" -
Run the job:
yc managed-spark job create-pyspark \ --cluster-id <cluster_ID> \ --name "<job_name>" \ --main-python-file-uri s3a://<bucket_name>/job-mpg-connection-with-secret.py \ --packages org.postgresql:postgresql:42.7.3You can get the cluster ID with the list of clusters in the folder.
The command will return the job ID. Save it as you will need it to get job execution logs.
-
Get job execution logs:
yc managed-spark job log <job_ID> \ --cluster-id <cluster_ID>The job logs will display rows from the table specified in the script.
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
- Yandex Managed Service for Apache Spark™ cluster.
- Managed Service for PostgreSQL cluster.
- Object Storage bucket. Before deleting the bucket, delete all its objects.
- NAT gateway.