Getting data from external sources using named queries
You can use named queries to retrieve data from external systems through PXF in Yandex MPP Analytics for PostgreSQL.
A named query is a prebuilt SQL query stored in the mdb_toolkit.pxf_named_queries table of the Yandex MPP Analytics for PostgreSQL cluster's system database. You specify the query name reference when creating an external table.
Named queries allow you to join tables and aggregate data in an external source when creating a view is not an option. With all computations performed externally, the cluster operates more efficiently.
Named queries can be used with data sources connected to a Yandex MPP Analytics for PostgreSQL cluster through a JDBC connector.
To get data from an external source using a named query:
- Prepare the test data.
- Create an external data source.
- Create a named query.
- Create an external table and get data from it.
If you no longer need the resources you created, delete them.
Required paid resources
The support cost for this solution includes:
- Fee for a Yandex MPP Analytics for PostgreSQL cluster, specifically computing resources allocated to cluster hosts, as well as the size of storage and backups (see Yandex MPP Analytics for PostgreSQL pricing).
- Fee for a Yandex Managed Service for PostgreSQL cluster, specifically computing resources allocated to cluster hosts, as well as the size of storage and backups (see Managed Service for PostgreSQL pricing).
- Fee for hourly usage of NAT gateways and outgoing traffic they handle (see Yandex Virtual Private Cloud pricing).
- Fee for using public IP addresses (see Virtual Private Cloud pricing).
Getting started
Set up the infrastructure:
-
Create a Yandex MPP Analytics for PostgreSQL cluster with any suitable configuration.
-
In the Yandex MPP Analytics for PostgreSQL cluster subnet, set up a NAT gateway and create a security group allowing all incoming and outgoing traffic from all addresses.
-
Create a Managed Service for PostgreSQL cluster with publicly available hosts.
-
In the Managed Service for PostgreSQL cluster subnet, set up a NAT gateway and create a security group allowing all incoming and outgoing traffic from all addresses.
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
Download the
pxf-named-queries-infrastructure.tffile to the working directory. This file describes:- Networks.
- Subnets.
- NAT gateways.
- Security groups.
- Yandex MPP Analytics for PostgreSQL cluster.
- Managed Service for PostgreSQL cluster.
-
Specify the following in the file:
mgp_password: Greenplum® user password.mgp_version: Greenplum® version.mpg_password: PostgreSQL database user password.mpg_version: PostgreSQL version.
-
To verify that the config files are correct, run the command below:
terraform validateIf there are any errors in the configuration files, Terraform will point them out.
-
Create an infrastructure:
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Prepare the test data
-
Connect to the PostgreSQL database.
-
Create a table named
customersand populate it with test data:CREATE TABLE customers(id int, name text, city text); INSERT INTO customers VALUES (111, 'Benjamin', 'Chicago'); INSERT INTO customers VALUES (222, 'Mary', 'New York'); INSERT INTO customers VALUES (333, 'Paul', 'Boston'); INSERT INTO customers VALUES (444, 'Peter', 'Denver'); INSERT INTO customers VALUES (555, 'Natalie', 'Austin'); -
Check the result:
SELECT * FROM customers;id | name | city ----+---------+------------- 111 | Benjamin | Chicago 222 | Mary | New York 333 | Paul | Boston 444 | Peter | Denver 555 | Natalie | Austin -
Create a table named
ordersand populate it with test data:CREATE TABLE orders(customer_id int, amount int, year int); INSERT INTO orders VALUES (111, 12, 2018); INSERT INTO orders VALUES (222, 234, 2019); INSERT INTO orders VALUES (333, 34, 2018); INSERT INTO orders VALUES (444, 456, 2019); INSERT INTO orders VALUES (555, 56, 2021); -
Check the result:
SELECT * FROM orders;customer_id | amount | year ------------+--------+------ 111 | 12 | 2018 222 | 234 | 2019 333 | 34 | 2018 444 | 456 | 2019 555 | 56 | 2021
Create an external data source
In the Yandex MPP Analytics for PostgreSQL cluster, create an external data source with the following settings:
- Name:
pgserver. - Driver:
org.postgresql.Driver. - Url:
jdbc:postgresql://c-<PostgreSQL_cluster_ID>.rw.mdb.yandexcloud.net:6432/<DB_name_in_PostgreSQL_cluster>. - User:
<PostgreSQL_username>.
You can get the cluster ID with the list of clusters in the folder.
Create a named query
To create a named query, add it to the mdb_toolkit.pxf_named_queries table:
-
Connect to Greenplum®.
-
Run this query:
INSERT INTO mdb_toolkit.pxf_named_queries (pxf_profile, name, query) VALUES ( 'pgserver', 'my_query', 'SELECT c.name, o.year FROM customers c JOIN orders o ON c.id = o.customer_id;' );Where:
pgserver: Data source name.my_query: Named query name.
As an example of a named query, we will use one that joins the
customersandorderstables by theid = customer_idfield and returns the customer name (c.name) along with the year of their orders (o.year). -
Check the result:
SELECT * FROM mdb_toolkit.pxf_named_queries;pxf_profile | name | query ------------+-------------+-------------------------------------------- pgserver | my_query | SELECT c.name, o.year | | FROM customers c | | JOIN orders o ON c.id = o.customer_id;The
querycolumn should contain the text of the named query.
Create an external table and get data from it
-
Connect to Greenplum®.
-
Create an external table named
pxf_named_query. It will reference the data the named query retrieves from thecustomersandorderstables in the PostgreSQL DB:CREATE READABLE EXTERNAL TABLE pxf_named_query(name text, year int) LOCATION ('pxf://query:my_query?PROFILE=JDBC&SERVER=pgserver') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');Learn more about the SQL syntax for creating external tables here.
-
Get the data:
SELECT * FROM pxf_named_query;Result:
name | year --------+------ Benjamin | 2018 Mary | 2019 Paul | 2018 Peter | 2019 Natalie | 2021Tip
If the query fails and returns an error, wait a few minutes and try again. The changes may not have been applied yet.
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
-
In the terminal window, go to the directory containing the infrastructure plan.
Warning
Make sure the directory has no Terraform manifests with the resources you want to keep. Terraform deletes all resources that were created using the manifests in the current directory.
-
Delete resources:
-
Run this command:
terraform destroy -
Confirm deleting the resources and wait for the operation to complete.
All the resources described in the Terraform manifests will be deleted.
-
Greenplum® and Greenplum Database® are registered trademarks or trademarks of Broadcom Inc. in the United States and/or other countries.