Connecting to an external file server
Greenplum® Parallel File Servergpfdist
) is a utility used to read and write data from files located on remote servers. It is installed on each segment host of a Managed Service for Greenplum® cluster and provides parallel data loading by distributing it across segments either evenly or according to the distribution key set. This improves performance when handling large amounts of external data.
gpfdist
works with any delimited text files as well as compressed gzip and bzip2 files.
To read or write files on an external server:
- Install and run
gpfdist
as part of the Greenplum® Loader or Greenplum® Database package on the remote server hosting your target files. - Create an external table in the Greenplum® database to reference these files.
Running gpfdist
Note
Downloading and using software from the VMware website is not part of the Yandex Managed Service for Greenplum® Terms of Use
-
Download and install the Greenplum® Loader package from the VMware website
or the Greenplum® Database package from the Yandex Object Storage bucket by following this guide. -
Run
gpfdist
:gpfdist -d <data_file_directory> -p <connection_port> -l <log_file_path>
Where:
<data_file_directory>
: Local path to the directory with files for reading/writing data through an external table.<connection_port>
: Port the utility will use to operate. Default:8080
.<log_file_path>
: (Optional) Path to the filegpfdist
will write its logs to.
To distribute the network's workload, you can run several
gpfdist
instances on the same server by specifying different directories and connection ports, e.g.:gpfdist -d /var/load_files1 -p 8081 -l /home/gpadmin/log1 & \ gpfdist -d /var/load_files2 -p 8082 -l /home/gpadmin/log2 &
-
Make sure that files from the specified directory are available on the specified port from Yandex Cloud. To do this, run the following command from a VM in Yandex Cloud:
wget http://hostname:port/filename
Creating an external table using gpfdist
SQL query syntax to create an external table:
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
(<column_name> <data_type> [, ...])
LOCATION('gpfdist://<path_to_file_on_remote_server>' [, ...])
FORMAT '[TEXT|CSV|CUSTOM]';
Where:
<table_name>
: Name of the external table that will be created in the Greenplum® database.<column_name>
: Table column name.<data_type>
: Table column data type.<path_to_file_on_remote_server>
: Address of the servergpfdist
runs on, connection port, and file path. You can specify a particular file or a mask using the asterisk symbol (*).
The WRITABLE
option allows writing data to an external object. To read data from an external object, create an external table with the READABLE
option.
Examples for creating external tables
-
Creating an external table with data from
file.csv
on thehostname
server:CREATE EXTERNAL TABLE tableName (id int) LOCATION('gpfdist://hostname:8080/file.csv') FORMAT 'CSV' (DELIMITER ',');
-
Creating an external table consolidating data from all
txt
files, with|
for a separator and space forNULL
, on thehostname1
andhostname2
servers:CREATE EXTERNAL TABLE tableName (...) LOCATION('gpfdist://hostname1:8081/*.txt', 'gpfdist://hostname2:8081/*.txt') FORMAT 'TEXT' (DELIMITER '|' NULL ' ');
Greenplum® and Greenplum Database® are registered trademarks or trademarks of Broadcom Inc. in the United States and/or other countries.