Estimating the number of unique elements using the HyperLogLog algorithm
You can estimate the number of unique elements in a table column (set cardinality) using the built-in Managed Service for Greenplum® functions that implement the HyperLogLog algorithm
This algorithm converts the original set into a special object called HLL. The cardinality of the HLL object corresponds to the cardinality of the original set and can be estimated in constant time. This algorithm can quickly estimate the number of unique elements in a column, with the relative estimation error decreasing as the set size increases.
In Managed Service for Greenplum®, the HyperLogLog algorithm is implemented by these two functions:
gp_hyperloglog_accum(<column_name>)
: Applies the HyperLogLog algorithm to the provided column. This function returns an HLL with thegp_hyperloglog_estimator
data type.gp_hyperloglog_get_estimate(<HLL>)
: Converts the provided HLL to a numeric estimation.
In addition, you can use auxiliary functions:
-
gp_hyperloglog_merge(<HLL1>, <HLL2>)
: Returns the HLL resulting from conjunction (disjunction) of the source HLLs. -
gp_hyperloglog_add_item_agg_default(<HLL>, <value>)
: Checks if the value belongs to the set whose HLL was provided in the parameters:- If the value belongs to the set or is a
null
value, the function returns the HLL of the original set. - If the value does not belong to the set, the function returns the updated HLL with the specified value added to the set.
- If the value belongs to the set or is a
Getting the HLL of a column
Run this request:
SELECT gp_hyperloglog_accum(<column_name>) FROM <table_name>;
The output will contain a string.
Getting an estimate of the number of unique elements in a column
Run this request:
SELECT gp_hyperloglog_get_estimate(gp_hyperloglog_accum(<column_name>)) FROM <table_name>;
The output will contain a real number.
Getting an estimate of the number of unique elements in a conjunction of two columns
Run this request:
SELECT gp_hyperloglog_get_estimate(
gp_hyperloglog_merge(gp_hyperloglog_accum(<column_1>), gp_hyperloglog_accum(<column_2>)))
FROM <table_names>;
The output will contain a real number.
Including a new element in an existing estimate
Run this request:
SELECT gp_hyperloglog_get_estimate(
gp_hyperloglog_add_item_agg_default(gp_hyperloglog_accum(<column_name>), <new_element_value>))
FROM <table_name>;
If the new element is not unique for this column or has a null
value, the returned estimate will be the same as if you requested the number of unique elements in the column. Otherwise, the estimate will be incremented by approximately 1
.
Creating and using a pre-aggregation table
Pre-aggregation allows you to estimate the number of unique values in a multidimensional dataset in advance. This way, you can retrieve the estimate for any dimension whenever you need. Furthermore, it eliminates the need to calculate the HLL for each query, providing an estimate in constant time, O(1).
To use this method:
-
Create a pre-aggregation table and populate it with data:
CREATE TABLE agg_table(column1, column2, ..., estimate gp_hyperloglog_estimator); INSERT INTO agg_table(column1, column2, ..., estimate) SELECT column1, column2, ..., gp_hyperloglog_accum(estimated_col) FROM original_table GROUP BY column1, column2, ...;
Where:
agg_table
: Name of the pre-aggregation table.column1, column2, ...,
: Names of columns in the original table, other than the column you do the estimate for.estimate
: Name of the column with the estimate in the pre-aggregation table.estimated_col
: Name of the estimated column in the original table.original_table
: Name of the original table.
-
Create a new aggregate function
based ongp_hyperloglog_merge
:CREATE AGGREGATE gp_hyperloglog_merge(gp_hyperloglog_estimator) ( SFUNC=gp_hyperloglog_merge, STYPE=gp_hyperloglog_estimator, FINALFUNC=gp_hyperloglog_comp, COMBINEFUNC=gp_hyperloglog_merge );
-
Get an estimate of the number of unique elements for a particular column, e.g.,
column1
:SELECT column1, gp_hyperloglog_get_estimate(gp_hyperloglog_merge(estimate)) FROM agg_table GROUP BY column1;
Examples
The following examples use tables populated with integers using the generate_series function
Estimating the number of unique elements
Let's assume we have a table with a list of customers in different cities:
CREATE TABLE clients(city_id int, client_id int);
INSERT INTO clients (city_id, client_id)
VALUES (generate_series(1, 10), generate_series(1, 10000));
To estimate how many unique customers there are in each city, run the following query:
SELECT city_id, gp_hyperloglog_get_estimate(gp_hyperloglog_accum(client_id))
FROM clients
GROUP BY city_id
ORDER BY city_id;
city_id | gp_hyperloglog_get_estimate
---------+-----------------------------
1 | 988.212360350532
2 | 1002.0269799313
3 | 1006.27966975993
4 | 998.837197408661
5 | 1004.1525443185
6 | 995.648631718888
7 | 1003.08887802751
8 | 997.775023703087
9 | 1005.21614155965
10 | 1002.0269799313
(10 rows)
Estimating the number of unique elements in a conjunction of columns
Let's assume we have two tables:
CREATE TABLE viewers(index_id int, viewer_id int);
INSERT INTO viewers (index_id, viewer_id)
VALUES (generate_series(1, 1000), generate_series(1, 1000));
CREATE TABLE purchasers(index_id int, purchaser_id int);
INSERT INTO purchasers (index_id, purchaser_id)
VALUES (generate_series(1, 1000), generate_series(501, 1500));
In this example, viewer_id
is the ID of the product website page visitor, and purchaser_id
is the ID of the product buyer. Both of these values are customer IDs, i.e., they belong to the same set.
To estimate how many customers have bought the product after visiting the website, i.e., the number of non-unique elements in the conjunction of columns:
-
Estimate the number of unique entries in the conjunction of the
viewer_id
andpurchaser_id
columns:SELECT gp_hyperloglog_get_estimate( gp_hyperloglog_merge(gp_hyperloglog_accum(viewer_id), gp_hyperloglog_accum(purchaser_id))) FROM viewers, purchasers;
gp_hyperloglog_get_estimate ----------------------------- 1498.48297183718 (1 row)
-
Subtract the resulting estimate from the total number of elements in both columns:
2000 − 1498 = 502
.
Creating a pre-aggregation table
Let's assume we have a table with a list of customers in different cities and stores:
CREATE TABLE clients(city_id int, shop_id int, client_id int);
INSERT INTO clients (city_id, shop_id, client_id)
VALUES (generate_series(1, 5), generate_series(1, 10), generate_series(1, 10000));
To get a quick estimate of the number of unique customers by city or store:
-
Create a pre-aggregation table and populate it with data:
CREATE TABLE agg_clients(city_id int, shop_id int, unique_clients gp_hyperloglog_estimator); INSERT INTO agg_clients(city_id, shop_id, unique_clients) SELECT city_id, shop_id, gp_hyperloglog_accum(client_id) FROM clients GROUP BY city_id, shop_id;
-
Create an aggregate function based on
gp_hyperloglog_merge
:CREATE AGGREGATE gp_hyperloglog_merge(gp_hyperloglog_estimator) ( SFUNC=gp_hyperloglog_merge, STYPE=gp_hyperloglog_estimator, FINALFUNC=gp_hyperloglog_comp, COMBINEFUNC=gp_hyperloglog_merge );
-
Get an estimate of the number of unique customers:
-
By city:
SELECT city_id, gp_hyperloglog_get_estimate(gp_hyperloglog_merge(unique_clients)) FROM agg_clients GROUP BY city_id ORDER BY city_id;
city_id | gp_hyperloglog_get_estimate --------+----------------------------- 1 | 1984.47308188562 2 | 2002.54325709491 3 | 1993.50633730345 4 | 2009.32588394612 5 | 2012.71700838494 (5 rows)
-
By store:
SELECT shop_id, gp_hyperloglog_get_estimate(gp_hyperloglog_merge(unique_clients)) FROM agg_clients GROUP BY shop_id ORDER BY shop_id;
shop_id | gp_hyperloglog_get_estimate --------+----------------------------- 1 | 988.212360350532 2 | 1002.0269799313 3 | 1006.27966975993 4 | 998.837197408661 5 | 1004.1525443185 6 | 995.648631718888 7 | 1003.08887802751 8 | 997.775023703087 9 | 1005.21614155965 10 | 1002.0269799313 (10 rows)
-