Estimating the number of unique elements using the HyperLogLog algorithm

Written by

Updated at May 22, 2025

Getting the HLL of a column
Getting an estimate of the number of unique elements in a column
Getting an estimate of the number of unique elements in a conjunction of two columns
Including a new element in an existing estimate
Creating and using a pre-aggregation table
Examples

You can estimate the number of unique elements in a table column (set cardinality) using the built-in Managed Service for Greenplum® functions that implement the HyperLogLog algorithm.

This algorithm converts the original set into a special object called HLL. The cardinality of the HLL object corresponds to the cardinality of the original set and can be estimated in constant time. This algorithm can quickly estimate the number of unique elements in a column, with the relative estimation error decreasing as the set size increases.

In Managed Service for Greenplum®, the HyperLogLog algorithm is implemented by these two functions:

gp_hyperloglog_accum(<column_name>): Applies the HyperLogLog algorithm to the provided column. This function returns an HLL with the gp_hyperloglog_estimator data type.
gp_hyperloglog_get_estimate(<HLL>): Converts the provided HLL to a numeric estimation.

In addition, you can use auxiliary functions:

gp_hyperloglog_merge(<HLL1>, <HLL2>): Returns the HLL resulting from conjunction (disjunction) of the source HLLs.
gp_hyperloglog_add_item_agg_default(<HLL>, <value>): Checks if the value belongs to the set whose HLL was provided in the parameters:
- If the value belongs to the set or is a null value, the function returns the HLL of the original set.
- If the value does not belong to the set, the function returns the updated HLL with the specified value added to the set.

Getting the HLL of a column

Run this request:

SELECT gp_hyperloglog_accum(<column_name>) FROM <table_name>;

The output will contain a string.

Getting an estimate of the number of unique elements in a column

Run this request:

SELECT gp_hyperloglog_get_estimate(gp_hyperloglog_accum(<column_name>)) FROM <table_name>;

The output will contain a real number.

Getting an estimate of the number of unique elements in a conjunction of two columns

Run this request:

SELECT gp_hyperloglog_get_estimate(
    gp_hyperloglog_merge(gp_hyperloglog_accum(<column_1>), gp_hyperloglog_accum(<column_2>)))
FROM <table_names>;

The output will contain a real number.

Including a new element in an existing estimate

Run this request:

SELECT gp_hyperloglog_get_estimate(
    gp_hyperloglog_add_item_agg_default(gp_hyperloglog_accum(<column_name>), <new_element_value>))
FROM <table_name>;

If the new element is not unique for this column or has a null value, the returned estimate will be the same as if you requested the number of unique elements in the column. Otherwise, the estimate will be incremented by approximately 1.

Creating and using a pre-aggregation table

Pre-aggregation allows you to estimate the number of unique values in a multidimensional dataset in advance. This way, you can retrieve the estimate for any dimension whenever you need. Furthermore, it eliminates the need to calculate the HLL for each query, providing an estimate in constant time, O(1).

To use this method:

Create a pre-aggregation table and populate it with data:
```
CREATE TABLE agg_table(column1, column2, ..., estimate gp_hyperloglog_estimator);

INSERT INTO agg_table(column1, column2, ..., estimate)
SELECT column1, column2, ..., gp_hyperloglog_accum(estimated_col)
FROM original_table
GROUP BY column1, column2, ...;
```
Where:
- agg_table: Name of the pre-aggregation table.
- column1, column2, ...,: Names of columns in the original table, other than the column you do the estimate for.
- estimate: Name of the column with the estimate in the pre-aggregation table.
- estimated_col: Name of the estimated column in the original table.
- original_table: Name of the original table.

Create a new aggregate function based on gp_hyperloglog_merge:

CREATE AGGREGATE gp_hyperloglog_merge(gp_hyperloglog_estimator) (
SFUNC=gp_hyperloglog_merge,
STYPE=gp_hyperloglog_estimator,
FINALFUNC=gp_hyperloglog_comp,
COMBINEFUNC=gp_hyperloglog_merge
);

Get an estimate of the number of unique elements for a particular column, e.g., column1:

SELECT column1, gp_hyperloglog_get_estimate(gp_hyperloglog_merge(estimate))
FROM agg_table
GROUP BY column1;

Examples

The following examples use tables populated with integers using the generate_series function.

Estimating the number of unique elements

Let's assume we have a table with a list of customers in different cities:

CREATE TABLE clients(city_id int, client_id int);
INSERT INTO clients (city_id, client_id)
VALUES (generate_series(1, 10), generate_series(1, 10000));

To estimate how many unique customers there are in each city, run the following query:

SELECT city_id, gp_hyperloglog_get_estimate(gp_hyperloglog_accum(client_id))
FROM clients
GROUP BY city_id
ORDER BY city_id;

 city_id | gp_hyperloglog_get_estimate
---------+-----------------------------
       1 |            988.212360350532
       2 |             1002.0269799313
       3 |            1006.27966975993
       4 |            998.837197408661
       5 |             1004.1525443185
       6 |            995.648631718888
       7 |            1003.08887802751
       8 |            997.775023703087
       9 |            1005.21614155965
      10 |             1002.0269799313
(10 rows)

Estimating the number of unique elements in a conjunction of columns

Let's assume we have two tables:

CREATE TABLE viewers(index_id int, viewer_id int);
INSERT INTO viewers (index_id, viewer_id)
VALUES (generate_series(1, 1000), generate_series(1, 1000));

CREATE TABLE purchasers(index_id int, purchaser_id int);
INSERT INTO purchasers (index_id, purchaser_id)
VALUES (generate_series(1, 1000), generate_series(501, 1500));

In this example, viewer_id is the ID of the product website page visitor, and purchaser_id is the ID of the product buyer. Both of these values are customer IDs, i.e., they belong to the same set.

To estimate how many customers have bought the product after visiting the website, i.e., the number of non-unique elements in the conjunction of columns:

Estimate the number of unique entries in the conjunction of the viewer_id and purchaser_id columns:

SELECT gp_hyperloglog_get_estimate(
    gp_hyperloglog_merge(gp_hyperloglog_accum(viewer_id), gp_hyperloglog_accum(purchaser_id)))
FROM viewers, purchasers;

gp_hyperloglog_get_estimate
-----------------------------
            1498.48297183718
(1 row)

Subtract the resulting estimate from the total number of elements in both columns: 2000 − 1498 = 502.

Creating a pre-aggregation table

Let's assume we have a table with a list of customers in different cities and stores:

CREATE TABLE clients(city_id int, shop_id int, client_id int);
INSERT INTO clients (city_id, shop_id, client_id)
VALUES (generate_series(1, 5), generate_series(1, 10), generate_series(1, 10000));

To get a quick estimate of the number of unique customers by city or store:

Create a pre-aggregation table and populate it with data:

CREATE TABLE agg_clients(city_id int, shop_id int, unique_clients gp_hyperloglog_estimator);

INSERT INTO agg_clients(city_id, shop_id, unique_clients)
SELECT city_id, shop_id, gp_hyperloglog_accum(client_id)
FROM clients
GROUP BY city_id, shop_id;

Create an aggregate function based on gp_hyperloglog_merge:

CREATE AGGREGATE gp_hyperloglog_merge(gp_hyperloglog_estimator) (
SFUNC=gp_hyperloglog_merge,
STYPE=gp_hyperloglog_estimator,
FINALFUNC=gp_hyperloglog_comp,
COMBINEFUNC=gp_hyperloglog_merge
);

Get an estimate of the number of unique customers:

By city:

SELECT city_id, gp_hyperloglog_get_estimate(gp_hyperloglog_merge(unique_clients))
FROM agg_clients
GROUP BY city_id
ORDER BY city_id;

city_id | gp_hyperloglog_get_estimate 
--------+-----------------------------
      1 |            1984.47308188562
      2 |            2002.54325709491
      3 |            1993.50633730345
      4 |            2009.32588394612
      5 |            2012.71700838494
(5 rows)

By store:

SELECT shop_id, gp_hyperloglog_get_estimate(gp_hyperloglog_merge(unique_clients))
FROM agg_clients
GROUP BY shop_id
ORDER BY shop_id;

shop_id | gp_hyperloglog_get_estimate 
--------+-----------------------------
      1 |            988.212360350532
      2 |             1002.0269799313
      3 |            1006.27966975993
      4 |            998.837197408661
      5 |             1004.1525443185
      6 |            995.648631718888
      7 |            1003.08887802751
      8 |            997.775023703087
      9 |            1005.21614155965
     10 |             1002.0269799313
(10 rows)

Estimating the number of unique elements using the HyperLogLog algorithm

Getting the HLL of a columnGetting the HLL of a column

Getting an estimate of the number of unique elements in a columnGetting an estimate of the number of unique elements in a column

Getting an estimate of the number of unique elements in a conjunction of two columnsGetting an estimate of the number of unique elements in a conjunction of two columns

Including a new element in an existing estimateIncluding a new element in an existing estimate

Creating and using a pre-aggregation tableCreating and using a pre-aggregation table

ExamplesExamples

Estimating the number of unique elementsEstimating the number of unique elements

Estimating the number of unique elements in a conjunction of columnsEstimating the number of unique elements in a conjunction of columns

Creating a pre-aggregation tableCreating a pre-aggregation table

Was the article helpful?

Getting the HLL of a column

Getting an estimate of the number of unique elements in a column

Getting an estimate of the number of unique elements in a conjunction of two columns

Including a new element in an existing estimate

Creating and using a pre-aggregation table

Examples

Estimating the number of unique elements

Estimating the number of unique elements in a conjunction of columns

Creating a pre-aggregation table