JOIN in Yandex DataLens
You can join data via the dataset creation interface by dragging tables to the workspace and configuring links between them using the JOIN operator. For YTsaurus tables, a join like this causes an error due to the table storage structure and default query logic in YTsaurus. To avoid this error, use an SQL query to the source.
Storing table data
The table storage schema in YTsaurus fundamentally differs from that in ClickHouse®:
-
In YTsaurus, the basic data storage primitive is a static table with rows grouped into chunks that are randomly distributed across a cluster. By default, the data is distributed across shards based on a sharding key. Static tables are highly inefficient for point data reads.
Note
In ClickHouse®, sharding ensures locality of rows with a single sharding expression value per VM.
-
YTsaurus supports sorted tables: a table schema may factor in its row sorting by a column prefix. These columns are called key columns. This is a more efficient storage primitive: a dynamic table.
Note
In YTsaurus, sorting ensures locality of rows with a single key value per chunk (or in a set of successive chunks).
Simple SELECT query in CHYT
A SELECT query without a JOIN is basically a streaming query. It reads data from a single source and performs row processing. You can apply it to function and aggregation fields.
In CHYT, this query is implemented as follows:
- A clique instance gets a query. The instance is called the query coordinator, as it determines the subsequent execution plan.
- The coordinator sends an rewritten query to each instance over the internal TCP. In this query, the input table name is replaced with an encoded sequence of
chunk slice, wherechunk sliceis a certain range within a chunk. (The range is set using border row numbers or keys.) - Each instance runs the rewritten query and returns its result to the coordinator for its final aggregation (if required) and for sending a response to the user.
Types of queries with a JOIN in ClickHouse®
Same as a simple query, a SELECT query with a JOIN is sent to a clique instance called the query coordinator. The query coordinator determines the execution plan, while the load distribution between instances depends on how lhs JOIN rhs USING/ON is run:
-
In ClickHouse®, a distributed local JOIN is run by default. If tables are sharded in the same way, a pair of matching keys cannot end up on different VMs. Therefore,
lhsandrhsare interpreted on each instance as corresponding local tables. In this case, the query the coordinator rewrites into fragments can be run independently on each instance.Note
For this method, tables must have the same sharding schema.
-
A GLOBAL JOIN is executed if you use the
GLOBALkeyword withJOIN. The query coordinator fully executes and materializes the right-hand side,rhs. Next, its serialized representation is sent along with the query to the instances. They use this representation to retrieve the right-hand side in their memory. For more information, see this guide .Note
- This method is efficient if
rhsis relatively small and there are relatively few instances. Otherwise, you may lack resources when runningrhson the coordinator or when the coordinator sends a serialized representation with a subquery across the network. - This method does not require meeting any additional conditions for table data storage or sharding consistency.
- This method is efficient if
-
JOIN via subqueries: ClickHouse® allows enclosing
lhsand/orrhsin brackets, which significantly affects the execution plan:Enclosing `lhs` in bracketsEnclosing `rhs` in brackets- The coordinator executes the LHS independently.
- The RHS is executed independently.
- The RHS is placed in-memory in a hash table.
- A
JOINis only fully executed on the coordinator.
-
The coordinator executes the LHS independently as if the query is SELECT
lhs. -
The coordinator sends its queries to instances leaving
JOIN(rhs) as is. -
Each instance executes
rhsindependently.Note
This may significantly increase the load, since each instance will materialize the RHS independently. Therefore, the ClickHouse® security algorithm prohibits this behavior by default returning the
Double-distributed IN/JOIN subqueries is deniederror.
Features of queries with a JOIN in CHYT
A SELECT query with a JOIN in CHYT runs as a Sorted JOIN by default. It uses the same column sorting schema. The standard lhs JOIN rhs USING/ON statement is used.
In this case, additional restrictions are imposed on lhs and rhs:
- The
lhsandrhsmust be sorted tables. - The
USING/ONclause must only use sorted columns.
For example:
- Let's assume that
lhsis sorted by thel1, l2, ..., lncolumns, andrhs, by ther1, r2, ..., rmcolumns. - The
JOINONclause should look like a set ofl1 = r1 , ..., lk = rkequations for a certaink(the equations can be listed in any order). - This can be represented as a set of equations in the
ONclause and as a set of general key columns in theUSINGstatement, but not as a set of equations in theWHEREclause.
If these conditions are met, the query coordinator generates pairs of matching ranges from lhs and rhs and distributes them across the instances in subqueries.
Otherwise, an error is returned. In this case, you should either use GLOBAL JOIN or enclose rhs in a subquery.
ClickHouse® is a registered trademark of ClickHouse, Inc