Sharding in Managed Service for ClickHouse®
Sharding is a horizontal cluster scaling strategy that spreads parts of a single ClickHouse® database across different shards. A shard consists of one or more replica hosts. A write or read request for the shard can be sent to any of its replicas, as there is no dedicated master. For inserts, data is copied from the replica that ran the INSERT request to other replicas in the shard in asynchronous mode.
Advantages of sharding
Sharding is often used in the following cases:
- When you expect very frequent database queries and rapid data growth.
- When your application requires more and more resources, but a replicated cluster can no longer be scaled vertically, i.e., by increasing the computing power of the host resources, such as disks, RAM, and CPU.
With sharding, you can:
-
Overcome technical limitations.
If you need to work with large datasets, your data storage infrastructure may reach the maximum capacity of commercially available hardware, e.g., the disk subsystem will indicate low IOPS. If your application is running close to the hardware limits, sharding may help. In this case, data reads will be concurrent.
-
Increase availability.
Sharding allows you to isolate individual host or replica failures. Without sharding, if a single host or a few replicas fail, you may lose access to the entire dataset they contain. Conversely, if one shard out of five fails, 80% of the table data will still be available.
-
Improve query performance.
Queries compete for the computing resources of cluster hosts, which may reduce the query processing rate. This rate drop usually becomes obvious as the number of reads or CPU time per query increases. However, in a sharded cluster, queries to the same table can run in parallel without competing for shared resources, which reduces the query processing time.
Use of sharding
To distribute data across shards, create a distributed table on the Distributed engine
When running the INSERT query, ClickHouse® uses a sharding key to determine the target shard. Its value determines which shard will be queried. The sharding key is similar to the partitioning keyINSERT queries, the SELECT queries send subqueries to all shards in the cluster, regardless of how data is distributed across the shards.
ClickHouse® offers two different approaches to operating distributed tables with flexible data distribution in a cluster:
-
You can create a distributed table that uses all shards in a cluster (see the example here).
-
You can create a distributed table that uses a group of shards in a cluster (see the example here and also here for the advanced sharding example). Such a group includes only some shards in a cluster.
In this case, you can:
- Place data tables on the same cluster shards the distributed table uses.
- Place data tables in one shard group, and the distributed table that routes queries to these data tables, in another shard group.
For example, you can set up the following shard configuration within a single ClickHouse® cluster using these methods:
- Group
Acomprised of two shards withs2.smallhosts: Used as the main one for a distributed table with a low load. Data in the distributed table is stored in the same shard group. - Group
Bcomprised of two shards withs2.mediumhosts: Used as the main one for a distributed table with a constant high load. Data in the distributed table is stored in another group,C, comprised of five shards with high-performancem2.largehosts.
To learn more about using distributed tables, see this ClickHouse® guide
Sharding specifics in Managed Service for ClickHouse®
Managed Service for ClickHouse® manages shards as follows:
-
When you create a cluster, a shard named
shard1is automatically added to it. This shard includes all hosts in the cluster. When you create a multi-host cluster, replication support is automatically enabled. -
You can add as many shards as you need to the created cluster.
To make the most of sharding, your cluster should have at least two shards. Using a distributed table and a single shard is equivalent to using replication without sharding. It only protects you against data loss when one or multiple shard replica hosts fail; however, it does not provide distributed data storage.
-
You can add hosts to a shard.
Multi-host shards require enabled replication, thus:
- In clusters with a multi-host shard, ClickHouse® Keeper or ZooKeeper replication is already enabled, which means you can immediately add hosts to the shard.
- In clusters with single-host shards, you should first turn on the ClickHouse® Keeper or ZooKeeper coordination service and only then proceed to add hosts to the shard.
For more information about replication, ClickHouse® Keeper, and ZooKeeper, see this guide.
-
Shard IDs in Yandex Cloud are different from their IDs in ClickHouse®:
- Managed Service for ClickHouse® clusters use shard names.
- ClickHouse® uses numeric IDs that match the alphabetic order of shard names in Managed Service for ClickHouse®, e.g.,
A-shard,B-shard,shard10,shard100.
Keep this in mind if your application accesses shards by their IDs when writing and reading distributed table data.
To learn how IDs in Yandex Cloud and ClickHouse® match, run this query:
SELECT substitution AS shard_name, shardNum() AS shard_number FROM cluster('{cluster}', system.macros) WHERE macro = 'shard' ORDER BY shard_nameFor example, in a cluster with the
shard1,shard2,shard3,shard4, andshard100shards, the matching will look as follows:┌─shard_name─┬─shard_number─┐ │ shard1 │ 1 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard100 │ 2 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard2 │ 3 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard3 │ 4 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard4 │ 5 │ └────────────┴──────────────┘
Use cases
ClickHouse® is a registered trademark of ClickHouse, Inc