Sharding in Managed Service for ClickHouse®
Sharding is a horizontal cluster scaling strategy that puts parts of a single ClickHouse® database on different shards. A shard consists of one or more replica hosts. A write or read request for the shard can be sent to any of its replicas, as there is no dedicated master. When data is inserted, it is taken from the replica on which the INSERT
request was executed and copied to other replicas in the shard in asynchronous mode.
Advantages of sharding
Sharding is often used in the following cases:
- When you expect very frequent database queries and rapid data growth.
- When your application requires more and more resources, but a replicated cluster can no longer be scaled using the vertical strategy, i.e., by increasing the computing power of the host resources, such as disks, RAM, and CPU.
Sharding can help you:
-
Overcome technical limitations.
If you need to work with large datasets, your data storage infrastructure may cause your commercially available hardware to become stressed to the limit (for example, the disk subsystem will demonstrate poor IOPS metrics). If the application is running at hardware performance limits, it may be a good idea to distribute data across shards. In this case, the read operations will be performed concurrently.
-
Improve fault tolerance.
Sharding allows you to isolate individual host or replica failures. If you do not use sharding, then, when one host or a few replicas fail, you lose access to the entire dataset they contain. Conversely, if one shard out of five fails, 80% of the table data will still be available.
-
Improve the query performance.
Queries compete with each other for the computing resources of cluster hosts, which can reduce the rate of query processing. This drop in the rate usually becomes obvious as the number of read operations or CPU time per query grows. However, in a sharded cluster, where queries to the same table can be run in parallel, there is no competition for shared resources, which allows you to reduce the query processing time.
Use of sharding
To distribute data across shards, create a distributed table based on the Distributed engine
When running the INSERT
query, ClickHouse® uses a sharding key to determine where the data should be placed. Its value determines which shard the query will target. The sharding key is similar to the partitioning keyINSERT
queries, the SELECT
queries send subqueries to all shards in the cluster, regardless of how data is distributed across the shards.
ClickHouse® offers two different approaches to operating distributed tables with flexible data distribution in a cluster:
-
You can create a distributed table that uses all shards in a cluster (see the example here).
-
You can create a distributed table that uses a group of shards in a cluster (click here for a regular sharding example, and here, for an example of advanced sharding). Such a group includes only some shards in a cluster.
In this case, you can:
- Place tables with data on the same cluster shards that are used by the distributed table.
- Place tables with data in one shard group, and the distributed table that routes queries to these tables with data, in another shard group.
For example, you can set up the following shard configuration within a single ClickHouse® cluster using these methods:
- Group
A
comprised of two shards with hosts of thes2.small
class: Used as the main one for a distributed table with a low load. Data in the distributed table data is stored in the same shard group. - Group
B
comprised of two shards withs2.medium
hosts: Used as the main one for a distributed table with a constant high load. The data in the distributed table is stored in another group,C
, comprised of five shards with high-performancem2.large
hosts.
To learn more about working with distributed tables, see the ClickHouse® documentation
Sharding management in Managed Service for ClickHouse®
Managed Service for ClickHouse® manages shards as follows:
-
When you create a cluster, a shard named
shard1
is automatically added to it. This shard includes all hosts in the cluster. When you create a cluster with multiple hosts, replication support is automatically enabled. -
You can add the appropriate number of shards to the created cluster.
To fully benefit from all advantages of sharding, your cluster should have at least two shards. Using a distributed table and a single shard is equivalent to using replication without sharding. It only protects against data loss when one or more shard replica hosts fail; however, it does not provide distributed data storage.
-
You can add hosts to a shard.
Shards with multiple hosts require running replication. Therefore:
- Clusters with a multi-host shard have ClickHouse® Keeper or ZooKeeper replication already running, which means you can immediately add hosts to the shard.
- In clusters with single-host shards, you need to enable fault tolerance using ZooKeeper, and only then add hosts to the shard.
For more information about replication, ClickHouse® Keeper, and ZooKeeper, see Replication.
-
Shard IDs in Yandex Cloud are different from their IDs in ClickHouse®.
- Managed Service for ClickHouse® clusters use shard names.
- ClickHouse® uses numeric IDs that match the alphabetic order of shard names in Managed Service for ClickHouse®, e.g.,
A-shard
,B-shard
,shard10
,shard100
.
Keep this in mind if your application accesses shards by their IDs when writing and reading distributed table data.
To learn how IDs in Yandex Cloud and ClickHouse® match, run this query:
SELECT substitution AS shard_name, shardNum() AS shard_number FROM cluster('{cluster}', system.macros) WHERE macro = 'shard' ORDER BY shard_name
For example, in a cluster with the
shard1
,shard2
,shard3
,shard4
,shard100
shards, the matching will look as follows:┌─shard_name─┬─shard_number─┐ │ shard1 │ 1 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard100 │ 2 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard2 │ 3 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard3 │ 4 │ └────────────┴──────────────┘ ┌─shard_name─┬─shard_number─┐ │ shard4 │ 5 │ └────────────┴──────────────┘
ClickHouse® is a registered trademark of ClickHouse, Inc