A model for choosing the number of shards and replicas for optimal efficiency

2020/08

In the design of large scale systems, hardware cost plays an important role. Especially in the era of big data and an ever-increasing amount of data to process. One facet of scaling is scaling out (also known as horizontal scaling) which means distributing data on multiple machines to improve performance and reliability.

👯‍♂️ Replication vs Sharding

There are two common ways to distribute data across multiple nodes, replication and sharding.1

Replication is the concept of keeping the same data on multiple servers (each copy is of the data is known as replica).

There are two primary reasons for replicating data. One is to increase the reliability of a system, the other is to improve performance.2

  • Reliability increases with the number of copies as the chance of losing data decreases as the number of copies increases.
  • Performance is also proportional to the number of servers as using more servers usually means more processing capacity.

Sharding (also known as partitioning) is a way of dividing a large dataset into smaller ones (shards) so that it could be split between multiple machines. There are multiple reasons to do that, but we’ll focus here only on performance reasons.

replication and sharding

💵 Efficiency

For this article, I’ll define efficiency as a measure inverse proportional to the cost of the system (number of servers NN) that can satisfy some pre-defined reliability and performance requirements (Σ\Sigma).

efficiency∝ΣN\text{efficiency} \propto \frac{\Sigma}{N}

I have defined two dimensions for vertical scaling. Also, claiming that both may improve performance.

❓ Problem

I want to focus on the following problem:

How to choose the number of shards, and the number of replicas such that efficiency is optimal (satisfy performance requirements with the lowest number of servers)?

When a share-nothing system is close to its performance capacity I’ve seen the following approach to be used in practice:

  • if better write performance is needed → increase the number of shards
  • if better read performance is needed → increase the replication factor (number of replicas)

I want to argue that the choice isn’t necessarily binary, and propose a model for reasoning about this problem.

🛀 Assumptions

The goal of this post is to explore a way of thinking about this problem rather than explore the full problem space of horizontal scaling. To keep it simple we’ll model an abstract system with the following properties:

  • shared-nothing architecture; ignores complexities related to cross-shard communication
  • asynchronous circular replication; ignores replica synchronization cost on writes
  • reads and writes have the same cost, and it is constant
  • physical replication and sharding is considered (server level/no virtual sharding)

These properties aren’t necessarily of a fictional system. In-memory key-value databases come quite close to this description.

See 3 for a list of complexities that arise in a shared-nothing architecture.

🖋 Notations

  • NN number of servers
  • Σw\Sigma_w write throughput requirements
  • Σr\Sigma_r read throughput requirements
  • CC server processing capacity / operations per second it can handle
  • C=Cw+CrC=C_w + C_r, where CwC_w write capacity and CrC_r read capacity
  • RF\text{RF} replication factor (number of copies)
  • FF number of acceptable replica failures
  • S=N/RFS=N/\text{RF} number of shards

🖼 A visual example that motivated this article

Let’s imagine an initial system that is consists of 2 shards with a replication factor of 3.

system with 2 shards and replication factor of 3

A server is designed to handle 11 operations per second (CC), which are divided into 9 writes per second and 2 reads per second.

The total write capacity of the system is the number of writes an instance can handle multiplied by the number of shards. Replicas have to repeat the same write operations, they don’t contribute in any way to the total write capacity (hachured background). In other words, this is the sum of the heights of all red shaded rectangles.

Σw=S∗Cw=2∗9=18\Sigma_w = S * C_w = 2 * 9 = 18

Total read capacity is given by the number of replicas multiplied by the number of reads a single instance handles. Sum of all green shaded rectangles heights.

Σr=(S∗RF)∗Cr=2∗3∗2=12\Sigma_r = (S * \text{RF}) * C_r = 2 * 3 * 2 = 12

Imagine we have the following requirement now: the system should handle twice as many reads (Σr=24\Sigma_r = 24). Probably doubling the number of replicas RF\text{RF} comes first to the mind.

system with 2 shards and replication factor of 6

What happened here is we double the sum of green shaded rectangles heights. Indeed, it gives us twice the read capacity.

Σr=(S∗RF)∗Cr=(2∗6)∗2=24\Sigma_r = (S * \text{RF}) * C_r = (2 * 6) * 2 = 24

By doubling the number of replicas we also doubled the amount of servers that are used.

N=S∗RF=2∗6=12N = S * \text{RF} = 2 * 6 = 12

Giving a total of 12 instances.

But are there other options? More efficient maybe? Compare the following drawing to the previous ones.

system with 3 shards and replication factor of 3

Instead of doubling the number of replicas, I added a new shard instead.

Let’s evaluate how this system would perform.

This deployment still has to handle the same amount of writes as the initial one, but now the writes are spread onto 3 shards instead of 2. This means that each shard has to do only 6 writes per second.

Σw=(S∗Cw)=3∗6=18\Sigma_w = (S * C_w) = 3 * 6 = 18

A server was handling 11 operations per second before. If now each instance processes only 6 writes per second it means that there is room for 5 reads per second per instance instead of just 2.

Σr=(S∗RF)∗Cr=(3∗3)∗5=45\Sigma_r = (S * \text{RF}) * C_r = (3 * 3) * 5 = 45

By adding a shard instead of doubling the number of replicas not only the system can handle more load it is also more efficient!

N=S∗RF=3∗3=9N = S * \text{RF} = 3 *3 = 9

Only 9 servers are needed in this case. More performance for less money. 🧙‍♂️

Note: This result doesn’t mean that always adding more shards is the correct solution, I’ll show why later.

🔮 A shot at formalizing a model

Hopefully, the above example convinced you that it is worth thinking about this problem and there is value to be extracted if it is approached correctly.

Let’s return to the initial problem and formalize it a bit.

How to choose the number of shards, and the number of replicas such that efficiency is optimal?

We want to optimize for smallest NN such that performance requirements are satisfied.

The maximum write throughput the system can handle is number shards multiplied by shard capacity. Since the number of replicas doesn’t influence write capacity shard capacity becomes processing server capacity.

Σw=S∗Cw\Sigma_w = S * C_w

Read throughput is limited by the total number of replicas (we ignore failure tolerance (FF) for a moment).

Σr=(S∗RF)∗Cr\Sigma_r = (S * \text{RF}) * C_r

Let’s recall some previous notations (S=N/RFS=N/\text{RF}, C=Cr+CwC = C_r + C_w) and do some substitutions.

ΣwS+ΣrS∗RF=C\frac{\Sigma_w}{S} + \frac{\Sigma_r}{S*\text{RF}} = CΣwNRF+ΣrNRF∗RF=C\frac{\Sigma_w}{\frac{N}{\text{RF}}} + \frac{\Sigma_r}{\frac{N}{\text{RF}}*\text{RF}} = CN=⌈Σw∗RF+ΣrC⌉N = \Big\lceil\frac{\Sigma_w*\text{RF} + \Sigma_r}{C}\Big\rceil

Note that RF\text{RF} appears only in the numerator. This implies a linear relation to N. We would never want to increase it.

If failure tolerance is to be ignored, then RF=1 is what we want. It makes sense intuitively as well. Replicas don’t contribute to write capacity in any way. And replicas read capacity is limited by node capacity, minus the capacity used for writes. Adding shards, however, contributes to both. Write capacity is spread between more nodes, this frees up capacity from every single node that can be used to process reads.

Let’s introduce failure tolerance into the model. FF will denote the number of replicas failures we want to tolerate.

Σr=N∗(RF−F)RF∗Cr\Sigma_r = \frac{N * (\text{RF} - F)}{\text{RF}} * C_rN=⌈Σw∗RFC+Σr∗RF(RF−F)∗C⌉N = \Big\lceil \frac{\Sigma_w*\text{RF}}{C} + \frac{\Sigma_r * \text{RF}}{(\text{RF}-F)*C} \Big\rceil

When F>1F \gt 1, the RF\text{RF} at which the minimum NN is achieved varies depending on the mix of the write and read workload, and the desired FF!

🧠 Conclusion

All models are wrong, but some are useful. –George E. P. Box

Real-world systems are usually much more complicated compared to what I have tried to model here.3 However, I believe this simple model to be a strong pillar for reasoning about scaling shared-nothing systems when efficiency is a goal.

I wasn’t able to find a lot of related work, if you have any relevant references please send them @nvartolomei.

References:


  1. M. Kleppmann. Designing Data-Intensive Applications. O’Reilly Media, 2017.↩
  2. M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017.↩
  3. Michael Stonebraker: ”The Case for Shared Nothing” IEEE Database Engineering Bulletin, volume 9, number 1, pages 4–9, March 1986.↩