Performance Theory 101 for insert throughput scaling with CrateDB

Monday, January 28, 2019

This is a short introduction on how to apply a bit of performance theory in the context of tuning insert throughput performance in CrateDB.

2 Axis ¶

The first thing to understand is that there are 2 axis that make up throughput. Latency and concurrency.

Latency ¶

Latency is the amount of time a single operation requires from start to finish. In the case of an insert statement that is the time it takes to parse, analyze, plan, index (CPU) and write a document (disk IO). In case of a cluster, the documents are routed to the right nodes and replicated. So it involves network IO as well.

Good practice is to measure this both on the server side and on the client side. If you only observe and measure the latency on the server side you might get an embellished picture. The server can only start measuring the duration once it starts processing a request. So if you only measure the server side latencies you’re blind to the time it takes for the server to accept and start processing requests.

As an example. If you’re doing a single bulk insert into CrateDB and that ends up taking about 50ms, you’ll max out at 20 bulk inserts per second with a single ingest process.

Insert 1 |-----|
Insert 2        |-----|

To improve this you can try to decrease the latency of a single operation:

Upgrade the CPU (single thread performance)
Upgrade to a faster disk
Try to decrease network latency

Once you’re done optimizing these, there is the second axis. In the case of CrateDB this is usually the more important one to achieve high throughputs.

Concurrency ¶

Executing operations concurrently means that you continue starting operations without waiting for the prior operations to complete.

So a client might invoke two inserts concurrently, which the CrateDB cluster will process at roughly the same time:

Insert 1 |-----|
Insert 2 |-----|

The inserts individually still take up 50ms to complete, but you’re able to achieve 40 bulk inserts per second instead of the former 20.

There is of course a catch to this: Contention. That is where the Universal Law of Computational Scalability comes into play.

Some of the resources required to process inserts are shared. On a single CrateDB node the most obvious is the disk.

The CPU resources become also contended once all cores are utilized and there is some scheduling overhead. Besides, there is some contention due to book-keeping on a node basis and within a single shard there is additional contention due to synchronization.

In CrateDB all these contention points can be avoided by scaling out (In the case of inserts). Once you add more nodes you’ve more individual disks and more CPUs. There is no contention between them. You can configure tables to have several shards, so there is no synchronization contention between them.

This means you should see a linear increase in throughput by adding more nodes.

The exception to this are inserts which cause schema updates, or on partitioned tables if new partitions are created. These go through the active master node and are blocked by the cluster state updates.

In addition, if you go from a single node to 3 nodes you won’t see a linear increase as there will be additional network IO involved (and you might start using replicas, which increases the latency further)

The important thing while optimizing for high throughputs is to not overload any single node with too much concurrent requests.

Queuing theory ¶

If the concurrency is too high the latency of individual requests and the throughput will get worse. This is where queuing theory starts to come into play.

If all resources are used up, new requests have to be queued and wait for resources to become available again. This means the individual request latency goes up. If the system is too overloaded the throughput will actually start decreasing.

The sweet spot is said to be around 70% system utilization.

In CrateDB there are a couple of metrics that are important to monitor to make sure that the utilization is not too high. Or the other way around: To judge if it is safe to scale up with the clients.

The threadpool queue sizes are exposed via JMX and in the sys.nodes table. If they start filling up you’ll know that you’re overloading your cluster.

The operation latencies are also exposed via JMX and via the sys.jobs_metrics table. If they go up it is also a sign that you’re overloading your cluster.

(But as mentioned before, it is best to also monitor the round-trip time latencies on the client for a full picture)

Wrap up ¶

Throughput depends on both latency and concurrency. You can measure the latency on a mostly idle cluster to get a baseline from which you can then calculate the required concurrency to reach a desired throughput.

For example, given a latency of 0.5ms, you can achieve (1s / 0.5ms) 2000 requests per second. If your target is 10000 requests you’d need at least 5 concurrent ingest processes.

To gauge if the CrateDB cluster is able to handle the amount of operations, it is necessary to have proper monitoring in place. Basic system metrics, JVM metrics, the thread pool queue sizes and the query latencies from both server and client side should be measured to get a full picture.

If you observe that the latencies start to go up, you know that the system started queuing requests, in which case increasing the concurrency further will be detrimental to the systems overall performance.