Benchmarking Beyond Performance: A Case Study with Aerspike and Cassandra
Database benchmarking is a well-know method with a very long papertrack to compare the performance of two or more systems. Classically, in this context performance is considered to be either throughput or per query latency or a combination of both. Yet, these are by far not the only metrics that can become relevant when it comes to a system comparison. For practicioners such as Site-Reliability Engineers the important capabilities of a distributed database system are not limited to classical performance metrics, but also include the ability to scale-out with increasing workload and the ability to quickly recover frome node failures.
In this article, we revisit the conceptual and technical work, benchANT have done to compare the performance of Aerospike and Cassandra under a recovery and a scale-out scenario. The technical details of the study, the configuration of the competitors, and the methodology of the project as well as the performance results are contained in a [whitepaper released by Aerospike](https://aerospike.com/resources/benchmarks/aerospike-vs-cassandra-benchmark/). While we briefly summarize the results and the characteristics of individual systems here as well, this blog post is more about detailling the need for evaluating resilience and scale-out capabilities of distributed database systems.
Why Resilience Benchmarking?
Distributed databases rely on two mechanisms to handle distinct challenges: sharding and replication: (i) sharding distributes data over multiple nodes (servers) so that one node on average hosts a fraction of the overall data set (1/n in the case of n nodes). This approach increases the physical capacities of the system, e.g., total amount of storage, and, depending on the queries, may also increase the parallel workload (throughput) the system can handle. Sharding is a core mechanism to enable scalability of a distributed database system. Yet, sharding does not introduce any means of redundancy so that in case of node failure 1/n-th of the total data set will be temporarily unavailable or even permanently lost. As the probability of 1+ out of n nodes fail grows with the number of nodes large-scale systems are more likely to experience a failure compared to a single node system.
(ii) replication ensures that multiple copies of the same data item (row, cell, block, ...) are kept in the system, usually on different nodes, so that in case of node failure the data item is not lost. In addition to that, replicas can be used to improve the read bandwidth of the system. On the downside, however, using replicas requires the system to keep replicas in sync and makes the distributed system vulnerable to the effects of the CAP theorem. Of course, replication does not avoid node failures. It only mitigates the impact of such a failure in the sense that data is still available and not temporarily unavailable or even permanently lost.
Resilience Metrics
Sharding only works reasonably well, when available shards are rather equally distributed over the available nodes. Adding new nodes to a system (scaling out), but also removing a node (scaling in) does require to redistribute exsiting shards over the new set of nodes. During this phase of redistribution, the cluster has to perform additional work which will reduce the amount of resources available to process user requests. That is, even though a scale-out will eventually make more resources available to users, attempting to achieve this scaled-out state will temporarily lead to a reduced performance. In consequence, scaling out a system too late (when its load is already too high) may lead to a collapse instead of better performance. Nevertheless, scaling-out or scaling-in happens in a controlled manner, as some external entity, e.g. an administrator, decides when to scale out and further, heuristics in the system can decide how quickly to re-arrange the shards over the nodes.
In the case of node failures, this is not necessarily possible. A node goes down and the system has to react immediately. Its heuristics can balance the risk of future data loss against a current unavailability, though. In any case, however, existing shards not only need to be rebalanced, but also the replication degree of some shards has to be re-established. Finally, due to the node failure, less compute resources are available so that the overall compute capabilities of the system are further reduced. Even worse, this may happen during a phase with high load.
To describe the hit a system experiences during scale-out, scale-in, and node failure, a set of metrics are available. These are:
- performance drop: these metric describe the decrease in throughput / increase in latency in a system while the scale-out / scale-in / recovery process is ongoing. It compares for instance throughput and latency against the respective steady-state metrics as measured right before changing the system configuration (pre-incidence performance).
- post-incident performance captures the performance (throughput / increase) the system is able to achieve once the scale-out / scale-in / recovery has been completed. It is usually compared to the pre-incidence performance.
Our research paper King louie: reproducible availability benchmarking of cloud-hosted DBMS discusses these aspects in much more detail.
The benchANT Approach
benchANT runs a unique multi-cloud, multi-database, multi-workload database benchmarking framework. It is able to issue multiple configuration workloads against different (types) of database management systems which may either be operated as Database as a Service (DBaaS) or on Infrastructure-as-a-Service (IaaS). The framework handles all the necessary steps from acquiring the necessary database instances over different cloud providers, creating required tables, and finally installing software and triggering the workload against the database. Beyond this, its monitoring observes the involved servers and databases and collects all metrics as time-series. Last but least, the framework's run-time collects the timestamps of phase change events. Such an event occurs for instance, when a database instance has started, when the load phase has completed, or when the workload and with it the benchmark has completed.
For supporting scaling and resilience benchmarks, the benchANT framework has the capability to trigger a scale-out or scale-in at specific points in time, e.g. after 120 minutes into a benchmark run; or after certain conditions are fulfilled, e.g. the database contains more than 100 million objects. The respective sequence of steps is summarized in Figure 1. The steps executed for emulating a resource failure is very similar, but not identical. The difference between a scale-in and a failure case is mostly that in a failure case the respective node is forcibly stopped without letting the database cluster know in advance. In contrast, a scale-in will first trigger the database API and clear the respective database node before it is shut down.
While scaling out / in are supported with both DBaaS- and IaaS-based database deployments, IaaS-based deployments further allow for triggering failure scenarios. Forcing node failures is more difficult for DBaaS offerings, as the DBaaS provider needs to support this capability. For instance, MongoDB Atlas offers limited capabilities to test primary fail-over.
In addition to the simple scenarios sketched above, the benchANT framework allows combining simple scenarios to more complex ones: For instance, it is possible to have a scale-out scenario follow a failure scenario to evaluate how well a cluster recovers after the original number of nodes has been recovered.
Case Study: Aerospike v8 vs Cassandra v5
In a case study conducted for Aerospike Inc., benchANT compares Aerospike Enterprise 8.0.4 with Apache Cassandra 5.0.3 in three different scenarios. All of them use an extended YCSB workload with 50% reads (by primary key), 40% inserts, 6% updates, and 4% deletes. Further, all scenarios use a 4-node Aerospike cluster with a replication factor of 2 and a 6-node Cassandra cluster with a replication factor of 3. The replication factors conform to best practices for each respective database, and the differing cluster sizes are an immediate consequence of this. All evaluations are run on AWS EC2 using Virtual Machines of type i4g.4xlarge for the database cluster.
The baseline scenario aims at finding the maximum throughput for this workload that can be achieved under a fixed client-side P99 latency threshold for read operations (1ms for Aerospike, 10ms for Cassandra). For running the workload, each database is initialized with 100GB of data; then, 4.4B operations are executed so that at the end of the benchmark, the database contains around 3.4TB of data. It turns out that 200 parallel clients for Aerospike and 400 parallel clients for Cassandra maximize throughput, while keeping the latencies bewlow the requested thresholds. The experiments show that Aerospike achieves much higher throughput than Cassandra (480kOps/s vs 138kOps/s).
The other two scenarions are concerned with system behaviour while (i) scaling out (adding one node to the database cluster) and (ii) node failure (forcibly removing one node from the cluster). For our evaluation we use a workload intensity of 25%-30% of the workload from the baseline scenario per competitor. For both scenarios the failure / start-up event is triggered when 20% of the operations from the regular workload were executed, i.e. when about 800GB of data resided in the database. Both databases can deal with both scenarios without causing any unavailability. Table 1 summarizes the results for both resilience and scale-out evaluation.
| Scale-out Aerospike | Scale-out Cassandra | Resilience Aerospike | Resilience Cassandra | |
|---|---|---|---|---|
| Data re-distri- bution time [h] | 3.5 | 0.9 | 4.9 | 1.75 |
| Throughput [kops/s] | 130 | 44 | 130 | 44 |
| read P99 [ms] | 0.5 - 1.2 | 1.5 - 1.9 | 0.55 | 1.6 - 2.3 |
| insert P99 [ms] | 0.3 | 0.9 | 0.4 | 1.0 - 1.3 |
| delete P99 [ms] | 0.3 | 0.8 | 0.4 | 1.0 - 1.3 |
| update P99 [ms] | 0.7 - 1.4 | 0.8 | 0.7 - 0.8 | 1.0 - 1.3 |
While scaling out Aerospike shows stable latencies of around 0.3ms for insert and delete operations while Apache Cassandra is up to 3x slower, but still sub-ms. For update operations Aerospike has some partially spiky behaviour with P99 latencies between 0.5ms and 1.2ms while Apache Cassandra has more stable P99 latencies at around 0.8ms. For read operations both systems show spiky behaviour. While Aerospike’s P99 latencies are in a range between 0.5ms and 1.2ms, Apache Cassandra’s are between 1.5ms and 1.9ms.
For the resilience case, the P99 latencies for Aerospike are consistently better than for Apache Cassandra. For insert operations and delete operations P99 latencies for Apache Cassandra are 2.5x-3.25x higher than for Aerospike. For read operations the difference grows to 2.9x-4.55x. In contrast, for update operations, it is only 1.25x-1.86x.
Figure 1 illustrates the throughput measured for Cassandra over the entire benchmark runtime. Despite the relatively low load, there is a clearly visible drop in performance when the cluster node fails. After data redistribution has completed, Cassandra's throughput raises again, but with five cluster nodes does not reach the level it had before when operated with six cluster nodes.
Conclusions
Benchmarking methodologies that focus solely on throughput and latency capture only a partial view of database performance. Real-world database systems must sustain service continuity under fluctuating workloads, infrastructure changes, and unexpected node failures. To reflect these operational realities, benchmarking must incorporate resilience-oriented dimensions such as recovery behavior, redistribution efficiency, and post-incident stability. These metrics provide a more accurate characterization of how systems behave under stress and, consequently, how suitable they are for production environments with strict availability requirements.
In this article, we sketched how such elaborate evaluations can be executed by the benchANT benchmarking framework. Using the framework, we compared Aerospike and Cassandra under controlled scale-out and failure scenarios measuring metrics such as recovery time, post-incident performance, and stability. The evaluation illustrates the benefits of this broader methodological scope. By systematically introducing scale-out and failure events, the study reveals differences not only in steady-state efficiency but also in adaptive behavior: Aerospike demonstrates higher overall throughput and lower latency under both normal and disturbed conditions, whereas Cassandra recovers faster in redistribution phases but suffers greater transient performance loss.