Story: How Daimler TSS Identified a Fast, High-Availability Database for Their Cloud2Go Project!
Which database management system delivers a reliable throughput of 1,000 ops/s even if one or more database nodes fail?
We evaluated this problem with Daimler TSS in our first PoC.
With 90 performance benchmarks to
- a high-performance database,
- in a highly available multi-node setup
- on “right-sized”, private cloud VMs
for the customer-specific IoT issue and the specific time-series workload.
About Daimler TSS
Daimler TSS GmbH is an internal IT partner of Mercedes-Benz Group AG. Daimler TSS develops exclusive and innovative IT solutions in the field of mobility.
Daimler TSS is headquartered in Ulm, Germany and employs approximately 1,200 people.
The Challenge: High-Availability Performance
IoT and Big Data require highly scalable and (geo-) distributed data management. Many NoSQL database management systems promise to meet these requirements for scaling and distribution, along with elasticity and high availability.
In the Daimler-TSS "Cloud2Go" project, we worked with the IT team to develop the following database and cloud infrastructure requirements.
- {SLA #1} Guaranteed database throughput of 1,000 ops/s.
- {SLA #2} Cost-efficient high-availability setup that guarantees no data loss.
A throughput of 1,000 ops/s is promised and achieved by many NoSQL databases without problems at peak.
- But which DBMS delivers this guaranteed at all times?
- Even if one or more DBMS nodes fail?
- And this with a write-heavy workload in this IoT time-series scenario?
- And this with the most cost-efficient minimal infrastructure possible? ## The Solution: Benchmarking with Failure-Injection benchANT solved this challenge in successive performance measurement series with different targets. This procedure is typical for most performance evaluations, as one steadily approaches the goal by measuring and narrowing down.
Phase 1: Performance benchmarking DBMS candidates
Together with Daimler TSS, we selected 7 potential DBMS candidates.
These databases were installed on different cloud setups with different node counts, VM sizes, and storage backends, and the modeled IoT workload was run using the YCSB benchmark. The performance results are then collected and graphed.
Note: This entire benchmarking process is mostly automated using benchANT's benchmarking framework. Thus, benchmarking is possible without large expenditure of time and consulting costs.
Phase 2: High-Availability Benchmarking with Favorite DBMSs
Based on the performance results, we selected the most promising DBMS vendors and subjected them to a chaos testing evaluation. Here, we re-ran the performance benchmarks above, but injected typical bugs into the setup during the benchmark run. For example, we made sure that a DBMS node failed and had to be restarted. At the end, we analyzed what impact such error injection had on the performance of the setup.
Note: We obtained very surprising insights into the fault-proneness and high-availability of a wide variety of database management systems. Unfortunately, the actual behavior, in some cases, does not match the promises of the vendors. For example, 2 databases were not able to restart the failed node during the write-heavy workload because the consistency of the failed node could not be restored.
Phase 3: Performance/Cost Optimization.
The remaining two candidates were optimized with different DBMS configurations in further performance benchmarks (configuration tuning).
This allowed us to determine the most ideal and cost-efficient setup possible, which met the SLAs and had a performance/cost ratio that was approximately 30% better than the initial setup.
The Results: Testing shows the Truth
The findings of these performance and high availability evaluations provided Daimler TSS and us with some insights:
- NoSQL databases are very heterogeneous and often behave differently
- Vendor claims are not to be trusted in every workload scenario
- Enormous optimizations can be made through tuning and testing.
With this performance data, Daimler TSS had a secure and well-founded decision-making basis for the further course of the project. The results were presented at an internal Daimler TSS conference.
The total effort on the part of Daimler TSS was also extremely low, at about 3 person days - from kickoff to the final presentation of the results.
Further information can be found in this project article of the University of Ulm.