Customized Workloads with the YCSB
The Yahoo Cloud Serving Benchmark (YCSB) is a well-known benchmark suite to measure and evaluate the performance of database management systems (DBMS). It has originally been developed by Yahoo! to prove the superiority of their PNUTS DBMS. Over time, it has been adopted by several other database vendors many of which provide bindings for their respective DBMS. In addition, multiple vendors use it as a tool for running market comparisons and comparing their DBMS against one or multiple competitors.
One of the major advantages of YCSB is the very simple workload it uses: all operations that ship with YCSB are CRUD operations and all of them make use of a row's/document's primary key. On the up-side, because of this simplicity, YCSB can be implemented for many different DBMS. On the down-side, YCSB default workloads do not make use of any even marginally sophisticated DBMS features. For instance, all data types are strings and not even secondary indexes used.
In this article, we sketch how to exploit YCSB in order to build different workloads and also make use of other types of queries. While such an approach is still work, it relieves the programmer from building any features beyond workload handling that are still important for benchmarking: metric collection, thread handling, and start-up and tear-down handling.
Overview
The Yahoo! Cloud Serving Benchmark (YCSB) is a well-known and widely used database benchmark suite distributed as open-source software. It allows measuring the performance of numerous modern NoSQL and SQL database management systems with simple database operations on synthetically generated data. Here, the YCSB also lends itself to performance comparison of multi-node database systems on distributed infrastructures such as the public cloud. Because it uses only simple operations, the YCSB can be used to compare many, architecturally different databases and measure a baseline performance of different database configurations under different workloads.
As the name suggests, the YCSB was developed in 2010 by the then Internet giant Yahoo! Their aim was to create a standardized benchmark for the purpose of comparing the Yahoo-internal database "PNUTS" with other NoSQL databases. The associated research paper was also published in 2010 and has since been cited over 3,500 times.
Since then, the YCSB has continuously been enhanced and updated; mostly support for further Database Management Systems (DBMS) has been added. Sporadically attempts were made to introduce support for new workloads. Unfortunately, little activity and consistent development can be found at the official GitHub repo so that many different, often incompatible, forks of the library are available. Needless to say that almost all popular database vendors maintain their own fork of the YCSB. Also benchANT maintains their own fork of the YCSB that mostly differs from the main repository by the fact that additional database bindings have been added and driver versions have been upgraded in many cases.
As shown in our 2023 recap the YCSB is a core building block of the work done at benchANT. What is not explicitly stated in the recap is that benchANT often makes use of the YCSB framework and associated tooling to implement custom workloads. In this article we discuss how YCSB can be used to implement custom workloads. We will not touch upon aspects of YCSB that are unrelated to this. For an extensive description of YCSB from a user perspective, please refer to our earlier blog article.
YCSB Building Blocks
A database benchmark suite, such as the YCSB, provides a framework that automates essential tasks in a database benchmarking process such as:
- The connectivity to the database via the database drivers
- The definition of a workload with the essential parameters.
- The execution of the workload on the database including
- the handling of concurrency and hence, the emulation of multiple database users
- the timing of request submission to support constant throughput
- The collection and storage of performance data.
The YCSB is implemented in Java and built in a rather modular way. It is true that many parts of the framework are configurable. Many others can be re-placed, enhanced, and exchanged. Yet, it also needs to be stated that the YCSB has never been designed to be fully flexible and in consequence, some enhancements are more laborious than others.
From an architectural perspective, the YCSB consists of four core parts: database bindings, workload generator, thread handling, and metric collection (not shown in the diagram are further components in the YCSB core that compute further high-level metrics such as throughput and take care of e.g. initialization and clean-up).
Technically, the YCSB bootstrap process will read the configuration to figure out which Database Binding to use for the benchmarks and how many clients (threads) to emulate. The configuration also determines which workload generator to use. The core will then take care of initializing the workload generator, the right number of threads, as well as the database binding. During this process, the database binding is wrapped into an API-compatible wrapper that intercepts all invocations towards the database binding and therefore is able to measure the execution time as well as the return status (SUCCESS vs FAILED) on a per query basis.
Workload Handling
With the principal understanding of the YCSB architecture, we now dive a bit deeper into workload handling. At the bottom of all workload lays the threads that represent database users. Each of the threads runs its main loop repeating invocations to the workload generator. The exact sequence of steps depends on the configuration; yet, in any round, the workload is invoked. The next listing illustrates this:
while(notDone) {
waitForNextInvocation
workload.doTransaction
checkIfDone
}
In the next step, the workload implementation choses the operation to perform and generates parameters for that operation. Finally, it invokes one of the five methods offered by the database binding: read, update, insert, scan, delete.
Workload.doTransaction {
op := chooseOperation
params := generateParameters( op )
invoke(db, op, params)
}
The database binding would then execute the operation and return a status code.
Approaches to Workload Customization
The standard workload implementation of YCSB is found in the CoreWorkload class and the straight forward approach to customization of YCSB workloads is the adaption of configuration parameters of that workload. This workload implementation offers a wide range of parameters that amongst others influence the number of data items, their internal structure (number of fields), and their size. Other parameters tweak the ratio of operations types, e.g. the portion of read operations, and the request distribution for read operations (zipfian distribution vs. uniform vs. hot data). These means alone allow performance engineers to apply YCSB for different scenarios ranging from read-only to insert-heavy.
In all cases, however, the major downsides of the YCSB default workload still apply: all operations are basically CRUD operations based on a data item's primary key. Further, none of the operations makes use of any even marginally sophisticated DBMS features, and all data types are strings.
Custom Workload Classes
In case the parametrization of CoreWorkload is not sufficient and greater flexibility in workload characteristics is needed, YCSB supports the definition of custom workload classes. Its configuration mechanism even supports loading and using new workload classes out of the box. The YCSB suite ships with a set of different data generators that support developers of custom workload classes with creating data of different distributions. These include for instance generators for uniform and exponential distributions.
Despite this flexibility, building custom workload classes comes with some limitations that mostly result from the API YCSB imposes towards database bindings. The API foresees exactly five methods (read, insert, update, scan, delete) with a fixed set of parameters. That is, when a new workload class is supposed to be used by existing database bindings, the data generated by this workload class needs to conform with the semantics set by the API. In consequence, using a custom workload class helps with building specific workloads, but cannot overcome the basic limitations.
Custom Workload Classes with Modified Parameter Semantics
When new workload classes do not have to work with existing bindings, it is an options to overload the semantics of the API parameters and to build a new database binding able to understand the new semantics. For instance, YCSB ships with a RestWorkload that can be used together with the rest binding and that is supposed to be used for benchmarking REST endpoints. Both workload and binding use the primaryKey parameter to exchange information about the endpoint of the REST service. Similarly, YCSB also ships with a TimeSeriesWorkload class that uses the existing API and instructions to parameter mapping in order to send information about time-series queries and inserts to the database binding. Yet, none of the bindings from the official YCSB repository supports the semantics imposed by the TimeSeriesWorkload.
Using a sufficiently complex mapping from YCSB API to a new parameter semantics enables users to circumvent most of the limitations. Yet, the drawback of changing the semantics of the parameters of the standard binding API is that in the end, it may not be clear to a user which binding works well with which workload. Further, the API remains limited to its current parameters and transferring more complex information requires the definition of a complex mapping (as done by the TimeSeriesWorkload class) that may cause problems for adopters. This difficulty is not only a programming challenge, as for the sake of benchmark comparability, all bindings need to basically do the same thing. Finally, it is difficult to have one binding support different types of workloads that all have their custom parameter semantics.
Custom Workload Classes with Custom Binding API
An alternative and in our opinion preferable approach is to define new binding APIs when required by the workload. This makes the semantics visible and further allows to understand which binding supports what kinds of workloads. It also allows to pass non-string data between workload class and binding, something not foreseen by the YCSB API. Also, it does allow that a database binding supports different types of workloads with different semantics, something not possible when a binding is forced to change the semantics of a parameter.
On the downside, this approach stretches the modularity of YCSB beyond its limits. More precisely, implementing this approach not only requires implementing a workload class and a binding, but also to adapt several system classes for that they are able to support the modified binding APIs. In particular, this affects the metric collection that wraps the database binding and therefore needs to be tailored towards the new binding.
Very recently, we used this approach to extend YCSB with two very different workloads, one querying time-series databases and the other querying relational and document-oriented DBMS with typed columns / document fields using WHERE clauses and secondary indexes. In both cases we defined a specific interface the bindings supporting the respective workload would have to implement. The interface defines and documents the methods and hence, separates the APIs of different workload classes from each other so that a single binding can support multiple different workloads.
Summary
In this article, we re-visited YCSB and its software architecture in order to understand how YCSB can be exploited for specifying custom workloads. While the standard workload shipping with YCSB provides many configuration parameters for customization, it fixes the structure / schema of the tables / documents, the data types used (currently only strings) as well as the way data is accessed (currently only by primary key).
Whenever another approach to data and data structure is required, a user also has to provide a custom implementation of a workload. Often, the custom workload also requires to transfer additional information to the database bindings. This can be done in multiple different ways amongst which, we prefer to define custom APIs for the database bindings. This approach offers maximum flexibility and further documents the relationship between workloads and database bindings well. On the downside, it requires making changes to YCSB core classes which feels a bit heavy weight. We are therefore working on a modification of the YCSB core that better supports flexible APIs without jeopardizing the essential features YCSB provides, mainly the handling of parallel workloads.