Telegraf-TS: A YCSB-based Benchmark for Telegraf Time-series Data

Benchmarking suites such as TPC-C, the YCSB, TSBS, and others make use of synthetic data. This makes them very flexible and broadly applicable (within their application domain), but introduces some fuzziness when evaluating very specific use cases.

A widely found use case is the storage, retrieval, and analytics of IT monitoring data, which is used to supervise the functioning of IT infrastructures, but also for error prediction and intrusion detection. Here, Telegraf is a popular tool for collecting that data and storing it into a data backend.

We introduce Telegraf-TS, a new benchmark built on YCSB, intended for supporting benchANT customers in choosing the right Telegraf data backend. It is specifically tailored to evaluate the performance of different types of analytical queries over data stored by Telegraf.

Introduction

Database benchmarking is a clearly defined method for analysing, measuring and comparing performance metrics of database management systems usually with the goal to evaluate the efficiency and the performance-cost ratio of different database technologies and database configurations.

The Yahoo! Cloud Serving Benchmark (YCBS) is a simple and generic benchmarking suite with the goal to support many different databases and data schemas. Yet, per se, it is not so much suited to evaluate sophisticated features of certain products. Further, it makes use of synthetic data, limiting the reliability of its results for specific scenarios.

Monitoring is a re-occurring problem in IT operations. Monitoring tools are available across the board in various different flavors and periodically create multi-dimensional data. Many of them support a vast amount of data sinks including time-series databases such as InfluxDB, Azure Data Explorer, and AWS Timestream. One popular monitoring tool is Telegraf, a monitoring agent with many database connectors, called output plugins.

In a recent article, we sketched different strategies how YCBS can be extended for new workloads, while at the same time relying on YCSB core features such as latency measuring, timing, and thread handling. In the remainder of this article, we detail a new YCSB extension called Telegraf-TS built to work on time-series data stored by the Telegraf monitoring agent.

Benchmarking Scopes

In case you are wondering why one would want to evaluate databases filled with Telegraf data, the most obvious answer is that you are already monitoring your servers and IoT fleet using Telegraf, but had to find out that the database backend you are using to visualize and analyze the data does not fulfil all your expectations. A different, but similar reason is when you plan to migrate your storage backend to a DBaaS offering and are evaluating the QoS of different providers.

In these cases, the interest of a user usually lies in two different questions: (i) how many parallel Telegraf instances / metrics can a (time-series) database instance sustain. Hence, what is a reasonable overall ingestion rate for a certain database seizing and when should more resources be acquired? (ii) How do different database technologies and database configurations differ with respect to their analytics performance. Here, different queries languages and querying capabilities come into play, but also the configuration of the database instance plays a role.

In this article, we focus more on question (ii). More precisely, we present the Telegraf-TS extension to YCSB that evaluates the latency of different types of queries.

Data Ingestion using Telegraf

Before being able to run Telegraf-TS, which is purely read-only, any database under test needs to be filled with data. A natural approach to perform this ingestion step is to use Telegraf itself. This is an out-of-band mechanism with respect to the Telegraf-TS benchmark. Obviously, this mechanism is able to collect performance metrics for the ingestion and help answering question (i) from above. While this is out of scope of this article, it is worth noting that multiple Telegraf instances will be needed to create a significant load on the database and to collect a minimum of useful content.

It is further important to understand that Telegraf collects monitoring data through a set of input plugins. These are processed by a set of processors and filters, and finally sent to a set of output plugins. Telegraf ships with a rich selection of output plugins supporting a large number of different backends. It is worth noting that using Telegraf for data ingestion has the consequence that the data schemas used for each of the backends depends on the output plugin. While it is desirable that the developers of the respective output plugin have used a data model that suites the backend, this is not granted. Nevertheless, this does not render the results of a benchmark such as Telegraf-TS useless. In the contrary, users applying Telegraf in production have to deal with the data model the Telegraf plugin uses and will also suffer from performance problems caused by broken data models.

Telegraf also supports a large number of input plugins that capture different kinds of metrics of a server. For production benchmarks we ran with Telegraf-TS, we activated the following plugins. The more plugins are activated, the more data will be generated by one Telegraf instance.

plugin	description
internal	collects metrics about the telegraf agent itself
cpu	gathers metrics on the system CPUs
disk	gathers metrics about disk usage
diskio	gathers metrics about disk traffic and timing.
kernel	gathers info about the kernel that doesn't fit into other plugins
mem	collects system memory metrics
processes	gathers info about the total number of processes and groups them by status (zombie, sleeping, running, etc.)
swap	collects system swap metrics
cgroup	will capture specific statistics per cgroup
conntrack	collects statistics from Netfilter's conntrack-tool
iptables	gathers packets and bytes counters for rules within a set of table and chain from the Linux's iptables firewall. s
net	gathers metrics about network interface and protocol usage (Linux only)
netstat	collects TCP connections state and UDP socket counts by using lsof
procstat	can be used to monitor the system resource usage of one or more processes.

Approach to Implementation

For implementing Telegraf-TS we added a new YCSB Workload class that iteratively executes a set of pre-defined queries with a configurable iteration count. Thereby, the user can chose between a breadth-first sequence (Q01, Q01, ..., Q02, Q02, ..., Q03, ...) and a depth-first sequence (Q01, Q02, ..., Q20, Q01, ...).

The benchmark supports up to 20 different queries and the ones to be used can be set via a configuration option. The benchmark supports all of YCSB standard configuration parameters, but should not be used with more than a single thread. Also, as with standard YCSB, the database binding needs to be chosen and configured. Apart from that only a one additional benchmark-specific configuration parameter exists for the benchmark: timespan defines the time-interval filter to be used in all queries. More precisely, all queries will consider only data points with timestamps > now() - timespan.

Regarding the database bindings, we chose the approach named Custom Workload Classes with Custom Binding API in our previous article. In brief, we define a new interface with one method tsQuery to be defined by all bindings. It takes the query type (Q01 ... Q20) and the timespan as parameters.

Currently, Telegraf-TS ships with three different bindings: AWS Timestream, Azure Data Explorer, and InfluxDB.

Query Types

The following table provides a high level overview on the 20 different queries and shows the functions and clauses used by each of them. While the representation adopts an SQL-like syntax and is very close to language constructs supported by AWS Timestream and InfluxDB, it has to be stressed that each database comes with its own (dialect of a) query language and queries need to be adapted to each case individually. This is nothing special to Telegraf-TS, but also the case for vanilla YCSB and many other database benchmarks.

	functions	clauses
Q01	count
Q02	count	WHERE “<”
Q03		ORDER BY, LIMIT
Q04		WHERE “LIKE”
Q05		LIMIT
Q06	distinct	WHERE “<”
Q07	count	WHERE “<”, GROUP BY
Q08	sum
Q09	sqrt, avg, sin, cos, pow, round
Q10	avg	WHERE “<”
Q11	min	WHERE “<”
Q13	max, min	WHERE “<”, UNION ALL
Q14	count, sum
Q15	sum	WHERE “<”, “in”
Q16	count	WHERE “=”
Q17	all	WHERE “=”
Q19	date_trunc, max, count	GROUP BY, ORDER BY
Q20	median	WHERE “=”

All queries shown in the table restrict the results using one or multiple filters. An example such filter is a WHERE clause filtering by a specific host. This is the kind of WHERE clause symbolized in the table. In addition, all queries make use of a time filter related to the timespan parameter, which is not listed in the table. This time filter restricts the results based on a time interval relative to the current timestamp, e.g., in the last 5 minutes.

All queries presented in the table exclusively query the procstat table and the metrics stored therein. Despite that, depending on the use case, it may still be useful to activate further input plugins in Telegraf, as we did in our evaluations.

Summary

So far, the article detailed the use case and intention behind building Telegraf-TS. It should help benchANT customers finding a suitable storage backend for their monitoring solution built on Telegraf. It is probably worth mentioning that it is a very small step to apply a similar approach to other monitoring frameworks as well, be they based on Prometheus, Elastic's monitoring stack, and others.

In case you have been waiting for some data and exciting insights into the difference between Telegraf backends, we have to disappoint you. The extensions to YCSB presented here as well as additional tooling for Telegraf-based data ingestion has been implemented as part of a customer project and the data cannot be shared. Overall, the results can be described as truly surprising.