Benchmarking Vector Databases
The capabilities to store and search for vectors, also called embeddings, is a crucial capability in many fields of artificial intelligence (AI). For that reason, vector databases and vector extensions for established general purpose databases have been a major topic in the database world and beyond for quite a while now.
When selecting a database technology for vector workloads, many selection criteria are similar to other application domains for databases. Nevertheless, furhter criterias appear due to the heuristic nature of vector search that put up additional requirements when benchmarking vector databases.
Introduction
Vector databases and vector extensions for established general purpose databases have been a major topic in the database world in 2023. Also in 2024 the buzz has not stopped. In brief, this is a consequence of the AI wave that has hit IT landscape forcing all decision makers to focus more on AI: established products are enhanced with AI features and new AI-based products are pushed on the market; new architectural patterns such as RAG have risen and rely on databases that provide low-latency and high parallelism when accessing vector data.
The post at hand is only a very short summary of the topic with a focus on benchmarking. For all who want to dig deeper, we recommend this brillant blog series by Prashanth Rao.
In brief, vector-enabled databases offer a special data type (a vector). This vector, also called embedding, is, as Aerospike put it in their Aerospike Vector Search FAQ "a statistical representation of a piece of data [...] produced by machine learning models [that] can represent a word, document, image, video, song".
Classically database queries aim at finding an item that was previously stored in the database by filtering all data set by certain criteria. In contrast, a vector search query aims at finding vectors stored in the database that are similar to a vector passed as query parameter; or to put it more technically, the aim is to find the nearst neighbors of that vector. In order to cope with large data sets, special types of database indexes exist for vector columns. Probably all of them change the nearest neighbor search to a heuristical process (Approximate Nearest Neighbor, ANN).
Existing database system differ with respect to which kinds of indexes and parameters to these indexes they support, but also with respect to the maximum amount of dimensions that a vector can have and the kind of metrics users can choose from to qualify vector similarity.
Selection Criteria
When selecting a backend for vector data, a number of criteria need to be considered. As with any other application domain these include standard metrics such as total cost of ownership and performance in various flavors (including maximum throughput, tail latencies etc.), but also technical compatability with respect to, for instance, the data model. Other common criteria include the availabile operational models (on-prem vs cloud self-hosted vs Database-as-a-Service (DBaaS)) and whether the code base is Open Source software.
For storing vector data, obviously a database is required that supports vector data types. Superlinked has done a great job collecting and classifying such databases with their various capabilities and constraints in their VectorDB comparison. Hence, we will not discuss individual database technologies here.
While some database systems come with built-in capabilities to create embeddings or hooks to plug-in external embedding tools, others do not. And even if this is supported, the question remains what impact this additional step has on ingestion and query performance.
One of several constraints that need to be considered is the maximum supported vector dimensionality. The required maximum dimensionality depends on the chosen embedding and hence from the data set and the technology / model used for creating the embedding. Yet, even if a database supports vectors up to the required dimensionality, this does not necessary mean that it is suited for such high (or low) dimensions. The sweet spots of a specific database may lay somewhere else.
A further important criteria is the use of indexes on the vectorized columns. Multiple different index types with different properties are known such as Inverted File Index (IVF) and Hierarchical Navigable Small World (HNSW). Which index to use again depends on the data set, particularly its dimensionality, and data set size. Yet, as the use of indexes causes ANN, it is important to understan if the application requirements allow for lower recalls and lower precision. Hence, the possible trade-offs for instance between precision and memory demands are further decision criteria.
The Benchmarking Perspective
Overall, the benchmarking process for vector databases is not so much different from benchmarking other database systems. That is, the process of first loading data into the database (load phase) and then issuing operations against the database (run phase) remains the same provided that benchmarking of the respective database system is even allowed.
In contrast to other application domains the data model appears relatively simple as it only consists of a single column of a vector
data type. Also, the queries are very simple on the database level: While they may apply filters on non-vector columns, standard vector workloads usually do not involve joins
as, e.g., in OLAP, nor aggregation functions as in time-series nor any other complex constructs.
Nevertheless, the data model appears simple only on the surface. Beyond that, the structure of the data, ie. number of dimensions, density, and similarity metric influence a lot which indexes and heuristic work will and which do not, as does the number of data points. For instance, SingleStore recommends to not use any index if the data set has less than a few million data points and queries use filters on non-vector columns:
"if you have a smaller set of vectors, say less than a few million, and selective filters in your queries on other non-vector columns, you're often better off not using ANN indexes" [SingleStore]
Therefore, it is of utter importance to base benchmarks on a data set that is as similar as possible to the data set that will be used in production later on. Ideally, you can even use your production dataset for the benchmarks, but in any case you should use the same embedding. Also, each database system under test should be tweaked and evaluated independently. It is true that if a quantified IVF index works well for database system A, it will not be terrible for system B. Yet, we can still expect differences in the implementations and therefore in the behaviour so that a flat IVF index may even be better for system B. As with other benchmarks, the distribution of query types should match the distribution in production.
The Benchmarking Landscape
The fact that many different ad-hoc benchmark results are being published on the Internet, e.g. by Tembo, by Adesso, and here demonstrates the desire to better understand certain trade-offs when working with embeddings and when applying vector databases. Further, the questions of when to use a specialized vector database over a general purpose databases with vector extensios is one that can be answered by applying rigid benchmarking.
These days, several benchmarking libraries and suites are available for vector databases. You may want to check out our article on Benchmarking Suites to get the full picture. Nevertheless, none of them has received wide uptake so far. This is surprising as vector databases have been around for more than a decade despite having attracted a lot of attention only over the last couple of years
As of now VectorDBBench issued by Zilliz seems to be the vector benchmarking library that is some sort of the lowest common denominator. Nevertheless, also this suite lacks a lot of the flexibility and configurability of benmarking tools such as YCSB or TSBS as well as the rigidity of the TPC benchmarks.
So, concluding, we can expect some more development and progress in this field in the future.