NoSQL Comparison: MongoDB vs ScyllaDB (Part I)
When choosing a NoSQL database for your business, the options can be overwhelming. One of the most popular choices is MongoDB, known for its easy use and mature community. But the highly performance-oriented ScyllaDB is one of the rising challengers.
In this report, Dr. Daniel Seybold from benchANT takes a closer technical look at both databases, comparing their features, architectures, performance and scalability.
By the end of this article, you'll have a foundational understanding you can use to assess which database is the best fit for your project.
For the benchmark study with 133 measurement results, see this article!
Table of Contents
- Chapter 1: Executive Summary
- Chapter 2: Findings
- Chapter 3: Overview
- Chapter 4: Data Model
- Chapter 5: Query Language
- Chapter 6: Distributed Architectures
- Chapter 7: Conclusion
- Appendix
- About the Author: Dr. Daniel Seybold
Executive Summary
The continuously evolving database landscape provides a variety of solutions for the most diverse use cases. Today, we analyze and compare two NoSQL database systems – the very popular document-oriented MongoDB and the rising wide-column oriented ScyllaDB – from an independent and technical angle.
Both database systems address similar use cases in the eCommerce, social media, mobile, gaming, financial, real-time analytics and Internet of Things domain.
But when it comes to the technical side, you can see the different approaches and focus. While MongoDB puts a lot of emphasis on its ecosystem and flexible data structures, ScyllaDB aims for predictable high throughput and low latency. Everything is tailored to performance and efficient resource usage.
The first key difference is the data model and the resulting query interface. The document data model of MongoDB favors schemaless data and comes with its own powerful query language. ScyllaDB applies a schema for its wide-column data model as a query interface, ScyllaDB extends the Cassandra Query Language (CQL) that provides less query features but ensures more predictable performance and infinite scalability.
The implementation of the storage and distributed architecture represents the second major difference between both databases. Regarding the storage architecture, MongoDB applies a threaded B+-Tree storage approach, while ScyllaDB follows an LSM tree approach with a shard-per-core approach.
Regarding the distributed architecture, MongoDB comes with two cluster modes: a replica set cluster targets high-availability, while a sharded cluster targets horizontal scalability and high-availability. A sharded cluster requires additional node types, router and config nodes, resulting in an increased operational complexity. ScyllaDB implements a homogeneous multi-primary architecture where nodes are equal, resulting in a simplified operability and scalability.
In summary, both database systems have their advantages: MongoDB for applications with flexible data structures that require complex queries and ScyllaDB for applications of large-scale data sets that require predictable low latency and high throughput.
Comparison Findings: MongoDB vs ScyllaDB
Comparing two distributed database systems, in a fair and in-depth manner, requires a thorough understanding of distributed systems and database domains as well as practical experience with the target database systems.
Based on our systems research background and the daily performance measurements of various distributed database systems at benchANT, we compared MongoDB and ScyllaDB from a scientifically objective and technical angle. Here are the findings of our extensive analysis:
1. Data Model: MongoDB applies a JSON-based document data model, ScyllaDB a wide-column data model. Both data models support a broad range of use cases, while their applicability for large volume and data-intensive applications requires an in-depth evaluation. Read more.
2. Query Language: MongoDB has developed its own "mongosh" query syntax, which is based on JavaScript concepts with a high functionality. ScyllaDB has extended the Cassandra Query Language (CQL), which has the same syntax including extensions, which need to be considered in the data modeling phase. Moreover, ScyllaDB also provides Alternator, a fully compatible AWS DynamoDB API Read more.
3. Architecture: The technical architectures of both databases are analyzed regarding their storage architecture and distribution concepts to enable high availability and horizontal scalability. The commonalities (such as using write-ahead-logging) as well as the differences of their internal storage architecture (B+-Tree of MongoDB and the LSM trees of ScyllaDB) are discussed. Regarding the distributed architecture, the replica set and sharded cluster modes for MongoDB are analyzed and compared to the multi-primary architecture of ScyllaDB. These aspects are further analyzed regarding their operational complexity and expected performance and scalability impact. Read more.
4. Use Cases & Customers: Both MongoDB and ScyllaDB position themselves as “general purpose” databases for manyfold use cases and the ability to handle large amounts of data, provide high throughput, low latencies, high availability and high consistency for many of the world's largest companies. In summary, ScyllaDB is favored for use cases with considerable high throughput, large data sets and low latency requirements, while MongoDB is favored for flexible data structures. Read more.
5. Products: Both MongoDB and ScyllaDB have a similar product portfolio consisting of a community / open-source version respectively, a commercial enterprise featured version, and a cloud-ready fully managed Database-as-a-Service product. Here, MongoDB Atlas also supports a serverless mode, while ScyllaDB Cloud recently released this feature in a beta version.
6. Data Consistency: Both MongoDB and ScyllaDB are using write-ahead-logging and have comparable, but not identical consistency options. A detailed, informative comparison table is provided. Read more.
7. Benchmarking: There are multiple open source benchmark suites available that enable a comparative benchmarking study. Read more. A comprehensive performance study comparing the DBaaS offers of MongoDB and ScyllaDB are published on a separate study, which can be found here.
8. Operations: Both database technologies provide supportive tooling to automate operational tasks on self-managed resources on premise or in the cloud, and on Kubernetes. In addition, integration into analytical frameworks is provided. Read more
9. Open source: Both databases have strong open source roots. MongoDB originally used AGPL licenses and enjoyed massive growth due to it. Several years ago they switched to a new SSPL license, which isn’t considered OSS officially, but free usage and source availability are provided. ScyllaDB was founded by the KVM hypervisor inventors with strong roots in open source. ScyllaDB has always used the AGPL license.
An Overview about MongoDB & ScyllaDB
MongoDB is one of the most popular NoSQL databases on the market, with a large and active community of users and developers. According to DB-Engines, MongoDB is currently ranked as the most popular NoSQL database and the 5th most popular database overall.
Compared to MongoDB, ScyllaDB is a relatively new NoSQL database. It has gained popularity recently for its close-to-the metal design and promising performance results. It was built specifically for applications that require high throughput and predictable low latency.
Both databases, MongoDB and ScyllaDB, address similar use cases that cover eCommerce, social media, mobile and gaming applications, financial applications, real-time analytics and Internet of Things.
Some well-known customer stories of MongoDB are eBay for their data analytics pipeline, Cisco for their web-scale analytics platform and Sega for their mobile games like Sonic Forces.
Well known ScyllaDB customer stories are Strava for powering its real-time exercise tracking for 100M+ athletes, Expedia for its geography-based travel recommendations and Palo Alto Networks which operates more than 1000 ScyllaDB clusters(!) for network security detection.
Numberly uses both database systems for different use cases, MongoDB for web backends and real-time queries over unstructured data and ScyllaDB for latency sensitive data and mixed batch and real-time workloads.
TRACTIAN evaluated MongoDB, ScyllaDB and PostgreSQL as candidates for their real-time machine learning environment and analytical dashboards to support their growing workloads. As a result, they migrated from MongoDB to ScyllaDB because they can achieve a higher throughput and more than 10 times decreased P90 latency.
Discord moved from MongoDB to Cassandra to handle the storage of billions of messages. But with the continuously growing workloads, they were still facing performance issues that ScyllaDB has been able to solve. After the migration, they replaced 177 Cassandra nodes with 72 ScyllaDB nodes and at the same time significantly improved P99 read and write latencies.
Data Model Comparison of MongoDB and ScyllaDB
In the last two chapters, we saw many similarities in the use cases and products between the two NoSQL databases, MongoDB and ScyllaDB, both designed for large distributed deployments and high scalability.
Selecting the optimal database technology for your data intensive use case also depends on the supported data model. Depending on the application data, the selection of a database with a suitable data model such as relational, document-oriented, wide-column, time-series, etc. will affect the effort on the client side to transform the data into the target database data model. Moreover, the data model can also have an impact on performance, as outlined in the upcoming section. For instance, the wide-column data model usually enables highly efficient data compression and makes retrieving only a portion of an entity very efficient.
For this reason, we briefly look at the differences between the MongoDB document data model and the ScyllaDB wide-column data model and show how the same example data could look on the two data models.
A basic difference between these data models is that MongoDB’s document data model is in general schemaless, while ScyllaDB’s wide column data models apply a schema. A schemaless database comes with the advantages of no upfront data model planning and without complex schema updates. However, it does not impose any rules regarding the data types and structure, which might result in messy data and additional overhead on the application layer. A database enforcing a schema comes with strongly enforced data types and less ambiguity when connecting and moving information between systems. In addition, it makes it easier to understand and debug the application code for the developers. But it requires upfront planning, and schema changes need to be considered wisely.
MongoDB Data Model - BSON Documents
MongoDB uses a document-based data model, where data is stored in a BSON (binary JSON) format. BSON is a binary representation of JSON data, which allows for efficient storage and retrieval of data. In MongoDB, a collection is the equivalent of a table in a relational database, and documents are the equivalent of rows. Each document in a collection can have a different set of fields, and fields can contain nested documents and arrays.
Schemaless document data models provide the flexibility to store any kind of data, but this data is usually not normalized and JOINs are not a common operation. But JOIN-like operations are supported, for instance with the "$lookup" command of MongoDB that joins two collections.
Indexes are defined per collection and are also possible over complex data structures. Collections are a collection of documents with the same structure. It is possible to create more than one collection per database.
ScyllaDB Data Model - Wide-Column Stores
ScyllaDB, like Apache Cassandra and DynamoDB, relies on a wide-column store, where data is stored in tables that consist of rows and columns. ScyllaDB’s data model can also be considered as key-key-value to reflect the partitioning and clustering keys. The partition key can be composed of one or multiple columns and is part of the primary key. It is used to create data partitions that are distributed across the available nodes. The cluster keys reflect additional columns that are added to the primary key. They allow you to sort each row physically inside the partition. An example for a cluster key is using the timestamp as time.
The following examples show the basic differences in the MongoDB and ScyllaDB data models for a "movies" data model with its respective attributes.
Query Language Analysis of MongoDB and ScyllaDB
When it comes to the operational usage, both developers and data experts need a powerful query language to perform database operations. While SQL is the most popular query language in RDBMS, MongoDB and ScyllaDB have their own query language with limitations and differences compared to the SQL standard.
Mongosh - the MongoDB query language
MongoDB has developed its own powerful query language mongosh
for the JSON data model with a syntax similar to JavaScript. The capabilities of this language are:
- Single or bulk operations for reading, writing and updating documents.
- Support for filtering, projection, sorting, and aggregation operations.
- Support for querying nested JSON documents, arrays.
- Support for geospatial and text-based queries.
- Support for advanced query options such as explain plans and cursor-based pagination.
- Support for joins using the $lookup command.
As powerful as this language is, it is not comparable by its syntax to SQL. For learning it, there are many helpful tutorials and tools like "MongoDB Compass" to translate SQL to mongosh, and backwards or deal with data operations.
Also, the MongoDB Docs provide compatible drivers for a long list of programming languages (C#, Go, Java (sync/async), Engine, Node.js, Perl, PHP, Python, Ruby, Scala).
Cassandra Language Query - Extended by ScyllaDB
ScyllaDB's query language, also known as the CQL (Cassandra Query Language), is based on the SQL data model and uses a syntax similar to SQL. SELECT, INSERT, UPDATE, DELETE commands with WHERE conditions can be used 1:1 from SQL, as well as the CREATE and ALTER commands for table operations.
Some of the key features of the CQL include:
- Single or bulk operations for reading, writing and updating rows.
- Support for filtering, projection, sorting, and aggregation operations.
- Support for advanced query options such as Explain plans and paging.
However, one must be careful here regarding the dimensions, since a relational database corresponds to a keyspace, and a table to a column family. The CQL does not support JOIN operations but recommends data denormalization instead. This makes it important that data should be written to column families, as it will be read afterward. This means that in comparison to normalized SQL tables, you store data denormalized in ScyllaDB and even multiple times if you need it for other read operations in another data structure. Sometimes it is necessary to use an (existing) data analytics or processing engine to get the same query capabilities as SQL.
ScyllaDB has extended the CQL language capabilities with several further commands. They were developed to give some extra enterprise usability for handling caching, large data sets, inconsistencies and improving query performance.
ScyllaDB provides drivers for CQL for Java, Go, Python, Rust and C/C++, but is also compatible with other Apache Cassandra CQL drivers.
Besides the CQL API, ScyllaDB also supports the AWS DynamoDB API via the ScyllaDB Alternator. It enables the deployment of an open source and fully DynamoDB compatible database with JSON-encoded HTTP requests. Alternator is Open Source and enables the operation in a multi-cloud context to escape the AWS vendor lock-in. Moreover, ScyllaDB with Alternator offers lower cost and higher performance compared to DynamoDB. With the Alternator API, it is possible to run applications written for DynamoDB unmodified on ScyllaDB, too. This makes a migration from DynamoDB to ScyllaDB easy and fast.
Operation | SQL | CQL (ScyllaDB) | mongosh |
---|---|---|---|
SELECT | Yes | Yes | Yes |
INSERT | Yes | Yes | Yes |
UPDATE | Yes | Yes | Yes |
DELETE | Yes | Yes | Yes |
Primary Keys | Yes | Yes | Yes |
Secondary Keys | Yes | Yes | Yes |
Foreign Keys | Yes | No, use denormalized data | No, use denormalized data |
JOIN | Yes | No, use denormalized data or 3rd party tools | Yes, use $lookup, but it can be performance costly |
WHERE | Yes | Yes | Yes |
ORDER BY | Yes | Yes, but only to clustering columns | Yes |
Aggregation Functions | Yes | Yes, but only at partition key level, or clustering column level | Yes |
A Performance Viewpoint on the MongoDB and ScyllaDB Data Models
- Data model and query functionalities have a considerable impact on performance.
- MongoDB offers a more flexible and feature rich query interface that is associated with the risk of unpredictable performance. Client applications can execute complex queries that trigger costly but non-obvious internal operations such as table scans.
- ScyllaDB offers a less extensive query interface to align with its goal of achieving predictable performance. This means that only a selected set of range and aggregation query types is supported, and additional capabilities need to be implemented on the client side. Therefore, ScyllaDB is expected to deliver stable performance numbers, while MongoDB favors flexible queries over easily predictable performance.
- ScyllaDB offers prepared statements which utilize the schema and allow the database to prepare for serialization better. Schema provides structure when you plan user-defined data types, materialized views and indexing.
MongoDB and ScyllaDB, Distributed Architectures With Different Approaches
Both database technologies promise a high-available, performant and scalable architecture. But the way they achieve these objectives is much more different than you might think at first glance. For instance, an experience report demonstrates how ScyllaDB can easily be operated on AWS EC2 spot instances thanks to its distributed architecture, while MongoDB’s distributed architecture would make this a very challenging task. To highlight these differences, we provide an in-depth discussion of the internal storage architecture and the distributed architectures enabling high-availability and horizontal scalability.
A Performance Viewpoint on the Storage Architecture of MongoDB and ScyllaDB
A basic commonality is that both databases are implemented in C++ and recommend the use of the XFS filesystem. Moreover, MongoDB and ScyllaDB are building upon the write-ahead-logging concept, Commit Log in ScyllaDB terminology and Oplog in MongoDB terminology. With write-ahead-logging, all operations are written to a log table before the operation is executed. The write-ahead-log serves as a source to replicate the data to other nodes, and it is used to restore data in case of failures because it is possible to replay
the operations to restore the data.
MongoDB uses as default storage engine a B+-Tree index (Wired Tiger) for data storage and retrieval. B+-Tree indexes are balanced tree data structures that store data in a sorted order, making it easy to perform range-based queries. MongoDB supports multiple indexes on a collection, including compound indexes, text indexes, and geospatial indexes. Indexing of array elements and nested fields, allowing for efficient queries on complex data structures, are also possible. In addition, the enterprise version of MongoDB supports an in-memory storage engine for low latency workloads.
ScyllaDB divides data into shards by assigning a fragment of the total data in a node to a specific CPU, along with its associated memory (RAM) and persistent storage (such as NVMe SSD). The internal storage engine of ScyllaDB follows the write-ahead-logging concept by applying a disk persistent commit log together with memory based memtables that are flushed to disk over time. ScyllaDB supports primary, secondary and composite indexes, both local per node and global per cluster. The primary index consists of a hashing ring where the hashed key and the corresponding partition are stored. And within the partition, ScyllaDB finds the row in a sorted data structure (SSTable), which is a variant of the LSM-Tree. The secondary index is maintained in an index table. When a secondary index is queried, ScyllaDB first retrieves the partition key, which is associated with the secondary key, and afterward the data value for the secondary key on the right partition.
These different storage architectures result in a different usage of the available hardware to handle the workload. MongoDB does not pin internal threads to available CPU cores but applies an unbound approach to distributed threads to cores. With modern NUMA-based CPU architectures, this can cause a performance degradation, especially for large servers because threads can dynamically be assigned to cores on different sockets with different memory nodes.
In contrast, ScyllaDB follows a shard per core approach that allows it to pin the responsible threads to specific cores and avoids switching between different cores and memory spaces. In consequence, the shard key needs to be selected carefully to ensure an equal data distribution across the shards and to prevent hot shards. Moreover, ScyllaDB comes with an I/O scheduler that provides built-in priority classes for latency sensitive and insensitive queries, as well as the coordinated I/O scheduling across the shards on one node to maximize disk performance. Finally, ScyllaDB’s install scripts come with a performance auto-tuning step by applying the optimal database configuration based on the available resources. In consequence, a clear performance advantage of ScyllaDB can be expected.
ScyllaDB allows the user to control whether data should reside in the DB cache or bypass it for rarely accessed partitions. ScyllaDB allows the client to reach the node and CPU core (shard) that owns the data. This provides lower latency, consistent performance and perfect load balancing. ScyllaDB also provides ‘workload prioritization’ which provides the user different SLAs for different workloads to guarantee lower latency for certain, crucial workloads.
The MongoDB Distributed Architecture - Two Operation Modes For High-Availability and Scalability
The MongoDB architecture offers two cluster modes that are described in the following sections: a replica set cluster targets high-availability, while a sharded cluster targets horizontal scalability and high-availability.
Replica Set Cluster: High-Availability with Limited Scalability
The MongoDB architecture enables high-availability by the concept of replica sets. MongoDB replica sets follow the concept of Primary-Secondary nodes, where only the primary handles the WRITE operations. The secondaries hold a copy of the data and can be enabled to handle READ operations only. A common replica set deployment consists of two secondaries, but additional secondaries can be added to increase availability or to scale read-heavy workloads. MongoDB supports up to 50 secondaries within one replica set. Secondaries will be elected as primary in case of a failure at the former primary.
Regarding geo-distribution, MongoDB supports geo-distributed deployments for replica sets to ensure high-availability in case of data center failures. In this context, secondary instances can be distributed across multiple data centers, as shown in the following figure. In addition, secondaries with limited resources or network constraints can be configured with a priority to control their electability as primary in case of a failure.
Sharded Cluster: Horizontal Scalability and High-Availability with Operational Complexity
MongoDB supports horizontal scaling by sharding data across multiple primary instances to cope with write intensive workloads and growing data sizes. In a sharded cluster, each replica set consisting of one primary and multiple secondaries represents a shard. Since MongoDB 4.4 secondaries can also be used to handle read requests by using the hedged read option.
To enable sharding, additional MongoDB node types are required: query routers (mongos) and config servers. A mongos instance acts as a query router, providing an interface between client applications and the sharded cluster. In consequence, clients never communicate directly with the shards, but always via the query router. Query routers are stateless and lightweight components that can be operated on dedicated resources or together with the client applications. It is recommended to deploy multiple query routers to ensure the accessibility of the cluster because the query routers are the direct interface for the client drivers. There is no limit to the number of query routers, but as they communicate frequently with the config servers, it should be noted that too many query routers can overload the config servers. Config servers store the metadata of a sharded cluster, including state and organization for all data and components. The metadata includes the list of chunks on every shard and the ranges that define the chunks. Config servers need to be deployed as a replica set itself to ensure high availability.
Data sharding in MongoDB is done at the collection level, and a collection can be sharded based on a shard key. MongoDB uses a shard key to determine which documents belong on which shard. Common shard key choices include the _id field and a field with a high cardinality, such as a timestamp or user ID. MongoDB supports three sharding strategies: range based, hash based and zone based.
Ranged sharding partitions documents across shards according to the shard key value. This keeps documents with shard key values close to one another and works well for range-based queries, e.g. on time series data. Hashed sharding guarantees a uniform distribution of writes across shards, which favors write workloads. Zoned sharding allows developers to define custom sharding rules, for instance to ensure that the most relevant data reside on shards that are geographically closest to the application servers
Also, sharded clusters can be deployed in a geo-distributed setup to overcome data center failures, as depicted in the following figure.
The ScyllaDB Architecture - Multi-Primary for High-Availability and Horizontal Scalability
Unlike MongoDB, ScyllaDB does not follow the classical RDBMS architectures with one primary node and multiple secondary nodes, but uses a decentralized structure, where all data is systematically distributed and replicated across multiple nodes forming a cluster. This architecture is commonly referred to as multi-primary architecture.
A cluster is a collection of interconnected nodes organized into a virtual ring architecture, across which data is distributed. The ring is divided into vNodes, which represent a range of tokens assigned to a physical node, and are replicated across physical nodes according to the replication factor set for the keyspace. All nodes are considered equal, in a multi-primary sense. Without a defined leader, the cluster has no single point of failure. Nodes can be individual on-premises servers, or virtual servers (public cloud instances) composed of a subset of hardware on a larger physical server. On each node, data is further partitioned into shards. Shards operate as mostly independently operating units, known as a “shared nothing” design. This greatly reduces contention and the need for expensive processing locks.
All nodes communicate with each other via the gossip protocol. This protocol decides in which partition which data is written and searches for the data records in the right partition using the indexes.
When it comes to scaling, ScyllaDB’s architecture is made for easy horizontal sharding across multiple servers and regions. Sharding in ScyllaDB is done at the table level, and a table can be sharded based on a partition key. The partition key can be a single column or a composite of multiple columns. ScyllaDB also supports range-based sharding, where rows are distributed across shards based on the partition key value range, as well as hash-based sharding for equally distributing data and to avoid hot spots.
Additionally, ScyllaDB allows for data to be replicated across multiple data centers for higher availability and lower latencies. In this multi-data-center or multi-region setup, the data between data centers is asynchronously replicated.
On the client side, applications may or may not be aware of the multi datacenter deployment, and it is up to the application developer to decide on the awareness to fallback data-center(s). This can be configured via the read and write consistency options that define if queries are executed against a single data center or across all data centers. Find more details in Data Consistency Options of MongoDB and ScyllaDB. Load balancing in a multi datacenter setup depends on the available settings within the specific programming language driver.
A Comparative Scalability Viewpoint on the Distributed Architectures of MongoDB and ScyllaDB
When it comes to scalability, the significantly different distribution approaches of both ScyllaDB and MongoDB need to be considered, especially for self-managed clusters running on-premises or on IaaS. MongoDB’s architecture easily allows scaling read-heavy workloads by increasing the number of secondaries in a replica set.
Yet, for scaling workloads with a notable write proportion, the replica sets need to be transformed into a sharded replica set and this comes with several challenges. First, two additional MongoDB services are required: n query routers (mongos) and a replica set of config servers to ensure high availability. Consequently, considerably more resources are required to enable sharding in the first place. Moreover, the operational complexity clearly increases. For instance, a sharded cluster with three shards requires a replica set of three mongos instances, a replica set of three config servers and three shards – each shard consisting of one primary and at least two secondaries.
The second challenge is the repartitioning of data in the sharded cluster. Here, MongoDB applies a constantly running background task that autonomously triggers the redistribution of data across the shards. The repartitioning does not take place as soon as a new shard is added to the cluster, but when certain internal thresholds are reached. Consequently, increasing the number of shards will immediately scale the cluster, but may have a delayed scaling effect. Surprisingly, until MongoDB Version 5.0, MongoDB engineers themselves recommended to not shard, but rather to scale vertically with bigger machines if possible.
Scaling a ScyllaDB cluster is comparably easy and transparent for the user thanks to ScyllaDB’s multi-primary architecture. Here, each node is equal, and no additional services are needed to scale the cluster to hundreds of nodes. Moreover, data repartitioning is triggered as soon as a new node is added to the cluster.
In this context, ScyllaDB offers clear advantages over MongoDB. First, thanks to the consistent hashing approach, data does not need to be repartitioned across the full cluster, only across a subset of nodes. Second, the partitioning starts with adding the new node, which eases the timing of the scaling action. This is important, since repartitioning will put some additional load on the cluster and should be avoided at peak workload phases.
The main scalability differences are summarized in the following table:
Replication | Horizontal Read Scalability | Data Partitioning Process | Horizontal Write Scalability | # of required in- stances for a cluster with 3 data serving nodes | |
---|---|---|---|---|---|
ScyllaDB | ✅ | ✅ | implicitly with the addition/removal of new nodes | ✅ | 3 nodes |
MongoDB Replica Set | ✅ | ✅ | ❌ | ❌ | 3 nodes (1 replica set with 1 primary + 2 secondaries) |
MongoDB Sharded Cluster | ✅ | ✅ | implicitly triggered by a background process | 14 nodes (3 sharded replica sets, 9 nodes total + 1 config server replica, 3 nodes total + 2 query routers) |
Conclusion and Outlook
When you compare two distributed NoSQL databases, you always discover some parallels, but also numerous considerable differences. This is also the case here with MongoDB vs ScyllaDB. Both databases address similar use cases and have a similar product and community strategy.
But when it comes to the technical side, you can see the different approaches and focus. Both databases are built for enabling high availability through a distributed architecture. But when it comes to the target workloads, MongoDB enables easily getting started with single node or replica set deployments that fit well for small and medium workloads, while addressing large workloads and data sets becomes a challenge due to the technical architecture. ScyllaDB clearly addresses performance critical workloads that demand for easy and high scalability, high throughput, low and stable latency, and everything in a multi datacenter deployment. This is also shown by data intensive use cases of companies such Discord, Numberly or TRACTIAN that migrated from MongoDB to ScyllaDB to successfully solve performance problems.
And to provide further insights into their respective performance capabilities, we provide a transparent and reproducible performance comparison in a follow-up article that investigates the performance, scalability, and costs for MongoDB Atlas and ScyllaDB Cloud.
Appendix - Data Consistency Options of MongoDB and ScyllaDB
Data consistency is an important aspect of distributed systems. It refers to the guarantees to which replicated data in a distributed system is identical across all nodes. Consistency is particularly important in systems that allow multiple users to access and update data concurrently, as it ensures that all users see the same data at the same time.
Database systems usually come with tunable consistency settings to enable stricter or lower consistency, depending on different use cases. Stricter consistency levels need more internal database operations to verify the data consistency. These internal operations need resources and have a direct influence on the performance and latencies of the database systems. The consistency settings can also have a direct impact on the availability and error tolerance of the database system. In consequence, in-depth knowledge of the consistency settings, the default as well as the tunable levels, is required for the database administrator as well as for the application developers.
When it comes to comparing consistency levels of different NoSQL database technologies, it can get tricky due to different architectures and semantics, and a 1:1 comparison is hardly possible.
The following two tables provide an overview of comparable client consistency configurations for ScyllaDB and MongoDB, separated into write and read consistency settings, with a description of their implications with respect to consistency and availability. Moreover, the stricter consistency settings also come with an increased performance overhead, the extent of which must be analyzed individually for each database system. More details, including also multi-datacenter consistency settings, can be found in the respective documentation of ScyllaDB and MongoDB. Here it is worth emphasizing that ScyllaDB offers dedicated consistency configurations for geo-distributed clusters to allow the users to tune for improved performance or stronger consistency, e.g. LOCAL_QUORUM vs. QUORUM in a geo-distributed deployment. In contrary, MongoDB does not offer datacenter aware consistency settings.
ScyllaDB Read Consistency Levels | MongoDB Read Concern Levels | Description |
---|---|---|
ANY / ONE | local | weak consistency, high availability |
LOCAL_ONE | n/a | weak consistency, high availability (within one datacenter) |
TWO | n/a | medium consistency, medium availability (in large clusters) |
QUORUM | majority | medium consistency, medium availability in large clusters, low availability in small clusters |
LOCAL_QUORUM | n/a | medium consistency, medium availability in large clusters, low availability in small clusters (within one datacenter) |
THREE | n/a | strong consistency, medium availability in large clusters, low availability in small clusters |
ScyllaDB Write Consistency Levels | MongoDB Write Concern Levels | Description |
---|---|---|
n/a | w:0 / journal: false | weakest consistency, highest availability |
ANY / ONE | w:1 / journal: true | weak consistency, high availability (in large clusters) |
LOCAL_ONE | n/a | weak consistency, high availability (in large clusters) within one datacenter |
TWO | w:2 / journal: true | medium consistency, medium availability (in large clusters) |
QUORUM | majority /journal: true | medium consistency, medium availability in large clusters, low availability in small clusters |
LOCAL_QUORUM | n/a | medium consistency, medium availability in large clusters, low availability in small clusters (within one datacenter) |
THREE | w:3 | strong consistency, medium availability (in large clusters) |
ALL | w: # of nodes in the replica set / journal: true | strongest consistency, low availability |
Appendix - Benchmarking MongoDB and ScyllaDB
Based on the previous conceptual analysis, it can be expected that ScyllaDB provides better performance and scalability compared to MongoDB. But of course, these assumptions need to be validated by realistic performance benchmarks. Further, both database vendors highlight the need for benchmarking, e.g. “Benchmarking will help you fail fast and recover fast before it’s too late. (ScyllaDB)” or “… measure everything, assume nothing … (MongoDB)”.
Since both database technologies have a strong community and open-source focus, there are multiple benchmark suites that provide support for ScyllaDB and MongoDB benchmarking. In the following, we present popular open-source benchmark suites that are actively maintained. To evaluate ScyllaDB and MongoDB, these benchmarks can be executed manually or in an automated manner by using the benchANT platform.
The following table lists a set of available benchmark suites along with their supported workload types, since a benchmark suite can support different workload types. A workload is commonly defined as the data structure and the query types that are executed during the benchmark run. Depending on the workload, the queries can range from simple create-read-update-delete (CRUD) operations to more complex analytical queries that combine scalar and aggregate functions. Generally, these benchmark suites can be applied for all previously described use cases.
Benchmark Suite | Workloads | Supported Database |
---|---|---|
Yahoo Cloud Serving Benchmark (YCSB) | CRUD | ScyllaDB / MongoDB |
Time Series Benchmark Suite (TSBS) | Time Series | ScyllaDB (via Cassandra interface) / MongoDB |
NoSQLBench | CRUD / Time Series / Tabular | ScyllaDB (via Cassandra interface) / MongoDB |
Genny | CRUD / Time Series / Batch Inserts / Micro Operations | MongoDB |
mongo-perf | Micro Operations | MongoDB |
py-tpcc | eCommerce based on TPC-C specification | MongoDB |
TLP-Stress | CRUD / Time Series / Views / Collections / Counters | ScyllaDB (via Cassandra interface) |
Cassandra Stress | CRUD / Time Series / Views / Collections / Counters | ScyllaDB (via Cassandra interface) including shard awareness and lightweight transactions |
ClickBench | OLAP | MongoDB |
All of these benchmarks are designed to be extensible regarding the supported database systems, but also regarding the supported workloads. The extension effort heavily depends on the specific benchmark suite.
No independent and transparent performance comparison has been published yet for MongoDB and ScyllaDB, especially regarding their DBaaS offers, except for the following two exceptions.
An actual use case driven benchmark between MongoDB and ScyllaDB is described by engineers of TRACTIAN who migrated from MongoDB to ScyllaDB because they were able to decrease the P90 latency by more than 10 times and increase the achieved throughput.
The benchANT Database Ranking applies a scientifically proven benchmark process to provide an independent and open ranking that contains over 100 performance measurements for some of the most popular databases on different cloud providers. These measurements are based on non-production workloads in a few scaling sizes for similar, fair settings.
Over the entire measurements on AWS EC2, ScyllaDB in scaling size xLarge achieves the highest overall throughput with 204K ops/s and impressive write latencies. For MongoDB, the largest measured scaling size is large, where it archives an average throughput of 21k ops/s, and it ranges on position 14 on AWS EC2. Both databases show good read latencies at their respective throughput results.
Appendix - First-Hand Operational Experiences of MongoDB and ScyllaDB
When it comes to operating a (distributed) database, additional tool support besides a CLI is desirable. Thankfully, both ScyllaDB and MongoDB provide a rich set of additional tooling to ease the deployment and operation of database clusters in the cloud. In the following, we summarize the official tools; note that numerous community provided operational tools are also available.
ScyllaDB Operational Tools
In order to deploy a ScyllaDB cluster on premises or in the cloud, ScyllaDB offers supportive installation scripts as well as integrations into common DevOps tools such as Ansible or Terraform. An aspect worth mentioning is that the installation scripts automatically tune the ScyllaDB configuration for the available storage and network resources.
Since Kubernetes has become a common target to deploy databases to, operators to facilitate the operation of databases on Kubernetes are desirable. ScyllaDB provides an open source operator that supports Kubernetes, Google Kubernetes Engine and AWS Elastic Kubernetes Service (experimental).
In addition, ScyllaDB also provides the ScyllaDB Manager to support repetitive tasks such as deployments, backups or data maintenance tasks. ScyllaDB Manager is free up to five nodes and supports unlimited nodes in the enterprise version.
The ScyllaDB Monitoring Stack is built upon open-source technologies and enables the comprehensive collection and visualization of metrics, events and logs. It can also be easily integrated with the Cluster Manager.
In addition, ScyllaDB offers multiple integrations into analytical frameworks such as Apache Spark or Presto.
MongoDB Operational Tools
Besides the manual installation via a package manager, the deployment of MongoDB is supported by Ansible modules. Provisioning of MongoDB Atlas instances is supported by a Terraform provider.
Kubernetes' deployments are supported by an open-source community operator and an enterprise operator. The enterprise operator provides a richer feature set, for instance backup support.
Similar to the Cluster Manager of ScyllaDB, MongoDB offers the enterprise tool Ops Manager to support deployment, operation, monitoring and data maintenance tasks for self-managed MongoDB deployments on premise or in the cloud.
In addition, MongoDB provides additional services to ease the integration with analytical tasks, for instance a BI Connector for running SQL queries on top of MongoDB, an Apache Spark Connector, MongoDB Compass to inspect and query data, MongoDB Charts to visualize data.
About Dr. Daniel Seybold
Dr. Daniel Seybold is the co-founder and CTO of benchANT, a company that specializes in benchmarking and performance testing of databases. Daniel started his career as a researcher with a focus on distributed systems and databases. He has extensive experience in the field of database performance testing and has been working with NoSQL databases such as MongoDB, Cassandra and ScyllaDB for more than a decade. During his academic career he published over 20 papers on cloud and database performance related topics on renowned scientific conferences and completed his PhD with the thesis An automation-based approach for reproducible evaluations of distributed DBMS on elastic infrastructures.
These research results are the technical foundation of benchANT which pursues the goal of supporting organizations in selecting the right database for their use case.
From his point of view, there is no "best" database, but only a better-suited and more efficient database solution for each use case.