Disk based key-value stores

Mahesh Senniappan
8 min readJan 27, 2019

--

When it comes to key-value stores that provide low-latency, high-throughput access to data, Redis is the first thing that comes to mind. It is widely used in an amplitude of use cases — session store, application cache and even as a key-value database. Redis ticks a lot of these boxes as it is memory based and is able to provide crazy fast access to data and perform operations on the data in a fraction of time taken by disk-based data stores. See this page in Redis’s website to understand what I mean by crazy fast.

Once you move past session stores and application caches, there are a lot of uses for key-value stores even in big data and web-scale applications. For example, a Spark job can use a key-value store for fast lookups as part of a larger data pipeline or a social network can use it as a front for a relational database to store profiles of millions of users for fast access. In both these examples, it is paramount for the key-value store to provide low-latency, high-throughput access to data.

There is only one caveat — if there is one thing we have learnt from cloud services, it is that memory is not cheap. For example, at the time of writing (Jan 2019), this is how two machines with similar CPUs but different memory capacities are priced (on-demand)in AWS east region. As you can see, there is a price difference of 30%.

+----------+------+-------------+----------------+
| Instance | vCPU | Memory(GiB) | Price/hour ($) |
+----------+------+-------------+----------------+
| m5.large | 2 | 8 | 0.096 |
| r5.large | 2 | 16 | 0.126 |
+----------+------+-------------+----------------+

Let say we have a dataset that has 50 million keys each with string values of size 1kB. At least 47GB of memory is required to store just the values (over simplified estimate). This would require three r5.large instances with a total cost of 0.378/hour or $63/week. A fault tolerant setup would incur double the cost if not more. The purpose of this blog post is to find out if SSD based key-value stores (open-source), can be an alternative for the costlier memory-based solutions. There is no question that memory lookups will always be faster than SSD lookups. But how much of a performance compromise are we talking about?

SSD based key-value stores

Gone are the days of magnetic tapes and spinning disks. SSDs are the next best thing to store data after main memory when it comes to low latency access. Look at this list of latency numbers every programmer should know to understand how fast SSDs can be (150μs for random reads). They are also widely supported by cloud providers and are getting cheaper every day. For comparison, the 50 million values mentioned earlier can be stored in a single node having sufficient SSD storage for just 1/3 of the total cost. So, can they be a viable alternative for memory-based key-value stores? Read on.

There has been a lot of research and development in the area of disk-based key-value data stores in the past few years producing projects like LevelDB, RocksDB, LMDB, BadgerDB, etc. In fact, they have become so popular that Redis and memcached — the two most popular key-value stores are offering disk-based solutions — Redis on Flash and Extstore. I am going to take RocksDB as an example here as it is very popular lately, it is open source and is used in a lot of projects as a storage backend.

RocksDB Performance

Performance benchmarks are provided by the RocksDB team which was done in an AWS i3 instance with NVMe SSD storage and is optimized for low latency access. Random writes clock at around 56 micros/op which translates (1/(56 * 1e-6)) to 17k ops/sec and random reads are close at around 64 micros/op and 15k ops/sec. Note that if regular SSD backed instances were used instead of NVMe storage, latency numbers would be a bit higher. So theoretically if you run RocksDB and your application in two different availability zones, you can expect latencies of around 1ms to 3ms for key lookups from application code. Is it reasonable to expect these numbers?

Ardb (with RocksDB storage) running in an m5.large instance was used for the benchmarks to write and read 1 million keys. The instance has 2 cores and 8GiB memory and was configured with General Purpose SSD (gp2). Ardb was configured to use 2 threads to match the cpu cores in the instance. Benchmarks were run from a different instance (r5.large) in the same availability zone.

Benchmark command:

redis-benchmark -h <IP_ADDRESS> -t set,get -r 1000000 -n 1000000 -d 1024

Benchmark Results:

(few lines were truncated for brevity)

====== SET ======
1000000 requests completed in 25.53 seconds
50 parallel clients
1024 bytes payload
keep alive: 1
19.34% <= 1 milliseconds
99.83% <= 5 milliseconds
99.96% <= 10 milliseconds
99.99% <= 26 milliseconds
100.00% <= 203 milliseconds
39171.14 requests per second
====== GET ======
1000000 requests completed in 19.80 seconds
50 parallel clients
1024 bytes payload
keep alive: 1
51.08% <= 1 milliseconds
99.93% <= 2 milliseconds
99.99% <= 7 milliseconds
100.00% <= 203 milliseconds
50507.60 requests per second

For comparison, Redis running in the same instance produced the following results.

====== SET ======
1000000 requests completed in 19.50 seconds
50 parallel clients
1024 bytes payload
keep alive: 1
61.89% <= 1 milliseconds
99.82% <= 2 milliseconds
99.93% <= 3 milliseconds
99.99% <= 8 milliseconds
100.00% <= 203 milliseconds
51284.68 requests per second
====== GET ======
1000000 requests completed in 19.31 seconds
50 parallel clients
1024 bytes payload
keep alive: 1
67.99% <= 1 milliseconds
99.97% <= 2 milliseconds
99.99% <= 3 milliseconds
100.00% <= 203 milliseconds
51783.96 requests per second

Here is how the results look side-by-side.

+-----------------+----------+----------------+
| Datastore | Redis | Ardb + RocksDB |
+-----------------+----------+----------------+
| SET (req/sec) | 51284.68 | 39171.14 |
| GET (req/sec) | 51783.96 | 50507.60 |
| P99 SET (ms) | 2 | 5 |
| P99.9 SET (ms) | 3 | 10 |
| P99.99 SET (ms) | 8 | 26 |
| P99.9 GET (ms) | 2 | 2 |
| P99.99 GET (ms) | 3 | 7 |
+-----------------+----------+----------------+

The performance numbers shown above can either be encouraging or discouraging depending on the use case and requirements but it is important to remember the trade off between performance and cost.

As you can see, Ardb with RocksDB storage engine is performing with millisecond latencies for most operations except for write throughput and P99.99 latency. While RocksDB provides lot of configurations to tune performance, Provisioned IOPS SSD can be used to improve performance. There is also the possibility of running multiple instances of Ardb to effectively use all the cores in the machine. While you can do the same with Redis, it will soon run out of memory which is not the case with disk space.

Storage libraries and Embedded databases

If you look at the descriptions of RocksDB, LevelDB or BadgerDB, they are either called storage engines or embedded databases. Practically, what this means is that, these can’t run like a typical database server. They are meant to be used as a library from your application code - there, I said it. So you would bundle these “libraries” as part of an application’s code and deploy them in an environment with access to SSD storage. This is a radically different type of application architecture and comes with advantages and disadvantages of it’s own. But what if you are using Redis and would like to try or switch to one of these storage libraries?

Redis protocol to the rescue

Redis used a protocol called RESP (REdis Serialization Protocol) for client-server communication. While it is meant to be used by Redis, it is flexible enough to be used in other projects where client-server communication is required, especially in a key-value store. The added advantage is that the existing tool set around Redis can be reused — redis-cli and other language specific clients.

So the solution is to use one of the storage engine libraries mentioned above for storage and add a TCP communication layer which communicates using Redis protocol to create a key-value store. Luckily, there are a few projects that have already done it and ardb is one such project. It is as simple as building the project with the storage engine of your choice and it can be deployed on a server as a key-value store. If there is existing code that uses Redis, it just needs to be pointed to the new server.

High availability and Replication

Although it is simple enough to get started with these storage engines using projects like Ardb, they can’t help if your goal is to have a highly available (HA) system. Replication and sharding are required to make this a highly available system and there is no built-in feature either in the storage engine libraries or in the Redis protocol wrappers (ardb) to setup a HA system.

This is exactly the problem Netflix was trying to solve and Dynomite was created as a result of it. To quote Dynomite documentation, it’s design goal is:

Dynomite’s design goal is to turn those single-server datastore solutions into peer-to-peer, linearly scalable, clustered systems while still preserving the native client/server protocols of the datastores, e.g., Redis protocol.

Default implementation of Dynomite supports both Redis and Memcached protocols and it can also be extended to support other protocols if needed. Note that I said protocols instead of the actual data stores. This means that, any data store can be plugged into Dynomite as long as they speak one of these two protocols. Here is how the architecture will look like, straight from Dynomite documentation.

Image source: https://github.com/Netflix/dynomite/wiki/Architecture

Dynomite will sit in front of the data store and will replicate the data across multiple nodes in multiple data centers. While Dynomite itself is standalone, it can be used along with other tools in the ecosystem like Dyno client and Dynomite-manager for additional benefits.

Conclusion

There will always be an implicit trade-off between performance and cost and I have to acknowledge the fact that memory based data stores will always be faster than disk-based solutions. Having said that, it is not always possible to depend on memory based solutions because of the cost and their ephemeral characteristics. Disk based key-value stores are a viable alternative as they can provide acceptable performance characteristics for a fraction of the cost. Especially, if the values stored in the KV-store are larger and stored for longer periods. We also looked at how solutions (RocksDB, Ardb and Dynomite), that each address a specific concern, can be combined to produce a highly available storage solution.

--

--