Key-value stores are used by companies such as Facebook and Twitter to improve the performance of web applications with a high read-to-write ratio. They operate as caches for frequently requested content or data that is costly to obtain, such as the result of a computationally expensive database query. We study two design problems associated with key-value stores.
The first problem we consider is the design of eviction policies that are particularly suited to the constraints of a key-value store. Current implementations use Least Recently Used (LRU), a popular and simple eviction policy. However, LRU does not take into consideration the time to obtain an item from its source (referred to as fetch time) which can can vary widely. If the fetch times for items stored in a cache vary significantly, a more sophisticated eviction algorithm such as GreedyDual-Size (GDS) provides better performance in terms of total fetch time. But GDS can be costly to implement. We propose an eviction policy called Cost Adaptive Multi-queue eviction Policy (CAMP) that closely approximates GDS's caching performance while being as fast as LRU to implement. We show that CAMP's competitive ratio is a factor of (1 + epsilon) more than GDS's competitive ratio, where epsilon is a parameter that depends on the number of bits of precision used to compute the eviction priority of cached items.
In addition to eviction decisions, key-value stores also typically manage the placement of data objects. The current state of the art uses a technique called slab allocation in which items are assigned to one of several LRU queues according to their size. To handle changing workloads, the queues must be dynamically resized. Current schemes have been handling this problem in an ad hoc manner. We propose a variant of CAMP that manages its own memory layout and show that if it is given a modest amount of additional memory to account for fragmentation, it is competitive against an offline optimal algorithm that does not specify layout.
The second problem we investigate is the design of memory hierarchies using multiple types of memory technology for caching. Advances in storage technology have introduced many new types of storage media which present a system designer with a wide array of options in designing caching middleware. We provide a systematic way to use knowledge about the frequencies of read and write requests to individual data items in order to determine the optimal cache configuration. The ability to replicate a data item in more than one memory bank can benefit the overall performance of the system with a faster recovery time in the event of a memory failure. The key design question we are concerned with is how to best assign data items to memory banks, given that we have the option of replicating objects in order to maximize performance. Our performance model takes into account retrieval, update and recovery time. We study two variants of this problem. In the first variant which we call the cache configuration problem, we have a fixed budget and must decide which types of the storage media to purchase, how much of each to buy and how to place data objects in this system once the capacity of each storage medium is determined. In the second variant which we call the subset assignment problem, the storage hardware has already been purchased and we are solely concerned with data placement.
Both problems are NP-hard since they are generalizations of the knapsack problem. We make the reasonable practical assumption that there are many more data items than there will be storage media, and that each storage medium is orders of magnitude larger than any single data item. These assumptions allow us to efficiently find nearly optimal solutions. Thus, for the cache configuration problem, we show that the problem is equivalent to the multiple-choice knapsack problem. We provide results from an empirical study that evaluates our algorithm in the context of a memory hierarchy for a key-value store as well as a host-side cache to store disk pages. The results show that selective replication is appropriate with certain failure rates, but that it is not advantageous to replicate data items with slim failure rates. For the subset assignment problem, we devise an algorithm loosely based on the cycle canceling algorithm for the minimum cost flow problem and give theoretical bounds for its running time. Our algorithm solves the linear programming relaxation in time O(exp(d(d+1)) poly(d) n log(n) log(nC) log(Z)), where d is the number of storage media, n the number of distinct data items that can be requested, Z the maximum size of any object, and C the maximum cost for storing an item.