Old Performance - bitfaster/BitFaster.Caching Wiki

Performance

DISCLAIMER: Always measure performance in the context of your application. The results provided here are intended as a guide.

The cache replacement policy must maximize the cache hit rate, and minimize the computational and space overhead involved in implementing the policy. Below an analysis of hit rate vs cache size, latency and throughput is provided.

ConcurrentLru Hit rate

The charts below show the relative hit rate of classic LRU vs Concurrent LRU on a Zipfian distribution of input keys, with parameter s = 0.5 and s = 0.86 respectively. If there are N items, the probability of accessing an item numbered i or less is (i / N)^s.

Here N = 50000, and we take 1 million sample keys. The hit rate is the number of times we get a cache hit divided by 1 million. This test was repeated with the cache configured to different sizes expressed as a percentage N (e.g. 10% would be a cache with a capacity 5000). ConcurrentLru is configured with the default FavorFrequencyPartition to allocate internal queue capacity (see here for details).

As above, but interleaving a sequential scan of every key (aka sequential flooding). In this case, ConcurrentLru performs significantly better across the board, and is therefore more resistant to scanning.

These charts summarize the percentage increase in hit rate for ConcurrentLru vs LRU. Increase is in hit rate is significant at lower cache sizes, outperforming the classic LRU by over 150% when s = 0.5 in the best case for both Zipf and scan access patterns.

ConcurrentLru Latency

In these benchmarks, a cache miss is essentially free. These tests exist purely to compare the raw execution speed of the cache bookkeeping code. In a real setting, where a cache miss is presumably quite expensive, the relative overhead of the cache will be very small.

Benchmarks are based on BenchmarkDotNet, so are single threaded. The ConcurrentLru family of classes are composed internally of ConcurrentDictionary.GetOrAdd and ConcurrentQueue.Enqueue/Dequeue method calls, and scale well to concurrent workloads.

Benchmark results below are from a workstation with the following config:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
  [Host]             : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT
  .NET 6.0           : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  .NET Framework 4.8 : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT

The relative ranking of each cache implementation is stable across .NET Framework/Core/5/6 and on the CPU architectures available in Azure (e.g. Intel Skylake, AMD Zen). Absolute performance can vary.

What are FastConcurrentLru/FastConcurrentTLru?

These are classes that execute with the hit counting logic eliminated (via JIT). If hit counts are not required, this makes the code around 10% faster.

Lookup keys with a Zipf distribution

Take 1000 samples of a Zipfian distribution over a set of keys of size N and use the keys to lookup values in the cache. If there are N items, the probability of accessing an item numbered i or less is (i / N)^s.

s = 0.86 (yields approx 80/20 distribution)
N = 500

Cache size = N / 10 (so we can cache 10% of the total set). ConcurrentLru has approximately the same computational overhead as a standard LRU in this single threaded test.

Method Mean Error StdDev Ratio RatioSD
ClassicLru 175.7 ns 2.75 ns 2.43 ns 1.00 0.00
FastConcurrentLru 180.2 ns 2.55 ns 2.26 ns 1.03 0.02
ConcurrentLru 189.1 ns 3.14 ns 2.94 ns 1.08 0.03
FastConcurrentTLru 261.4 ns 4.53 ns 4.01 ns 1.49 0.04
ConcurrentTLru 266.1 ns 3.96 ns 3.51 ns 1.51 0.03

Raw Lookup speed

In this test the same items are fetched repeatedly, no items are evicted. Representative of high hit rate scenario, when there are a low number of hot items.

  • ConcurrentLru family does not move items in the queues, it is just marking as accessed for pure cache hits.
  • Classic Lru must maintain item order, and is internally splicing the fetched item to the head of the linked list.
  • MemoryCache and ConcurrentDictionary represent a pure lookup. This is the best case scenario for MemoryCache, since the lookup key is a string (if the key were a Guid, using MemoryCache adds string conversion overhead).

FastConcurrentLru does not allocate and is approximately 5-10x faster than System.Runtime.Caching.MemoryCache or the newer Microsoft.Extensions.Caching.Memory.MemoryCache.

Method Runtime Mean StdDev Ratio Allocated
ConcurrentDictionary .NET 6.0 7.783 ns 0.0720 ns 1.00 -
FastConcurrentLru .NET 6.0 9.773 ns 0.0361 ns 1.26 -
ConcurrentLru .NET 6.0 13.615 ns 0.0606 ns 1.75 -
FastConcurrentTLru .NET 6.0 25.480 ns 0.0935 ns 3.28 -
ConcurrentTLru .NET 6.0 29.890 ns 0.2107 ns 3.84 -
ClassicLru .NET 6.0 54.422 ns 0.2935 ns 7.00 -
RuntimeMemoryCacheGet .NET 6.0 115.016 ns 0.6619 ns 14.79 32 B
ExtensionsMemoryCacheGet .NET 6.0 53.328 ns 0.2130 ns 6.85 24 B
ConcurrentDictionary .NET Framework 4.8 13.644 ns 0.0601 ns 1.00 -
FastConcurrentLru .NET Framework 4.8 14.639 ns 0.0892 ns 1.07 -
ConcurrentLru .NET Framework 4.8 17.008 ns 0.2538 ns 1.25 -
FastConcurrentTLru .NET Framework 4.8 43.854 ns 0.0827 ns 3.22 -
ConcurrentTLru .NET Framework 4.8 47.954 ns 1.2772 ns 3.52 -
ClassicLru .NET Framework 4.8 62.683 ns 0.8105 ns 4.60 -
RuntimeMemoryCacheGet .NET Framework 4.8 287.627 ns 1.3691 ns 21.08 32 B
ExtensionsMemoryCacheGet .NET Framework 4.8 114.511 ns 0.5902 ns 8.39 24 B

ConcurrentLru Throughput

In this test, we generate 2000 samples of 500 keys with a Zipfian distribution (s = 0.86). Caches have size 50. From N concurrent threads, fetch the sample keys in sequence (each thread is using the same input keys). The principal scalability limit in concurrent applications is the exclusive resource lock. As the number of threads increases, ConcurrentLru significantly outperforms an LRU implemented with a short lived exclusive lock used to synchronize the linked list data structure.

This test was run on a Standard D16s v3 Azure VM (16 cpus), with .NET Core 3.1.

image

⚠️ **GitHub.com Fallback** ⚠️