Close

2023-11-16

How DoorDash Rearchitected its Cache to Improve Scalability and Performance

How DoorDash Rearchitected its Cache to Improve Scalability and Performance

DoorDash has restructured its caching system, a mix of different technologies across its microservices, into a unified, multi-layered cache. This new system provides a generic mechanism to address issues from the fragmented approach, such as cache staleness and heavy reliance on Redis. The architecture was driven by the need for improved performance and scalability, particularly for their DashPass service, which was struggling under increased traffic.

The new caching system is built on a multi-layered design with three distinct layers:

  1. Request Local Cache: Data is stored in a hash map and lasts only as long as the request does.
  2. Local Cache: Utilizes Caffeine to share data among all workers within the same Java Virtual Machine.
  3. Redis Cache: Visible to all pods within the same Redis cluster, leveraging Redis Lettuce for operations.

This design allows runtime control over each layer, including toggling the cache on or off, setting time-to-live (TTL) values, and operating in shadow mode to compare a percentage of cache requests with the source of truth. The system also supports metrics collection for hits, misses, and latency, as well as logging capabilities.

After successfully implementing DashPass, the system was gradually introduced to other services within DoorDash, accompanied by clear guidelines on its use.

The original article is “How DoorDash Rearchitected its Cache to Improve Scalability and Performance.

Balancing the trade-offs between data consistency and latency in a multi-layered cache system like DoorDash’s involves several strategies:

  1. Layered TTL Management: Each layer in the cache hierarchy can have different Time-to-live (TTL) settings. Shorter TTLs at higher layers ensure more frequent updates, while longer TTLs at lower layers can reduce latency.
  2. Stale Data Handling: Implementing strategies to handle stale data, such as using a write-through cache or employing background refreshes, can help maintain consistency across layers.
  3. Cache Invalidation: Sophisticated invalidation mechanisms ensure that when data changes, all relevant cache layers are updated or invalidated to maintain consistency.
  4. Request Local Caching: By using request local caches, DoorDash ensures that within a single request, data remains consistent, reducing the risk of serving stale data within that context.

For cache invalidation, DoorDash likely employs a combination of strategies to manage the complexity:

  1. Event-Driven Invalidation: Using an event-driven architecture where changes in the data source trigger cache invalidation events across all layers.
  2. Versioning: Implementing versioned cache keys can help invalidate only the affected parts of the cache when data changes.
  3. Scheduled Invalidation: Regularly scheduled invalidation processes can clear out stale data, especially for less frequently updated information.
  4. Active Monitoring: Monitoring cache hit rates and patterns can help identify when data might be stale and trigger a proactive refresh.

The inclusion of metrics and logging is crucial for several reasons:

  1. Performance Tuning: Metrics on cache hits, misses, and latency help tuning the cache layers for optimal performance.
  2. Troubleshooting: Logging provides insights into cache operations, making it easier to troubleshoot issues like cache poisoning or unexpected misses.
  3. Capacity Planning: Metrics can inform decisions about scaling the cache infrastructure and planning for capacity based on actual usage patterns.
  4. Health Checks: Regular health checks based on metrics can alert to issues such as cache server overloads or network latencies affecting cache performance.
  5. Behavioral Insights: Understanding how different services use the cache can lead to better design and configuration, ensuring that the caching strategy aligns with actual usage scenarios.