Skip to content

ADR-012: Response Cache Strategy

Status: Superseded by ADR-017 Date: 2026-03-20 Deciders: AI Engineering NAMER

Every request to the AI Gateway currently hits upstream LLM providers directly. There is no response caching layer, which means:

  • Identical prompts (common in testing, retries, and templated workflows) incur full provider costs each time.
  • Latency for repeated queries is unnecessarily high.
  • During load spikes, all traffic fans out to providers with no local absorption.

Portkey Gateway OSS has built-in support for response caching via a Redis backend, activated by setting CACHE_STORE=redis and REDIS_URL environment variables.

Deploy an Amazon ElastiCache Redis cluster within the existing VPC private subnets and configure the Portkey Gateway to use it for exact-match response caching.

  • Engine: Redis 7.1 (ElastiCache)
  • Node type: cache.t4g.micro (single-node for dev, scalable for prod)
  • Eviction: allkeys-lru (least recently used eviction when memory is full)
  • Encryption: At-rest and in-transit (TLS) enabled
  • Network: Private subnets only; security group allows TCP 6379 from ECS tasks only
OptionProsCons
ElastiCache Redis (chosen)Native Portkey support; sub-ms latency; mature managed service; LRU eviction built inAdditional infrastructure; no semantic matching
DynamoDB DAXServerless scalingNot supported by Portkey; wrong access pattern for KV cache
CloudFront cachingEdge distributionPOST requests not cacheable by default; header-based cache keys are fragile for LLM payloads
No cacheZero complexityEvery request hits providers; highest cost and latency

Positive:

  • Reduced provider costs for repeated or templated queries.
  • Lower latency for cache hits (sub-millisecond Redis vs. hundreds of milliseconds for provider round-trips).
  • Foundation for future semantic caching (similarity-based cache matching).

Negative:

  • Additional infrastructure cost (~$12/month for a single cache.t4g.micro node).
  • Cache invalidation complexity: stale responses are evicted only by LRU or TTL.
  • TLS overhead on every cache read/write (negligible in practice).
  • Semantic caching: match semantically similar prompts to cached responses using embedding similarity. Portkey supports this as an upgrade path.
  • Multi-node replication group with automatic failover for production workloads.
  • CloudWatch metrics for cache hit rate monitoring and alerting.