Docs/Platform

Platform

Semantic Caching

Last updated April 17, 2026

Reduce latency and costs by automatically reusing responses for repeated and similar prompts.

Overview

Semantic Caching is built into Cencori and runs automatically for eligible requests.

When a new request is close to one your project has already asked, Cencori can return a cached response instead of calling the upstream model again.

Benefits

  1. Lower latency: Responses can return much faster on cache hits.
  2. Lower cost: Cached hits avoid an additional model generation call.
  3. More consistency: Repeated prompts return stable results.

When Cache Applies

Semantic cache is currently applied to:

  • Non-streaming requests
  • Requests without tool/function calls
  • Requests scoped to the same project and compatible generation settings

Dashboard Configuration

No cache toggle is required in the dashboard today. Caching is automatic.

To improve cache hit rates:

  1. Keep model selection stable for repeated workloads.
  2. Use consistent system prompts and instruction structure.
  3. Avoid unnecessary randomness for deterministic tasks.

Cache Status Header

You can verify the cache status of any response using the X-Cencori-Cache header:

ValueDescription
HITServed from cache based on an exact repeat request.
SEMANTIC-HITServed from cache based on similar request meaning.
MISSServed from model generation and then cached.

Retention (TTL)

Cached responses are retained for 1 hour by default.

Current Limits

  • Streaming requests bypass cache.
  • Tool/function-calling requests bypass cache.
  • Per-request cache controls are not exposed yet.