|

Semantic Caching

Reduce latency and costs by automatically reusing responses for repeated and similar prompts.

Overview

Semantic Caching is built into Cencori and runs automatically for eligible requests.

When a new request is close to one your project has already asked, Cencori can return a cached response instead of calling the upstream model again.

Benefits

Lower latency: Responses can return much faster on cache hits.
Lower cost: Cached hits avoid an additional model generation call.
More consistency: Repeated prompts return stable results.

When Cache Applies

Semantic cache is currently applied to:

Non-streaming requests
Requests without tool/function calls
Requests scoped to the same project and compatible generation settings

Dashboard Configuration

No cache toggle is required in the dashboard today. Caching is automatic.

To improve cache hit rates:

Keep model selection stable for repeated workloads.
Use consistent system prompts and instruction structure.
Avoid unnecessary randomness for deterministic tasks.

Cache Status Header

You can verify the cache status of any response using the X-Cencori-Cache header:

Value	Description
`HIT`	Served from cache based on an exact repeat request.
`SEMANTIC-HIT`	Served from cache based on similar request meaning.
`MISS`	Served from model generation and then cached.

Retention (TTL)

Cached responses are retained for 1 hour by default.

Current Limits

Streaming requests bypass cache.
Tool/function-calling requests bypass cache.
Per-request cache controls are not exposed yet.

Did you like the content?

Routing & FailoverConfigure smart routing rules, automatic failover, and load balancing for high-availability AI.

Content FilteringFilter harmful output from AI models. Configure thresholds for hate, violence, and self-harm.

On This Page

Overview Benefits When Cache Applies Dashboard Configuration Cache Status Header Retention (TTL)Current Limits