|
Semantic Caching
Reduce latency and costs by automatically reusing responses for repeated and similar prompts.
Overview
Semantic Caching is built into Cencori and runs automatically for eligible requests.
When a new request is close to one your project has already asked, Cencori can return a cached response instead of calling the upstream model again.
Benefits
- Lower latency: Responses can return much faster on cache hits.
- Lower cost: Cached hits avoid an additional model generation call.
- More consistency: Repeated prompts return stable results.
When Cache Applies
Semantic cache is currently applied to:
- Non-streaming requests
- Requests without tool/function calls
- Requests scoped to the same project and compatible generation settings
Dashboard Configuration
No cache toggle is required in the dashboard today. Caching is automatic.
To improve cache hit rates:
- Keep model selection stable for repeated workloads.
- Use consistent system prompts and instruction structure.
- Avoid unnecessary randomness for deterministic tasks.
Cache Status Header
You can verify the cache status of any response using the X-Cencori-Cache header:
Retention (TTL)
Cached responses are retained for 1 hour by default.
Current Limits
- Streaming requests bypass cache.
- Tool/function-calling requests bypass cache.
- Per-request cache controls are not exposed yet.

