Platform
Semantic Caching
Last updated April 17, 2026
Reduce latency and costs by automatically reusing responses for repeated and similar prompts.
Overview
Semantic Caching is built into Cencori and runs automatically for eligible requests.
When a new request is close to one your project has already asked, Cencori can return a cached response instead of calling the upstream model again.
Benefits
- Lower latency: Responses can return much faster on cache hits.
- Lower cost: Cached hits avoid an additional model generation call.
- More consistency: Repeated prompts return stable results.
When Cache Applies
Semantic cache is currently applied to:
- Non-streaming requests
- Requests without tool/function calls
- Requests scoped to the same project and compatible generation settings
Dashboard Configuration
No cache toggle is required in the dashboard today. Caching is automatic.
To improve cache hit rates:
- Keep model selection stable for repeated workloads.
- Use consistent system prompts and instruction structure.
- Avoid unnecessary randomness for deterministic tasks.
Cache Status Header
You can verify the cache status of any response using the X-Cencori-Cache header:
| Value | Description |
|---|---|
HIT | Served from cache based on an exact repeat request. |
SEMANTIC-HIT | Served from cache based on similar request meaning. |
MISS | Served from model generation and then cached. |
Retention (TTL)
Cached responses are retained for 1 hour by default.
Current Limits
- Streaming requests bypass cache.
- Tool/function-calling requests bypass cache.
- Per-request cache controls are not exposed yet.