AI Gateway
Prompt Cache
Last updated May 6, 2026
Reduce costs and latency with Exact and Semantic caching.
Prompt Cache reduces AI costs by 30-60% and latency by 95%.
How It Works
Exact Match uses Redis for instant hits under 10ms. Semantic Match uses pgvector for similar prompts.
Request arrives, check exact cache. No exact hit, check semantic cache. No semantic hit, call provider and store response.
Features
Exact Match is Redis-backed with SHA-256 hash of normalized prompt. Case-insensitive and whitespace-normalized.
Semantic Match uses text-embedding-004 with configurable threshold. Default is 0.95. Best for paraphrased prompts.
Smart Temperature Handling caches only low-temperature responses. Default max_cacheable_temperature is 0.2. Higher temperatures are skipped.
Model Exclusion lets you exclude specific models like o3-mini, o1, and claude-3-opus.
Configuration
Navigate to Project and then Cache in Settings.
{
"cache_enabled": true,
"exact_match_enabled": true,
"semantic_match_enabled": true,
"ttl_seconds": 3600,
"similarity_threshold": 0.95,
"max_entries": 10000,
"excluded_models": ["o3-mini"],
"max_cacheable_temperature": 0.2
}Analytics
Track Hit Rate, Tokens Saved, Cost Saved, and Active Entries. Time ranges include Last 1 Hour, Last 24 Hours, Last 7 Days, Last 30 Days.
Examples
Enable Exact Match for production use:
const response = await cencori.chat({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: 'Hello' }],
cache: { mode: 'exact' }
});Enable Semantic Match for cost savings:
const response = await cencori.chat({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Summarize' }],
cache: { mode: 'semantic', threshold: 0.95 }
});Best Practices
Enable Exact Match first. Start with high threshold for semantic. Exclude reasoning models. Lower TTL for dynamic data.
Limits
TTL is 1 hour default. Max entries is 10,000. Similarity threshold is 0.95. Max temperature is 0.2.