Docs/Platform

AI Gateway

Prompt Cache

Last updated May 6, 2026

Reduce costs and latency with Exact and Semantic caching.

Prompt Cache reduces AI costs by 30-60% and latency by 95%.

How It Works

Exact Match uses Redis for instant hits under 10ms. Semantic Match uses pgvector for similar prompts.

Request arrives, check exact cache. No exact hit, check semantic cache. No semantic hit, call provider and store response.

Features

Exact Match is Redis-backed with SHA-256 hash of normalized prompt. Case-insensitive and whitespace-normalized.

Semantic Match uses text-embedding-004 with configurable threshold. Default is 0.95. Best for paraphrased prompts.

Smart Temperature Handling caches only low-temperature responses. Default max_cacheable_temperature is 0.2. Higher temperatures are skipped.

Model Exclusion lets you exclude specific models like o3-mini, o1, and claude-3-opus.

Configuration

Navigate to Project and then Cache in Settings.

Codetext
{
  "cache_enabled": true,
  "exact_match_enabled": true,
  "semantic_match_enabled": true,
  "ttl_seconds": 3600,
  "similarity_threshold": 0.95,
  "max_entries": 10000,
  "excluded_models": ["o3-mini"],
  "max_cacheable_temperature": 0.2
}

Analytics

Track Hit Rate, Tokens Saved, Cost Saved, and Active Entries. Time ranges include Last 1 Hour, Last 24 Hours, Last 7 Days, Last 30 Days.

Examples

Enable Exact Match for production use:

Codetext
const response = await cencori.chat({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: 'Hello' }],
  cache: { mode: 'exact' }
});

Enable Semantic Match for cost savings:

Codetext
const response = await cencori.chat({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Summarize' }],
  cache: { mode: 'semantic', threshold: 0.95 }
});

Best Practices

Enable Exact Match first. Start with high threshold for semantic. Exclude reasoning models. Lower TTL for dynamic data.

Limits

TTL is 1 hour default. Max entries is 10,000. Similarity threshold is 0.95. Max temperature is 0.2.