Prompt Injection Protection

Understand and prevent prompt injection attacks that attempt to manipulate AI behavior.

What is Prompt Injection?

Prompt injection is a security vulnerability where malicious users craft inputs that manipulate an AI model's behavior, bypassing intended restrictions or extracting sensitive information.

Example Attack:

attack-example

Without protection, the AI might comply and expose sensitive context.

Common Attack Vectors

1. Instruction Override

Attempts to replace system instructions:

override-attack

2. Context Extraction

Tries to reveal hidden prompts:

extraction-attack

3. Jailbreak Attempts

Bypasses safety guardrails:

jailbreak-attack

4. Delimiter Confusion

Uses special characters to break context:

delimiter-attack

How Cencori Protects Against Prompt Injection

1. Pattern Detection

Cencori scans inputs for known malicious patterns like "ignore previous instructions", "system prompt", "jailbreak", and common attack keywords.

2. Semantic Analysis

Uses ML models to detect inputs that semantically resemble instruction overrides, even if they use novel phrasing.

3. Character Anomaly Detection

Flags inputs with suspicious delimiter usage, excessive special characters, or unusual formatting.

4. Behavioral Scoring

Assigns a risk score to each request. High-risk requests are blocked automatically.

When Injection is Detected

If prompt injection is detected, Cencori blocks the request before it reaches the AI provider:

blocked-response.json

Handling Injection Detection

handle-injection.ts

Handling False Positives

Legitimate inputs might occasionally trigger detection:

Example: Technical Discussion

"How do I ignore previous context in my chatbot implementation?"

This is a legitimate question but contains the trigger phrase "ignore previous".

If you experience false positives:

  • View the incident in your dashboard to see the risk score
  • Adjust sensitivity settings (Enterprise feature)
  • Whitelist specific patterns for your use case

Best Practices for Developers

  • Never trust user input: Always send it through Cencori for scanning
  • Separate system prompts: Use the system role, never concatenate with user input
  • Monitor incidents: Review blocked injection attempts weekly
  • Educate users: Make it clear that manipulative prompts won't work
  • Test your defenses: Try common attacks in test environment

Testing Prompt Injection Protection

Try these sample attacks in your test environment to verify protection:

test-attacks.ts