Content Filtering
Automatically filter harmful, inappropriate, or policy-violating content in AI requests and responses.
What is Content Filtering?
Content filtering prevents your AI application from processing or generating harmful content, including:
- Hate speech and discrimination
- Violence and graphic content
- Sexual or adult content
- Self-harm and dangerous activities
- Illegal activities
Filter Categories
| Category | Description | Examples |
|---|---|---|
| Hate Speech | Attacks based on identity | Racial slurs, religious attacks |
| Violence | Graphic or threatening content | Violent threats, gore |
| Sexual Content | Adult or explicit material | NSFW imagery, explicit text |
| Self-Harm | Dangerous behavior encouragement | Suicide methods, self-injury |
| Illegal Activity | Instructions for crimes | Drug manufacturing, theft |
| Profanity | Offensive language | Curse words, slurs |
How Content Filtering Works
1. Input Scanning
Before sending to the AI provider, Cencori scans the user's prompt for harmful content.
2. Classification
ML models categorize content and assign severity scores (low, medium, high, critical).
3. Policy Enforcement
Based on your configured policy, the request is either blocked, flagged, or allowed with warnings.
4. Output Monitoring
AI responses are also scanned. If harmful content is generated, it's blocked before reaching the user.
Filtering Policy Modes
| Mode | Behavior | Use Case |
|---|---|---|
| Strict | Block all harmful content | Public apps, children's apps |
| Moderate | Block high/critical only | General purpose apps |
| Permissive | Log only, don't block | Internal tools, research |
| Custom | Define your own rules | Enterprise use cases |
When Content is Blocked
blocked-response.json
Handling Content Filter Violations
handle-filter.ts
Custom Filtering Rules (Enterprise)
Enterprise customers can define custom rules:
- Industry-specific terms (e.g., medical terminology that's acceptable in healthcare)
- Company-specific blocklists
- Domain-specific allowlists
- Regional language variations
Monitoring Filter Activity
View all content filter incidents in your dashboard:
- Navigate to project dashboard
- Click "Security" sidebar
- Filter by "Content Filter Violation"
- View breakdown by:
- Category (violence, hate speech, etc.)
- Severity level
- Trends over time
Best Practices
- Start with Moderate mode and adjust based on your app's audience
- Show clear error messages to users explaining policy violations
- Review filter incidents weekly to identify abuse patterns
- For creative writing apps, consider Permissive mode with output filtering
- Test edge cases with your specific content
- Combine with prompt injection protection for comprehensive security

