Content Filtering

Automatically filter harmful, inappropriate, or policy-violating content in AI requests and responses.

What is Content Filtering?

Content filtering prevents your AI application from processing or generating harmful content, including:

  • Hate speech and discrimination
  • Violence and graphic content
  • Sexual or adult content
  • Self-harm and dangerous activities
  • Illegal activities

Filter Categories

CategoryDescriptionExamples
Hate SpeechAttacks based on identityRacial slurs, religious attacks
ViolenceGraphic or threatening contentViolent threats, gore
Sexual ContentAdult or explicit materialNSFW imagery, explicit text
Self-HarmDangerous behavior encouragementSuicide methods, self-injury
Illegal ActivityInstructions for crimesDrug manufacturing, theft
ProfanityOffensive languageCurse words, slurs

How Content Filtering Works

1. Input Scanning

Before sending to the AI provider, Cencori scans the user's prompt for harmful content.

2. Classification

ML models categorize content and assign severity scores (low, medium, high, critical).

3. Policy Enforcement

Based on your configured policy, the request is either blocked, flagged, or allowed with warnings.

4. Output Monitoring

AI responses are also scanned. If harmful content is generated, it's blocked before reaching the user.

Filtering Policy Modes

ModeBehaviorUse Case
StrictBlock all harmful contentPublic apps, children's apps
ModerateBlock high/critical onlyGeneral purpose apps
PermissiveLog only, don't blockInternal tools, research
CustomDefine your own rulesEnterprise use cases

When Content is Blocked

blocked-response.json

Handling Content Filter Violations

handle-filter.ts

Custom Filtering Rules (Enterprise)

Enterprise customers can define custom rules:

  • Industry-specific terms (e.g., medical terminology that's acceptable in healthcare)
  • Company-specific blocklists
  • Domain-specific allowlists
  • Regional language variations

Monitoring Filter Activity

View all content filter incidents in your dashboard:

  1. Navigate to project dashboard
  2. Click "Security" sidebar
  3. Filter by "Content Filter Violation"
  4. View breakdown by:
    • Category (violence, hate speech, etc.)
    • Severity level
    • Trends over time

Best Practices

  • Start with Moderate mode and adjust based on your app's audience
  • Show clear error messages to users explaining policy violations
  • Review filter incidents weekly to identify abuse patterns
  • For creative writing apps, consider Permissive mode with output filtering
  • Test edge cases with your specific content
  • Combine with prompt injection protection for comprehensive security