Toxicity Detection flags whether a response contains hateful or toxic information.
Categories of toxicity
Types of Toxic Content
Hate Speech: Statements that demean, dehumanize, or attack individuals or groups based on identity factors like race, gender, or religion.
Offensive Content: Vulgar, abusive, or overly profane language used to provoke or insult.
Sexual Content: Explicit or inappropriate sexual statements that may be offensive or unsuitable in context.
Violence or Harm: Advocacy or description of physical harm, abuse, or violent actions.
Illegal or Unethical Guidance: Instructions or encouragement for illegal or unethical actions.
Manipulation or Exploitation: Language intended to deceive, exploit, or manipulate individuals for harmful purposes.
Calculation method
Toxicity detection is computed through a specialized process:Model Architecture
The detection system employs a Small Language Model (SLM) that leverages both open-source and internal datasets to identify various forms of toxic content across multiple categories.
Performance Metrics
The model demonstrates exceptional accuracy with a 96% success rate when evaluated against comprehensive validation sets drawn from multiple established datasets.
Toxic Comment Classification Challenge
Open-source dataset for toxic content detection
Jigsaw Unintended Bias
Dataset focused on identifying biased toxic content
Jigsaw Multilingual
Multi-language toxic content classification
Optimizing your AI system
Addressing Toxicity in Your System
Implement guardrails: Flag responses before being served to prevent future occurrences.
Fine-tune models: Adjust model behavior to reduce toxic outputs.
Identify responses that contain toxic content and take preventive measures to ensure safe and appropriate AI interactions.
Performance Benchmarks
We evaluated Toxicity Detection against gold labels on the “test” split of rungalileo/toxicity dataset using top frontier models. This dataset was created by combining the toxicity labels in Arsive/toxicity_classification_jigsaw.| Model | F1 (True) |
|---|---|
| GPT-4.1 | 0.88 |
| GPT-4.1-mini (judges=3) | 0.87 |
| Claude Sonnet 4.5 | 0.87 |
| Gemini 3 Flash | 0.87 |
GPT-4.1 Classification Report
| Precision | Recall | F1-Score | |
|---|---|---|---|
| False | 0.84 | 0.96 | 0.90 |
| True | 0.95 | 0.82 | 0.88 |
Confusion Matrix (Normalized)
Predicted
True
False
Actual
True
0.817
0.183
False
0.044
0.956
0.01.0
Benchmarks based on the rungalileo/toxicity dataset. Performance may vary by use case.
Related Resources
If you would like to dive deeper or start implementing Toxicity Detection, check out the following resources:Examples
- Toxicity Examples - Log in and explore the “Toxicity” Log Stream in the “Preset Metric Examples” Project to see this metric in action.