Skip to main content

Toxicity Detection flags whether a response contains hateful or toxic information.

Categories of toxicity

Types of Toxic Content

Hate Speech: Statements that demean, dehumanize, or attack individuals or groups based on identity factors like race, gender, or religion.
Offensive Content: Vulgar, abusive, or overly profane language used to provoke or insult.
Sexual Content: Explicit or inappropriate sexual statements that may be offensive or unsuitable in context.
Violence or Harm: Advocacy or description of physical harm, abuse, or violent actions.
Illegal or Unethical Guidance: Instructions or encouragement for illegal or unethical actions.
Manipulation or Exploitation: Language intended to deceive, exploit, or manipulate individuals for harmful purposes.

Calculation method

Toxicity detection is computed through a specialized process:
1

Model Architecture

The detection system employs a Small Language Model (SLM) that leverages both open-source and internal datasets to identify various forms of toxic content across multiple categories.
2

Performance Metrics

The model demonstrates exceptional accuracy with a 96% success rate when evaluated against comprehensive validation sets drawn from multiple established datasets.
3

Validation Sources

The system’s effectiveness is verified using industry-standard benchmarks including the Toxic Comment Classification Challenge, Jigsaw Unintended Bias dataset, and Jigsaw Multilingual dataset for robust cross-cultural detection.

Toxic Comment Classification Challenge

Open-source dataset for toxic content detection

Jigsaw Unintended Bias

Dataset focused on identifying biased toxic content

Jigsaw Multilingual

Multi-language toxic content classification

Optimizing your AI system

Addressing Toxicity in Your System

When toxic content is detected in your system, consider these approaches:
Implement guardrails: Flag responses before being served to prevent future occurrences.
Fine-tune models: Adjust model behavior to reduce toxic outputs.
Identify responses that contain toxic content and take preventive measures to ensure safe and appropriate AI interactions.

Performance Benchmarks

We evaluated Toxicity Detection against gold labels on the “test” split of rungalileo/toxicity dataset using top frontier models. This dataset was created by combining the toxicity labels in Arsive/toxicity_classification_jigsaw.
ModelF1 (True)
GPT-4.10.88
GPT-4.1-mini (judges=3)0.87
Claude Sonnet 4.50.87
Gemini 3 Flash0.87

GPT-4.1 Classification Report

PrecisionRecallF1-Score
False0.840.960.90
True0.950.820.88
Confusion Matrix (Normalized)
Predicted
True
False
Actual
True
0.817
0.183
False
0.044
0.956
0.0
1.0
Benchmarks based on the rungalileo/toxicity dataset. Performance may vary by use case.
If you would like to dive deeper or start implementing Toxicity Detection, check out the following resources:

Examples

  • Toxicity Examples - Log in and explore the “Toxicity” Log Stream in the “Preset Metric Examples” Project to see this metric in action.