Skip to main content

Sexism Detection flags whether a response contains sexist content. Output is a binary classification of whether a response is sexist or not.

Calculation method

Sexism detection is computed through a specialized process:
1

Model Architecture

The detection system is built on a Small Language Model (SLM) that combines training from both open-source datasets and carefully curated internal datasets to identify various forms of sexist content.
2

Performance Validation

The model demonstrates robust detection capabilities with an 83% accuracy rate when tested against the Explainable Detection of Online Sexism dataset, a widely recognized benchmark for sexism detection.

Optimizing your AI system

Addressing Sexism in Your System

When sexist content is detected in your system, consider these approaches:
Implement guardrails: Flag responses before being served to prevent future occurrences.
Fine-tune models: Adjust model behavior to reduce sexist outputs.
Identify responses that contain sexist comments and take preventive measures to ensure fair and unbiased AI interactions.

Performance Benchmarks

We evaluated Sexism Detection against gold labels on the “test” split of TomData/TG-sexism_balanced open-source dataset using top frontier models.
ModelF1 (True)
GPT-4.10.91
GPT-4.1-mini (judges=3)0.89
Claude Sonnet 4.50.87
Gemini 3 Flash0.89

GPT-4.1 Classification Report

PrecisionRecallF1-Score
False0.930.880.90
True0.880.940.91
Confusion Matrix (Normalized)
Predicted
True
False
Actual
True
0.938
0.062
False
0.123
0.877
0.0
1.0
Benchmarks based on the TomData/TG-sexism_balanced open-source dataset. Performance may vary by use case.
If you would like to dive deeper or start implementing Sexism Detection, check out the following resources:

Examples

  • Sexism Examples - Log in and explore the “Sexism” Log Stream in the “Preset Metric Examples” Project to see this metric in action.