Conversation Quality is a binary metric that assesses whether a chatbot interaction left the user feeling satisfied and positive or frustrated and dissatisfied, based on tone, engagement, and overall experience.
Conversation Quality at a glance
| Property | Description |
|---|---|
| Name | Conversation Quality |
| Category | Agentic AI |
| Can be applied to | Session |
| LLM-as-a-judge Support | ✅ |
| Luna Support | ❌ |
| Protect Runtime Protection | ❌ |
| Value Type | Boolean shown as a percentage confidence score |
When to use this metric
When to Use This Metric
Score interpretation
Expected Score: 80%-100%.060%100%
Poor
Many conversations indicate frustration, impatience, or dissatisfaction directed at the botFair
Excellent
Most conversations reflect positive user sentiment, polite engagement, and satisfactionHow to improve Conversation Quality scores
Some techniques to improve Conversation Quality scores are:- Ensure bots provide clear, empathetic, and concise responses
- Detect and mitigate repeated clarification loops
- Train models to de-escalate external frustration effectively
- Log complete sessions to allow accurate tone assessment
- Mislabeling external frustration as bot-directed
- Incomplete logs
- Abrupt session truncation
Performance Benchmarks
We evaluated Conversation Quality against human expert labels on an internal dataset of agentic conversation samples using top frontier models.| Model | F1 (True) |
|---|---|
| GPT-4.1 | 0.89 |
| GPT-4.1-mini (judges=3) | 0.85 |
| Claude Sonnet 4.5 | 0.85 |
| Gemini 3 Flash | 0.88 |
GPT-4.1 Classification Report
| Precision | Recall | F1-Score | |
|---|---|---|---|
| False | 0.91 | 0.83 | 0.87 |
| True | 0.85 | 0.93 | 0.89 |
Confusion Matrix (Normalized)
Predicted
True
False
Actual
True
0.925
0.075
False
0.173
0.827
0.01.0
Benchmarks based on internal evaluation dataset. Performance may vary by use case.
Related Resources
If you would like to dive deeper or start implementing Conversation Quality, check out the following resources:Examples
- Conversation Quality Examples - Log in and explore the “Conversation Quality” Log Stream in the “Preset Metric Examples” Project to see this metric in action.