Context Relevance (Query Adherence)

Context Relevance measures whether your retrieved context, taken together, contains enough information to fully answer the user query.

Context relevance

Context Relevance asks whether your retrieved context, as a whole, contains enough information to fully answer the user query. High Context Relevance values indicate strong confidence that there is enough context to fully answer the question. Low Context Relevance values are a sign that you need to increase your Top K, modify your retrieval strategy, or use better embeddings.

Context Relevance is differentiated from Context Adherence: Context Relevance evaluates whether the retrieved context is relevant to a user’s query whereas Context Adherence determines how well the response aligns to provided context.

Chunk Relevance vs. Context Relevance

Chunk Relevance (Chunk Relevance) evaluates each chunk individually: does this chunk contain anything useful for answering the query?
Context Relevance evaluates the retrieved context as a whole: do all of these chunks, taken together, cover everything needed to answer the query end-to-end?

Use Chunk Relevance when you’re tuning chunking or reranking (which chunks should show up at all), and Context Relevance when you’re deciding if retrieval “succeeded” for a given query and whether to adjust Top K, retriever configuration, or fallback behavior.

Reading Context Relevance with Context Precision

High Context Relevance & High Context Precision: Retrieved context is both sufficient and mostly noise-free — focus next on generation quality and grounding.
High Context Relevance & Low Context Precision: The right information is present but mixed with a lot of irrelevant chunks — keep your recall but prune noise (better filters, reranking, or a lower Top K).
Low Context Relevance & High Context Precision: Most chunks are on-topic, but together they still miss pieces needed for a full answer — broaden retrieval (higher Top K, alternate retriever, or additional data sources).
Low Context Relevance & Low Context Precision: Retrieval is both incomplete and noisy — revisit embeddings, indexing, and query formulation end-to-end.

Best practices

Use for Results Assessment

Leverage Context Relevance when assessing the quality of your retrieval system’s results and determining how accurately it adheres to queries.

Combine with Other Metrics

Use context relevance alongside context adherence, correctness, and completeness metrics for a comprehensive view of response quality.

Performance Benchmarks

We evaluated Context Relevance against human expert labels on an internal dataset of RAG samples using top frontier models.

Model	F1 (True)
GPT-4.1	0.82
GPT-4.1-mini (judges=3)	0.85
Claude Sonnet 4.5	0.81
Gemini 3 Flash	0.81

GPT-4.1 Classification Report

	Precision	Recall	F1-Score
False	0.82	0.99	0.89
True	0.97	0.71	0.82

Confusion Matrix (Normalized)

Predicted

True

False

Actual

True

0.708

0.292

False

0.014

0.986

0.0

1.0

Benchmarks based on internal evaluation dataset. Performance may vary by use case.

If you would like to dive deeper or start implementing Context Relevance, check out the following resources:

Examples

Context Relevance Examples - Log in and explore the “Context Relevance” Log Stream in the “Preset Metric Examples” Project to see this metric in action.