Evaluation Scales

When evaluating the effectiveness of a search system, it's important to assess how relevant the retrieved documents are to a given query. Relevance judgments are expressed using specific evaluation scales. These scales define the granularity and semantics of what is considered "relevant."

Below are the relevance scales available for evaluation runs.

Binary Relevance Scale

The binary relevance scale is the simplest form of relevance measurement. Each document is judged as either:

→ Not relevant
→ Relevant

This scale is widely used in metrics like Precision, Recall, and F-Score, where the system is rewarded only for retrieving relevant documents, regardless of the degree of relevance.

Note

When binary metrics are used with graded or detailed scales, any score greater than 0 is treated as relevant (normalized to 1).

Graded Relevance Scale

The graded relevance scale allows for multiple levels of relevance, ranging from 0 to 4:

→ Not relevant
→ Marginally relevant
→ Fairly relevant
→ Highly relevant
→ Perfectly relevant

This scale is useful for metrics such as DCG, NDCG, and ERR, which reward search systems more for retrieving highly relevant documents at higher ranks.

Detailed Relevance Scale

The detailed relevance scale provides the most fine-grained judgment, ranging from 0 to 9:

→ Not relevant
→ Slightly relevant
→ Somewhat relevant
→ Moderately relevant
→ Fairly relevant
→ Generally relevant
→ Highly relevant
→ Very relevant
→ Extremely relevant
→ Perfectly relevant

This scale is suited for domains where subtle differences in relevance significantly matter, such as legal or medical search. It works with the same graded metrics as the graded scale (DCG, NDCG, ERR) but provides finer discrimination between results.

Choosing a Scale

Scale Type	Granularity	Example Metric(s)	Best For
Binary	Low	Precision, Recall, F1	Simpler tasks, classification
Graded	Medium	DCG, NDCG, ERR	Web search, user experience
Detailed	High	Custom metrics, ML training	Domain-specific, fine-tuned tasks

Choosing the right relevance scale depends on your goals. Binary is fast, graded gives a middle ground, and detailed enables advanced evaluation. If you're unsure, it's reasonable to start with graded. Make sure your metrics align with the level of judgment detail provided!

Choosing a Scale​

Choosing a Scale