Evaluation Scales
When evaluating the effectiveness of a search system, it's important to assess how relevant the retrieved documents are to a given query. Relevance judgments are expressed using specific evaluation scales. These scales define the granularity and semantics of what is considered "relevant."
Below are the relevance scales available for evaluation runs.
Binary Relevance Scale
The binary relevance scale is the simplest form of relevance measurement. Each document is judged as either:
- → Not relevant
- → Relevant
This scale is widely used in metrics like Precision, Recall, and F-Score, where the system is rewarded only for retrieving relevant documents, regardless of the degree of relevance.
When binary metrics are used with graded or detailed scales, any score greater than 0 is treated as relevant (normalized to 1).
Graded Relevance Scale
The graded relevance scale allows for multiple levels of relevance, ranging from 0 to 4:
- → Not relevant
- → Marginally relevant
- → Fairly relevant
- → Highly relevant
- → Perfectly relevant
This scale is useful for metrics such as DCG, NDCG, and ERR, which reward search systems more for retrieving highly relevant documents at higher ranks.
Detailed Relevance Scale
The detailed relevance scale provides the most fine-grained judgment, ranging from 0 to 9:
- → Not relevant
- → Slightly relevant
- → Somewhat relevant
- → Moderately relevant
- → Fairly relevant
- → Generally relevant
- → Highly relevant
- → Very relevant
- → Extremely relevant
- → Perfectly relevant
This scale is suited for domains where subtle differences in relevance significantly matter, such as legal or medical search. It works with the same graded metrics as the graded scale (DCG, NDCG, ERR) but provides finer discrimination between results.
Choosing a Scale
| Scale Type | Granularity | Example Metric(s) | Best For |
|---|---|---|---|
| Binary | Low | Precision, Recall, F1 | Simpler tasks, classification |
| Graded | Medium | DCG, NDCG, ERR | Web search, user experience |
| Detailed | High | Custom metrics, ML training | Domain-specific, fine-tuned tasks |
Choosing the right relevance scale depends on your goals. Binary is fast, graded gives a middle ground, and detailed enables advanced evaluation. If you're unsure, it's reasonable to start with graded. Make sure your metrics align with the level of judgment detail provided!