Evaluation Metrics

Evaluating the performance of search systems involves measuring how well the retrieved documents satisfy a user's query or information need. The following are metrics available for an evaluation run.

Binary Relevance Metrics

Binary relevance metrics can be used with results graded with binary relevance, as well as graded and detailed relevance, where scores > 1 are normalized to 1.

Precision

Precision measures the fraction of retrieved documents that are relevant. At a cutoff depth $k$ :

\text{Precision@}k = \frac{|\{\text{relevant documents in top } k\}|}{k}

A Precision@10 of 0.7 means that 7 out of the top 10 results are relevant. Precision does not consider the order of relevant results within the top $k$ .

Recall

Recall measures the fraction of all relevant documents that appear in the retrieved results. At a cutoff depth $k$ :

\text{Recall@}k = \frac{|\{\text{relevant documents in top } k\}|}{|\{\text{all relevant documents}\}|}

A Recall@10 of 0.5 means that half of all relevant documents for the query appear in the top 10 results. Recall does not consider the order of relevant results within the top $k$ .

F-Score

The F-Score balances precision and recall using a harmonic mean. It's parameterized by $β$ , where $β > 1$ favors recall and $β < 1$ favors precision.

F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}

For the common $F_1$ Score $(β = 1)$ :

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Mean Average Precision (MAP)

MAP computes the mean of average precision scores across multiple queries.

\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)

Where $AP(q)$ is the average precision for query $q$ , defined as:

\text{AP}(q) = \frac{1}{R} \sum_{k=1}^N P(k) \cdot \text{rel}(k)

where

$R$ is the total number of relevant documents.
$P(k)$ is precision at cutoff $k$ .
$rel(k)$ is 1 if the document at rank $k$ is relevant, else 0.

Best@k

Best $@k$ evaluates whether the top- $k$ includes the single best item:

\text{Best@}k = \begin{cases} 1 & \text{if the best item is in top-}k \\ 0 & \text{otherwise} \end{cases}

This is especially useful when a single item carries outsized importance (e.g. the best answer in a Q&A system).

Mean Reciprocal Rank (MRR)

MRR measures how quickly a relevant result appears in the ranked list, averaged across multiple queries. For each query, the reciprocal rank is the inverse of the position of the first relevant result:

\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}

where

$Q$ is the set of queries.
$\text{rank}_q$ is the position of the first relevant result for query $q$ .

An MRR of 1.0 means that every query has a relevant result at rank 1. MRR is well suited for scenarios where only the first relevant result matters, such as navigational queries or question answering.

Graded Relevance Metrics

Cumulative Gain (CG)

Cumulative Gain is the sum of the graded relevance values (the gain) of all results in a ranked list. It doesn't take into account the rank of a result in a ranked list.

\text{CG@}k = \sum_{i=1}^k rel_{i}

where

$k$ is the depth into the ranked list to consider results
$rel_i$ is the graded relevance of the document at rank $i$ .

Since CG doesn't take into account the rank of results, its value as a relevancy metric is limited.

Discounted Cumulative Gain (DCG)

Discounted Cumulative Gain (DCG) builds on top of CG, and measures the gain of the results whilst also considering the rank of a result in the ranked list. A common formula for DCG is:

\text{DCG@}k = \sum_{i=1}^k \frac{2^{rel_i} - 1}{\log_2(i + 1)}

where

$k$ is the depth into the ranked list to consider results
$rel_i$ is the graded relevance of the document at rank $i$ .

The core idea behind DCG is that highly relevant documents are more useful when appearing earlier in the ranking, and their value should decrease logarithmically as their position in the list increases; the numerator rewards higher relevance exponentially, while the denominator discounts the value based on rank/position. The total DCG is the sum of these discounted gains across all ranks in the list.

Info

DCG scores are relative to the query so cannot be used to compare ranking quality across queries.

Normalized Discounted Cumulative Gain (NDCG)

Normalized Discounted Cumulative Gain (NDCG) extends DCG by providing a way to compare ranking quality across queries of varying difficulty or relevance distributions. It does this by normalizing the DCG score against the ideal DCG (IDCG)- the maximum possible DCG that could be achieved if all relevant documents were perfectly ranked.

\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}

where

$\text{IDCG@}k$ is the $\text{DCG@}k$ of the ideal ranking of documents for the given query.

Normalization enables fair comparisons between search results for different queries and is especially useful in large-scale evaluations involving diverse query sets.

Tip

NDCG is a good default choice when the relevance distribution is unknown.

Expected Reciprocal Rank (ERR)

Expected Reciprocal Rank (ERR) accounts for user satisfaction by modeling the probability that a user finds a relevant result and stops browsing the ranked list. It assumes that each result at rank $i$ has a probability $R(i)$ of satisfying the user, based on its relevance grade (typically mapped to a value between 0 and 1). ERR computes the expected reciprocal of the rank at which the user stops.

\text{ERR@}k = \sum_{i=1}^k \frac{1}{i} P(\text{user stops at } i)

where

P(\text{user stops at } i) = \left( \prod_{j=1}^{i-1}(1 - R(j)) \right) R(i)

$R(i)$ is a probability derived from the relevance grade, e.g. $R(i) = \frac{2^{rel_i} - 1}{2^{\text{max rel}}}$

The product term represents the probability that the user was not satisfied by any of the previous results. The final ERR score is the sum of these contributions over the ranked list.

Similarity Metrics

Rank-Biased Overlap (RBO)

Rank-Biased Overlap (RBO) measures similarity between two ranked lists while giving more weight to higher-ranked items. Unlike traditional measures, RBO accounts for incomplete lists and allows for a customizable weighting parameter, adjusting how much emphasis is placed on top-ranked elements.

\text{RBO}(S, T, p) = (1 - p) \sum_{d=1}^\infty p^{d-1} \cdot A_d .

where

$S$ and $T$ are two infinite rankings
$A_d$ is the agreement at depth $d$ .
$p ∈ [0, 1]$ controls top-weightedness. When $p$ is $0$ , only the top-ranked item is considered. Larger values of $p$ place more emphasis on deeper ranks.

RBO is particularly useful in evaluating search rankings, where the relevance of results diminish as the user navigates further down the list. It provides a robust, intuitive measure that handles partial overlaps.

Extrapolation

$\text{RBO}_\text{EXT}$ is the extrapolated version of RBO that approximates infinite summation with finite depth; it is very typical for search systems to only make available the top $k$ results, but have many more that match for a given query. $\text{RBO}_\text{EXT}$ extrapolates a score from the available ranked lists, assuming that the agreement seen at depth $k$ continues indefinitely.

\text{RBO}_{\text{EXT}}(S, T, p, k) = \frac{X_k}{k} \cdot p^k + \frac{1 - p}{p} \sum_{d=1}^{k} \frac{X_d}{d} \cdot p^d

where

$S$ and $T$ are two infinite rankings
$p ∈ [0, 1]$ controls top-weightedness. When $p$ is $0$ , only the top-ranked item is considered. Larger values of $p$ place more emphasis on deeper ranks.
$k$ is the depth of available results.
$\frac{X_k}{k}$ is $A_r$ , the agreement at ranks $r > k$
$\frac{X_d}{d}$ is $A_d$ , the agreement at depth $d$ .

$\text{RBO}_\text{EXT}$ is the implementation used in Releval. It is particularly useful for comparing the similarity of search results

over time from a single search system, whether the content is static or dynamically changing
whilst implementing performance improvements to a search algorithm, without wanting to affect the results e.g. a change in indexing implementation that improves latency at query time.
whilst performing a migration from one search engine to another, without wanting to affect the results e.g. migrating from Solr to Elasticsearch.

Binary Relevance Metrics​

Precision​

Recall​

F-Score​

Mean Average Precision (MAP)​

Best@k​

Mean Reciprocal Rank (MRR)​

Graded Relevance Metrics​

Cumulative Gain (CG)​

Discounted Cumulative Gain (DCG)​

Normalized Discounted Cumulative Gain (NDCG)​

Expected Reciprocal Rank (ERR)​

Similarity Metrics​

Rank-Biased Overlap (RBO)​

Extrapolation​

Binary Relevance Metrics

Precision

Recall

F-Score

Mean Average Precision (MAP)

Best@k

Mean Reciprocal Rank (MRR)

Graded Relevance Metrics

Cumulative Gain (CG)

Discounted Cumulative Gain (DCG)

Normalized Discounted Cumulative Gain (NDCG)

Expected Reciprocal Rank (ERR)

Similarity Metrics

Rank-Biased Overlap (RBO)

Extrapolation