Skip to main content

Evaluation Metrics

Evaluating the performance of search systems involves measuring how well the retrieved documents satisfy a user's query or information need. The following are metrics available for an evaluation run.

Binary Relevance Metrics

Binary relevance metrics can be used with results graded with binary relevance, as well as graded and detailed relevance, where scores > 1 are normalized to 1.

Precision

Precision measures the fraction of retrieved documents that are relevant. At a cutoff depth kk:

Precision@k={relevant documents in top k}k\text{Precision@}k = \frac{|\{\text{relevant documents in top } k\}|}{k}

A Precision@10 of 0.7 means that 7 out of the top 10 results are relevant. Precision does not consider the order of relevant results within the top kk.

Recall

Recall measures the fraction of all relevant documents that appear in the retrieved results. At a cutoff depth kk:

Recall@k={relevant documents in top k}{all relevant documents}\text{Recall@}k = \frac{|\{\text{relevant documents in top } k\}|}{|\{\text{all relevant documents}\}|}

A Recall@10 of 0.5 means that half of all relevant documents for the query appear in the top 10 results. Recall does not consider the order of relevant results within the top kk.

F-Score

The F-Score balances precision and recall using a harmonic mean. It's parameterized by ββ, where β>1β > 1 favors recall and β<1β < 1 favors precision.

Fβ=(1+β2)PrecisionRecallβ2Precision+RecallF_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}

For the common F1F_1 Score (β=1)(β = 1):

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Mean Average Precision (MAP)

MAP computes the mean of average precision scores across multiple queries.

MAP=1QqQAP(q)\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)

Where AP(q)AP(q) is the average precision for query qq, defined as:

AP(q)=1Rk=1NP(k)rel(k)\text{AP}(q) = \frac{1}{R} \sum_{k=1}^N P(k) \cdot \text{rel}(k)

where

  • RR is the total number of relevant documents.
  • P(k)P(k) is precision at cutoff kk.
  • rel(k)rel(k) is 1 if the document at rank kk is relevant, else 0.

Best@k

Best@k@k evaluates whether the top-kk includes the single best item:

Best@k={1if the best item is in top-k0otherwise\text{Best@}k = \begin{cases} 1 & \text{if the best item is in top-}k \\ 0 & \text{otherwise} \end{cases}

This is especially useful when a single item carries outsized importance (e.g. the best answer in a Q&A system).

Mean Reciprocal Rank (MRR)

MRR measures how quickly a relevant result appears in the ranked list, averaged across multiple queries. For each query, the reciprocal rank is the inverse of the position of the first relevant result:

MRR=1QqQ1rankq\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}

where

  • QQ is the set of queries.
  • rankq\text{rank}_q is the position of the first relevant result for query qq.

An MRR of 1.0 means that every query has a relevant result at rank 1. MRR is well suited for scenarios where only the first relevant result matters, such as navigational queries or question answering.

Graded Relevance Metrics

Cumulative Gain (CG)

Cumulative Gain is the sum of the graded relevance values (the gain) of all results in a ranked list. It doesn't take into account the rank of a result in a ranked list.

CG@k=i=1kreli\text{CG@}k = \sum_{i=1}^k rel_{i}

where

  • kk is the depth into the ranked list to consider results
  • relirel_i is the graded relevance of the document at rank ii.

Since CG doesn't take into account the rank of results, its value as a relevancy metric is limited.

Discounted Cumulative Gain (DCG)

Discounted Cumulative Gain (DCG) builds on top of CG, and measures the gain of the results whilst also considering the rank of a result in the ranked list. A common formula for DCG is:

DCG@k=i=1k2reli1log2(i+1)\text{DCG@}k = \sum_{i=1}^k \frac{2^{rel_i} - 1}{\log_2(i + 1)}

where

  • kk is the depth into the ranked list to consider results
  • relirel_i is the graded relevance of the document at rank ii.

The core idea behind DCG is that highly relevant documents are more useful when appearing earlier in the ranking, and their value should decrease logarithmically as their position in the list increases; the numerator rewards higher relevance exponentially, while the denominator discounts the value based on rank/position. The total DCG is the sum of these discounted gains across all ranks in the list.

Info

DCG scores are relative to the query so cannot be used to compare ranking quality across queries.

Normalized Discounted Cumulative Gain (NDCG)

Normalized Discounted Cumulative Gain (NDCG) extends DCG by providing a way to compare ranking quality across queries of varying difficulty or relevance distributions. It does this by normalizing the DCG score against the ideal DCG (IDCG)- the maximum possible DCG that could be achieved if all relevant documents were perfectly ranked.

NDCG@k=DCG@kIDCG@k\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}

where

  • IDCG@k\text{IDCG@}k is the DCG@k\text{DCG@}k of the ideal ranking of documents for the given query.

Normalization enables fair comparisons between search results for different queries and is especially useful in large-scale evaluations involving diverse query sets.

Tip

NDCG is a good default choice when the relevance distribution is unknown.

Expected Reciprocal Rank (ERR)

Expected Reciprocal Rank (ERR) accounts for user satisfaction by modeling the probability that a user finds a relevant result and stops browsing the ranked list. It assumes that each result at rank ii has a probability R(i)R(i) of satisfying the user, based on its relevance grade (typically mapped to a value between 0 and 1). ERR computes the expected reciprocal of the rank at which the user stops.

ERR@k=i=1k1iP(user stops at i)\text{ERR@}k = \sum_{i=1}^k \frac{1}{i} P(\text{user stops at } i)

where

P(user stops at i)=(j=1i1(1R(j)))R(i)P(\text{user stops at } i) = \left( \prod_{j=1}^{i-1}(1 - R(j)) \right) R(i)
  • R(i)R(i) is a probability derived from the relevance grade, e.g. R(i)=2reli12max relR(i) = \frac{2^{rel_i} - 1}{2^{\text{max rel}}}

The product term represents the probability that the user was not satisfied by any of the previous results. The final ERR score is the sum of these contributions over the ranked list.

Similarity Metrics

Rank-Biased Overlap (RBO)

Rank-Biased Overlap (RBO) measures similarity between two ranked lists while giving more weight to higher-ranked items. Unlike traditional measures, RBO accounts for incomplete lists and allows for a customizable weighting parameter, adjusting how much emphasis is placed on top-ranked elements.

RBO(S,T,p)=(1p)d=1pd1Ad.\text{RBO}(S, T, p) = (1 - p) \sum_{d=1}^\infty p^{d-1} \cdot A_d .

where

  • SS and TT are two infinite rankings
  • AdA_d is the agreement at depth dd.
  • p[0,1]p ∈ [0, 1] controls top-weightedness. When pp is 00, only the top-ranked item is considered. Larger values of pp place more emphasis on deeper ranks.

RBO is particularly useful in evaluating search rankings, where the relevance of results diminish as the user navigates further down the list. It provides a robust, intuitive measure that handles partial overlaps.

Extrapolation

RBOEXT\text{RBO}_\text{EXT} is the extrapolated version of RBO that approximates infinite summation with finite depth; it is very typical for search systems to only make available the top kk results, but have many more that match for a given query. RBOEXT\text{RBO}_\text{EXT} extrapolates a score from the available ranked lists, assuming that the agreement seen at depth kk continues indefinitely.

RBOEXT(S,T,p,k)=Xkkpk+1ppd=1kXddpd\text{RBO}_{\text{EXT}}(S, T, p, k) = \frac{X_k}{k} \cdot p^k + \frac{1 - p}{p} \sum_{d=1}^{k} \frac{X_d}{d} \cdot p^d

where

  • SS and TT are two infinite rankings
  • p[0,1]p ∈ [0, 1] controls top-weightedness. When pp is 00, only the top-ranked item is considered. Larger values of pp place more emphasis on deeper ranks.
  • kk is the depth of available results.
  • Xkk\frac{X_k}{k} is ArA_r, the agreement at ranks r>kr > k
  • Xdd\frac{X_d}{d} is AdA_d, the agreement at depth dd.

RBOEXT\text{RBO}_\text{EXT} is the implementation used in Releval. It is particularly useful for comparing the similarity of search results

  • over time from a single search system, whether the content is static or dynamically changing
  • whilst implementing performance improvements to a search algorithm, without wanting to affect the results e.g. a change in indexing implementation that improves latency at query time.
  • whilst performing a migration from one search engine to another, without wanting to affect the results e.g. migrating from Solr to Elasticsearch.