Evaluation Metrics
Evaluating the performance of search systems involves measuring how well the retrieved documents satisfy a user's query or information need. The following are metrics available for an evaluation run.
Binary Relevance Metrics
Binary relevance metrics can be used with results graded with binary relevance, as well as graded and detailed relevance, where scores > 1 are normalized to 1.
Precision
Precision measures the fraction of retrieved documents that are relevant. At a cutoff depth :
A Precision@10 of 0.7 means that 7 out of the top 10 results are relevant. Precision does not consider the order of relevant results within the top .
Recall
Recall measures the fraction of all relevant documents that appear in the retrieved results. At a cutoff depth :
A Recall@10 of 0.5 means that half of all relevant documents for the query appear in the top 10 results. Recall does not consider the order of relevant results within the top .
F-Score
The F-Score balances precision and recall using a harmonic mean. It's parameterized by , where favors recall and favors precision.
For the common Score :
Mean Average Precision (MAP)
MAP computes the mean of average precision scores across multiple queries.
Where is the average precision for query , defined as:
where
- is the total number of relevant documents.
- is precision at cutoff .
- is 1 if the document at rank is relevant, else 0.
Best@k
Best evaluates whether the top- includes the single best item:
This is especially useful when a single item carries outsized importance (e.g. the best answer in a Q&A system).
Mean Reciprocal Rank (MRR)
MRR measures how quickly a relevant result appears in the ranked list, averaged across multiple queries. For each query, the reciprocal rank is the inverse of the position of the first relevant result:
where
- is the set of queries.
- is the position of the first relevant result for query .
An MRR of 1.0 means that every query has a relevant result at rank 1. MRR is well suited for scenarios where only the first relevant result matters, such as navigational queries or question answering.
Graded Relevance Metrics
Cumulative Gain (CG)
Cumulative Gain is the sum of the graded relevance values (the gain) of all results in a ranked list. It doesn't take into account the rank of a result in a ranked list.
where
- is the depth into the ranked list to consider results
- is the graded relevance of the document at rank .
Since CG doesn't take into account the rank of results, its value as a relevancy metric is limited.
Discounted Cumulative Gain (DCG)
Discounted Cumulative Gain (DCG) builds on top of CG, and measures the gain of the results whilst also considering the rank of a result in the ranked list. A common formula for DCG is:
where
- is the depth into the ranked list to consider results
- is the graded relevance of the document at rank .
The core idea behind DCG is that highly relevant documents are more useful when appearing earlier in the ranking, and their value should decrease logarithmically as their position in the list increases; the numerator rewards higher relevance exponentially, while the denominator discounts the value based on rank/position. The total DCG is the sum of these discounted gains across all ranks in the list.
DCG scores are relative to the query so cannot be used to compare ranking quality across queries.
Normalized Discounted Cumulative Gain (NDCG)
Normalized Discounted Cumulative Gain (NDCG) extends DCG by providing a way to compare ranking quality across queries of varying difficulty or relevance distributions. It does this by normalizing the DCG score against the ideal DCG (IDCG)- the maximum possible DCG that could be achieved if all relevant documents were perfectly ranked.
where
- is the of the ideal ranking of documents for the given query.
Normalization enables fair comparisons between search results for different queries and is especially useful in large-scale evaluations involving diverse query sets.
NDCG is a good default choice when the relevance distribution is unknown.
Expected Reciprocal Rank (ERR)
Expected Reciprocal Rank (ERR) accounts for user satisfaction by modeling the probability that a user finds a relevant result and stops browsing the ranked list. It assumes that each result at rank has a probability of satisfying the user, based on its relevance grade (typically mapped to a value between 0 and 1). ERR computes the expected reciprocal of the rank at which the user stops.
where
- is a probability derived from the relevance grade, e.g.
The product term represents the probability that the user was not satisfied by any of the previous results. The final ERR score is the sum of these contributions over the ranked list.
Similarity Metrics
Rank-Biased Overlap (RBO)
Rank-Biased Overlap (RBO) measures similarity between two ranked lists while giving more weight to higher-ranked items. Unlike traditional measures, RBO accounts for incomplete lists and allows for a customizable weighting parameter, adjusting how much emphasis is placed on top-ranked elements.
where
- and are two infinite rankings
- is the agreement at depth .
- controls top-weightedness. When is , only the top-ranked item is considered. Larger values of place more emphasis on deeper ranks.
RBO is particularly useful in evaluating search rankings, where the relevance of results diminish as the user navigates further down the list. It provides a robust, intuitive measure that handles partial overlaps.
Extrapolation
is the extrapolated version of RBO that approximates infinite summation with finite depth; it is very typical for search systems to only make available the top results, but have many more that match for a given query. extrapolates a score from the available ranked lists, assuming that the agreement seen at depth continues indefinitely.
where
- and are two infinite rankings
- controls top-weightedness. When is , only the top-ranked item is considered. Larger values of place more emphasis on deeper ranks.
- is the depth of available results.
- is , the agreement at ranks
- is , the agreement at depth .
is the implementation used in Releval. It is particularly useful for comparing the similarity of search results
- over time from a single search system, whether the content is static or dynamically changing
- whilst implementing performance improvements to a search algorithm, without wanting to affect the results e.g. a change in indexing implementation that improves latency at query time.
- whilst performing a migration from one search engine to another, without wanting to affect the results e.g. migrating from Solr to Elasticsearch.