Skip to main content

Judging Results

After an evaluation run completes, the next step is to judge the relevance of the returned candidates. Judgments are the foundation of relevance metrics — without them, metrics cannot be calculated.

What is a Judgment?

A judgment is a relevance rating assigned by a human evaluator to a specific candidate returned for a query. For example, if the query is "running shoes" and one of the returned candidates is "Nike Air Zoom Pegasus", a judge might rate it as "Highly relevant" (grade 3 on a graded scale).

Judging in the UI

The Evaluate page presents un-rated candidates one at a time, alongside the query and the candidate's mapped fields. Pick a grade, click Submit Rating, and the next un-rated candidate loads automatically.

Number-key shortcuts: press a key (e.g. 3) to highlight a grade, then press the same key again to submit. Useful when judging a long backlog.

Tip

Use Next Unrated to quickly navigate to the next candidate that hasn't been judged yet.

Judging via the API

Submit judgments for one or more queries:

curl -X POST "https://${RELEVAL_HOST}/api/v1/judgments" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
--data @- <<EOF
{
"queries": [
{
"query_id": "${QUERY_ID}",
"judgments": [
{ "candidate_id": "product-123", "grade": 4 },
{ "candidate_id": "product-456", "grade": 2 },
{ "candidate_id": "product-789", "grade": 0 }
]
}
]
}
EOF

The grade value must be within the range of the run's scale:

ScaleValid Grades
Binary0–1
Graded0–4
Detailed0–9

How Metrics Are Calculated

Metrics are recomputed automatically whenever judgments are submitted, both at the query level and aggregated to the run level. Each metric is scored against the candidate ranking using the grades you've supplied. The UI reflects updated values as soon as judgments are saved.

Multiple Judges

Multiple users can judge the same candidates. When more than one user has judged a given candidate, their grades are averaged to produce a consensus score, which helps reduce individual bias. Updating your own judgment changes only your rating; other judges' ratings are unaffected.

Best Practices

Judge Consistently

Establish clear guidelines for what each grade level means in your context before starting. For example, for an e-commerce search:

Grade (Graded Scale)Meaning
0 — Not relevantProduct has no relation to the query
1 — Marginally relevantLoosely related but not what the user wants
2 — Fairly relevantRelated product but not an ideal match
3 — Highly relevantGood match for the query intent
4 — Perfectly relevantExactly what the user is looking for

Prioritize Top Results

Focus judgment effort on the top-ranked results first. Most metrics weight higher-ranked results more heavily, so judging the top 10–20 candidates per query gives you meaningful metrics without needing to judge every result.

Use Appropriate Scales

  • Binary for quick pass/fail assessments or when relevance is clear-cut
  • Graded for most evaluation scenarios — it balances granularity with judgment speed
  • Detailed when fine distinctions matter (e.g., comparing very similar ranking models)