Judging Results
After an evaluation run completes, the next step is to judge the relevance of the returned candidates. Judgments are the foundation of relevance metrics — without them, metrics cannot be calculated.
What is a Judgment?
A judgment is a relevance rating assigned by a human evaluator to a specific candidate returned for a query. For example, if the query is "running shoes" and one of the returned candidates is "Nike Air Zoom Pegasus", a judge might rate it as "Highly relevant" (grade 3 on a graded scale).
Judging in the UI
The Evaluate page presents un-rated candidates one at a time, alongside the query and the candidate's mapped fields. Pick a grade, click Submit Rating, and the next un-rated candidate loads automatically.
Number-key shortcuts: press a key (e.g. 3) to highlight a grade, then press the same key
again to submit. Useful when judging a long backlog.
Use Next Unrated to quickly navigate to the next candidate that hasn't been judged yet.
Judging via the API
Submit judgments for one or more queries:
curl -X POST "https://${RELEVAL_HOST}/api/v1/judgments" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
--data @- <<EOF
{
"queries": [
{
"query_id": "${QUERY_ID}",
"judgments": [
{ "candidate_id": "product-123", "grade": 4 },
{ "candidate_id": "product-456", "grade": 2 },
{ "candidate_id": "product-789", "grade": 0 }
]
}
]
}
EOF
The grade value must be within the range of the run's scale:
| Scale | Valid Grades |
|---|---|
| Binary | 0–1 |
| Graded | 0–4 |
| Detailed | 0–9 |
How Metrics Are Calculated
Metrics are recomputed automatically whenever judgments are submitted, both at the query level and aggregated to the run level. Each metric is scored against the candidate ranking using the grades you've supplied. The UI reflects updated values as soon as judgments are saved.
Multiple Judges
Multiple users can judge the same candidates. When more than one user has judged a given candidate, their grades are averaged to produce a consensus score, which helps reduce individual bias. Updating your own judgment changes only your rating; other judges' ratings are unaffected.
Best Practices
Judge Consistently
Establish clear guidelines for what each grade level means in your context before starting. For example, for an e-commerce search:
| Grade (Graded Scale) | Meaning |
|---|---|
| 0 — Not relevant | Product has no relation to the query |
| 1 — Marginally relevant | Loosely related but not what the user wants |
| 2 — Fairly relevant | Related product but not an ideal match |
| 3 — Highly relevant | Good match for the query intent |
| 4 — Perfectly relevant | Exactly what the user is looking for |
Prioritize Top Results
Focus judgment effort on the top-ranked results first. Most metrics weight higher-ranked results more heavily, so judging the top 10–20 candidates per query gives you meaningful metrics without needing to judge every result.
Use Appropriate Scales
- Binary for quick pass/fail assessments or when relevance is clear-cut
- Graded for most evaluation scenarios — it balances granularity with judgment speed
- Detailed when fine distinctions matter (e.g., comparing very similar ranking models)