Running an AI judging run
Once you've created and tested an AI judge, you can use it to judge any completed evaluation run. An AI judging run is the unit of work: one judge applied to one evaluation run, executed asynchronously.
Prerequisites
The target evaluation run must satisfy two conditions:
- The run must be completed. AI judging cannot start on a run that is still pending, queued, running, or failed — wait for the run to finish first.
- The run must not be locked. A locked run rejects all judgment changes, including those produced by an AI judging run.
Starting a judging run
In the UI
- Open the Evaluation Run detail page.
- Click Run AI Judge in the sidebar.
- Choose the AI Judge to use from the dropdown.
- Click Start.
The run is queued immediately and a progress indicator appears as candidates are graded.
Using the API
curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/ai-judge" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"ai_judge_id": "${AI_JUDGE_ID}"
}'
The response contains the newly created judging run, including its identifier and an initial status of queued.
Multiple AI judging runs can target the same evaluation run — for example, to compare grades from two different models. Later runs overwrite earlier judgments for the same candidate; if you need to keep both sets, export the first set's judgments before starting the second.
Run lifecycle
| Status | Meaning |
|---|---|
| Queued | The run has been created and is waiting to start. |
| Running | The judge is being applied to candidates. |
| Completed | Every candidate has either been judged or recorded as failed. |
| Failed | The run aborted. The error message on the run describes the cause. |
| Cancelled | The run was stopped by a user via the API or UI. |
Monitoring progress
Each AI judging run reports overall progress, how many candidates have been judged successfully, how many failed, and the running input/output token totals across all LLM calls so far. The UI updates these live as the run progresses. To monitor from your own tooling, list the AI judging runs for an evaluation run and re-fetch on a backoff until the status settles:
curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/ai-judging-runs" \
-H "Authorization: Bearer ${TOKEN}"
Stop polling once the status is no longer queued or running.
Cancelling a run
A queued or running AI judging run can be cancelled at any time:
curl -X POST "https://${RELEVAL_HOST}/api/v1/ai-judging-runs/${JUDGING_RUN_ID}/cancel" \
-H "Authorization: Bearer ${TOKEN}"
Cancellation marks the run as cancelled and stops further LLM calls. Calls that were already in flight may still complete and write their judgments, leaving the run in a "partially judged" state — usable, but not exhaustive. Start a new judging run to fill in the gaps if needed.
Cancellation is the right tool when:
- You realise you picked the wrong judge or model.
- The run is consuming more tokens than expected.
- The provider is rate-limiting heavily and you want to retry later with a lower concurrency.
Handling failures
Two kinds of failure can occur:
- Per-candidate failure — the provider returned a response Releval could not parse, or it returned a grade outside the evaluation scale's range. The candidate is counted as failed, no judgment is written, and the rest of the run continues.
- Run failure — an unrecoverable error such as authentication failure or repeated provider outage. The run moves to a failed status with an error message describing the cause.
Common causes of per-candidate failures:
- The model returned JSON or Markdown instead of
<candidate>blocks. Refine the prompt template. - The model returned a grade outside the scale (e.g.
5on a 0–4 graded scale). Make the scale range explicit in the prompt. - The model returned grades for fewer candidates than were sent. Lower the batch size until it reliably handles every candidate in a batch.
For run failures, fix the underlying issue (rotate the API key, raise the provider quota, adjust the endpoint URL) and start a new judging run on the same evaluation run.
Reviewing AI-generated judgments
AI judgments appear alongside human ones in the judging UI,
with the AI judge's name shown as the judge of record. Each judgment carries the <reasoning>
text the model returned, which is invaluable for spot-checking quality.
A reasonable workflow once a run completes:
- Sort by candidates the AI marked highly relevant and confirm a few. Hallucinated relevance is the most damaging error.
- Sort by candidates marked not relevant at the top of the result list. Genuine "wrong results" should appear here; if they don't, the prompt may be too lenient.
- Adjust the prompt template if needed and re-run on a small evaluation to verify the fix before re-running on full corpora.