AI Judges

AI Judges automate relevance assessment by using a large language model (LLM) to grade search results in place of, or alongside, human judges. An AI judge is a saved configuration: an AI Provider it talks to, a model, and a prompt template. Once configured, you can run it against any completed evaluation run to populate that run's judgments automatically.

AI judging is most useful when:

A query set is too large for human judges to grade exhaustively.
You need fast turnaround between iterations on ranking changes.
You want a consistent baseline rater you can re-run as your endpoints evolve.
You're seeding a new judgment list before refining it manually.

It is not a replacement for thoughtful human judging on smaller, high-value query sets. Use AI judges to widen coverage, then sample and verify their work.

How AI judging works

The judge is applied to each unjudged (query, candidate) pair on the run, producing a grade and (optionally) reasoning per candidate. Progress is reported live in the UI. When the run finishes, evaluation metrics are recomputed against the new judgments.

Visibility and ownership

Each AI judge has an owner:

Personal judges — owned by the member who created them. Visible only to that member.
Shared judges — created by an Admin with no owner set. Visible to all members.

Admins see every judge regardless of owner. Credentials live on the linked AI Provider, not on the judge, and are stored encrypted and never returned through the API after creation.

Creating an AI judge

Two steps: create (or pick) the AI Provider, then create the judge that references it.

In the UI

The walkthrough below creates an Ollama-backed judge. The flow is the same for any provider — the credentials and required fields are documented per provider type on the Providers page.

Step by step:

Navigate to AI Judges and click Create AI Judge.
Enter a Name and optional Description.
Choose an existing AI Provider, or create one inline.
Enter the Model the provider should invoke (e.g. gpt-4o, claude-sonnet-4-20250514, or an Azure deployment name).
Supply a Prompt Template.
Adjust Max Concurrency (1–50, default 5) and Batch Size (1–20).
Toggle Include Images to send candidate images to a vision-capable model.
Click Create, then click Test to verify connectivity through to the model.

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/ai-judges" \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer ${TOKEN}" \
  -d '{
    "name": "GPT-4o (graded)",
    "description": "Default graded relevance judge",
    "ai_provider_id": "01JRVX6GG4XMSM55XBC3E518DY",
    "model": "gpt-4o",
    "prompt_template": "Rate the relevance of {{candidates.[0].title}} for the query \"{{query}}\" on a 0-4 scale.",
    "max_concurrency": 5,
    "batch_size": 5
  }'

The judge inherits its credentials and endpoint from the linked provider. To use different credentials, create another provider and point a new judge at it.

Info

Provider credentials are encrypted at rest, never returned through the API, and rotated via the provider — not the judge. See Rotating credentials.

Tuning concurrency and batching

Two settings control throughput and cost:

Setting	What it does	Trade-off
Max concurrency	Number of LLM calls in flight at once.	Higher = faster, but more likely to hit provider rate limits.
Batch size	Number of candidates included in one prompt.	Higher = fewer requests and cheaper per-candidate, but a single bad parse loses the whole batch.

For most workflows, the default of five-at-a-time and five-per-batch is a good starting point. Increase the batch on long candidate lists (graded scales with deep cut-offs) and decrease it if the model frequently returns malformed responses.

Costs

Each AI judging run records the input and output token counts consumed across all of its LLM calls, so you can trace spend back to specific runs. Image judging multiplies token usage substantially — a vision model evaluating ten image-bearing candidates per query can be 5–10x the text-only cost.

Testing connectivity

Two test actions are available, at different levels:

Test the provider — confirms that the credentials, endpoint, and a model of your choice are reachable. See Testing a provider.
Test the judge — issues a probe prompt through the judge's configured model. If image judging is enabled, an additional multimodal probe verifies the model accepts image input.

Neither test says anything about the judge's quality on real candidates — for that, run it on a small evaluation run first.

Next steps

Providers — credentials and configuration for each supported provider type.
Prompt templates — the variables, default template, and required response format.
Running an AI judging run — starting, monitoring, and cancelling judging runs.

How AI judging works​

Visibility and ownership​

Creating an AI judge​

In the UI​

Using the API​

Tuning concurrency and batching​

Costs​

Testing connectivity​

Next steps​