Running Evaluations

An Evaluation Run is a single execution of an evaluation. Each run sends every query in the query set to the search endpoint, collects the results, and prepares them for judgment.

Creating an Evaluation Run

In the UI

Navigate to Evaluations and select an evaluation
Click Create Run
Enter a Name for the run
Select the Scale for judging relevance — see Scales for details
Select the Metrics you want calculated — see Metrics for details
Click Create

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "name": "Run 1 - Baseline",
  "scale": "graded",
  "metrics": ["NDCG@10", "MAP", "ERR@10", "MRR@10"]
}'

Scale Options

Three scales are available — each defines the granularity of relevance grades. Pick the scale that matches how nuanced your judgments need to be. See Scales for the full reference.

Scale	Range	Use when
`binary`		Pass/fail relevance is enough — was this result relevant or not?
`graded`		Most evaluation work; distinguishes marginal / fair / highly / perfectly relevant.
`detailed`		Fine-grained ranking work where small differences in relevance matter.

Metrics

Metrics are specified by name, optionally with a cutoff depth using @k — for example, NDCG@10 computes NDCG over the top 10 results.

See Metrics for formulas and worked examples.

Overriding the template for a single run

By default, a new run inherits the query template attached to the evaluation. To compare ranking variants without forking the template, supply overrides on the create-run request — any of the request body, query string, content type, and headers can be replaced for that run only. A typical use is keeping the same endpoint and query set, but tweaking one knob between runs:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "name": "Re-rank weight 0.8",
  "scale": "graded",
  "metrics": ["NDCG@10", "MAP"],
  "body": "{ \"query\": { \"function_score\": { \"query\": { \"match\": { \"title\": \"{{query}}\" } }, \"weight\": 0.8 } } }"
}'

Overrides are preserved on the run itself, so the comparison between runs can flag exactly which parts of the configuration changed.

Starting a Run

After creating a run, start it to begin query execution.

In the UI

Open the evaluation, locate the pending run, and choose Start from its actions menu. The status updates live as the run progresses through Queued → Running → Completed.

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/start" \
-H "Authorization: Bearer ${TOKEN}"

Run Status

Each run progresses through the following statuses:

Status	Description
Pending	Run has been created but not started
Queued	Start has been requested; the run will begin shortly
Running	Queries are being executed against the endpoint
Completed	All queries have been executed and can be judged
Locked	Judgments are frozen; the run is read-only. See Locking a run.
Cancelled	The run was stopped before completing. See Cancelling runs.
Failed	An error occurred during execution

Real-Time Progress

The UI displays progress updates live while a run is executing, so you can watch a long run advance without refreshing.

Viewing Results

Once a run completes, browse the queries table to see what came back from each query, and expand any row to inspect the candidates in detail.

In the UI

Select the completed run to see the list of queries and their results. Each query shows:

The executed URL and request body
The candidates returned by the search endpoint
The position of each candidate in the results

Using the API

List queries in a run:

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/queries?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"

List results for a specific query:

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/queries/${QUERY_ID}/results?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"

Locking a run

A locked run is read-only; its judgments cannot be added, changed, or deleted, and it will not accept new AI judging runs. Locking is how Releval preserves a "result of record" for a configuration once you've moved on to iterating.

Automatic locking

Creating a new run in an evaluation automatically locks the most recent completed run in that evaluation. The intent is that the previous run is the baseline you just decided to iterate past, so its judgments should stop changing under your feet. If you need to keep editing the prior run's judgments, do it before kicking off the new run.

Locking manually

You can lock any completed run yourself — useful when you want to freeze a milestone result without starting another run yet:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/lock" \
  -H "Authorization: Bearer ${TOKEN}"

Only completed runs can be locked. Locked runs remain in metric trends and comparisons (see below) — locking is about preventing judgment churn, not hiding the run.

Comparing runs over time

Two views let you see how relevance is moving across the runs in an evaluation: a per-evaluation trend across all runs, and a head-to-head between two runs.

Metric trends across an evaluation

The metric-trends endpoint returns each completed or locked run's metric values in chronological order, alongside a flag indicating whether the run's configuration changed from the previous one and which parts changed if so. This is what powers the "metric over time" chart in the UI, with a marker on the runs where the configuration changed.

curl "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/metrics/trends" \
  -H "Authorization: Bearer ${TOKEN}"

Head-to-head comparison

The comparison endpoint returns run-level metric deltas (absolute and relative) plus a per-query breakdown classifying each query as improved, regressed, or unchanged against the baseline. Both runs must belong to the same evaluation and be completed or locked.

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${CANDIDATE_RUN_ID}/compare/${BASELINE_RUN_ID}?page=1&page_size=50" \
  -H "Authorization: Bearer ${TOKEN}"

Use this to find the specific queries a configuration change helped or hurt — the natural follow-up is to open those queries' judgments and inspect what actually changed in the result list.

Cloning a Run

You can clone an existing run to re-execute the same queries. This is useful when you've made changes to the search endpoint and want to compare results:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/clone" \
-H "Authorization: Bearer ${TOKEN}"

This creates a new run with the same configuration and starts it.

Creating an Evaluation Run​

In the UI​

Using the API​

Scale Options​

Metrics​

Overriding the template for a single run​

Starting a Run​

In the UI​

Using the API​

Run Status​

Real-Time Progress​

Viewing Results​

In the UI​

Using the API​

Locking a run​

Automatic locking​

Locking manually​

Comparing runs over time​

Metric trends across an evaluation​

Head-to-head comparison​

Cloning a Run​

Creating an Evaluation Run

In the UI

Using the API

Scale Options

Metrics

Overriding the template for a single run

Starting a Run

In the UI

Using the API

Run Status

Real-Time Progress

Viewing Results

In the UI

Using the API

Locking a run

Automatic locking

Locking manually

Comparing runs over time

Metric trends across an evaluation

Head-to-head comparison

Cloning a Run