Skip to main content

Running Evaluations

An Evaluation Run is a single execution of an evaluation. Each run sends every query in the query set to the search endpoint, collects the results, and prepares them for judgment.

Creating an Evaluation Run

In the UI

  1. Navigate to Evaluations and select an evaluation
  2. Click Create Run
  3. Enter a Name for the run
  4. Select the Scale for judging relevance — see Scales for details
  5. Select the Metrics you want calculated — see Metrics for details
  6. Click Create

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "Run 1 - Baseline",
"scale": "graded",
"metrics": ["NDCG@10", "MAP", "ERR@10", "MRR@10"]
}'

Scale Options

Three scales are available — each defines the granularity of relevance grades. Pick the scale that matches how nuanced your judgments need to be. See Scales for the full reference.

ScaleRangeUse when
binary
Pass/fail relevance is enough — was this result relevant or not?
graded
Most evaluation work; distinguishes marginal / fair / highly / perfectly relevant.
detailed
Fine-grained ranking work where small differences in relevance matter.

Metrics

Metrics are specified by name, optionally with a cutoff depth using @k — for example, NDCG@10 computes NDCG over the top 10 results.

See Metrics for formulas and worked examples.

Overriding the template for a single run

By default, a new run inherits the query template attached to the evaluation. To compare ranking variants without forking the template, supply overrides on the create-run request — any of the request body, query string, content type, and headers can be replaced for that run only. A typical use is keeping the same endpoint and query set, but tweaking one knob between runs:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "Re-rank weight 0.8",
"scale": "graded",
"metrics": ["NDCG@10", "MAP"],
"body": "{ \"query\": { \"function_score\": { \"query\": { \"match\": { \"title\": \"{{query}}\" } }, \"weight\": 0.8 } } }"
}'

Overrides are preserved on the run itself, so the comparison between runs can flag exactly which parts of the configuration changed.

Starting a Run

After creating a run, start it to begin query execution.

In the UI

Open the evaluation, locate the pending run, and choose Start from its actions menu. The status updates live as the run progresses through Queued → Running → Completed.

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/start" \
-H "Authorization: Bearer ${TOKEN}"

Run Status

Each run progresses through the following statuses:

StatusDescription
PendingRun has been created but not started
QueuedStart has been requested; the run will begin shortly
RunningQueries are being executed against the endpoint
CompletedAll queries have been executed and can be judged
LockedJudgments are frozen; the run is read-only. See Locking a run.
CancelledThe run was stopped before completing. See Cancelling runs.
FailedAn error occurred during execution

Real-Time Progress

The UI displays progress updates live while a run is executing, so you can watch a long run advance without refreshing.

Viewing Results

Once a run completes, browse the queries table to see what came back from each query, and expand any row to inspect the candidates in detail.

In the UI

Select the completed run to see the list of queries and their results. Each query shows:

  • The executed URL and request body
  • The candidates returned by the search endpoint
  • The position of each candidate in the results

Using the API

List queries in a run:

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/queries?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"

List results for a specific query:

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/queries/${QUERY_ID}/results?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"

Locking a run

A locked run is read-only; its judgments cannot be added, changed, or deleted, and it will not accept new AI judging runs. Locking is how Releval preserves a "result of record" for a configuration once you've moved on to iterating.

Automatic locking

Creating a new run in an evaluation automatically locks the most recent completed run in that evaluation. The intent is that the previous run is the baseline you just decided to iterate past, so its judgments should stop changing under your feet. If you need to keep editing the prior run's judgments, do it before kicking off the new run.

Locking manually

You can lock any completed run yourself — useful when you want to freeze a milestone result without starting another run yet:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/lock" \
-H "Authorization: Bearer ${TOKEN}"

Only completed runs can be locked. Locked runs remain in metric trends and comparisons (see below) — locking is about preventing judgment churn, not hiding the run.

Comparing runs over time

Two views let you see how relevance is moving across the runs in an evaluation: a per-evaluation trend across all runs, and a head-to-head between two runs.

The metric-trends endpoint returns each completed or locked run's metric values in chronological order, alongside a flag indicating whether the run's configuration changed from the previous one and which parts changed if so. This is what powers the "metric over time" chart in the UI, with a marker on the runs where the configuration changed.

curl "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/metrics/trends" \
-H "Authorization: Bearer ${TOKEN}"

Head-to-head comparison

The comparison endpoint returns run-level metric deltas (absolute and relative) plus a per-query breakdown classifying each query as improved, regressed, or unchanged against the baseline. Both runs must belong to the same evaluation and be completed or locked.

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${CANDIDATE_RUN_ID}/compare/${BASELINE_RUN_ID}?page=1&page_size=50" \
-H "Authorization: Bearer ${TOKEN}"

Use this to find the specific queries a configuration change helped or hurt — the natural follow-up is to open those queries' judgments and inspect what actually changed in the result list.

Cloning a Run

You can clone an existing run to re-execute the same queries. This is useful when you've made changes to the search endpoint and want to compare results:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/clone" \
-H "Authorization: Bearer ${TOKEN}"

This creates a new run with the same configuration and starts it.