Prompt templates

Each AI judge has a required Handlebars prompt template that defines what the model is asked, and the shape of the response it must return. The UI pre-fills a sensible default when you create a judge, which you can take as-is or adapt to your domain.

Available variables

The Handlebars context exposes the following variables:

Variable	Type	Description
`query`	string \| object	The query value from the query set. A string for plain text query sets, or the JSON object for structured query sets.
`candidates`	array	Candidates being judged in this prompt. Length is 1 unless `batch_size > 1`.
`candidates[].id`	string	The candidate's ID. Must be referenced in the response so grades can be matched back to candidates.
`candidates[].title`	string	The candidate's title (from candidates mapping).
`candidates[].fields`	object	Additional mapped fields (price, brand, description, etc.). May be absent.
`candidates[].image`	string	The candidate's image. Only present when `include_images: true` and the candidate has a mapped image.
`scale_min`	number	The lowest grade the evaluation scale accepts.
`scale_max`	number	The highest grade the scale accepts.
`scale_labels`	array	Each entry has `grade` (number) and `label` (human-readable name like "Highly Relevant").
`evaluation_name`	string	The parent evaluation's name. Useful for context-setting prompts.
`evaluation_description`	string	The parent evaluation's description, if set.

Standard Handlebars syntax is supported, including {{#each}}, {{#if}}, {{else}}, and the helpers Releval registers for query templates.

When to customise the default

The pre-filled template works well for most domains. Adapt it when you need to:

Inject domain-specific guidelines (e.g. "treat refurbished items as Marginally Relevant").
Constrain the model to a stricter rubric.
Localise the prompt (the default is English-only).
Emphasise particular fields (e.g. brand, category) that appear in candidates[].fields.

Required response format

Whatever your template asks, the model's response must contain one <candidate> block per candidate sent in the prompt:

<candidate id="abc123">
<reasoning>
The query asked for "running shoes" and this candidate is a hiking boot.
The product description does not mention running.
</reasoning>
<grade>
1
</grade>
</candidate>

Three requirements apply. If any fails, the affected candidates are not judged:

Every candidate ID sent in the prompt must appear in exactly one <candidate> block.
Each <grade> must be an integer.
Each grade must be within the evaluation scale's range (e.g. 0–4 for graded).

The <reasoning> block is optional but strongly recommended — it is stored on the judgment and shown in the UI alongside the grade.

Info

The response must contain <candidate id="...">…</candidate> blocks with <grade> and <reasoning> children. Don't ask the model to return JSON or Markdown tables.

A worked custom template

Here's a custom template that adds an e-commerce rubric and explicitly asks the model to penalise out-of-stock items:

You are evaluating product search relevance for an online grocery store.

The shopper searched for: **{{query}}**

Rate each candidate on this scale:
{{#each scale_labels}}
- {{this.grade}} ({{this.label}})
{{/each}}

Apply these rules:
- Exact category and intent match → top of scale.
- Acceptable substitute (e.g. shopper searched "milk", saw "oat milk") → middle.
- Same product type, wrong attribute (e.g. "whole milk" → "skim milk") → middle-low.
- Out of stock → drop one grade.
- Off-topic → 0.

Candidates to judge:
{{#each candidates}}
- ID `{{this.id}}`: {{this.title}}{{#if this.fields.price}} (£{{this.fields.price}}){{/if}}{{#if this.fields.in_stock}}{{else}} [OUT OF STOCK]{{/if}}
{{/each}}

For each candidate, return:

{{#each candidates}}
<candidate id="{{this.id}}">
<reasoning>One sentence on why.</reasoning>
<grade>{{../scale_min}}-{{../scale_max}}</grade>
</candidate>
{{/each}}

Iterating on a template

The prompt template lives on the judge, so changes apply to all future runs that use it. To test a template change without burning credits on a full run:

Clone the judge (or create a sibling using a free local provider such as ollama).
Run it on a small evaluation run (10–20 queries).
Inspect the per-candidate reasoning in the UI.
Refine the template, repeat.
Once stable, point your full evaluations at the production judge.

When you're ready, see Running an AI judging run to start judging.

Available variables​

When to customise the default​

Required response format​

A worked custom template​

Iterating on a template​

Available variables

When to customise the default

Required response format

A worked custom template

Iterating on a template