Skip to main content

Prompt templates

Each AI judge has a required Handlebars prompt template that defines what the model is asked, and the shape of the response it must return. The UI pre-fills a sensible default when you create a judge, which you can take as-is or adapt to your domain.

Available variables

The Handlebars context exposes the following variables:

VariableTypeDescription
querystring | objectThe query value from the query set. A string for plain text query sets, or the JSON object for structured query sets.
candidatesarrayCandidates being judged in this prompt. Length is 1 unless batch_size > 1.
candidates[].idstringThe candidate's ID. Must be referenced in the response so grades can be matched back to candidates.
candidates[].titlestringThe candidate's title (from candidates mapping).
candidates[].fieldsobjectAdditional mapped fields (price, brand, description, etc.). May be absent.
candidates[].imagestringThe candidate's image. Only present when include_images: true and the candidate has a mapped image.
scale_minnumberThe lowest grade the evaluation scale accepts.
scale_maxnumberThe highest grade the scale accepts.
scale_labelsarrayEach entry has grade (number) and label (human-readable name like "Highly Relevant").
evaluation_namestringThe parent evaluation's name. Useful for context-setting prompts.
evaluation_descriptionstringThe parent evaluation's description, if set.

Standard Handlebars syntax is supported, including {{#each}}, {{#if}}, {{else}}, and the helpers Releval registers for query templates.

When to customise the default

The pre-filled template works well for most domains. Adapt it when you need to:

  • Inject domain-specific guidelines (e.g. "treat refurbished items as Marginally Relevant").
  • Constrain the model to a stricter rubric.
  • Localise the prompt (the default is English-only).
  • Emphasise particular fields (e.g. brand, category) that appear in candidates[].fields.

Required response format

Whatever your template asks, the model's response must contain one <candidate> block per candidate sent in the prompt:

<candidate id="abc123">
<reasoning>
The query asked for "running shoes" and this candidate is a hiking boot.
The product description does not mention running.
</reasoning>
<grade>
1
</grade>
</candidate>

Three requirements apply. If any fails, the affected candidates are not judged:

  1. Every candidate ID sent in the prompt must appear in exactly one <candidate> block.
  2. Each <grade> must be an integer.
  3. Each grade must be within the evaluation scale's range (e.g. 0–4 for graded).

The <reasoning> block is optional but strongly recommended — it is stored on the judgment and shown in the UI alongside the grade.

Info

The response must contain <candidate id="...">…</candidate> blocks with <grade> and <reasoning> children. Don't ask the model to return JSON or Markdown tables.

A worked custom template

Here's a custom template that adds an e-commerce rubric and explicitly asks the model to penalise out-of-stock items:

You are evaluating product search relevance for an online grocery store.

The shopper searched for: **{{query}}**

Rate each candidate on this scale:
{{#each scale_labels}}
- {{this.grade}} ({{this.label}})
{{/each}}

Apply these rules:
- Exact category and intent match top of scale.
- Acceptable substitute (e.g. shopper searched "milk", saw "oat milk") middle.
- Same product type, wrong attribute (e.g. "whole milk" "skim milk") middle-low.
- Out of stock drop one grade.
- Off-topic 0.

Candidates to judge:
{{#each candidates}}
- ID `{{this.id}}`: {{this.title}}{{#if this.fields.price}} (£{{this.fields.price}}){{/if}}{{#if this.fields.in_stock}}{{else}} [OUT OF STOCK]{{/if}}
{{/each}}

For each candidate, return:

{{#each candidates}}
<candidate id="{{this.id}}">
<reasoning>One sentence on why.</reasoning>
<grade>{{../scale_min}}-{{../scale_max}}</grade>
</candidate>
{{/each}}

Iterating on a template

The prompt template lives on the judge, so changes apply to all future runs that use it. To test a template change without burning credits on a full run:

  1. Clone the judge (or create a sibling using a free local provider such as ollama).
  2. Run it on a small evaluation run (10–20 queries).
  3. Inspect the per-candidate reasoning in the UI.
  4. Refine the template, repeat.
  5. Once stable, point your full evaluations at the production judge.

When you're ready, see Running an AI judging run to start judging.