Skip to main content

Introduction

Releval is a search relevance evaluation platform. Connect your search system, run real queries against it, score the results, and watch the metrics that matter — NDCG, MAP, MRR, ERR, Precision, Recall — move with every change you ship.

Whether you're tuning e-commerce search, retrieval for an LLM agent, or an in-app search box, Releval gives you a measurement loop you can actually trust: consistent benchmarks, real judgments (human or AI), and the head-to-head comparisons that turn ranking opinions into ranking decisions.

Why measure search relevance?

Search relevance is one of the few product surfaces where intuition reliably misleads. A ranker tweak that helps your three favourite queries can quietly tank the long tail. A new analyzer that fixes one synonym class can break two others. A new query category — like the one users started searching for last quarter — gets no signal at all unless someone is looking. Without measurement, every change is a guess, and every regression is discovered the slow way: by users.

A measurement loop fixes that. You define the queries you care about, judge what "good" looks like for each, and let metrics tell you whether your latest change moved the needle or knocked things over. The loop pays for itself the first time it catches a regression before it ships.

What Releval does

Releval gives you the pieces of that loop and stitches them together:

  • Search Endpoints model where your queries run — Elasticsearch, OpenSearch, Solr, Vespa, any HTTP search API, or a rendered search results page. The same query set can point at production, staging, or any A/B variant by swapping endpoints.
  • Query Sets are the queries you want to score, curated from real user logs, trending topics, long-tail searches, or known problem queries.
  • Query Templates turn each query into the request shape your endpoint expects — parameterised, reusable, and editable in a Monaco-powered editor with Handlebars variables, conditionals, and iteration.
  • Evaluation Runs execute the query set against an endpoint with a template, capture every response, and prepare the candidates for judgment.
  • Judgments rate each result on a binary, graded, or detailed scale. Judge by hand with keyboard-shortcut-driven UI, by AI (OpenAI, Anthropic, Bedrock, Azure OpenAI, Ollama, or any OpenAI-compatible endpoint, including vision models for multimodal evaluation), or both — and judgments are automatically carried forward to subsequent runs against the same query/result pair.
  • Metrics are computed at run, query, and candidate level — NDCG, MAP, MRR, ERR, Precision, Recall, F-Score — at any depth (@k). Compare runs side-by-side with per-query deltas; quantify rank-list churn between configurations with similarity metrics like Rank-Biased Overlap (RBO) and Jaccard.
  • User Behaviour Insights ingests real user queries, clicks, and impressions over REST or gRPC and stores them in ClickHouse. Use the in-app SQL workspace to derive click models and implicit relevance judgments straight from production traffic.

Everything in the UI is also exposed over REST, gRPC, and an MCP server. Pair that with App Clients for API-key authentication and you can drive evaluations from CI to gate ranking changes on relevance regressions, the same way unit tests gate code.

Who uses Releval?

  • Search and ML engineers tuning rankers, query DSLs, or learning-to-rank models, and needing to know whether a change is actually a win.
  • Product managers tracking relevance KPIs across releases and reporting on them.
  • Data scientists building click models from behaviour and using them as implicit ground truth for model iteration.
  • QA and release teams gating launches on a relevance baseline rather than a smoke-test sample.

Where to start