Candidates Mapping
In order for Releval to work with search results, they need to be transformed into a JSON structure that Releval understands. Releval supports
for this purpose.
Structure
Search endpoints must output a structure that adheres to the following definition, shown below as a TypeScript interface and as a JSON Schema:
- TypeScript
- JSON Schema
interface SearchResult {
total: number,
candidates: {
id: string,
title: string,
image?: string,
fields?: {
[key: string]: unknown
}
}[]
}
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "SearchResult",
"type": "object",
"required": ["total", "candidates"],
"properties": {
"total": {
"type": "integer",
"description": "Total number of candidates at the search endpoint that match the query"
},
"candidates": {
"type": "array",
"description": "Top N matching candidates for the query",
"items": {
"type": "object",
"required": ["id", "title"],
"properties": {
"id": { "type": "string" },
"title": { "type": "string" },
"image": {
"type": "string",
"format": "uri",
"description": "Public URL for an image"
},
"fields": {
"type": "object",
"description": "Additional fields for the candidate",
"additionalProperties": true
}
}
}
}
}
}
Concretely, the Search endpoint output must
- Be JSON
- have a
"total"integer field, which represents the total number of candidates at the search endpoint that match the given query - have a
"candidates"array field of candidates, which are the topNmatching candidates for the query.
Each candidate JSON object must have
- an
"id"string field - a
"title"string field
and can optionally have
- an
"image"string field that is a public URL for an image - a
"fields"JSON object field that can contain other fields for the candidate
Examples
Here is a minimal example of a search endpoint output of the top 10 candidates for a query, where the search endpoint has 52 candidates matching the query:
{
"total": 52,
"candidates": [
{ "id": "1", "title": "Candidate 1" },
{ "id": "2", "title": "Candidate 2" },
{ "id": "3", "title": "Candidate 3" },
{ "id": "4", "title": "Candidate 4" },
{ "id": "5", "title": "Candidate 5" },
{ "id": "6", "title": "Candidate 6" },
{ "id": "7", "title": "Candidate 7" },
{ "id": "8", "title": "Candidate 8" },
{ "id": "9", "title": "Candidate 9" },
{ "id": "10", "title": "Candidate 10" }
]
}
Here is a richer example where each candidate also has an image and fields:
{
"total": 190004,
"candidates": [
{ "id": "1", "title": "Candidate 1", "image": "https://example.com/1.jpg", "fields": { "location": "New York", "interests": ["photography", "graphic design"] }},
{ "id": "2", "title": "Candidate 2", "image": "https://example.com/2.jpg", "fields": { "location": "San Francisco", "interests": ["surfing", "cycling"] }},
{ "id": "3", "title": "Candidate 3", "image": "https://example.com/3.jpg", "fields": { "location": "Chicago", "interests": ["cooking", "piano"] }},
{ "id": "4", "title": "Candidate 4", "image": "https://example.com/4.jpg", "fields": { "location": "Austin", "interests": ["hiking", "stand-up comedy"] }},
{ "id": "5", "title": "Candidate 5", "image": "https://example.com/5.jpg", "fields": { "location": "Seattle", "interests": ["coffee roasting", "indie games"] }},
{ "id": "6", "title": "Candidate 6", "image": "https://example.com/6.jpg", "fields": { "location": "Boston", "interests": ["history documentaries", "chess"] }},
{ "id": "7", "title": "Candidate 7", "image": "https://example.com/7.jpg", "fields": { "location": "Los Angeles", "interests": ["film editing", "street art"] }},
{ "id": "8", "title": "Candidate 8", "image": "https://example.com/8.jpg", "fields": { "location": "Denver", "interests": ["snowboarding", "rock climbing"] }},
{ "id": "9", "title": "Candidate 9", "image": "https://example.com/9.jpg", "fields": { "location": "Portland", "interests": ["brewery tours", "bookbinding"] }},
{ "id": "10", "title": "Candidate 10", "image": "https://example.com/10.jpg", "fields": { "location": "Miami", "interests": ["salsa dancing", "marine biology"] }}
]
}
JMESPath
A JMESPath expression can be specified to transform a JSON response from a search endpoint into the required JSON structure. The JMESPath expression parser is fully compliant with the JMESPath specification.
A JMESPath expression is used for endpoint types:
Built-in expressions
Releval comes with built-in default expressions for the supported endpoint types.
It is recommended to copy the built-in default expression for the endpoint type and then change it to
use better fields for "title" and "image".
Elasticsearch
A default JMESPath expression to work with all versions of Elasticsearch:
{
total: (hits.total.value || hits.total),
candidates: hits.hits[].{
id: _id,
title: _id,
image: null,
fields: _source
}
}
Total is mapped to the hits total, whether using
- an older version of Elasticsearch that does not include
"relation" rest_total_hits_as_int- the default
totalobject that does includesrelation
Candidates are mapped using
- the
_idfield for the candidate id - the
idfield for the candidate title. You should change this to use a better field. nullfor the candidate image. You should change this to use a better field.- the
_sourcefield for the candidate fields.
OpenSearch
A default JMESPath expression to work with all versions of OpenSearch:
{
total: (hits.total.value || hits.total),
candidates: hits.hits[].{
id: _id,
title: _id,
image: null,
fields: _source
}
}
Total is mapped to the hits total, whether using
- an older version of OpenSearch that does not include
"relation" rest_total_hits_as_int- the default
totalobject that does includesrelation
Candidates are mapped using
- the
_idfield for the candidate id - the
idfield for the candidate title. You should change this to use a better field. nullfor the candidate image. You should change this to use a better field.- the
_sourcefield for the candidate fields.
Solr
A default JMESPath expression to work with Solr:
{
total: response.numFound,
candidates: response.docs[].{
id: id,
title: id,
image: null,
fields: @
}
}
Vespa
A default JMESPath expression to work with Vespa:
{
total: root.fields.totalCount,
candidates: root.children[].{
id: id,
title: id,
image: null,
fields: fields
}
}
Examples
Whilst testing a search endpoint, it can be useful to emit the verbatim JSON response from the search endpoint. This can be achieved with the following candidates mapping:
@
A minimum example for an endpoint that returns an array of numeric ids like [1,2,3,4,5,6] is:
{
total: length(@),
candidates: [].{
id: to_string(@),
title: to_string(@)
}
}
which outputs
{
"total": 6,
"candidates": [
{"id":"1","title":"1"},
{"id":"2","title":"2"},
{"id":"3","title":"3"},
{"id":"4","title":"4"},
{"id":"5","title":"5"},
{"id":"6","title":"6"}
]
}
JavaScript
A JavaScript script can be specified to transform any HTTP response from a search endpoint into the required JSON structure. The script is executed in a browser context that has access to the Document Object Model (DOM) as well as any scripts loaded as part of the response. For example, if the search page loads jQuery as part of the response, your JavaScript script will have access to jQuery.
A JavaScript script is used for endpoint types:
The return value of the JavaScript script must be a JSON string that conforms to the required
JSON structure. The simplest way to do this is to build a JavaScript object then use
JSON.stringify() to
convert the object to a string.
The JavaScript script can be any valid script, with full support for ECMAScript 2022 (ECMA-262 13th Edition). Typically, it will be a function or arrow function that contains all the necessary logic inside its body to build a response, and return that response from the function.
Built-in script
Releval comes with a default JavaScript script to help get you started:
() => {
const content = document.documentElement.outerHTML;
// TODO: parse content and return a JSON string e.g.
// return JSON.stringify({
// total: 1,
// candidates: [
// {
// id: 'id',
// title: 'title',
// image: 'image url',
// fields: {
// additionalField: 'optional'
// }
// }
// ]
// });
return content;
}
Since the response for search pages can have any structure, the built-in script returns the document HTML by default, which must
be modified to create the JSON structure required from the search page. A good way to iterate on a script is using
the console in Chrome's DevTools with the search page loaded. Then, when you have a suitable script working in
Chrome DevTools, copy it to the candidates mapping section of your page endpoint, and test it there.
Working with the shadow DOM
If the search page uses the shadow DOM to isolate web components, you may need to first replace the content of each element containing shadow DOM with the HTML of the shadow DOM before your able to extract values from the page. A node with a shadow DOM looks as follows in the Chrome DevTools Elements tab:
<div class="ng-star-inserted">
<div class="ng-star-inserted">
#shadow-root (open) <!-- <== this is shadow DOM for the element -->
</div>
</div>
A function to replace shadow DOMs with their HTML is:
() => {
// from https://docs.apify.com/academy/node-js/scraping-shadow-doms
// license under Apache 2.0: https://github.com/apify/apify-docs/blob/master/LICENSE
const getShadowDomHtml = (shadowRoot) => {
let shadowHTML = '';
for (const elem of shadowRoot.childNodes) {
shadowHTML += elem.nodeValue || elem.outerHTML;
}
return shadowHTML;
};
const replaceShadowDomsWithHtml = (rootElement) => {
for (const elem of rootElement.querySelectorAll('*')) {
if (elem.shadowRoot) {
replaceShadowDomsWithHtml(elem.shadowRoot);
elem.innerHTML += getShadowDomHtml(elem.shadowRoot);
}
}
};
replaceShadowDomsWithHtml(document.body);
// now work with the document and return a value.
}
Examples
This example uses the Big W search page results, with a search for makeup:
https://www.bigw.com.au/search?text=makeup
() => {
const tiles = document.querySelectorAll('article[data-testid="product-tile"]');
const candidates = Array.from(tiles).map(tile => {
const link = tile.querySelector('a[data-product-code]');
const id = link?.dataset.productCode ?? '';
const title = tile.getAttribute('aria-label') ?? '';
const image = tile.querySelector('.ProductImage img')?.src ?? null;
const price = tile.querySelector(
'.PriceSection [class*="VisuallyHidden"]'
)?.textContent ?? null;
const href = link?.getAttribute('href');
return {
id,
title,
image,
fields: {
price,
url: href ? new URL(href, window.location.origin).toString() : null
}
};
});
return JSON.stringify({
total: candidates.length,
candidates
});
}
The Big W product grid is rendered as <article data-testid="product-tile"> elements inside a
<ul data-testid="product-grid">. Each tile carries the product code as a data-product-code
attribute on its anchor, the title as the tile's aria-label, and the image as a normal
<img> inside .ProductImage. The price is captured into fields for downstream use.
The Big W search page does not expose a total result count as a stable, easily-targetable
element, so this script reports total as the number of tiles rendered on the page. If a
selector for the result count is available on the page you are scraping, prefer that — it
allows Releval to report the corpus-wide hit count rather than just the rendered page.