Candidates Mapping

In order for Releval to work with search results, they need to be transformed into a JSON structure that Releval understands. Releval supports

JMESPath
JavaScript

for this purpose.

Structure

Search endpoints must output a structure that adheres to the following definition, shown below as a TypeScript interface and as a JSON Schema:

TypeScript
JSON Schema

interface SearchResult {
    total: number,
    candidates: {
        id: string,
        title: string,
        image?: string,
        fields?: {
            [key: string]: unknown
        }
    }[]
}

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "SearchResult",
  "type": "object",
  "required": ["total", "candidates"],
  "properties": {
    "total": {
      "type": "integer",
      "description": "Total number of candidates at the search endpoint that match the query"
    },
    "candidates": {
      "type": "array",
      "description": "Top N matching candidates for the query",
      "items": {
        "type": "object",
        "required": ["id", "title"],
        "properties": {
          "id": { "type": "string" },
          "title": { "type": "string" },
          "image": {
            "type": "string",
            "format": "uri",
            "description": "Public URL for an image"
          },
          "fields": {
            "type": "object",
            "description": "Additional fields for the candidate",
            "additionalProperties": true
          }
        }
      }
    }
  }
}

Concretely, the Search endpoint output must

Be JSON
have a "total" integer field, which represents the total number of candidates at the search endpoint that match the given query
have a "candidates" array field of candidates, which are the top N matching candidates for the query.

Each candidate JSON object must have

an "id" string field
a "title" string field

and can optionally have

an "image" string field that is a public URL for an image
a "fields" JSON object field that can contain other fields for the candidate

Examples

Here is a minimal example of a search endpoint output of the top 10 candidates for a query, where the search endpoint has 52 candidates matching the query:

{
  "total": 52,
  "candidates": [
    { "id": "1", "title": "Candidate 1" },
    { "id": "2", "title": "Candidate 2" },
    { "id": "3", "title": "Candidate 3" },
    { "id": "4", "title": "Candidate 4" },
    { "id": "5", "title": "Candidate 5" },
    { "id": "6", "title": "Candidate 6" },
    { "id": "7", "title": "Candidate 7" },
    { "id": "8", "title": "Candidate 8" },
    { "id": "9", "title": "Candidate 9" },
    { "id": "10", "title": "Candidate 10" }
  ]
}

Here is a richer example where each candidate also has an image and fields:

{
  "total": 190004,
  "candidates": [
    { "id": "1", "title": "Candidate 1", "image": "https://example.com/1.jpg", "fields": { "location": "New York", "interests": ["photography", "graphic design"] }},
    { "id": "2", "title": "Candidate 2", "image": "https://example.com/2.jpg", "fields": { "location": "San Francisco", "interests": ["surfing", "cycling"] }},
    { "id": "3", "title": "Candidate 3", "image": "https://example.com/3.jpg", "fields": { "location": "Chicago", "interests": ["cooking", "piano"] }},
    { "id": "4", "title": "Candidate 4", "image": "https://example.com/4.jpg", "fields": { "location": "Austin", "interests": ["hiking", "stand-up comedy"] }},
    { "id": "5", "title": "Candidate 5", "image": "https://example.com/5.jpg", "fields": { "location": "Seattle", "interests": ["coffee roasting", "indie games"] }},
    { "id": "6", "title": "Candidate 6", "image": "https://example.com/6.jpg", "fields": { "location": "Boston", "interests": ["history documentaries", "chess"] }},
    { "id": "7", "title": "Candidate 7", "image": "https://example.com/7.jpg", "fields": { "location": "Los Angeles", "interests": ["film editing", "street art"] }},
    { "id": "8", "title": "Candidate 8", "image": "https://example.com/8.jpg", "fields": { "location": "Denver", "interests": ["snowboarding", "rock climbing"] }},
    { "id": "9", "title": "Candidate 9", "image": "https://example.com/9.jpg", "fields": { "location": "Portland", "interests": ["brewery tours", "bookbinding"] }},
    { "id": "10", "title": "Candidate 10", "image": "https://example.com/10.jpg", "fields": { "location": "Miami", "interests": ["salsa dancing", "marine biology"] }}
  ]
}

JMESPath

A JMESPath expression can be specified to transform a JSON response from a search endpoint into the required JSON structure. The JMESPath expression parser is fully compliant with the JMESPath specification.

A JMESPath expression is used for endpoint types:

Built-in expressions

Releval comes with built-in default expressions for the supported endpoint types.

Tip

It is recommended to copy the built-in default expression for the endpoint type and then change it to use better fields for "title" and "image".

Elasticsearch

A default JMESPath expression to work with all versions of Elasticsearch:

{
  total: (hits.total.value || hits.total),
  candidates: hits.hits[].{ 
    id: _id, 
    title: _id, 
    image: null, 
    fields: _source 
  }
}

Total is mapped to the hits total, whether using

an older version of Elasticsearch that does not include "relation"
rest_total_hits_as_int
the default total object that does includes relation

Candidates are mapped using

the _id field for the candidate id
the id field for the candidate title. You should change this to use a better field.
null for the candidate image. You should change this to use a better field.
the _source field for the candidate fields.

OpenSearch

A default JMESPath expression to work with all versions of OpenSearch:

{
  total: (hits.total.value || hits.total),
  candidates: hits.hits[].{ 
    id: _id, 
    title: _id, 
    image: null, 
    fields: _source 
  }
}

Total is mapped to the hits total, whether using

an older version of OpenSearch that does not include "relation"
rest_total_hits_as_int
the default total object that does includes relation

Candidates are mapped using

the _id field for the candidate id
the id field for the candidate title. You should change this to use a better field.
null for the candidate image. You should change this to use a better field.
the _source field for the candidate fields.

Solr

A default JMESPath expression to work with Solr:

{
  total: response.numFound,
  candidates: response.docs[].{
    id: id,
    title: id,
    image: null,
    fields: @
  }
}

Vespa

A default JMESPath expression to work with Vespa:

{
  total: root.fields.totalCount,
  candidates: root.children[].{
    id: id,
    title: id,
    image: null,
    fields: fields
  }
}

Examples

Whilst testing a search endpoint, it can be useful to emit the verbatim JSON response from the search endpoint. This can be achieved with the following candidates mapping:

A minimum example for an endpoint that returns an array of numeric ids like [1,2,3,4,5,6] is:

{ 
  total: length(@), 
  candidates: [].{ 
    id: to_string(@), 
    title: to_string(@)
  } 
}

which outputs

{
  "total": 6,
  "candidates": [
    {"id":"1","title":"1"},
    {"id":"2","title":"2"},
    {"id":"3","title":"3"},
    {"id":"4","title":"4"},
    {"id":"5","title":"5"},
    {"id":"6","title":"6"}
  ]
}

JavaScript

A JavaScript script can be specified to transform any HTTP response from a search endpoint into the required JSON structure. The script is executed in a browser context that has access to the Document Object Model (DOM) as well as any scripts loaded as part of the response. For example, if the search page loads jQuery as part of the response, your JavaScript script will have access to jQuery.

A JavaScript script is used for endpoint types:

Search Page

The return value of the JavaScript script must be a JSON string that conforms to the required JSON structure. The simplest way to do this is to build a JavaScript object then use JSON.stringify() to convert the object to a string.

The JavaScript script can be any valid script, with full support for ECMAScript 2022 (ECMA-262 13th Edition). Typically, it will be a function or arrow function that contains all the necessary logic inside its body to build a response, and return that response from the function.

Built-in script

Releval comes with a default JavaScript script to help get you started:

() => {
  const content = document.documentElement.outerHTML;

  // TODO: parse content and return a JSON string e.g.
  // return JSON.stringify({ 
  //   total: 1,
  //   candidates: [
  //     { 
  //       id: 'id',
  //       title: 'title',
  //       image: 'image url',
  //       fields: {
  //         additionalField: 'optional'
  //       } 
  //     }
  //   ] 
  // });

  return content;
}

Since the response for search pages can have any structure, the built-in script returns the document HTML by default, which must be modified to create the JSON structure required from the search page. A good way to iterate on a script is using the console in Chrome's DevTools with the search page loaded. Then, when you have a suitable script working in Chrome DevTools, copy it to the candidates mapping section of your page endpoint, and test it there.

Working with the shadow DOM

If the search page uses the shadow DOM to isolate web components, you may need to first replace the content of each element containing shadow DOM with the HTML of the shadow DOM before your able to extract values from the page. A node with a shadow DOM looks as follows in the Chrome DevTools Elements tab:

<div class="ng-star-inserted">
  <div class="ng-star-inserted">
    #shadow-root (open) <!-- <== this is shadow DOM for the element -->
  </div>
</div>

A function to replace shadow DOMs with their HTML is:

() => {
    // from https://docs.apify.com/academy/node-js/scraping-shadow-doms
    // license under Apache 2.0: https://github.com/apify/apify-docs/blob/master/LICENSE
    const getShadowDomHtml = (shadowRoot) => {
        let shadowHTML = '';
        for (const elem of shadowRoot.childNodes) {
            shadowHTML += elem.nodeValue || elem.outerHTML;
        }
        return shadowHTML;
    };
    
    const replaceShadowDomsWithHtml = (rootElement) => {
        for (const elem of rootElement.querySelectorAll('*')) {
            if (elem.shadowRoot) {
                replaceShadowDomsWithHtml(elem.shadowRoot);
                elem.innerHTML += getShadowDomHtml(elem.shadowRoot);
            }
        }
    };
    
    replaceShadowDomsWithHtml(document.body);
    
    // now work with the document and return a value.
}

Examples

This example uses the Big W search page results, with a search for makeup:

https://www.bigw.com.au/search?text=makeup

() => {
  const tiles = document.querySelectorAll('article[data-testid="product-tile"]');

  const candidates = Array.from(tiles).map(tile => {
    const link = tile.querySelector('a[data-product-code]');
    const id = link?.dataset.productCode ?? '';
    const title = tile.getAttribute('aria-label') ?? '';
    const image = tile.querySelector('.ProductImage img')?.src ?? null;
    const price = tile.querySelector(
      '.PriceSection [class*="VisuallyHidden"]'
    )?.textContent ?? null;
    const href = link?.getAttribute('href');

    return {
      id,
      title,
      image,
      fields: {
        price,
        url: href ? new URL(href, window.location.origin).toString() : null
      }
    };
  });

  return JSON.stringify({
    total: candidates.length,
    candidates
  });
}

The Big W product grid is rendered as <article data-testid="product-tile"> elements inside a <ul data-testid="product-grid">. Each tile carries the product code as a data-product-code attribute on its anchor, the title as the tile's aria-label, and the image as a normal <img> inside .ProductImage. The price is captured into fields for downstream use.

The Big W search page does not expose a total result count as a stable, easily-targetable element, so this script reports total as the number of tiles rendered on the page. If a selector for the result count is available on the page you are scraping, prefer that — it allows Releval to report the corpus-wide hit count rather than just the rendered page.

Structure​

Examples​

JMESPath​

Built-in expressions​

Elasticsearch​

OpenSearch​

Solr​

Vespa​

Examples​

JavaScript​

Built-in script​

Working with the shadow DOM​

Examples​

Structure

Examples

JMESPath

Built-in expressions

Elasticsearch

OpenSearch

Solr

Vespa

Examples

JavaScript

Built-in script

Working with the shadow DOM

Examples