Add InstantSearch and Autocomplete to your search experience in just 5 minutes
A good starting point for building a comprehensive search experience is a straightforward app template. When crafting your application’s ...
Senior Product Manager
A good starting point for building a comprehensive search experience is a straightforward app template. When crafting your application’s ...
Senior Product Manager
The inviting ecommerce website template that balances bright colors with plenty of white space. The stylized fonts for the headers ...
Search and Discovery writer
Imagine an online shopping experience designed to reflect your unique consumer needs and preferences — a digital world shaped completely around ...
Senior Digital Marketing Manager, SEO
Winter is here for those in the northern hemisphere, with thoughts drifting toward cozy blankets and mulled wine. But before ...
Sr. Developer Relations Engineer
What if there were a way to persuade shoppers who find your ecommerce site, ultimately making it to a product ...
Senior Digital Marketing Manager, SEO
This year a bunch of our engineers from our Sydney office attended GopherCon AU at University of Technology, Sydney, in ...
David Howden &
James Kozianski
Second only to personalization, conversational commerce has been a hot topic of conversation (pun intended) amongst retailers for the better ...
Principal, Klein4Retail
Algolia’s Recommend complements site search and discovery. As customers browse or search your site, dynamic recommendations encourage customers to ...
Frontend Engineer
Winter is coming, along with a bunch of houseguests. You want to replace your battered old sofa — after all, the ...
Search and Discovery writer
Search is a very complex problem Search is a complex problem that is hard to customize to a particular use ...
Co-founder & former CTO at Algolia
2%. That’s the average conversion rate for an online store. Unless you’re performing at Amazon’s promoted products ...
Senior Digital Marketing Manager, SEO
What’s a vector database? And how different is it than a regular-old traditional relational database? If you’re ...
Search and Discovery writer
How do you measure the success of a new feature? How do you test the impact? There are different ways ...
Senior Software Engineer
Algolia's advanced search capabilities pair seamlessly with iOS or Android Apps when using FlutterFlow. App development and search design ...
Sr. Developer Relations Engineer
In the midst of the Black Friday shopping frenzy, Algolia soared to new heights, setting new records and delivering an ...
Chief Executive Officer and Board Member at Algolia
When was your last online shopping trip, and how did it go? For consumers, it’s becoming arguably tougher to ...
Senior Digital Marketing Manager, SEO
Have you put your blood, sweat, and tears into perfecting your online store, only to see your conversion rates stuck ...
Senior Digital Marketing Manager, SEO
“Hello, how can I help you today?” This has to be the most tired, but nevertheless tried-and-true ...
Search and Discovery writer
Structuring content for websites using a logical and well-organized information architecture makes it easier for crawlers and search engines to index relevant content for specific queries.
That’s because search is at its best when you organize online content into a hierarchy of web pages and structure each page into small bits of attributes, like title, description, sections, and paragraphs.
Luckily, a large majority of information on the web can be structured in this way. However, some content does not break down so easily, such as document search or purely textual content like blogs, technical documentation, and online news journals.
Document-based search may seem easy at first – it’s just matching the text of the content to the text of the query. But there are several pitfalls to web page structures that you need to be aware of and avoid. This article explains those pitfalls, proposing the most optimized index and web page structure in large document search.
While the suggestions in this article may apply to any website that offers an organized collection of textual content (blog, newsfeed), we discuss only large texts within the context of technical documentation, using our Laravel technical documentation implementation.
Before diving into the details of those pitfalls, let’s pull out the essentials.
Break up each page into small chunks, and save each chunk as a separate record. Incorporate the hierarchy of the website into each record.
Algolia does not use statistics or technology like NLP or TF-IDF to understand or decrypt the text. Instead, it focuses on the characters of the text itself, creating relevance by textually matching the query with the content, and then applying a ranking formula, typo tolerance, proximity, and other smart ways to read the text and order the results.
Show only parts of the text in the search results, not a large portion. Otherwise, it’s too much to read or scan. Additionally, only allow two instances of the same page to show up in the results, to make room for other pages that match.
That’s the general strategy for structuring large texts. Now for the implementation details and some pitfalls to avoid.
Take a look at DocSearch, the easiest and fastest way to add search to your documentation. Take a look, it’s free!
Developer documentations often mean lengthy pages filled with a lot of content. Most people try to index the complete page as one entry in their search engine. But, they discover later on that there were a lot of edge cases and they try to fix them through relevance tuning but it quickly becomes an endless story as the issue comes from the actual indexing itself:
For example the query "composer upgrade"
will match the QuickStart page because the menu contains "Upgrade Guide"
and the first paragraph contains the "composer"
word. This is not the kind of match that provides a good user experience.
Developers don’t like to change web pages too often and they like to have long pages containing a lot of information. If such a page is indexed as one document, it will almost systematically trigger relevance issues. This is why we do not recommend to use a standard web crawler, but rather a scrapper to have access to the original content (most of the time available in Markdown).
For example, querying "cache incrementing value"
will match the Query Builder page because it contains a paragraph with the word "cache"
and another paragraph with the words "incrementing"
and "value"
. This is a false positive because it is not relevant: the more text you have on a page, the more irrelevant results you will get.
In order to deliver the best user experience, it is key to open the page at the exact position of the match. This is made very difficult if you only index one document per page. That’s why there are so many documentation searches that just open the page at the top and the user needs to scroll or use the search of his browser to jump to the right section. This not always easy and is a waste of time.
Indexing the titles of your documentation page will probably answer common queries but this is not enough. The underlying paragraphs contain most of the words your users will search for. To obtain a great level of relevance, it’s important to index the whole content, body text included.
In this example, the text is required to correctly answer to the "rememberForever"
or "cache driver"
queries.
With most search engines, relevance is the trickiest part of the configuration because it is often defined by a unique and complex formula that mixes a lot of information almost impossible to manage. Engineers often adjust the formula or add some bonus/malus scoring to improve the results on one specific query. Since they don’t have any non-regression tools, they cannot measure the real impact for all queries. The consequences can be significant.
In order to keep the ranking under control, it is key to split the ranking formula in several pieces that you understand and will tune independently. In practice we are able to split the ranking formula with a Tie-Breaking algorithm.
Let’s imagine your ranking formula is split in 2 parts:
You can then first apply the textual relevance and only if two hits have the same value move to the use-case/business relevance (importance). This is the best way to ensure your end-users will always have relevant hits first (from a text POV, matching exactly their query words) and then – in terms of relevance equality – tie the results using the business relevance.
Since you’re not mixing together the text & the business relevance (but applied them one after another), you can modify the business relevance without impacting how the text relevance is working.
Getting Started With Realtime Search
In order to solve all those pitfalls, we split the page in a lot of smaller chunks indexed as separate records by using the HTML structure of the page (H1, H2, H3, H4, P).
See the Validation page of Laravel’s documentation:
The first record generated will be the Validation page title. It will be transformed into the following JSON object. The “link” attribute only contains the last part of the URL, the first part being easily rebuilt with the tag:
{
"h1": "Validation",
"link": "validation",
"importance": 0,
"_tags": [
"5.1"
],
"objectID": "master-validation-13148717f8faa9037f37d28971dfc219"
}
Then, the first section of the page (The Introduction) will be turned into the following record. The link now contains an anchor text and that keeps the title of the page:
{
"h1": "Validation",
"h2": "Introduction",
"link": "validation#introduction",
"importance": 1,
"_tags": [
"5.1"
],
"objectID": "master-validation#introduction-eeafb566c2af34e739e2685efdb45524"
}
A paragraph of this page under a H3 section would be translated into the following record:
{
"h1": "Validation",
"h2": "Validation Quickstart",
"h3": "Defining the Routes",
"link": "validation#validation-quickstart",
"content": "First, let's assume we have the following routes defined in out `app/Http/routes.php` file:",
"importance": 6,
"_tags": [
"5.1"
],
"objectID": "5.1-validation#validation-quickstart-380c9827712413dbe75b5db515cd3e59"
}
This approach fixes pitfalls #1 and #2. We have solved the problem by indexing each chunk of text as an independent record while keeping the titles hierarchy in each record.
Algolia is designed natively to use a Tie-Breaking algorithm to make sure everyone understands & is able to tune the ranking. Now,Pitfall #3 can be easily resolved by applying the settings we recommend for a documentation search implementation:
Matching hits will now be sorted against those six ranking criteria: the first 5 are related to text relevance and the last one is the custom business relevance.
First, we sort the number of query words found in the records. We have decided to process the query with all words as mandatory (AND between query terms). If there are not enough matching words, we run the query again with all words as optional (OR between query terms). This process is configured with a single index setting and allows your to get the best of both worlds: AND guarantees to reduce the number of false positives while OR allows to return results even if the query is too narrow.
If two records match with the same number of search terms, we use the number of typos as the differentiator (so we have exact matches first, then matches with 1 typo, then matches with 2 typos, …).
For example if the query is “validator”, the record that contains “validate” will match with some typos but will be retrieved after the record containing “validator”.
When two records are identical for the words and typos ranking criteria, we then move to the next criteria which compares the proximity of the query terms in the record. It will basically count the number of words in between them until a limit is reached (after a certain point they are considered as “too far”).
For example, the "cache configuration"
query will have a proximity of 1 when it matches the sentence: "The cache configuration is ..."
and will have a proximity of 2 when it matches the sentence "... in the config/cache.php configuration file"
. We sort this value by increasing order as we prefer records that contains the query terms close together first.
If two records are identical for the 3 first ranking criteria, we use the name of the matched attribute to determine which hit needs to be retrieved first. In the index settings, just order the attributes you want to search by order of importance:
That means that if the match is identified inside h1, it will be better than in h2, better than in h3, etc. You can also notice there is an “unordered” flag on each attribute. It means that the position of the match inside the attribute is not considered in the ranking. That’s why the query "cache"
will match with the same attribute score for a record that contains "[Cache Configuration]"
or "[Obtaining a cache instance]"
for the same attribute.
If two records are identical for the first 4 criteria, then we use the number of query terms that match exactly in the record to determine which hit needs to be retrieved first. Because we’re returning results after each keystroke, the last query term will mostly match as a prefix (it will match beginning of words). This criterion is used to rank an exact match before a prefix match.
For example the query “valid” will retrieve the records containing “valid” before the ones containing “validation”.
There is still one important thing missing: your use-case/business criterion. If all previous criteria are identical for two records, we use the custom ranking which is defined by the user.
For example, searching for "Validation"
will match the two following records using the most important “h1” attribute. That results in a tie on all previous criteria but we want to retrieve the page title first because the other record is a paragraph. This is how the "importance"
attribute plays out when added to the records.
{
"h1": "Validation",
"link": "validation",
"importance": 0,
"_tags": [
"5.1"
],
"objectID": "master-validation-13148717f8faa9037f37d28971dfc219"
}
{
"h1": "Validation",
"h2": "Working With Error Messages",
"h3": "Custom Error Messages",
"link": "validation#custom-error-messages",
"content": "If needed, you may use custom error messages for validation instead of the defaults. There are several ways to specify custom messages. First, you may pass the custom messages as the third argument to the `Validator::make` method:",
"importance": 6,
"_tags": [
"5.1"
],
"objectID": "5.1-validation#custom-error-messages-380c9827712413dbe75b5db515cd3e59"
}
The “importance” value is a integer that goes from 0 (page title) to 7 (text section under h4) and that we use in the custom ranking in an ascending order (the smaller, the better):
The complete scale of importance is the following:
We have successfully applied this recipe on several technical documentation, such as Laravel, Bootstrap, and many other documentations websites. The way results are displayed differ but we use exactly the same approach and the same API.
One of our missions is to help all developers navigate technical documentations. If you are working on an open source project, we’d be happy to help you — here’s how to get started with DocSearch. We will provide you with a free Algolia account and with any support to make your implementation a best-in-class reference. Drop us a note!
It's extensive, clear, and, of course, searchable.
Powered by Algolia Recommend