Add InstantSearch and Autocomplete to your search experience in just 5 minutes
A good starting point for building a comprehensive search experience is a straightforward app template. When crafting your application’s ...
Senior Product Manager
A good starting point for building a comprehensive search experience is a straightforward app template. When crafting your application’s ...
Senior Product Manager
The inviting ecommerce website template that balances bright colors with plenty of white space. The stylized fonts for the headers ...
Search and Discovery writer
Imagine an online shopping experience designed to reflect your unique consumer needs and preferences — a digital world shaped completely around ...
Senior Digital Marketing Manager, SEO
Winter is here for those in the northern hemisphere, with thoughts drifting toward cozy blankets and mulled wine. But before ...
Sr. Developer Relations Engineer
What if there were a way to persuade shoppers who find your ecommerce site, ultimately making it to a product ...
Senior Digital Marketing Manager, SEO
This year a bunch of our engineers from our Sydney office attended GopherCon AU at University of Technology, Sydney, in ...
David Howden &
James Kozianski
Second only to personalization, conversational commerce has been a hot topic of conversation (pun intended) amongst retailers for the better ...
Principal, Klein4Retail
Algolia’s Recommend complements site search and discovery. As customers browse or search your site, dynamic recommendations encourage customers to ...
Frontend Engineer
Winter is coming, along with a bunch of houseguests. You want to replace your battered old sofa — after all, the ...
Search and Discovery writer
Search is a very complex problem Search is a complex problem that is hard to customize to a particular use ...
Co-founder & former CTO at Algolia
2%. That’s the average conversion rate for an online store. Unless you’re performing at Amazon’s promoted products ...
Senior Digital Marketing Manager, SEO
What’s a vector database? And how different is it than a regular-old traditional relational database? If you’re ...
Search and Discovery writer
How do you measure the success of a new feature? How do you test the impact? There are different ways ...
Senior Software Engineer
Algolia's advanced search capabilities pair seamlessly with iOS or Android Apps when using FlutterFlow. App development and search design ...
Sr. Developer Relations Engineer
In the midst of the Black Friday shopping frenzy, Algolia soared to new heights, setting new records and delivering an ...
Chief Executive Officer and Board Member at Algolia
When was your last online shopping trip, and how did it go? For consumers, it’s becoming arguably tougher to ...
Senior Digital Marketing Manager, SEO
Have you put your blood, sweat, and tears into perfecting your online store, only to see your conversion rates stuck ...
Senior Digital Marketing Manager, SEO
“Hello, how can I help you today?” This has to be the most tired, but nevertheless tried-and-true ...
Search and Discovery writer
A search engine needs to “process” the language in a search bar before it can execute a query. The process could be as simple as comparing the query exactly as written to the content in the index. But classic keyword search is more advanced than that, because it involves tokenizing and normalizing the query into smaller pieces – i.e., words and keywords. This process can be easy (where the words are separated by spaces) or more complex (like Asian languages, which do not use spaces, so the machine needs to recognize the words).
Once the query is broken down into smaller pieces, the search engine can correct misspellings and typos, apply synonyms, reduce the words further into roots, manage multiple languages, and more – all of which enable the user to type a more “natural” query.
On average, people type single-word or short-phrase queries to describe the items they are searching for. That is, they use keywords, not whole sentences or questions. (Though that’s changing, due to voice technology and the success of Google’s question and answer results.)
This kind of keyword search, both the simple and more advanced versions of it, has been around since the beginning of search. The more natural it is, the more advanced the techniques become. Search engines need to structure incoming queries before they can look up results in the search index. This pre-processing technology falls into what we call Natural Language Processing, or NLP, which is an umbrella term for any technology that enables computers to understand human language, whether written or spoken.
Natural language processing (“NLP”) takes text and transforms it into pieces that are easier for computers to use. Some common NLP tasks are removing stop words, segmenting words, or splitting compound words. NLP can also identify parts of speech, or important entities within text.
— Dustin Coates, Product and GTM Manager at Algolia
We’ve written quite a lot about natural language processing (NLP) here at Algolia. We’ve defined NLP, compared NLP vs NLU, and described some popular NLP/NLU applications. Additionally, our engineers have explained how our engine processes language and handles multilingual search. In this article, we’ll look at how NLP drives keyword search, which is an essential piece of our hybrid search solution that also includes AI/ML-based vector embeddings and hashing.
To understand the nexus between keywords and NLP, it’s important to start off by diving deep into keyword search.
At its most basic, a keyword search engine compares the text of a query to the text of each record in a search index. Every record that matches (whether exact or similar) is returned by the search engine. Matching, as suggested, can be simple or advanced.
We use keywords to describe clothing, movies, toys, cars, and other objects. Most keyword search engines rely on structured data, where the objects in the index are clearly described with single words or simple phrases.
For example, a flower can be structured using tags, or “keys”, to form key-value pairs. The values (a large, red, summer, flower, with four petals) can be paired with their keys (size, color, season, type of object, and number of petals). The flower can also sell at a “price” of “4.99”.
We can represent this structure of keys and values as follows:
{ "name": "Meadow Beauty", "size": "large", "color": "red", "season": “summer”, "type of object": "flower", "number of petals": "4", "price": "4.99", "description": "Coming from the Rhexia family, the Meadow Beauty is a wildflower.” }
All these key-value pairs make up a record that can be stored in a search index, so that a query like “red flower” will return all flower records where “type = flower” and “color = red”.
Additionally, a partial query like “red” can find flowers with “color = reddish-green”, because “red” is in “reddish”.
Many keyword search engines use manually-defined synonyms. Thus, a “blue” query can return “azure” flowers, if you explicitly tell the engine that “blue” and “azure” are synonyms.
Other techniques correct misspellings and typos. The query“4 pedels” contains a typo; a typo-tolerant engine will return correctly spelled flowers (“petals”). The engine can also treat “4” as a synonym for “four”. And It can also match the plural “petals” to the singular “petal”, based on them both having the same root “petal”.
Partial searches, like “4 pe”, can match “four petals” because most keyword search engines allow for prefix searching, which enables the important as you type feature, where a user can see search results or see query suggestions as they type.
One more example (among many others): transliteration. Transliteration maps the letters or sounds of one language to the letters or sounds of another language. For example, transliteration enables a user to type in Latin letters (e.g. a, b, c etc.) to search for Russian Cyrillic characters, or type in Japanese Hiragana to search for Katakana.
A keyword search engines uses these language-processing techniques to create great relevance and ranking – the twin goals of a great search solution.
Relevance makes sure that all records that match a query are found:
Relevance relies on an intelligent matching that takes into account typo tolerance, partial word matching, spatial distance between matching words, the number of attributes that match, synonyms and Rules, natural language characteristics like stop words and plurals, geolocation, and many other intuitive aspects of what people would expect from search, especially in the Google era.
Ranking puts the records in order:
Ranking is how you order the records that a search returns so that the most accurate results appear earliest (on the first couple pages), while less accurate results appear later.
To accomplish the best relevance and ranking, engineers need to design the best algorithm and data structure that will enable the best textual comparisons.
For relevance, there are many data structures and search algorithms available to choose from. However, keyword search is best served by using an inverted index and applying character-based comparisons. This means the search engine pre-processes the data so that each character (letters, numbers, and most characters on your keyboard) refers to one or more records that contain those characters. For example, when an engine compares “aardvark” to a search index, it will execute an upside-down lookup, attempting to find all records that contain the letters a-aa-aar-aard-aardv-aardva-aardvar-aardvark:
This inverted index can be adapted to allow for typos and other keyword search techniques.
Once the records are found, the final task is for the engine to rank the results, ensuring that the best matches show up at the top of the list. Again, there are different techniques, for example, statistical ranking based on the frequency of the words matched. The one we chose relies on a tie-breaking algorithm, which ranks records by applying a top-down tie-breaking, or testing, strategy similar to an elimination game.
A good example is to look at the first and second tests: typos and geolocation. A query for “theatre” in the US will return records with either “theater” or “theatre”, where the UK spelling is treated as a typo. Since the tie-breaking algorithm privileges the best matches, the records that match exactly (no typos = “theater”) will be selected as the best matches, and the UK spelling will be sent to the bottom of the list.
Next, the best records (with “theater”) will be subjected to the second test, geolocation: theaters located near the user (similar coordinates) will be selected as the best. The tie-breaking algorithm then continues to apply the next six tests (exact matches go before partial matches, distance between words in a two-word query, etc.) until all records are ranked.
A keyword-based relevance and ranking algorithm is built on natural language processing (NLP). NLP manages the complexity of language. For example, singular vs plural terms, verb inflections (present vs past tense, present participle, etc.), agglutinative or compound languages, and so forth. We’ll look at some NLP techniques below, but first we’ll define two NLP basics: tokenization and normalization.
Tokenization breaks a larger text into smaller pieces. It may break a document into paragraphs, paragraphs into sentences, and sentences into “tokens.” Tokenization can be very difficult. For example, even something as simple as identifying a sentence in a paragraph is tricky. Is “Michael J. Fox” two separate sentences because it contains a period followed by a word with a capital letter?
A well known low-level step while doing NLP is normalization. The goal of this step is to standardize every query, to depend more on the letters than on the way it was typed. So instead of treating uppercase “Michael” different from lowercase “michael”, we normalize both to “michael”. We do similar normalizations with accents or special characters.
Here’s a short list of some standard NLP techniques.
Keyword search technology, laced with a more AI-driven technology, including NLU (natural language understanding) and vector-based semantic search, can take search to a new level.
Here are only a few examples:
Here’s a sample list of semantic-based NLP techniques:
Human language is filled with ambiguities that make it difficult to write software that accurately determines the intended meaning of text or voice data. Homonyms, homophones, sarcasm, idioms, metaphors, grammar and usage exceptions, variations in sentence structure — these are just a few of the irregularities of human language that took humans years to learn, but that programmers must teach natural language-driven applications to recognize and understand accurately from the start, if those applications are going to be useful.
To address the most complex aspects of language, NLP has changed with the times. Central to this change is artificial intelligence, in particular machine learning models like vectors and large language models (LLMs). In the area of translation and natural language understanding (NLU), machine learning has vastly simplified and improved the search process. Vector spaces have removed the need to manually create synonyms. In this article, we focused on the purposes and how-to of keyword search, and on certain essential NLP techniques. NLP continues to evolve, to empower the query-level functionality of keyword search – which will remain as the go-to method to handle the simple queries that we perform on a daily basis.
Powered by Algolia Recommend