10 Must-know Terms and Components for Search Engine Development

Article by Limarc Ambalina | June 19, 2019

Almost all online traffic is dictated by search engines and your search engine is often the first interaction a user has with your company. The user experience of search engines is quite
simple: type what you’re looking for, hit enter, and choose from a list of results. However, there are many factors and intricate parts that go into making that search engine find and rank results from thousands or even millions of pages in a matter of seconds.

For those looking to learn more about search engines or search relevance, below are 10 must-know search engine terms and components.

 

1. Search Relevance

What is search relevance?

You may hear this term a lot in machine learning, but search relevance is simply a measure 0f how closely the results of a search engine query match what the user was looking for. In other words, it refers to the relevancy of the fetched search results. If the search engine results pages match the user’s query accurately, the search engine has good search relevancy. If the results are completely off and not what the user was looking for, the search engine has poor search relevancy.

 

2. Recall

What is recall in search engines?

When you type a query into a search engine, the algorithm’s job is to return all relevant items matching your query. Recall refers to the number of relevant items that were actually returned versus the total amount of relevant items that exist. For example, imagine you type “coffee grinder” into a search engine on an ecommerce site. If the search engine returns 9 items, but the site actually had 10 coffee grinders for sale, the search results had a 90% recall.

The easiest way to achieve 100% recall would be to simply have the algorithm return everything. That way, you’re guaranteed to return 100% of the relevant items. However, this would result in low precision and an overall poor search engine.

 

3. Precision

What is precision in search engines?

Whereas recall is the amount of returned relevant items over the amount of existing relevant items, precision is the number of relevant items over the total amount of items (relevant or not) returned.

Going back to the coffee grinder example, let’s say you search for coffee grinders and the search engine returns 20 items. However, of those 20 items, only 9 were actually coffee grinders and the remaining 11 had nothing to do with coffee grinders. Here, the precision is 9 over 20, which is a little under 50%.

 

4. The F-measure

What is the F-measure in search engines?

10 Must-know Terms and Components for Search Engine Development - The F-measure

The F-measure, also known as the F-score or F1 score, is one way to judge or score search relevance. While there are multiple iterations of this, one common and simple formula for the F-measure is two divided by one over precision plus one over recall.

 

5. Part of Speech (POS) Tagging

What is part-of-speech tagging in search engines?

One important element of search engines is the ability to properly analyze text. Part-of-speech tagging, also known as POS tagging, is one factor that goes into text analysis. Search algorithms need to properly analyze the words in a query to understand what the user is looking for. POS tagging is the process of labeling different words or terms in a search query based on their function. The machine must be trained to recognize the different parts of speech (i.e. nouns, adjectives, etc.) and identify the main subject of the query and its modifiers.

10 Must-know Terms and Components for Search Engine Development - POS Tagging

In the above example, there are a variety of ways an untrained machine could mishandle the query. For instance, the user doesn’t want a bottle containing red water, they want a water bottle which is red. After training your search engine with enough accurate data, the machine will be able to identify content words and function words correctly. Good search engines can respond to queries written in a variety of ways, grammatically accurate or not, for example: water bottle with straw red, water bottle red with straw.

Training your search engine to accurately identify different words and their functions requires a large amount of data. This data must be accurately labeled by trained human annotators. There are many reasons why human intervention is necessary for search engine training, but perhaps the most important is the subjective nature of language when it comes to search queries.

The quality of your search engine is heavily reliant on the accuracy of the data you feed it. Building a search engine with erroneous data is like trying to teach with an outdated textbook: your students may have learned something, but what they have learned won’t match reality. Likewise, if a search engine is trained with improperly tagged data, the results it fetches won’t match what the user was looking for.

 

6. Term Weighting

What is term weighting in search engines?

Term weighting is the process of assigning each term or word a numerical value of importance which is then reflected in the fetched results of the search query. For instance, nouns in a search query would be given a heavier weight than adjectives, or words preceding “with” would be given a heavier weight than the words following it.

In the previous query example, “red water bottle with straw,” the most important part of the query is “water bottle”. In the case where there aren’t red water bottles, you don’t want the search engine to return results of red straws.

While that example was simple, term weighting can become quite complicated. If there are no red water bottles with a straw, should the search engine prioritize results of red water bottles without straws or non-red water bottles with straws? This balancing act of what should take priority is the key to making a strong search engine.

 

7. Term Frequency

What is term frequency in search engines?

The ranking of search results relies on a variety of elements, one of the most obvious being term frequency. Simply, term frequency is the number of times a term in the search query appears in the fetched document or page. For example, if you search “best romance film”, the page with the highest amount of those three individual words has the highest term frequency (see table below).

Web PageTerm CountTerm FrequencyRank
ABest – 20
Romance – 10
Film – 0
301st
BBest – 5
Romance – 13
Film – 10
282nd

However, as illustrated in the table above, ranking results on term frequency alone can lead to issues. Page A (ranked first) could be about the best romance books, poems, shows, but not necessarily romance films. Page B, which is likely more relevant, is ranked second. This could be due to a variety of reasons including differences in total word count, but one way to weed out irrelevant pages is to apply inverse document frequency.

 

8. Inverse Document Frequency

What is inverse document frequency in search engines?

While term frequency is a good way to judge how related a page or document is to the query, some words in the query should carry less weight. If you search for “best romance film”, the word “best” should carry less weight because there are millions of pages that include it.

One way to offset the scoring of these less important words is by using inverse document frequency. Whereas term frequency is equal to the number of times a term appears on a page, inverse document frequency is equal to the number of times a word appears on a page divided by the total amount of documents/pages the term appears in.

In the table below, since the words “best” and “romance” appear in both pages, they are given a lower weight than the word “film” which only appears in one page. Although Page A has a term frequency score of 30, it has an inverse document frequency score of 15.

Web PageTerm CountBestRomanceFilmInverse Document FrequencyRank
ABest – 20
Romance – 10
Film – 0
20/2 = 1010/2 = 50152nd
BBest – 5
Romance – 13
Film – 10
5/2 = 2.513/2 = 6.510/1 = 10191st

Keep in mind that the above example is simplified, as most search engine results will include hundreds to thousands of pages, rather than just two.

Inverse document frequency can help even the playing field, but it still isn’t enough to give the most accurate results. For example, if there were a 10,000-word article about the best films, that article might be highly ranked based on the sheer volume of the terms “best” and “film” alone, despite the fact that a 600-word listicle about the best romance films would be more relevant. Therefore, additional scoring elements should be implemented to provide more accurate results.

 

9. Keyword Proximity

What is keyword proximity in search engines?

Keyword proximity is another scoring element used in search engines. Search queries often include a combination of words and as the name suggests, keyword proximity is the distance between each of the individual keywords that occur in the fetched pages. Using the “best romance film” query example, a page that has the sentence “A Walk to Remember is the best romance film,” would be given additional ranking points over a page that says “Nicholas Sparks has signed a film deal for his best-selling romance novel.”

 

10. Stop Words

What are stop words in search engines?

Stop words are the words in a search query that aren’t taken into account at all when ranking search results. Certain words in a search query are so common that they’ll appear on almost every document or page. For example, when searching for “the best romance film”, there would be little point in giving “the” any weight in the query since “the” appears many times on almost every English web page and document in existence. Implementing stop words is another way to weed out irrelevant search results.

 

Search engine development is one of the most important applications of machine learning. If you’re looking for more reading on search engines, be sure to check out Search Relevance 101.

For engineers and developers looking for AI training data or search evaluation services, Lionbridge AI has a crowd of 500,000 multilingual experts ready to get to work creating your training data or evaluating your search queries in one of our 300 supported languages.

 

Multilingual Search Engine Evaluation Services

Lionbridge provides professional search engine evaluation services in over 300 languages.

Some of our most popular languages include:

  • Chinese search engine evaluation
  • Italian search engine evaluation
  • Dutch search engine evaluation
  • Japanese search engine evaluation
  • French search engine evaluation
  • Portuguese search engine evaluation
  • German search engine evaluation
  • Spanish search engine evaluation
Outsource search evaluation services with Lionbridge
The Author
Limarc Ambalina

Limarc writes content for Lionbridge’s website as part of the marketing team. Born and raised in Canada, Limarc’s love of Japanese pop culture brought him to Japan in 2016 and living in Japan has been his dream come true. Apart from Lionbridge content, you can catch Limarc online writing about anime, video games, and other nerd culture.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.