Five New Optimizely Certifications are Here! Validate your expertise and advance your career with our latest certification exams. Click here to find out more

Viktor Sahlström
Aug 22, 2016
  3136
(0 votes)

Episerver Find stemming

One of the things that makes search challenging (and really interesting) is language handling. Often, spoken languages differ from what a programmer is familiar with, since the rules are more comparable to a never-ending set of exceptions than actual rules. To properly analyze a string of text, the system must understand all of these exceptions.

A great feature of Episerver Find is stemming. Stemming reduces an inflected word to its root form (a.k.a. stem), for example, "fishing", "fished", and "fisher" have a root word of "fish." If the root word is determined, it can be used to return the full set of related items, thus improving retrievability and relevancy of search results.

Episerver Find uses snowball stemmers shipped as the default stemmer with Elastic search. This stemmer handles the general rules quite well but does not handle all special cases. In many languages, this works very well (English, for example). But depending on the complexity of the language and the maturity of the stemmer, this is not always enough. Swedish is a case where the default stemmer is not always perfect in execution. Often, the default stemmer creates a conflict that, in turn, causes unexpected search hits.

As an example, consider the Swedish words “bananen” (the banana) and “banans” (the race tracks). Using normal Swedish stemming rules, they would both be stemmed down to “banan.” In this case, any search also stemmed down to “banan” would give both results even though half of them are not relevant.
 
To fix this, a list of exceptions has been added to the Find stemming. We have started with Swedish and will look at additional languages going forward. The new algorithm recognizes that “bananen” and “banans” are different words even though their stem is the same. Hence, it creates unique tokens from them so the search engine can distinguish them at query time. This is a great improvement to search relevancy in many cases. One thing that remains to be solved is the case of “banan” (banana) and “banan” (the race track). In this form, the words are spelled exactly the same and cannot be distinguished without looking at the context. For these cases, search results are returned for both words.

To keep the list updated we would love users and partners to let us know if they find searches that results in weird results. 

Aug 22, 2016

Comments

Please login to comment.
Latest blogs
Optimizely Configured Commerce and Spire CMS - Figuring out Handlers

I recently entered the world of Optimizely Configured Commerce and Spire CMS. Intriguing, interesting and challenging at the same time, especially...

Ritu Madan | Mar 12, 2025

Another console app for calling the Optimizely CMS REST API

Introducing a Spectre.Console.Cli app for exploring an Optimizely SaaS CMS instance and to source code control definitions.

Johan Kronberg | Mar 11, 2025 |

Extending UrlResolver to Generate Lowercase Links in Optimizely CMS 12

When working with Optimizely CMS 12, URL consistency is crucial for SEO and usability. By default, Optimizely does not enforce lowercase URLs, whic...

Santiago Morla | Mar 7, 2025 |

Optimizing Experiences with Optimizely: Custom Audience Criteria for Mobile Visitors

In today’s mobile-first world, delivering personalized experiences to visitors using mobile devices is crucial for maximizing engagement and...

Nenad Nicevski | Mar 5, 2025 |

Unable to view Optimizely Forms submissions when some values are too long

I discovered a form where the form submissions could not be viewed in the Optimizely UI, only downloaded. Learn how to fix the issue.

Tomas Hensrud Gulla | Mar 4, 2025 |

CMS 12 DXP Migrations - Time Zones

When it comes to migrating a project from CMS 11 and .NET Framework on the DXP to CMS 12 and .NET Core one thing you need to be aware of is the...

Scott Reed | Mar 4, 2025