About the stemming feature

The stemming feature broadens search results to include word roots and word derivations.

Stemming is enabled in an Endeca data store by default.

The default configuration for stemming is recorded in the en_word_forms_collection configuration file. This file lists all word forms used for stemming dictionaries in an Endeca data store. This file is created in the Oracle Endeca Server data files once the data store is provisioned, and is typically not modified. However, you can overwrite the default English stemming word forms with those from another stemming file in another language, as explained below.

Stemming is intended to allow words with a common root form (such as the singular and plural forms of nouns) to be considered interchangeable in search operations. For example, search results for the word shirt will include the derivation shirts, while a search for shirts will also include its word root shirt.

Stemming equivalences are defined among single words. For example, stemming is used to produce an equivalence between the words automobile and automobiles (because the first word is the stem form of the second), but not to define an equivalence between the words vehicle and automobile (this type of concept-level mapping is done via the thesaurus feature).

Stemming equivalences are strictly two-way (that is, all-to-all). For example, if there is a stemming entry for the word truck, then searches for truck will always return matches for both the singular form (truck) and its plural form (trucks), and searches for trucks will also return matches for truck. In contrast, the thesaurus feature supports one-way mappings in addition to two-way mappings.

Note: The stemming implementation does not include decompounding. Decompounding is the ability to decompose a compound word (such as kindergarten) into its single word components (kinder and garten) and then find occurrences based on the smaller words.

Supported languages for stemming

The default language for the stemming feature is English. However, stemming files for these other languages are available:
  • Dutch
  • French
  • German
  • Italian
  • Portuguese
  • Spanish

The stemming files for these languages are shipped with Integrator. You use Integrator to overwrite the default stemming file with another one. For details, see the Oracle Endeca Information Discovery Integrator Components Guide.

Note: An Endeca data store supports only one stemming language at a time.