Overview of using internationalized data

Oracle Endeca Server support for the Unicode Standard version 4.0 allows an Endeca data domain to process and serve data in many of the world’s languages.

At either data ingest time (or later via a Configuration Web Service operation), you can specify that a given standard attribute will use internationalized data when it is provided in a native encoding. At query time, you can specify the language to be used for the record search or value search.

The section makes the following assumptions:

For more information about the Unicode Standard and character encoding, see http://unicode.org.

Overview of supported language features

The following is a high-level list of which features are supported for international languages:
Feature Language support
Auto-correction spelling Language-specific auto spelling correction is available for supported languages (i.e., spelling dictionaries are available for all supported languages).
Stemming Language-specific stemming is available for all supported languages.
Did You Mean (DYM) suggestions Language-specific DYM is available for all supported languages.
Snippeting Available for all supported languages.
Thesaurus One language-agnostic thesaurus is available for use with queries in any of the supported languages (i.e., language-specific thesauruses are not supported).
Search characters Available only for the unknown language identifier.
Stop words Available only for the unknown language identifier.
Language auto-detection Auto-detection of languages at ingest or query time is not supported. The user must explicitly specify the language for the PDR or the query.
Language collation Language-specification collation (sorting) is not available for the supported languages.

Diacritic folding

Diacritic folding is the default behavior for all supported languages (including "unknown") during record searches. This feature is the automatic mapping of ISO-Latin1 international characters to ASCII equivalents in record search queries. It basically ignores character accents so that search queries containing international characters will match against Anglicized result text. For example, an English query for "café" will match "café" in records. Note that you cannot disable this diacritic folding behavior.