Oracle® TimesTen In-Memory Database Operations Guide 11g Release 2 (11.2.2) E21633-12 |
|
|
View PDF |
The following sections describe TimesTen globalization support features:
TimesTen globalization support includes the following features:
Character set support
You must choose a database character set when you create a database. See "Choosing a database character set" in this book and the "Supported character sets" section in the Oracle TimesTen In-Memory Database Reference for all supported character sets. You can also choose a connection character set for a session. See "Connection character set".
Length semantics
You can specify byte semantics or character semantics for defining the storage measurement of character data types. See "Length semantics and data storage".
Linguistic sorts and indexes. You can sort data based on linguistic rules. See "Linguistic sorts". You can use linguistic indexes to improve performance of linguistic sorts. See "Using linguistic indexes".
SQL string and character functions
TimesTen provides SQL functions that return information about character strings. TimesTen also provides SQL functions that return a character from an encoded value. See "SQL string and character functions".
Note:
This release of TimesTen does not support session language and territory.TimesTen uses the database character set to define the encoding of data stored in character data types, such as CHAR
and VARCHAR2
.
Use the DatabaseCharacterSet
data store attribute to specify the database character set during database creation. You cannot alter the database character set after database creation, and there is no default value for DatabaseCharacterSet
. See "Supported character sets" in the Oracle TimesTen In-Memory Database Reference for a list of supported character sets.
Consider the following questions when you choose a character set for a database:
What languages does the database need to support now and in the future?
Is the character set available on the operating system?
What character sets are used on clients?
How well does the application handle the character set?
What are the performance implications of the character set?
If you are using TimesTen Application-Tier Database Cache (TimesTen Cache) to cache Oracle database tables, or if you loading Oracle database data into a TimesTen table, you must create the database with the same database character set as the Oracle database.
This section includes the following topics:
Choosing a database character set determines what languages can be represented in the database.
A group of characters, such as alphabetic characters, ideographs, symbols, punctuation marks, and control characters, can be encoded as a character set. An encoded character set assigns unique numeric codes to each character in the character repertoire. The numeric codes are called code points or encoded values.
Character sets can be single-byte or multibyte. Single-byte 7-bit encoding schemes can define up to 128 characters and normally support just one language. Single-byte 8-bit encoding schemes can define up to 256 characters and often support a group of related languages. Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese because these languages use thousands of characters. These encoding schemes use either a fixed number or a variable number of bytes to represent each character. Unicode is a universal encoded character set that enables information from any language to be stored using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.
The database character set is independent of the operating system. On an English operating system, you can create and run a database with a Japanese character set. However, when an application in the client operating system accesses the database, the client operating system must be able to support the database character set with appropriate fonts and input methods. For example, you cannot insert or retrieve Japanese data on the English Windows operating system without first installing a Japanese font and input method. Another way to insert and retrieve Japanese data is to use a Japanese operating system remotely to access the database server.
If all client applications use the same character set, then that character set is usually the best choice for the database character set. When client applications use different character sets, the database character set should be a superset of all the application character sets. This ensures that every character is represented when converting from an application character set to the database character set.
For best performance, choose a character set that avoids character set conversion and uses the most efficient encoding for the languages desired. Single-byte character sets result in better performance than multibyte character sets and are more efficient in terms of space requirements. However, single-byte character sets limit how many languages you can support.
In single-byte character sets, the number of bytes and the number of characters in a string are the same. In multibyte character sets, a character or code point consists of one or more bytes. Calculating the number of characters based on byte lengths can be difficult in a variable-width character set. Calculating column lengths in bytes is called byte semantics, while measuring column lengths in characters is called character semantics.
Character length and byte length semantics are supported to resolve potential ambiguity regarding column length and storage size. Multibyte encoding character sets are supported, such as UTF-8 or AL32UTF8
. Multibyte encodings require varying amounts of storage per character depending on the character. For example, a UTF-8 character may require from 1 to 4 bytes. If, for example, a column is defined as CHAR (10)
, all 10 characters fit in this column regardless of character set encoding. However, for UTF-8
character set encoding, up to 40 bytes are required.
Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8
), suppose that you need to define a VARCHAR2
column that can store up to five Chinese characters together with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, where each are three bytes long, and 5 bytes for the English characters, where each are one byte long, for a total of 20 bytes. Using character semantics, the column requires 10 characters.
The expressions in the following list use byte semantics. Note the BYTE
qualifier in the CHAR
and VARCHAR2
expressions.
CHAR (5 BYTE)
VARCHAR2(20 BYTE)
The expressions in the following list use character semantics. Note the CHAR
qualifier in the VARCHAR2
expression.
VARCHAR2(20 CHAR)
SUBSTR(
string
, 1, 20)
By default, the CHAR
and VARCHAR2
character data types are specified in bytes, not characters. Therefore, the specification CHAR(20)
in a table definition allows 20 bytes for storing character data.
The NLS_LENGTH_SEMANTICS
general connection attribute determines whether a new column of character data type uses byte or character semantics. It enables you to create CHAR
and VARCHAR2
columns using either byte-length or character-length semantics without having to add the explicit qualifier. NCHAR
and NVARCHAR2
columns are always character-based. Existing columns are not affected.
The default value for NLS_LENGTH_SEMANTICS
is BYTE
. Specifying the BYTE
or CHAR
qualifier in a data type expression overrides the NLS_LENGTH_SEMANTICS
value.
The database character set determines the encoding of CHAR
and VARCHAR2
character data types. The connection character set is used to describe the encoding of the incoming and outgoing application data, so that TimesTen can perform the necessary character set conversion between the application and the database. For example, a non-Unicode application can communicate with a Unicode (AL32UTF8
) database.
The ConnectionCharacterSet
general connection attribute sets the character encoding for the connection, which can be different than the database character set. The connection uses the connection character set for information that passes through the connection, such as parameters, SQL query text, results and error messages. Choose a connection character set that matches the application environment or the character set of your data source.
Best performance results when the connection character set and the database character set are the same because no conversion occurs. When the connection character set and the database character set are different, data conversion is performed in the ODBC layer. Characters that cannot be converted to the target character set are changed to replacement characters.
The default connection character set is US7ASCII
. This setting applies to both direct and client connections.
Different languages have different sorting rules. Text is conventionally sorted inside a database according to the binary codes used to encode the characters. Typically, this does not produce a sort order that is linguistically meaningful. A linguistic sort handles the complex sorting requirements of different languages and cultures. It enables text in character data types, such as CHAR
, VARCHAR2
, NCHAR
, and NVARCHAR2
, to be sorted according to specific linguistic conventions.
A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order. TimesTen offers two kinds of linguistic sorts: monolingual and multilingual.
This section includes the following topics:
TimesTen compares character strings in two steps for monolingual sorts. The first step compares the major value of the entire string from a table of major values. Usually, letters with the same appearance have the same major value. The second step compares the minor value from a table of minor values. The major and minor values are defined by TimesTen. TimesTen defines letters with accent and case differences as having the same major value but different minor values.
Monolingual linguistic sorting is available only for single-byte and Unicode database character sets. If a monolingual linguistic sort is specified when the database character set is non-Unicode multibyte, then the default sort order is the binary sort order of the database character set.
For a list of supported sorts, see "NLS_SORT "in the Oracle TimesTen In-Memory Database Reference.
TimesTen provides multilingual linguistic sorts so that you can sort data for multiple languages in one sort. Multilingual linguistic sort is based on the ISO/OEC 14651 - International String Ordering and the Unicode Collation algorithm standards. This framework enables the database to handle languages that have complex sorting rules, such as those in Asian languages, as well as providing linguistic support for databases with multilingual data.
In addition, multilingual sorts can handle canonical equivalence and supplementary characters. Canonical equivalence is a basic equivalence between characters or sequences of characters. For example, ç is equivalent to the combination of c and ,.
For example, TimesTen supports a monolingual French sort (FRENCH
), but you can specify a multilingual French sort (FRENCH_M
). _M represents the ISO 14651 standard for multilingual sorting. The sorting order is based on the GENERIC_M
sorting order and can sort accents from right to left. TimesTen recommends using a multilingual linguistic sort if the tables contain multilingual data. If the tables contain only French, then a monolingual French sort may have better performance because it uses less memory. It uses less memory because fewer characters are defined in a monolingual French sort than in a multilingual French sort. There is a trade-off between the scope and the performance of a sort.
For a list of supported multilingual sorts, see "NLS_SORT" in the Oracle TimesTen In-Memory Database Reference.
Operations inside a database are sensitive to the case and the accents of the characters. Sometimes you might need to perform case-insensitive or accent-insensitive comparisons.
To specify a case-insensitive or accent-insensitive sort:
Append _CI
to a TimesTen sort name for a case-insensitive sort. For example:
BINARY_CI
: accent-sensitive and case-insensitive binary sort
GENERIC_M_CI
: accent-sensitive and case-insensitive GENERIC_M
sort
Append _AI
to a TimesTen sort name for an accent-insensitive and case-insensitive sort. For example:
BINARY_AI
: accent-insensitive and case-insensitive binary sort
FRENCH_M_AI
: accent-insensitive and case-insensitive FRENCH_M
sort
The NLS_SORT
data store connection attribute indicates which collating sequence to use for linguistic comparisons. The NLS_SORT
value affects the SQL string comparison operators and the ORDER BY
clause.
You can use the ALTER SESSION
statement to change the value of NLS_SORT
:
ALTER SESSION SET NLS_SORT=SWEDISH; SELECT product_name FROM product ORDER BY product_name; PRODUCT NAME ------------ aerial Antenne Lcd ächzen Ähre
You can also override the NLS_SORT
setting by using the NLSSORT
SQL function to perform a linguistic sort:
SELECT * FROM test ORDER BY NLSSORT(name,'NLS_SORT=SPANISH');
Note:
For materialized views and cache groups, TimesTen recommends that you explicitly specify the collating sequence using theNLSSORT()
SQL function rather than using this attribute in the connection string or DSN definition.For more details, see "NLS_SORT" in the Oracle TimesTen In-Memory Database Reference. For more extensive examples of using NLSSORT
, see "NLSSORT" in the Oracle TimesTen In-Memory Database SQL Reference.
You can create a linguistic index to achieve better performance during linguistic comparisons. A linguistic index requires storage for the sort key values.
To create a linguistic index, use a statement similar to the following:
CREATE INDEX german_index ON employees (NLSSORT(employee_id, 'NLS_SORT=GERMAN'));
The optimizer chooses the appropriate index based on the values for NLSSORT
and NLS_SORT
.
You must create multiple linguistic indexes if you want more than one linguistic sort on a column. For example, if you want both GERMAN
and GERMAN_CI
sorts against the same column, create two linguistic indexes.
For more information, see "CREATE INDEX" in the Oracle TimesTen In-Memory Database SQL Reference.
There are many SQL functions that operate on character strings, which are detailed in the following sections in the Oracle TimesTen In-Memory Database SQL Reference:
"Conversion functions" (describing some functions that operate on character strings).
"General comparison functions" (describing some functions that operate on character strings but not on LOB data types).
The globalization support attributes are summarized in the following table:
Parameter | Description |
---|---|
DatabaseCharacterSet |
Indicates the character encoding used by a database. |
ConnectionCharacterSet |
Determines the character encoding for the connection, which may be different from the database character set. |
NLS_SORT |
Indicates the collating sequence to use for linguistic comparisons. |
NLS_LENGTH_SEMANTICS |
Sets the default length semantics. |
NLS_NCHAR_CONV_EXCP |
Determines whether an error is reported when there is data loss during an implicit or explicit data type conversion between NCHAR /NVARCHAR2 data and CHAR /VARCHAR2 data. |
DatabaseCharacterSet
must be set during database creation. There is no default. See "Choosing a database character set".
The rest of the attributes are set during connection to a database. For more information about ConnectionCharacterSet
, see "Connection character set".
You can use the ALTER SESSION
statement to change the following attributes during a session:
NLS_SORT
NLS_LENGTH_SEMANTICS
NLS_NCHAR_CONV_EXCP
For more information, see "ALTER SESSION" in the Oracle TimesTen In-Memory Database SQL Reference and "Connection Attributes" in the Oracle TimesTen In-Memory Database Reference.
TIMESTEN8
is a restricted database character set that specifies behavior from TimesTen releases before 7.0. It is supported for backward compatibility only.
TIMESTEN8
has the following restrictions:
There is no support for character set conversion of any kind. This includes:
Conversions between the application and the database. If DatabaseCharacterSet
is TIMESTEN8
, then ConnectionCharacterSet
must also be TIMESTEN8
.
Conversions between CHAR
/VARCHAR2
data and NCHAR
/NVARCHAR2
data.
Sorting for CHAR
and VARCHAR2
data types is limited to binary ordering. NLS_SORT=BINARY
is the only acceptable collating sequence allowed.
CHAR
semantics are ignored. Characters are single-byte.
UPPER
and LOWER
functions support ASCII
characters only. Results for non-ASCII characters are undefined. TimesTen does not return an error.
TIMESTEN8
is not supported in TimesTen Cache.
There is no support for any LOB data types in TIMESTEN8
.
During database creation, customers should select the database character set matching the actual encoding of the data being stored in CHAR
and VARCHAR2
columns whenever possible. Select TIMESTEN8
only when backwards compatibility to existing TimesTen data is required.
Note:
For details on potential issues when migrating a TimesTen database that uses theTIMESTEN8
character set to a database with a different character set, see "Migration, Backup, and Restoration" in the Oracle TimesTen In-Memory Database Installation Guide.Globalization support may cause issues during migration. For complete details, see "Migration, Backup, and Restoration" in the Oracle TimesTen In-Memory Database Installation Guide and the description of "ttMigrate" in the Oracle TimesTen In-Memory Database Reference.