Oracle ConText Option Application Developer's Guide, Rel. 2.3 | ![]() Library |
![]() Product |
![]() Contents |
![]() Index |
This chapter describes the approach used by ConText linguistics to provide advanced analysis of English-language text.
The following topics are covered in this chapter:
ConText linguistics is used to analyze the content of English-language documents. You use ConText linguistics to create different views of the contents of documents that allow the user to quickly review the essential content of documents and determine their relevance.
Because these services are separate and distinct from text and theme indexing, you can incorporate linguistic analysis and functionality in a text application, independent of the text/theme indexing process.
ConText linguistics can generate the following forms of linguistic output for documents:
You obtain linguistic output by submitting a linguistic request using the CTX_LING PL/SQL package. Linguistic requests can only be processed by ConText servers running with the Linguistic personality.
The requirements for using ConText linguistics are:
Note: The setup requirements of having text in a column and having a policy for the column apply to ConText indexes (text/theme) as well as ConText linguistics. The procedures for storing text and creating policies are not discussed in this manual. For more information about storing text in columns and creating policies for the columns, see Oracle ConText Option Administrator's Guide. |
To process requests for linguistic output (themes and Gists), a ConText server with the Linguistic personality must be running. A ConText server with the Linguistic personality can also have other personalities in its personality mask.
Starting up ConText servers is the task of the ConText administrator, through the CTXSYS Oracle user.
See Also:
For more information about the Linguistic personality and about specifying personality masks for ConText servers, see Oracle ConText Option Administrator's Guide. |
The Services Queue is used for managing ConText linguistic requests. Such a request is cached in memory until the requestor submits the request, at which time the request is added to the Services Queue. If more than one request is cached in memory when the user submits the requests, ConText stores all of the requests as a single batch job.
If a ConText server has the appropriate Linguistic personality, the server monitors the Service Queue for requests and processes the next request in the queue.
The ConText administration tool can be used to perform all administration functions on the Services Queue (e.g., cleaning up entries, etc.). In addition, the CTX_SVC PL/SQL package can be used to perform ConText administration from the command-line.
You can generate linguistic output in batch during the text indexing process or generate it as needed. Because the generation of linguistic output is independent of the text-indexing process, ConText places no restrictions on when you can create themes and Gists.
See Also:
For more information about generating linguistic output at indexing time versus generating linguistic output on demand, see "Combining Theme/Text Queries with Linguistic Output" in Chapter 8. |
Linguistic and queue management functions are invoked by using PL/SQL procedures called or executed within the programming language in which the application is developed. If the application is developed in PL/SQL, these procedures may be invoked directly as PL/SQL execute statements. If the application is developed in another language, such as C, the PL/SQL procedures for linguistic and queue management functions are accessed through the Oracle Call Interface (OCI).
ConText provides the following PL/SQL packages for generating linguistic output and managing the Services Queue, respectively:
The stored procedures in CTX_LING are used to request linguistic output and submit the requests to the Services Queue. CTX_LING also provides procedures for specifying user settings for generating linguistic output and enabling logging of parse information generated during the processing of a request.
The model for submitting requests and querying the linguistic output is similar to the two-step query model (CONTAINS procedure) provided within the ConText framework for content-based text retrieval.
For example, to generate themes for a document, you first create a table to store the results of the theme generation, then call CTX_LING.REQUEST_THEMES procedure followed by the CTX_LING.SUBMIT function. ConText stores the results in a theme table. To view the results, issue a SELECT statement to select the theme from the output table.
See Also:
For more information about the procedures in the CTX_LING package, see "CTX_LING:Linguistics" in Chapter 10. |
The stored procedures in CTX_SVC are used to monitor the Services Queue for the status of specific requests. CTX_SVC can be used to check the status of pending requests, and to display errors encountered. You can also cancel the request if it has not been picked up for processing by a ConText server or clear the request if the request encountered an error.
See Also:
For more information about procedures in the CTX_SVC package, see "CTX_SVC: Services Queue Administration" in Chapter 10. |
The linguistic core is made up of the following components:
The lexicon is a static knowledge base that provides word and phrase information for the parsing engine. The lexicon recognizes over one million English words and phrases and defines hundreds of lexical characteristics for each word.
Note: The lexicon is specific to the English language, but it recognizes the difference between American and British usage and spelling. |
Linguistic information about words in the lexicon is divided into the following types:
The knowledge catalog is a language-independent organization of industries, fields of study, special terms and jargon, and abstract concepts. It creates a classification scheme that defines ConText's semantic view of the world.
Context uses the knowledge catalog to generate linguistic output, to classify documents by theme during theme indexing, and to normalize theme queries.
See Also:
For more information about the knowledge catalog, see "Understanding Theme Queries" in Chapter 7. |
The parsing engine identifies paragraph, sentence, and token (word) boundaries, as well as phrases and clauses. It then passes the tokens to the lexicon where grammar and theme flags are attached and linguistic analysis begins.
Once the lexicon identifies the grammatical function of each word in a sentence, using the word's placement in the sentence and its relationship to the surrounding words, the parsing engine determines the thematic function of the word in the sentence.
As the parsing engine encounters successively larger text blocks (sentences, paragraphs, and the whole document), it expands the analysis to add new information about the text to its knowledge base.
If case-conversion is enabled, the parsing engine converts all the text to lowercase and processes the text through the case-sensitivity routines to determine proper capitalization.
Note: Case conversion does not affect the original text of the documents being processed; only the output of the parsing engine is stored in mixed-case. |
ConText linguistics has the following requirements and restrictions for text input:
Each word and sentence should be clearly identified using standard conventions such as blank spaces and recognized punctuation. Complete sentences produce the best results, but are not required. ConText can process incomplete sentences as well as text in headers and lists.
To successfully process text, the ConText requires documents to be separated into paragraphs. The method by which the paragraph delimiters are recognized is based on whether the text is formatted.
In formatted text, the filters used to extract the text must provide paragraph delimiters that can be recognized by ConText.
The internal filters provided by ConText automatically recognize the paragraph delimiters used in the document format for the filter. Similarly, any external filters used for filtering text must recognize the paragraph delimiters used in the document format for the filter.
With plain (ASCII) text, paragraph delimiters are determined on a per document basis. ConText samples the first 8 Kilobytes of text in a document to identify the common method used to mark the beginning and end of paragraphs in the sample. That method is then applied to the rest of the document.
ConText linguistics can process documents of any size, up to a maximum size of 5 megabytes for a single document
.
Note: If a ConText linguistics request is submitted for a document larger than 5 megabytes, ConText returns an error and does not generate output for the document |
ConText can analyze written material of all styles and subject matter. This includes technical manuals, literature of all types, newspapers and magazines, encyclopedias, and electronic-mail messages.
ConText linguistics is not well-suited for processing transcriptions of unstructured, spoken words, such as colloquial dialogue or casual conversation. ConText linguistics also does not work well with non-natural languages such as computer programming languages.
ConText linguistics depends on text that is properly capitalized, which helps indicate the beginning of sentences and identifies proper nouns. ConText linguistics can also process text that is not in mixed-case, which is especially useful for all-uppercase or all-lowercase text that may exist in legacy systems.
ConText processes mixed-case text by first reducing the text to all lowercase, then analyzing each word to determine if the word should be capitalized or not.
This internal case-conversion takes place only if the appropriate setting has been enabled in the setting configuration for the session.
.
ConText linguistics produces the following output:
Theme indexes are created as a prerequisite for issuing theme queries. Given a theme policy, you can create a theme index for all documents in an entire text column using CTX_DDL.CREATE_INDEX
.
See Also:
For more information about creating theme indexes, see "Understanding Theme Queries" in Chapter 5. |
You can generate a list of themes or list of main concepts of a document on a per document basis. Because themes present a profile of the main subjects of a document, a list of themes provide a snapshot of what the document is about. You can generate up to 16 themes for each document, using the CTX_LING.REQUEST_THEMES procedure. When you generate the themes for a document, each theme is assigned a relative weight.
Note: ConText linguistics produces only document-level themes; paragraph-level themes cannot be produced. |
See Also:
For more information about generating themes, see "Generating Themes and Gists" in Chapter 8. |
Each document theme is assigned a weight that measures the strength of the theme relative to the other themes in the document.
The cumulative weight of a theme also reflects the overall thematic content of the document. As such, theme weights can be used to compare a document theme to other themes within the same document or to other documents with the same theme.
The themes produced by ConText linguistics are essentially document classifications. Each theme provides information that can be used to classify the document into a semantic world view (classification structure) defined by the user. For this reason, ConText linguistics always normalize the terms and phrases in the theme output to their noun and plural forms, if applicable.
In addition, the theme output is not always a direct result of the actual terms and phrases found in a document. Often the output reflects ConText's understanding of how themes are related.
For example, if a document provides a detailed discussion of MS-DOS and UNIX, ConText returns DOS and UNIX as themes for the document; however, ConText might also return operating systems as a theme, indicating that a relationship exists between DOS and UNIX. The document could be classified under DOS, UNIX, operating systems, or any combination of the three.
A theme summary for a document provides a short summary of the document from a specific point-of-view. You can generate two types of theme summaries:
A paragraph-level theme summary consists of the paragraph or paragraphs that best represent a single document theme A sentence-level theme summary consists of the sentence or sentences that best match a single document theme.
To create either paragraph-level or sentence-level theme summaries, use CTX_LING.REQUEST_GIST.
Because it provides a concise, focused summary for a particular theme in a document, a theme summary can be used to compare documents with similar themes.
You can control the size of sentence-level and paragraph-level theme summaries with linguistic settings.
Note: The settings for theme summaries can only be modified by creating custom setting configurations in the GUI administration tool. |
See Also:
For more information about how to generate theme summaries, see "Generating Themes and Gists" in Chapter 8. For more information on specifying linguistic settings, see "Specifying Linguistic Settings" in Chapter 8. For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10. |
A Gist for a document provides a summary that reflects all of the themes in the document. You can generate two types of Gists:
A paragraph-level Gist consists of the document paragraphs that best represent the themes in a document as a whole. A sentence-level Gist is the sentence or sentences that best represent the themes in a document as a whole.
To generate either a paragraph-level or sentence-level Gist, use CTX_LING.REQUEST_GIST.
Because a Gist is generally longer than a theme summary, it serves better as a document reading tool than a document selection tool. For example, it can be used to quickly scan a document and to extract the most meaningful thematic information.
You can specify settings to control the size of the Gist.
.
Note: The settings for Gist can only be modified by creating custom setting configurations in the GUI administration tool. |
See Also:
For more information about how to generate a Gists, see "Generating Themes and Gists" in Chapter 8. For more information on specifying linguistic settings, see "Specifying Linguistic Settings" in Chapter 8. For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10. |
You can perform linguistic processing of documents to generate themes and Gists only when a ConText server with the Linguistic personality is running. The type of processing is determined by the following options:
There is a default configuration, but you can also set these options by specifying a label with the CTX_LING.SET_SETTINGS_LABEL procedure. A label is a predefined configuration of settings.
See Also:
For more information on how to specify linguistic settings, see "Specifying Linguistic Settings" in Chapter 8. For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10. |
![]() ![]() Prev Next |
![]() Copyright © 1997 Oracle Corporation. All Rights Reserved. |
![]() Library |
![]() Product |
![]() Contents |
![]() Index |