Oracle Text Mining
The Oracle Life Science Text Mining Demo was developed to demonstrate and encourage utilization of the Oracle database and applications in text mining of life science literature. The demo can be used with other text sources as detailed in section II. The demo utilizes a number of Oracle components including Oracle Text and Oracle Data Mining (ODM), as well as non-Oracle products such as the Apache web server, Perl programming language and the Cytoscape visualization software. The demo was tested with the Oracle 10gR2 database on Windows XP and Linux. Any issues of compatability with other operating systems are not known. The demo uses a number of Oracle Text and Oracle Data Mining features found only in Oracle database 10gR2.
The demo package installation instructions lead a user through the process of preparing and loading MEDLINE records into the database. A sample MEDLINE data file is provided. In addition, the demo can be used with custom datasets as described below.
MEDLINE is the National Library of Medicine's premier bibliographic database covering the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences. MEDLINE contains bibliographic citations and author abstracts from more than 4,800 biomedical journals published in the United States and 70 other countries. The database contains over 12 million citations dating back to the mid-1960's. Coverage is worldwide, but most records are from English-language sources or have English abstracts.
The BioOracle MEDLINE Text Mining demo is designed for MEDLINE documents. The provided sample dataset includes 4318 documents containing the terms 'AR' and 'cancer or neoplasia'. 'AR' is a common gene symbol of the androgen receptor gene (LocusLink ID 367) with a large role in cancer research. It is also frequently used for a number of other genes and non-gene acronyms. The demo has been tested with a database of over 140,000 MEDLINE documents.
MeSH is NLM's controlled vocabulary used for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts. MeSH terms associated with individual MEDLINE documents can be utilized in the search query and as a basis of NMF feature extraction and clustering.
The document data source can be selected or changed on the Data Source Selector page, accessible directly at
http://<SERVER_NAME>/cgi-bin/TextMining/TM.pl or by clicking the Change Data Source button on the demo search page
http://<SERVER_NAME>/cgi-bin/TextMining/search.pl.
The SRC.pm module contains a list of all available demo sources in the format:
$SRC::tables{"short_name"} = "full description";
For example, the existance of the MEDLINE data is stored as:
$SRC::tables{"MEDLINE"} = "MEDLINE abstracts";
Multipe MEDLINE tables can be used and, while the full description can be a descriptive string of text about the specific contents of the dataset, the short_name has to unique for each dataset and should not contain spaces or other characters not allowed in file names, e.g.,
$SRC::tables{"MEDLINE_1"} = "MEDLINE abstracts about cancer biology";
$SRC::tables{"MEDLINE_2"} = "MEDLINE abstracts on clinical trials";
The OTU.pm module contains information about the Oracle database connection, as well as specifics about the document dataset table, text index and text sections for use in reporting and mining operations. The contents of this file are changed each time a new dataset is selected. Information specific to a particular dataset is archived in file named SRC.<short_name>.pm where <short_name> must match the table key in the SRC.pm file, such as "MEDLINE".
Database Connection Parameters include the username, password, database name or SID, the server host and port. The demo setup scripts create a user named 'text' with a password 'miner' and assume the database name is 'ora10gR2', the server is 'localhost' with port 1521. This can be modified in the following lines of code:
$OTU::user = "text";
$OTU::passwd = "miner";
$OTU::dbname = "ora10gR2"; # SID
$OTU::dbserver = "localhost";
$OTU::port = "1521";
Table and Text Index details include 'column type', 'dataset name', 'table name', 'text index name', 'document key column name' and 'indexed text column name'. These can be modified in the following line of code:
# Usage: push(@tbl_info, column_type, dataset_name, table_name, text_index_name, document_key_column_name, indexed_text_column_name);
push(@tbl_info,"xml","MEDLINE","medtab","medtab_idx","pmid","text");
- column_type can be either xml or clob
- dataset_name is a short dataset name
- table_name is the name of the database table with the indexed text
- text_index_name is the name of the text index
- document_key_column_name is the name of the document ID table column
- indexed_text_column_name is the name of the text data table column
Document sections, if present and identified during the text indexing stage, are used during searching, document viewing and text mining operations. The MEDLINE demo contains individual sections within XML tag sets, such as article title, abstract text, MeSH terms, authors and affiliations, etc. Any of those tags can be selected for the three types of operations.
Searching can be done on the whole document (within every section if sections are used) or just within a selected list of sections. The following lines need to remain unchanged for all datasets and data types (XML or CLOB):
%OTU::search_sections = ();
# default for section-less search (do not change)
$OTU::search_sections{"all"} = "Whole Record";
Subsequently, individual document search sections can be listed to allow more targeted searches. The syntax is:
$OTU::search_sections{"section_tag_name"} = "Description";
For the MEDLINE demo, the following lines comprise a list of targeted search sections:
$OTU::search_sections{"AbstractText"} = "Abstract Text";
$OTU::search_sections{"ArticleTitle"} = "Article Title";
$OTU::search_sections{"NameOfSubstance"} = "Name of Substance";
$OTU::search_sections{"DescriptorName"} = "MeSH Term";
$OTU::search_sections{"QualifierName"} = "MeSH Qualifier";
A listing of the documents returned by the search will include the first 120 characters of the document text unless a specific section is selected for this purpose. The MEDLINE demo search results section is the article title:
$OTU::res_sections = "ArticleTitle";
Finally, a group of text mining sections is used by Oracle Text clustering, classification and document group summary algorithms. These sections do not effect the Oracle Data Mining NMF Feature Extraction algorithms (see section V below). If no sections are present or the whole document should be used, there is no need to set this variable and it can be removed. The MEDLINE demo uses the article title and abstract text:
@OTU::mining_sections = ("/MedlineCitation/Article/ArticleTitle",
"/MedlineCitation/Article/Abstract/AbstractText");
The primary use of thesauri in the demo is automatic expansion of query phrases into longer lists of synonyms or hierarchically related phrases. The thesarui are user generated. This demo provides a number of example thesauri, including the NCI Thesaurus, MeSH terminology tree, GO terminology and a transcription factor gene thesaurus. All the included thesauri are hierarchical and most can be viewed in the StretchViewer applet launched from the main search page. The contents of the data files for the applet viewer are independent from the Oracle thesauri stored in the database, but have an identical structure.
The THES.pm module contains the list of thesauri browsable in the StretchViewer applet via the buttons on the main search page. For example, the MeSH thesaurus StretchViewer launcher button is defined and made available through the following syntax:
# MeSH
$T7_btn = qq{<button type=button ID='form' NAME='B7' style='width:150'
onclick='window.open("/applet/launch_MeSH_stretchviewer.html", "MeSH",
"menubar=no, location=no, status=no, toolbar=no, resizable=yes")'>
<span id='input'>MeSH$lt;/span>
</button>};
push(@THES::thesauri, $T7_btn);
The following thesauri are included:
Usage of thesauri for searching and information extraction is further described in the Text Search and Ontology Co-occurrence sections.
The Oracle Knowledge Base is used to assign weighted "Themes" to documents during the text index creation. Theme indexing allows for searching for documents "ABOUT" a specific topic. Documents matching such a query will contain words matching the description of the topic. The description is represented as a hierarchical tree of concepts. For example, searching for documents about "dogs" would return documents matching terms like: dog, canine, great dane, german shepherd, etc. A search for "terriers" would not match documents about bulldogs (unless such a document also describes terriers), but will identify documents with reference to all terrier types. The Knowledge Base may be modified through incorporation of user thesauri. See the Oracle Text Application Developer's Guide for more information.
It is possible to query the Knowledge Base Themes which have been indentified in the indexed documents
on the Knowledge Base Theme Browser page launced by clicking the Query Knowledge Base button on the main search page.
The primary search form requests four pieces of information to perform the search. The Query field is used to enter the specific text query as detailed below. Search Fields are used to target the search at a specific list of existing document sections as defined in the OTU.pm file. The default choice is to search the whole document or all sections. Checking the Whole Record checkbox overrides any individual section selections. Search Type specifies whether the search is a traditional Key Word type or a Knowledge Base concepts-based Theme search. Finally, the limit for the number of highest-scoring, reported matches is set. This limit effects both the number of resluts displayed, and the list of documents passed on to all the text mining functions.
Alternatively, a random list of a specified number of documents may be fetched for subsequent use by the text mining methods, i.e., the randomly selected documents can be both clustered and classified. The documents can even be used to create new classification model categories or expand any existing category training sets (such as the NULL_MODEL).
The query phrase entered in the 'Query' text field will be imbedded in the Oracle Text SQL CONTAINS statement in the format. For a Key Word query in the Whole Record the syntax is:
SELECT score(1), doc_id_column
FROM table_name
WHERE CONTAINS(text_column, 'query phrase', 1) > 0
ORDER BY score(1) DESC;
Selection of one or more targeted search sections adds the WITHIN section statement:
SELECT score(1), doc_id_column
FROM table_name
WHERE CONTAINS(text_column, '(query phrase) WITHIN section_name', 1) > 0
ORDER BY score(1) DESC;
Selecting a Theme search embedds the query phrase in the ABOUT statement:
SELECT score(1), doc_id_column
FROM table_name
WHERE CONTAINS(text_column, 'ABOUT(query phrase)', 1) > 0
ORDER BY score(1) DESC;
It is not possible to perform a Theme search within a specific section. Searching for multi-theme documents is possible, but works in the opposite way from what might be expected. A theme query ABOUT(chemotherapy AND brain) will report all documents about chemotherapy plus all documents about brains. The good news is that documents with both themes will score higher than those with only one theme. Therefore, the top matches reported will likely satisfy the intent of the search.
Query syntax MEDLINE examples:
1. Combine a phrases and a word stem (select the Whole Record search field).
The query transcription factor AND $bind will result in an SQL statement:
SELECT score(1), PMID
FROM MEDTAB
WHERE CONTAINS(text, '(transcription factor AND $bind) WITHIN ArticleTitle OR
(growth factor AND development) WITHIN AbstractText', 1)>0
ORDER BY score(1) DESC;Internally, the query $bind will be expanded into binds OR bound OR bind OR binding , matching documents with the phrase transcription factor and any one of the forms of bind .
2. Theme search (select the Theme search type).
The query development will result in an SQL statement:
SELECT score(1), PMID
FROM MEDTAB
WHERE CONTAINS(text, 'ABOUT(development)', 1)>0
ORDER BY score(1) DESC;
3. Combine two phrases and two fields (select Abstract Text and Article Title search fields).
The query growth factor AND development will result in an SQL statement:
SELECT score(1), PMID
FROM MEDTAB
WHERE CONTAINS(text, '(growth factor AND development) WITHIN ArticleTitle OR
(growth factor AND development) WITHIN AbstractText', 1)>0
ORDER BY score(1) DESC;
Oracle Text thesauri are used to automatically expand the query phrases into longer lists of synonyms or hierarchically related phrases. Although the demo will correctly process all forms of Oracle Text thesauri functions (NT, BT, SYN, etc.), only the NT and BT syntax will be correctly processed for query expansion details and matching term highlighting.
Query syntax examples:
1. Expand the NCI thesaurus (NCIT) phrase 'growth factor' into all direct children.
The query: NT(Fibroblast Growth Factor Gene Family,1,NCIT) will be expanded as follows:
Expanding FIBROBLAST GROWTH FACTOR GENE FAMILY into 13 terms: FGF1, FGF10, FGF2, FGF20, FGF21, FGF3, FGF4, FGF5, FGF6, FGF7, FGF8, FGF9, FIBROBLAST GROWTH FACTOR GENE FAMILY
2. Expand the NCI thesaurus (NCIT) phrase 'FGF2' into broader parent terms.
The query: BT(FGF2,3,NCIT) will be expanded as follows:
Expanding FIBROBLAST GROWTH FACTOR GENE FAMILY into 6 terms: FGF2, FIBROBLAST GROWTH FACTOR GENE FAMILY, GENE, GROWTH FACTOR, PROTEIN, PROTEIN ORGANIZED BY FUNCTION
3. Expand the NCI thesaurus (NCIT) phrase 'Developmental Process' into all narrower terms up to 6 levels deep.
The query NT(Developmental Process,6,NCIT) will be expanded as follows:
Expanding Developmental Process into 39 terms: ANGIOGENESIS, B-CELL DEVELOPMENT, CELL DIFFERENTIATION, CELL FATE CONTROL, CELL LINEAGE, CELL ONTOGENY, CONCEPTION, DEVELOPMENTAL PROCESS, DORSAL-VENTRAL PATTERN FORMATION, EMBRYOGENESIS, EMBRYONIC INDUCTION, ERYTHROPOIESIS, EYE DEVELOPMENT, FERTILIZATION, GAMETOGENESIS, GRANULOPOIESIS, HEMATOPOIESIS, HISTOGENESIS, KERATINOCYTE DIFFERENTIATION, LEUKOPOIESIS, LIMB DEVELOPMENT, LUNG DEVELOPMENT, MEGAKARYOPOIESIS, MORPHOGENESIS, MYELOPOIESIS, MYOGENESIS, NERVOUS SYSTEM DEVELOPMENT, NEURAL DEVELOPMENT, OOGENESIS, ORGANOGENESIS, PATTERN FORMATION, SKELETAL DEVELOPMENT, SPERMATOGENESIS, SPLEEN DEVELOPMENT, STEM CELL DEVELOPMENT, T-CELL DEVELOPMENT, THROMBOPOIESIS, THYMOCYTE DEVELOPMENT, THYMOCYTE SELECTION
4. Combine two thesauri-expanded phrases.
The query NT(Fibroblast Growth Factor Gene Family,1,NCIT) AND NT(Developmental Process,6,NCIT) will identify documents containing at least one FGF gene family term or phrase from example 1, and at least one developmental process phrase from example 3. An example of a match is PMID 15221377, linking gene Fgf8 with embryogenesis.
The search results table reports documents matching the query. Documents are ordered by relevance score, with the document ID and relevant text sample (snippet) reported. An additional document section may be displayed here is specified in the OTU.pm configuration file. For MEDLINE, the document title is displayed above the snippet. It is possible that the snippet will include the title text if a match there is present.
| PMID | Score | Contents | ||
|---|---|---|---|---|
| [1] | 21 | Entry routes of malignant lymphoma into the brain and eyes in a mouse model. ...Entry routes of malignant lymphoma into the brain and eyes in a mouse model. ...lymphoma to the eye and brain. After i.p. inoculation... |
||
| [2] | 21 | Quantifying mRNA in postmortem human brain: influence of gender, age at death, postmortem interval, brain pH, agonal state and inter-lobe mRNA variance. ...postmortem human brain: influence of gender, age at death, postmortem interval, brain pH, agonal state...postmortem human brain is often... |
||
The document ID is embedded in a button used to display the full contents of the document. The Document View page contains additional information, including the Knowledge Base Themes and their weights, hotlinked to the Knowledge Base Query window. Individual query phrases are highlighted in the returned titles and in the full record when fetched.
The document Themes method extracts the most significant topics from a document based on the information contained in the Oracle Knowledge Base. The default knowledge base can be enhanced in a specific area of interest with the aid of thesauri and ontologies.
Document group summary is designed to generate a list of shared themes and extract the most representative secions of text from a related group of documents. The relationships between individual documents may be as trivial as matching a specific search query, or more significant, such as belonging to a specific document cluster, or scoring high to a specific classification model category.
Two mehods are used. Document Themes represents the most important topics extracted from all available text in the documents. Any themes shared by a large number of the documents in the group should have a high weight.
The document Gist method produces a content summary by extracting the most significant and representative text from a large document. In the case of a document group, the gist represents the most important individual sentences or paragraphs extracted from all available text in the documents.
Feature extraction is a method for generating a managable and informative numerical representation of a group of documents. This representation is stored in a database table and can be used for text and data mining analyses with the tools in Oracle Data Mining. The ODM feature extraction method used is Non-negative Matrix Factorization (NMF).
In this demo, either document tokens or themes are used as document characteristics. The MEDLINE demo also allows the use of MeSH terms. For tokens, the term frequencies are used as values in the attribute data matrix. For themes, the document attribute data matrix consists of theme weights, while for MeSH terms it is a binary matrix. The NMF alogorithm is applied to this matrix, generating an new set of representative features for each document.
Two NMF-based applications are provided in the demo.
| Feature Extraction for full Data Mining |
|---|
| Extract optimized document-representing features. Data Source:Tokens Themes MeSH terms |
Document clustering is used to obtain some idea of topic diversity within a group of documents for which no other differentiator is known. In practice (not available in current demo version), documemnts from an interesting cluster can be used as training documents for a category in document classification (see SVM). This demo offers different methods to cluster documents from a search result (see individual descriptions).
| Document Clustering |
|---|
| No. Clusters: Max Features per Doc: Features: Tokens Themes |
| Max Splits/Node: Tree Depth: Min Cluster Quality (0-1): Features: Tokens Themes |
| No. Features: No. Clusters: Data Source: Tokens Themes MeSH terms |
Cluster hierarchy can be viewed and navigated in the StretchViewer applet. In order to clarify the contextual meaning behind each cluster, the themes and gist of each cluster can be displayed by clicking on the appropriate cluster number button. In addition, each clustering method provides other specific cluster information as described below.
Oracle Text utilizes the ODM k-Means algorithm for document clustering. All documents returned by the query will be clustered based on the complete text found in the article unless specific data-mining sections were specified in the configuration file. For MEDLINE, the title and abstract are used. The user must specify the desired number of clusters and the maximum number of distinct terms per document which will be used as clustering features. The features can be based on a combination of text tokens and document themes.
The results page includes the cluster hierarchy (StretchViewer applet) with each document id linked to an exapanded document view, where cluster definition temrs are displayed and highlighted to show their occurrence in the text. A table of information about the size, quality and term-based definition of each cluster is displayed below the hierarchy tree. The individual definition terms are sorted in the order of relevance to the cluster. Cluster Gist can be generated for each individual cluster, providing a summary of the most informative text from all documents in the cluster.
1. Document attributes used for clustering can be tokens or themes. Here, only themes were used. The cluster hierarchy is seen in the stretchviewer applet. Red nodes represent intermediate, while blue nodes, the final document clusters. Green nodes are individual documents with their individual quality-in-cluster scores.
2. The document nodes can be double-clicked to launch a new window with the full document view. This expanded view includes cluster information and highlights the overlap between the cluster definition terms and the document text.
3. A table of full cluster descriptions includes the cluster name (the highest ranking term in the cluster definition), cluster quality score and size (number of documents assigned to the cluster) and the full fist of cluster attributes in order of relevance. The gist button fetches the themes and gist for the cluster.
4. The top 20 themes and a 3-sentence gist, based on the text of all cluster documents is fetched by pressing the "Gist" button for a selected cluster.
5. Cluster definition terms can be compared head-to-head in two heatmap views under the cluster results table. Columns represent cluters, rows the individual cluster definition terms. The left map compares the terms based on their relevance to the cluster definition, with the most relevant terms appearing first. The right map sorts the terms alphabetically. The scale goes from a maximum value for yellow to a minimum value for blue.
Cluster Drilldown allows for more detailed examination of document from a selected cluster. The possible operations include:
TEXTK is an alternative Oracle Text hierarchical document clustering method. The user specifies the desired depth and breadth of hierarchy, as well as a desired minimum similarity metric controlling when documents are too similar to split into further clusters (a value of 0.2 will produce fewer and bigger clusters, while that of 0.8 may produce many more levels and clusters, but assign fewer documents into each cluster). The results format is the same as for the k-Means clustering method described above.
Documents can be clustered based on how well they are represented by a set of features derived with the Oracle Data Mining Non-negative Matrix Factorization method. As described in section V, either document tokens or themes are used as document characteristics. The MEDLINE demo also allows the use of MeSH terms. For tokens, the term frequencies are used as values in the attribute data matrix. For themes, the document attribute data matrix consists of theme weights, while for MeSH terms it is a binary matrix. The NMF alogorithm is applied to this matrix, generating a new set of representative features for each document.
The user specifies the number of features to generate. Each document is then scored against each feature, generating a feature vs. document matrix. Documents are clustered with the ODM k-Means clustering algorithm, generating a specified number of hierarchical clusters. It is actually possible to generate a full hierarchy tree with one document per cluster if the number of requested clusters is equal to the number of documents.
Results are displayed in a form of a table/heatmap, ordering the documents by cluster and cluster score (See figure below). The heatmap palette uses blue for minimum and yellow for maximum. The cluster number button extracts the top 20 themes and a 3-sentence gist for the documents in the cluster. Each document can be viewed via the document ID link. Documents with the higest cluster score are displayed first. The right side of the results table shows the actual NMF features for each document. The heatmap cell color indicated the strength of the representation of the document by a particular feature. The numerical value can be seen in the browser window status bar.
Another possible insight into the contextual significance of the clusters is based on a review of the most commonly shared attributes within and between documents and clusters. A simple table shows the top ten attributes (tokens, themes or MeSH terms) for each cluster with the count and fraction of individual documents containing that attribute.
Next, the top 100 attributes representative of the clusters are shown. For each attribute a heatmap cell indicates its representation in cluster documents (as a fraction of documents containing that attribute). Once again, yellow indicates maximum, blue minimum. Finally, the importance of each attribute for the NMF features is presented in a second heatmap. The attributes are sorted by cummulative representation within all documents (top 100 terms displayed). The 9-document, MeSH term-based attribute, MEDLINE example below shows that all cluters refer to Alzheimer's Disease and humans. All clusters have some age references, while only clusters 1 and 3 contain references to gender. Only cluster 2 contains references to clinical trials.
Oracle Text utilizes the ODM SVM algorithm for document classification. Classification is a supervised method, requiring classification categories and training documents.
In this demo, documents returned by a search can be used as individual category's training documents or as subjects of classification based on the current set of categories.
| SVM classification | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
||||||||||||||||
There are five SVM operation buttons.
|
» Edit » |
|
|||||||||||||||||||||||||||||||||||||
|
Category Name:
|