Skip Headers

Oracle Data Mining Concepts
10g Release 1 (10.1)

Part Number B10698-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

2
Data for Oracle Data Mining

This chapter describes data requirements and how the data is to be prepared before you can begin mining it using either of the Oracle Data Mining (ODM) interfaces. The data preparation required depends on the type of model that you plan to build and the characteristics of the data. For example data that only takes on a small number of values may not require binning.

The following topics are addressed:

2.1 ODM Data, Cases, and Attributes

Data used by ODM consists of tables stored in an Oracle database. The rows of a data table are referred to as cases, records, or examples. The columns of the data tables are referred to as attributes (also known as fields); each attribute in a record holds an item of information. The attribute names are constant from record to record; the values in the attributes can vary from record to record. For example, each record may have an attribute labeled "annual income". The value in the annual income attribute can vary from one record to another.

ODM distinguishes two types of attributes: categorical and naumerical. Categorical attributes are those that define their values as belonging to a small number of discrete categories or classes; there is no implicit order associated with them. If there are only two possible values, for example yes and no, or male and female, the attribute is said to be binary. If there are more than two possible values, for example, small, medium, large, extra large, the attribute is said to be multiclass.

Numerical attributes are those that take on continuous values, for example, annual income or age. Annual income or age could theoretically be any value from zero to infinity, though of course in practice each usually occupies a more realistic range.

Numerical attributes can be treated as categorical: Annual income, for example, could be divided into three categories: low, medium, high.

Certain ODM algorithms also support unstructured attributes. Currently only one type of unstructured attribute type Text is supported. At most one attribute of type Text is allowed in ODM data.

2.2 ODM Data Requirements

ODM has requirements on several aspects of input data: data table format, column data type, and attribute type.

2.2.1 ODM Data Table Format

ODM data can be in one of two formats:

The Java interface for ODM provides a transformation utility reversePivot() that converts multiple data sources that are in single-record case format to one table that is in multi-record case format. Reverse pivoting can be used to create tables that exceed the 1000 column limit on Oracle tables by combining multiple tables that have a common key.

2.2.1.1 Single-Record Case Data

In single-record case (nontransactional) format, each case is stored as one row in a table. Single-record-case data is not required to provide a key column to uniquely identify each record. However, a key is needed to associate cases with resulting scores for supervised learning. This format is also referred to as nontransactional.

Note that certain algorithms in the ODM Java interface automatically and internally (that is, in a way invisible to the user) convert all single-record case data to multi-record case data prior to model building. If data is already in multi-record case format, algorithm performance might be enhanced over performance with data in single-record case format.

2.2.1.2 Multi-Record Case Data in the Java Interface

Oracle tables support at most 1,000 columns. This means that a case can have at most 1,000 attributes. Data that has more than 1,000 attributes is said to be wide. Certain classes of problems, especially problems in Bioinformatics, are associated with wide data.

The Java interface requires that wide data be in multi-record case format.

In multi-record case data format, each case is stored as multiple records (rows) in a table with columns sequence ID, attribute name, and value (these are user-defined names). This format is also referred to as transactional.

SequenceID is an INTEGER or NUMBER that associates the records that make up a single case in a multi-record case table, attribute name is a string containing the name of the attribute, and value is a number representing the value of the attribute. Note that the values in the value column must be of type NUMBER; non-numeric data must be converted to numeric data, usually by binning or explosion.

2.2.1.3 Wide Data in DBMS_DATA_MINING

In the domains of bioinformatics, text mining, and other specialized areas, the data is wide and shallow -- relatively few cases, but with one thousand or more mining attributes.

Wide data can be represented in a multi-record case format, where attribute/value pairs are grouped into collections (nested tables) associated with a given case ID. Each row in the multi-record collection represents an attribute name (and its corresponding value in another column in the nested table).

DBMS_DATA_MINING includes fixed collection types for defining columns.

It is most efficient to represent multi--record case data as a view.

2.2.1.3.1 Fixed Collection Types

The fixed collection types DM_Nested_Numericals and DM_Nested_Categoricals are used to define columns that represent collections of numerical attributes and categorical attributes respectively.

You can intersperse columns of types DM_Nested_Numericals and DM_Nested_Categoricals with scalar columns that represent individual attributes in a table or view.

For a given case identifier, attribute names must be unique across all the collections and individual columns. The two fixed collection types enforce this requirement. The two collection types are based on the assumption that mining attributes of the same type (numerical versus categorical) are generally grouped together, just as a fact table contains values that logically correspond to the same entity.

2.2.1.3.2 Views for Multi-Record Case Format

For maximum efficiency, you should represent multi-record case data using object views, and use the view as input to BUILD and APPLY operations. Using views for multi-record case data has two main advantages:

Figure 2-1 Single-Record Case and Multi-Record Case Data Format

Text description of singmult.gif follows.

Text description of the illustration singmult.gif

2.2.2 Column Data Types Supported by ODM

ODM does not support all the data types that Oracle supports. ODM attributes must have one of the following data types:

The supported attribute data types have a default attribute type (categorical or numerical); Table 2-1 lists the default attribute type for each of these data types.

2.2.2.1 Unstructured Data in ODM

Some ODM algorithms (Support Vector Machine, Non-Negative Matrix Factorization, Association, and the implementation of k-means Clustering in DBMS_DATA_MINING) permit one column to be unstructured of type Text. For information about text mining, see Chapter 8.

2.2.2.2 Dates in ODM

ODM does not support the DATE data type. Depending on the meaning of the item, you convert items of type DATE to either type VARCHAR2 or NUMBER.

If, for example, the date serves as a timestamp indicating when a transaction occurred, converting the date to VARCHAR2 makes it categorical with unique values, one per record. These types of columns are known as "identifiers" and are not useful in model building. However, if the date values are coarse and significantly fewer than the number of records, this mapping may be fine.

One way to convert a date to a number is as follows: select a starting date and subtract the starting date from each date value. This result produces a NUMBER column, which can be treated as a numerical attribute, and then binned as necessary.

2.2.3 Attribute Type for Oracle Data Mining

Oracle Data Mining handles categorical and numerical attributes; it imputes the attribute type and, for the Java interface, the data type of the attribute as described in Table 2-1.

Table 2-1 Interpretation of Oracle Database Data Types by ODM
Oracle Type Default ODM Attribute Type Default Java Data Type (Java interface only)

VARCHAR2

categorical

String

CHAR length > 1

categorical

String

NUMBER

numerical

Float

NUMBER 0 scale

numerical

Integer

CLOB

Text

Unstructured

LOB

Text

Unstructured

BLOB

Text

Unstructured

BFILE

Text

Unstructured

XMLTYPE

Text

Unstructured

URITYPE

Text

Unstructured

In situations where you have numbers that are treated as categorical data, you must typecast such attribute values using the TO_CHAR() operator and populate them into a VARCHAR2 or CHAR column representing the mining attribute.

In situations where you have numeric attribute values stored in a CHAR or VARCHAR2 column, you must typecast those attribute values using the TO_NUMBER() operator and store them in a NUMBER column.

If persisting these transformed values in another table is not a viable option, you can create a view with these conversions in place, and provide the view name to represent the training data input for model building.

Values of a categorical attribute do not have any meaningful order; values of a numerical attribute do. This does not mean that the values of a categorical attribute cannot be ordered, but rather that the order is not used by the application. For example, since U.S. postal codes are numbers, they can be ordered; however, their order is not necessarily meaningful to the application, and they can therefore be considered categorical.

2.2.3.1 Target t Attribute

Classification and Regression algorithms require a target attribute. A DBMS_DATA_MINING predictive model can on predict a single target attribute. The target attribute for all classification algorithms can be numerical or categorical. SVM Regression supports only numerical target attributes.

2.2.4 Data Storage Issues

If there are a few hundred mining attributes and your application requires the attributes to be represented as columns in the same row of the table, data storage must be carefully designed. For a table with several columns, the key question to consider is the (average) row length, not the number of columns. Having more than 255 columns in a table built with a smaller block size typically results in intrablock chaining. Oracle stores multiple row pieces in the same block, but the overhead to maintain the column information is minimal as long as all row pieces fit in a single data block. If the rows don't fit in a single data block, you may consider using a larger database block size (or use multiple block sizes in the same database). For more details, consult the Oracle10i Concepts Guide and the Oracle10i Performance Tuning Guide.

2.2.5 Missing Values in ODM

Data tables often contain missing values.

2.2.5.1 Missing Values and Null Values in ODM

The following algorithms assume that a null values indicate missing values (and not as indicators of sparse data): NB, ABN, AI, k-Means (Java interface), and O-Cluster.

2.2.5.2 Missing Values Handling

ODM is robust in handling missing values and does not require users to treat missing values in any special way. ODM will ignore missing values but will use non-missing data in a case.

In some situations you must be careful, for example, in transactional format, to distinguish between a "0" that has an assigned meaning and an empty cell.


Note:

Do not confuse missing values with sparse data.


2.2.6 Sparse Data in Oracle Data Mining

Data is said to be sparse if only a small fraction (no more than 20%, often 3% or less) of the attributes are non-zero or non-null for any given case. Sparse data occurs, for example, in market basket problems. In a grocery store, there might be 10,000 products in the store, and the average size of a basket (the collection of distinct items that a customer purchases in a typical transaction) is 50 or fewer products. In this example, a transaction (case or record) has at most 50 out of 10,000 attributes that are not null. This implies that the fraction of non-zero attributes in the table (or the density) is 50/10,000, or 0.5%. This density is typical for market basket and text processing problems.

Association models are designed to process sparse data; indeed, if the data is not sparse, the algorithm may require a large amount of temporary space and may not be able to build a model.

Different algorithms make different assumptions about what indicates sparse data as follows:

2.2.7 Outliers and Oracle Data Mining

An outlier is a value that is far outside the normal range in a data set, typically a value that is several standard deviations from the mean. The presence of outliers can have a significant impact on ODM models.

Outliers affect ODM during data pre-processing either when it is performed by the user or automatically during model build.

Outliers affect the different algorithms as follows:

2.3 Prepared and Unprepared Data

Data is said to be prepared or unprepared, depending on whether certain data transformations required by a data mining algorithm were performed by the user.

For the Java interface, data can be either unprepared (the default) or prepared; data for DBMS_DATA_MINING must be prepared.

2.3.1 Data Preparation for the ODM Java Interface

The ODM Java interface assumes data is unprepared and automatically performs the transformations necessary to prepare the data. This means different things to different algorithms. For most of the algorithms ODM, prepared data is binned data. Unbinned data is said to be unprepared. See Section 2.3.3 for information about binning in the java interface.

For the SVM and NMF algorithms, prepared data is normalized data. See Section 2.3.4 for information about normalization.

The user can specify the data's status (prepared or unprepared) in the DataPreparationStatus setting for each attribute. For example, if the user has already binned the data for an attribute, the data's status for that attribute should be set to prepared using so that ODM will not bin the data for that attribute again. If the user wants ODM to do the binning for all attributes, the status should be set to unprepared for all attributes.

Support Vector Machine models require especially careful data preparation. For more information, see Section 3.1.6.1.

2.3.2 Data Preparation for DBMS_DATA_MINING

The PL/SQL interface assumes that all data is prepared. The user must perform any required data preparation.

2.3.3 Binning (Discretization) in Data Mining

Some ODM algorithms may benefit from binning (discretizing) both numeric and categorical data. Naive Bayes, Adaptive Bayes Network, Clustering, Attribute Importance, and Association Rules algorithms may benefit from binning.

Binning means grouping related values together, thus reducing the number of distinct values for an attribute. Having fewer distinct values typically leads to a more compact model and one that builds faster, but it can also lead to some loss in accuracy.  

2.3.3.1 Methods for Computing Bin Boundaries

ODM utilities provide three methods for computing bin boundaries from the data:

Figure 2-2 Winsorizing

Text description of winsor3.gif follows.

Text description of the illustration winsor3.gif

2.3.4 Normalization in Oracle Data Mining

Normalizing converts individual attribute values in such a way that all attributes values lie in the same range. Normally, values are converted to be in the range 0.0 to 1.0 or the range -1 to +1. Normalization ensures that attributes do no receive artificial weighting caused by differences in the ranges that they span.

Support Vector Machine (SVM) and non-Negative Matrix Factorization (NMF) may benefit from normalization.