If the distribution of target values is skewed greatly, it may be necessary to create a build data set with an artificially balanced distribution. For example, fraud detection or response to a marketing campaign may have a positive target value 1% of the time or less. Any data mining algorithm usually needs more positive examples to learn the factors that differentiate positive from negative target values. Therefore, it is necessary to sample from the source data in a manner that captures an artificially large segment of positive values along with negative values so that the model is created with a well-defined profile of individuals with positive target values.
Note that even if the build data is a stratified sample, the test data must have the natural distribution. The steps required involve first splitting the data into Build and Test subsets using the Split transformation, then creating a stratified sample of the Build data set for input to the build process.
NOTE: The Stratified Sample wizard does not do over-sampling.
Follow these steps to build a stratified sample to use as the build data set:
You can always use Tools | SQL Worksheet to find this information. For example, if the data set is named MYBUILDDATA and the target is named TARGET, the following query returns the number of cases where TARGET=1:
SELECT COUNT(*) FROM MYBUILDDATA WHERE TARGET=1;
Make a note of the numbers of positive and negative cases.
The goal is to create a sample with approximately equal numbers of positive and negative values for the target attribute. Click the radio button next to Sample Size and enter a value equal to twice the number of positive target values in the build data set. If you have 100 positive values, you want to create a sample with 100 positive and 100 negative values; that is, a Sample Size of 200. Click Next.
Once the table is created, you can use the appropriate version of Show Summary to display a histogram of the target attribute. Note that due to the sampling method, the totals are not quite identical.
For an example of creating a stratified sample and using it to build a model, see the Oracle Data Mining Tutorial.
Copyright © 2005, Oracle. All rights reserved.