How to Use Stratified Sampling

If the distribution of target values is skewed greatly, it may be necessary to create a build data set with an artificially balanced distribution. For example, fraud detection or response to a marketing campaign may have a positive target value 1% of the time or less. Any data mining algorithm usually needs more positive examples to learn the factors that differentiate positive from negative target values. Therefore, it is necessary to sample from the source data in a manner that captures an artificially large segment of positive values along with negative values so that the model is created with a well-defined profile of individuals with positive target values.

Note that even if the build data is a stratified sample, the test data must have the natural distribution. The steps required involve first splitting the data into Build and Test subsets using the Split transformation, then creating a stratified sample of the Build data set for input to the build process.

NOTE: The Stratified Sample wizard does not do over-sampling.

Follow these steps to build a stratified sample to use as the build data set:

  1. Use Data | Transform | Split to create build and test data sets. Suppose that the build data set is named MYBUILDDATA.
  2. Determine the distribution of values for the target in the build data set. You may be able to use Show Summary Single-Record or Show Summary Multi-Record to do this, if the sample size used to create the histogram of the target is the same as the number of records in MYBUILDDATA.

    You can always use Tools | SQL Worksheet to find this information. For example, if the data set is named MYBUILDDATA and the target is named TARGET, the following query returns the number of cases where TARGET=1:

    SELECT COUNT(*) FROM MYBUILDDATA WHERE TARGET=1;

    Make a note of the numbers of positive and negative cases.

  3. Use Data | Transform | Stratified Sample to create the stratified sample. Select the build data set that you just created using Split. In Step 3 of the wizard, select the target attribute from the Attribute pulldown list.

    The goal is to create a sample with approximately equal numbers of positive and negative values for the target attribute. Click the radio button next to Sample Size and enter a value equal to twice the number of positive target values in the build data set. If you have 100 positive values, you want to create a sample with 100 positive and 100 negative values; that is, a Sample Size of 200. Click Next.

  4. In step 4 of Stratified Sample, note the original distribution percentages for use during the model build phase; for example, the values in the Sample Distribution column are 9% positive cases (1) and 91% negative cases (0). Edit the values in the Sample Distribution column so that they are both 50.0 (that is, the values are equal and add up to 100.000). Click Next and then Finish to create the table. This new table is the one that you will use to build the model.

Once the table is created, you can use the appropriate version of Show Summary to display a histogram of the target attribute. Note that due to the sampling method, the totals are not quite identical.

For an example of creating a stratified sample and using it to build a model, see the Oracle Data Mining Tutorial.