Skip Headers
Oracle® Warehouse Builder User's Guide
10g Release 2 (10.2.0.2)

Part Number B28223-05
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

20 Ensuring Data Quality

Oracle Warehouse Builder provides a set of features that enable you to ensure the quality of data that is moved from source systems to your data warehouse. Data profiling is a feature that enables you to discover inconsistencies and anomalies in your source data and then correct them. Before you transform data, you can define data rules and apply data profiling and data auditing techniques.

This chapter contains the following topics:

Steps to Perform Data Profiling

The section "How to Perform Data Profiling" describes the steps in the data profiling process. This section lists the steps and provides references to the sections that describe how to perform each step.

Use the following steps to perform data profiling:

  1. Load the metadata

    See "Import or Select the Metadata".

  2. Create a data profile

    See "Using Data Profiles".

  3. Profile the data

    See "Profiling the Data".

  4. View the profiling results

    See "Viewing the Results".

  5. Derive data rules based on the results of data profiling

    You can create data rules based on the data profiling results. For more information, see "Deriving Data Rules".

  6. Generate corrections

    See "Correcting Schemas and Cleansing Data".

  7. Define and edit data rules manually

    See "Using Data Rules".

  8. Generate, deploy and execute

    See "Generate, Deploy, and Execute".

Using Data Profiles

Data profiling is a process that enables you to analyze the content and structure of your data to determine inconsistencies, anomalies, and redundancies in the data. To begin the process of data profiling, you must first create a data profile using the Design Center. You can then profile the objects contained in the data profile and then create correction tables and mappings.

This section contains the following topics:

Creating Data Profiles

Use the following steps to create a data profile:

  1. From the Warehouse Builder Project Explorer, expand the project node in which you want to create the data profile.

  2. Right-click Data Profiles and select New.

    The Create Data Profile Wizard opens and displays the Welcome page. Click Next to proceed. The wizard guides you through the following steps:

Note:

You cannot profile a source table that contains complex data types if the source module is located on a different database instance from the data profile.

Naming the Data Profile

Use the Name field to specify a name for your data profile. The name must be unique within a project. The Name tab of the Edit Data Profile dialog enables you to rename the data profile. You can select the name and enter the new name.

Use the Description field to provide an optional description for the data profile. On the Name tab, you can modify the description by selecting the current description and typing the new description.

Selecting Objects

Use the Select Objects page to specify the objects that you want to profile. The Select Objects tab of the Edit Data Profile dialog enables you to modify object selections that you made.

The Available section displays a list of objects available for profiling. Select the objects you want to profile and use the shuttle buttons to move them to the Selected list. You can select multiple objects by holding down the Ctrl key and selecting the objects. The Selected list displays the objects selected for profiling. You can modify this list using the shuttle buttons.

The objects available for profiling include tables, external tables, views, materialized views, dimensions, and cubes. The objects are grouped by module. When you select a dimensional object in the Available section, Warehouse Builder displays a warning informing you that the relational objects that are bound to these dimensional objects will also be added to the profile. Click Yes to proceed.

Note:

You cannot profile a source table that contains complex data types if the source module is located on a different database instance than the data profile.

Reviewing the Summary

The Summary page displays the objects that you have selected to profile. Review the information on this page. Click Back to make changes or click Finish to create the data profile.

The data profile is created and added to the Data Profiles node in the navigation tree. If this is the first data profile you have created in the current project, the Connection Information dialog for the selected control center is displayed. Enter the control center manager connection information and click OK. The Data Profile Editor opens.

Editing Data Profiles

Once you create a data profile, you can use the Data Profile Editor to modify its definition. You can also add data objects to an existing data profile. To add objects, you can use either the menu bar options or the Select Objects tab of the Edit Data Profile dialog. For information about using the menu bar options, see "Adding Data Objects to a Data Profile". The following instructions describe how to access the Edit Data Profile dialog.

To edit a data profile:

  1. In the Project Explorer, right-click the data profile and select Open Editor.

    The Data Profile Editor is displayed.

  2. From the Edit menu, select Properties.

    The Edit Data Profile dialog is displayed.

  3. Use the following tabs on this dialog to modify the definition of the data profile:

Data Locations Tab

The Data Locations tab specifies the location that is used as the data profile workspace. This tab contains two sections: Available Locations and Selected Locations.

The Available Locations section displays all the Oracle locations in the current repository. The Selected Locations section displays the locations that are associated with the data profile. This section can contain more than one location. The location to which the data profile is deployed is the location for which the New Configuration Default option is selected.

To modify the data location associated with a data profile, use the shuttle arrow to move the new location to the Selected Locations section and select the New Configuration Default option for this location. When you change the location for a data profile, all objects that were created in the data profile workspace schema as a result of profiling this data profile are deleted.

You can also create a new location and associate it with the data profile. Click New below the Selected Locations section to display the Edit Database Location dialog. Specify the details of the new location and click OK. The new location is added to the Selected Locations section.

You cannot configure a data profile and change the location to which it is deployed. Note that this behavior is different from other objects. For example, consider Oracle modules. The Selected Locations section on the Data Locations tab of the Edit Module dialog contains a list of possible deployment locations. The objects in the module are deployed to the location that is set using Location configuration parameter of the Oracle module.

Adding Data Objects to a Data Profile

To add data objects to a data profile, use the following steps:

  1. Right-click the data profile in the Project Explorer and select Open Editor.

    The Data Profile Editor is displayed.

  2. From the Profile menu, select Add Objects.

    The Add Profile Tables dialog is displayed. Use this dialog to add data objects to the data profile.

Add Profile Tables Dialog

The Add Profile Tables dialog contains two sections: Available and Selected. The Available section displays the data objects that are available for profiling. To add an object displayed in this list to the data profile, select the object and move it to the Selected list using the shuttle arrows. Select multiple objects by holding down the Ctrl key and selecting the objects.

Click OK. The selected data objects are added to the data profile. You can see the objects on the Profile Objects tab of the Object Tree.

Using the Data Profile Editor

The Data Profile Editor provides a single access point for managing and viewing data profile information as well as correcting metadata and data. It combines the functionality of a data profiler, a target schema generator, and a data correction generator. As a data profiler, it enables you to perform attribute analysis and structural analysis of selected objects. As a target schema generator, it enables you to generate a target schema based on the profile analysis and source table rules. Finally, as a data correction generator, it enables you to generate mappings and transformations to provide data correction.

Figure 20-1 displays the Data Profile Editor.

Figure 20-1 Data Profile Editor

Description of Figure 20-1 follows
Description of "Figure 20-1 Data Profile Editor"

The Data Profile Editor consists of the following:

Menu Bar

The menu bar provides access to the Data Profile Editor commands. The Data Profile Editor contains the following menu items.

Profile The Profile menu contains the following menu options:

  • Close: Closes the Data Profile Editor.

  • Save All: Saves the changes made in the Data Profile Editor.

  • Export: Exports the data displayed in the Profile Results Canvas and the Data Drill Panel. Warehouse Builder can store the exported data in .csv or .html files. The number of files used to store the exported data depends on the amount of profiling data. For example, if you specify the name of the export file as prf_result.html, the files used to store the data begin with this file name and then continue with prf_results2.html, prf_results3.html, and so on depending upon the quantity of data.

  • Add: Adds objects to the data profile.

  • Profile: Profiles the objects contained in the data profile.

  • Derive Data Rule: Derives a data rule using the result of data profiling. This option is enabled only when you select a cell that contains a hyperlink in the Profile Results Canvas.

  • Remove Data Rule: Deletes the selected data rule.

  • Create Correction: Creates correction objects and mappings based on the results of data profiling.

  • Print: Prints the contents of the Data Profile Editor window.

Edit The Edit menu contains the following menu items:

  • Delete: Deletes the object selected in the Data Profile Editor.

  • Synchronize: Synchronizes the data profile metadata with the Warehouse Builder repository.

  • Properties: Displays the properties of the data profile.

Window Use the Window menu to display or hide the various panels of the Data Profile Editor. All the options act as toggle switches. For example, to hide the Data Rule panel, select Data Rule Panel from the Window menu. You can then display the Data Rule panel by selecting Data Rule Panel from the Window menu.

Help Use the Help menu options to access the online Help for the product. To display the context-sensitive Help for the current window, use the Topic option. This menu also contains options to access the different sections of the Oracle Technology Network.

Toolbars

The Data Profile Editor allows you the flexibility of using either the menu or toolbars to achieve your goals. The toolbar provides icons for commonly used commands that are also part of the menu bar. You can perform functions such as adding objects to a data profile, profiling data, deriving data rules, and creating corrections using the toolbar icons.

Figure 20-2 displays the various toolbars in the Data Profile Editor. When you first open the Data Profile Editor, all the toolbars are displayed in a single row. You can change their positions by holding down the left mouse button on the gray dots to the left of the toolbar, dragging, and dropping in the desired location.

Figure 20-2 Data Profile Editor Toolbars

Description of Figure 20-2 follows
Description of "Figure 20-2 Data Profile Editor Toolbars"

Object Tree

The object tree can be used to navigate through the objects included in the data profile. Figure 20-3 displays the object tree that contains two tabs: Profile Objects and Corrected Modules.

Figure 20-3 Profile Objects

Description of Figure 20-3 follows
Description of "Figure 20-3 Profile Objects"

The Profile Objects tab contains a navigation tree that includes all of the objects selected to be profiled. The tree goes down to the attributes because you can change properties for each attribute and ensure that on an attribute, you have the correct profiling settings. When you select an object in the Profile Objects tab, the profile details about that object can be viewed in the Profile Results Canvas. You can also open the Data Object Editor for a specific object by double-clicking the object.

The Corrected Modules tab lists the objects created as a result of performing data correction actions. This tab contains objects only if you have performed data correction actions. Data correction is the process of creating corrected source data objects based on the results of the data profiling. For more information about creating corrected schemas, see "Correcting Schemas and Cleansing Data".

Property Inspector

The Property Inspector enables you to define the configuration parameters for the data profiling operation. Many types of data profiling rely on the configuration parameters to set the limits and the assumptions of analysis.

For example, if the parameter Domain Discovery Max Distinct Values Count is set to 25, then if more than 25 distinct values are found in a column, no domains are considered for this column. Each type of analysis can also be turned off for performance purposes. If you run profiling without making changes to these properties, they will be set to default values.

You also use the Property Inspector to specify the types of data profiling that you want to perform. If you do not want to perform a particular type of profiling, use the configuration parameters to turn off the type of data profiling. By default, Warehouse Builder performs the following types of profiling: aggregation, data type discovery, pattern discovery, domain discovery, unique key discovery, functional dependency discovery, and row relationship discovery. Note that you cannot turn off the aggregation profiling.

For example, your data profile contains two tables called COUNTRIES and JOBS. For the COUNTRIES table, you want to perform aggregation profiling, data type profiling, domain discovery, pattern discovery, and data rule profiling. Use the Property Inspector to select the following configuration options: Enable Data Type Discovery, Enable Domain Discovery, Enable Pattern Discovery, Enable Data rule Profiling for table.

For details about the configuration parameters and their settings, see "Configuration Parameters for Data Profiles".

Monitor Panel

The Monitor panel is where you will be able to view details about currently running as well as past profiling events. The details about each profiling event includes the profile event name, profile job ID, status, timestamp, and the repository owner who executed the profiling. Figure 20-4 displays the Monitor panel.

Figure 20-4 Monitoring Profile Status

Description of Figure 20-4 follows
Description of "Figure 20-4 Monitoring Profile Status"

You can view more details about a job by double-clicking a job. The details include what problems were encountered ad how much time each profiling job took.

Profile Results Canvas

The Profile Results Canvas is where the results of the profiling can be viewed. Figure 20-5 displays the Profile Results Canvas panel. You can sort the results by clicking the heading of the column.

Figure 20-5 Profile Results

Description of Figure 20-5 follows
Description of "Figure 20-5 Profile Results"

Click the following tabs for detailed profile analysis of different aspects of the object:

Click the hyperlinks to drill into the data. When you drill into data, the data results appear in the Data Drill panel. Depending on the tab you have selected, the available format sub-tabs may change. The sub-tabs that are available for all analysis tabs are as follows: Tabular, which provides the results in a grid or table format; Graphical, which provides an array of charts and graphs. You can also derive rules from the Profile Results Canvas.

Click any header to sort the values in ascending and descending orders interchangeably. The graphical subtab displays the graphical representation of different measures.

Data Drill Panel

Use the Data Drill panel to view questionable data from a selected object in the Profile Results canvas. Data drill-down is available for all tabs in the profile results canvas. Figure 20-6 displays the Data Drill panel for a profiled object.

Figure 20-6 Data Drill Panel

Description of Figure 20-6 follows
Description of "Figure 20-6 Data Drill Panel"

The first sentence on this panel provides information about the attribute to which the displayed drill details belong. When you drill into data, the results appear on the left side of the Data Drill panel. Use the Distinct Values list to filter the data displayed on the left side of the panel. You can further drill into this data to view the entire records. Click any row on the left side of the panel to display the selected records on the right side of the Data Drill panel.

Note that the left side displays aggregated row counts. So there are 2 rows with salary 2400. And so while the entire table contains 109 rows, the left display only shows 57 rows, namely the distinct values found.

Data Rule Panel

On the Data Rule panel, you can view the data rules that have been created as a result of the profiling. Figure 20-7 displays the Data Rule panel.

Figure 20-7 Data Rules Panel

Description of Figure 20-7 follows
Description of "Figure 20-7 Data Rules Panel"

The left side of this panel displays the data rules created for the object that is selected in the Profile Objects tab of the object tree as a result of the data profiling. When you include a table in the data profile, any data rules or check constraints on the table are automatically added. The data rules are added to the Derived_Data_Rules node under the Data Rules node of the project that contains the data profile.

You can add rules to this panel in the following ways:

  • Some of the tabs on the Profile Results Canvas contain a Derive Data Rule button. This button is enabled when you select a hyperlink on the tab. Click Derive Data Rule to derive a data rule for the selected result.

    If the data rules has already been derived, the Remove Rule button is enabled.

  • Click Apply Rule in the Data Rules panel. The Apply Data Rule wizard is displayed. For more information about this wizard, see "Deriving Data Rules". Use this wizard to select an existing data rule and apply it to the table selected in the object tree.

You can disable any of the data rules in the Data Rule panel by unchecking the check box to the left of the data rule. To delete a data rule, right-click the gray cell to the left of the data rule and select Delete.

The details displayed for an applied data rule are as follows:

Name: The name of the applied data rule.

Rule: Click this field to display the name of the module that contains the data rule and the name of the data rule. Click the Ellipsis button on this field to launch the Edit Data Rule dialog that enables you to edit the data rule.

Rule Type: The type of data rule. This is not editable.

Description: The description of the applied data rule.

Configuring Data Profiles

You can configure a data profile by setting its configuration parameters in the Property Inspector of the Data Profile Editor. For more information about the properties that you can configure, see "Configuration Parameters for Data Profiles".

You can set configuration parameters for a data profile at any of the following levels:

  • For all objects in the data profile

    To set configuration parameters for all objects contained in the data profile, select the data profile in the Profile Objects tab of the Object Tree. In the Property Inspector, set the configuration parameters to the required values. These parameters are set for all the objects in the data profile.

  • For a single object in the data profile

    To set configuration parameters for a single object within a data profile, select the object in the Profile Objects tab of the Object Tree. In the Property Inspector, set the configuration parameters.

  • For an attribute in an object

    To set configuration parameters for an attribute within an object, in the Profile Objects tab of the Object Tree, expand the object node to display the attributes it contains. For example, you can expand a table node to display its columns. Select the attribute for which you want to specify configuration parameters. In the Property Inspector, set the configuration parameters.

Configuration Parameters for Data Profiles

The configuration parameters that you can set for data profiles are categorized as follows:

Load Configuration

This category contains the following parameters:

  • Enable Data Type Discovery: Set this parameter to true to enable data type discovery for the selected table.

  • Enable Common Format Discovery: Set this parameter to true to enable common format discovery for the selected table.

  • Copy Data into Workspace: Set this parameter to true to enable copying of data from the source to the profile workspace.

  • Random Sample Rate: This value represents the percent of total rows that will be randomly selected during loading.

  • Sample Set Filter: This represents the WHERE clause that will be applied on the source when loading data into the profile workspace. Click the Ellipsis button on this field to launch the Expression Builder. Use the Expression Builder to define your WHERE clause.

  • Null Value Representation: This value will be considered as the null value during profiling. You must enclose the value in single quotation marks. The default value is null, which is considered as a database null.

Aggregation Configuration

This category consists of one parameter called Not Null Recommendation Percentage. If the percentage of null values in a column is less than this threshold percent, then that column will be discovered as a possible Not Null column.

Pattern Discovery Configuration

This category contains the following parameters:

  • Enable Pattern Discovery: Set this to true to enable pattern discovery.

  • Maximum Number of Patterns: This represents the maximum number of patterns that the profiler will get for the attribute. For example, when you set this parameter to 10, it means that the profiler will get the top 10 patterns for the attribute.

Domain Discovery Configuration

This category contains the following parameters:

  • Enable Domain Discovery: Set this to true to enable domain discovery.

  • Domain Discovery Max Distinct Values Count: The maximum number of distinct values in a column in order for that column to be discovered as possibly being defined by a domain. Domain discovery of a column occurs if the number of distinct values in that column is at or below the Max Distinct Values Count property, and, the number of distinct values as a percentage of total rows is at or below the Max Distinct Values Percent property.

  • Domain Discovery Max Distinct Values Percent: The maximum number of distinct values in a column, expressed as a percentage of the total number of rows in the table, in order for that column to be discovered as possibly being defined by a domain. Domain Discovery of a column occurs if the number of distinct values in that column is at or below the Max Distinct Values Count property, and, the number of distinct values as a percentage of total rows is at or below the Max Distinct Values Percent property.

  • Domain Value Compliance Min Rows Count: The minimum number of rows for the given distinct value in order for that distinct value to be considered as compliant with the domain. Domain Value Compliance for a value occurs if the number of rows with that value is at or above the Min Rows Count property, and, the number of rows with that value as a percentage of total rows is at or above the Min Rows Percent property.

  • Domain Value Compliance Min Rows Percent: The minimum number of rows, expressed as a percentage of the total number of rows, for the given distinct value in order for that distinct value to be considered as compliant with the domain. Domain Value Compliance for a value occurs if the number of rows with that value is at or above the Min Rows Count property, and, the number of rows with that value as a percentage of total rows is at or above the Min Rows Percent property.

Relationship Attribute Count Configuration

This category contains one parameter called Maximum Attribute Count. This is the maximum number of attributes for unique key, foreign key, and functional dependency profiling.

Unique Key Discovery Configuration

This category contains the following parameters:

  • Enable Unique Key Discovery: Set this parameter to true to enable unique key discovery.

  • Minimum Uniqueness Percentage: This is the minimum percentage of rows that need to satisfy a unique key relationship.

Functional Dependency Discovery Configuration

This category contains the following parameters:

  • Enable Functional Dependency Discovery: Set this parameter to true to enable functional dependency discovery.

  • Minimum Functional Dependency Percentage: This is the minimum percentage of rows that need to satisfy a functional dependency relationship.

Row Relationship Discovery Configuration

This category contains the following parameters:

  • Enable Relationship Discovery: Set this parameter to true to enable foreign key discovery.

  • Minimum Relationship Percentage: This is the minimum percentage of rows that need to satisfy a foreign key relationship.

Redundant Column Discovery Configuration

This category contains the following parameters:

  • Enable Redundant Columns Discovery: Set this parameter to true to enable redundant column discovery with respect to a foreign key-unique key pair.

  • Minimum Redundancy Percentage: This is the minimum percentage of rows that are redundant.

Data Rule Profiling Configuration

This category contains one parameter called Enable Data Rule Profiling for Table. Set this parameter to true to enable data rule profiling for the selected table. This setting is only applicable for a table, not for an individual attribute.

Profiling the Data

After you have created a data profile in the navigation tree, you can open it in the Data Profile Editor and profile the data. The objects you selected when creating the profile are displayed in the object tree of the Data Profile Editor. You can add objects to the profile by selecting Profile and then Add.

To profile the data:

  1. Expand the Data Profiles node on the navigation tree, right-click a data profile, and select Open Editor.

    The Data Profile Editor opens the selected data profile.

  2. From the Profile menu, select Profile.

    If this is the first time you are profiling data, the Data Profile Setup dialog is displayed. Enter the details of the profiling workspace in this dialog. For more information see, "Data Profile Setup Dialog".

    Warehouse Builder begins preparing metadata for profiling. The progress window appears and displays the name of the object that Warehouse Builder is creating in order to profile the data. After the metadata preparation is complete, the Profiling Initiated dialog is displayed informing you that the profiling job has started. Once the profiling job starts, the data profiling is asynchronous and you can continue working in Warehouse Builder or even close the client. Your profiling process will continue to run until complete.

  3. View the status of the profiling job in the Monitor Panel of the Data Profile Editor.

    You can continue to monitor the progress of your profiling job in the Monitor panel. After the profiling job is complete, the status displays as complete.

  4. After the profiling is complete, the Retrieve Profile Results dialog is displayed and you are prompted to refresh the results.

    You can use this option if you have previously profiled data in the same data profile. It enables you to control when the new profiling results become visible in the Data Profile Editor.

Note:

Data profiling results are overwritten on subsequent profiling executions.

Data Profile Setup Dialog

The Data Profile Setup dialog contains details about the data profiling workspace. The profiling workspace is a schema that Warehouse Builder uses to store the results of the profiling. You set the profiling workspace once for each repository.

When the Data Profile Setup dialog is first displayed, it contains two fields.

SYSDBA Name: The name of a database user who has SYSDBA privileges.

SYSDBA Password: The password of the user specified in the SYSDBA Name field.

This information is required by Warehouse Builder to create the data profiling workspace schema. If you click OK, Warehouse Builder uses default names and creates a database schema to store data related to the data profiling activity. Additionally, a location that is associated with this schema is created and registered. For example, if the name of the repository owner is rep_own, Warehouse Builder creates a schema called rep_own_prf and a location called rep_own_prf_location.

To use your own name for the data profiling schema and location, click Show Details. The following fields are displayed in the Data Profile Setup dialog.

Name: The name of the schema that will be used as the profiling workspace.

Password: The password for the schema.

Confirm Password: Confirm the password for the schema.

Use the Default and Temporary drop-down lists to select the default tablespace and the temporary tablespace for the data profiling schema. Click OK to create the schema and the location associated with this schema. For example, if you specify the schema name as profile_workspace, a location called profile_workspace_location is created and associated with this schema.

Tip:

It is highly recommended to use a different tablespace and use a separate database file for this tablespace.

Warehouse Builder uses the location created in this step as the default location for all data profiles. You can modify the default data profiling location either globally or for a particular data profile. To change the default profile location at a global level, select Preferences from the Tools menu of the Design Center. Select the new location in the Default Profile Location preference under the Data Profiling node. To change the data profile location for a data profile, see "Data Locations Tab".

Viewing the Results

After the profile operation is complete, you can open the data profile in the Data Profile Editor to view and analyze the results.

To view the profile results:

  1. Select the data profile in the navigation tree, right-click, and select Open Editor.

    The Data Profile Editor opens and displays the data profile.

  2. If you have previous data profiling results displayed in the Data Profile Editor, refresh the view when prompted so that the latest results are shown.

    The results of the profiling are displayed in the Profile Results Canvas.

  3. Minimize the Data Rule and Monitor panels by clicking on the arrow symbol in the upper left corner of the panel.

    This maximizes your screen space.

  4. Select objects in the Profile Objects tab of the object tree to focus the results on a specific object.

    The results of the selected object are displayed in the Profile Results Canvas. You can switch between objects. The tab that you had selected for the previous object remains selected.

The following results are available in the Profile Results Canvas:

Data Profile

The Data Profile tab contains general information about the data profile. Use this tab to store any notes or information about the data profile.

Profile Object

The Profile Object tab, shown in Figure 20-8, contains two sub-tabs: Object Data and Object Notes. The Object Data tab lists the data records in the object you have selected in the Profile Objects tab of the object tree. The number of rows that were used in the sample are listed. You can use the buttons along the top of the tab to execute a query, get more data, or add a WHERE clause.

Figure 20-8 Profile Object Tab

Description of Figure 20-8 follows
Description of "Figure 20-8 Profile Object Tab"

Aggregation

The Aggregation tab, shown in Figure 20-9, displays all the essential measures for each column, such as minimum and maximum values, number of distinct values, and null values. Some measures are only available for specific data types. Those include the average, median, and standard deviation measures. Information can be viewed from either the Tabular or Graphical sub-tabs. Table 20-1 describes the various measurement results available in the Aggregation tab.

Table 20-1 Aggregation Results

Measurement Description

Minimum

The minimum value with respect to the inherent database ordering of a specific type

Maximum

The maximum value with respect to the inherent database ordering of a specific type

# Distinct

The total number of distinct values for a specific attribute

% Distinct

The percentage of distinct values over the entire row set number

Not Null

Yes or No

Recommended NOT NULL

From analyzing the column values, data profiling determines that this column should not allow null values

# Null

The total number of null values for a specific attribute

%Null

The percentage of null values over the entire row set number

Six Sigma

For each column, number of null values (defects) to the total number of rows in the table (opportunities).

Average

The average value for a specific attribute for the entire row set

Median

The median value for a specific attribute for the entire row set

Std Dev

The standard deviation for a specific attribute.


A hyperlinked value in the aggregation results grid indicates that you can click the value to drill down into the data. This enables you to analyze the data in the sample that produced this result. For example, if you scroll to the SALARY column, shown in Figure 20-9, and click the value in the Maximum cell showing 24000, the Data Drill Panel on the bottom changes to show you all the distinct values in this column with counts on the left. On the right, the Data Drill can zoom into the value you select from the distinct values and display the full record where these values are found.

Figure 20-9 Aggregation Tabular Results

Description of Figure 20-9 follows
Description of "Figure 20-9 Aggregation Tabular Results"

The graphical analysis, as shown in Figure 20-10, displays the results in a graphical format. You can use the graphical toolbar to change the display. You can also use the Column and Property drop-down menus to change the displayed data object.

Figure 20-10 Aggregation Graphical Results

Description of Figure 20-10 follows
Description of "Figure 20-10 Aggregation Graphical Results"

Data Type

The Data Type tab, shown in Figure 20-11, provides profiling results about data types. This includes metrics such as length for character data types and the precision and scale for numeric data types. For each data type that is discovered, the data type is compared to the dominant data type found in the entire attribute and the percentage of rows that comply with the dominant measure is listed.

Figure 20-11 Data Type Results

Description of Figure 20-11 follows
Description of "Figure 20-11 Data Type Results"

Table 20-2 describes the various measurement results available in the Data Type tab.

Table 20-2 Data Type Results

Measurement Description

Columns

Name of the column

Documented Data Type

Data type of the column in the source object

Dominant Data Type

From analyzing the column values, data profiling determines that this is the dominant (most frequent) data type.

Percent Dominant Data Type

Percentage of total number of rows where column value has the dominant data type

Documented Length

Length of the data type in the source object

Minimum Length

Minimum length of the data stored in the column

Maximum Length

Maximum length of data stored in the column

Dominant Length

From analyzing the column values, data profiling determines that this is the dominant (most frequent) length.

% Dominant Length

Percentage of total number of rows where column value has the dominant length

Documented Precision

Precision of the data type in the source object

Minimum Precision

Minimum precision for the column in the source object

Maximum Precision

Maximum precision for the column in the source object

Dominant Precision

From analyzing the column values, data profiling determines that this is the dominant (most frequent) precision.

% Dominant Precision

Percentage of total number of rows where column value has the dominant precision

Documented Scale

Scale specified for the data type in the source object

Minimum Scale

Minimum scale of the data type in the source object

Maximum Scale

Maximum scale of the data type in the source object

Dominant Scale

From analyzing the column values, data profiling determines that this is the dominant (most frequent) scale.

% Dominant Scale

Percentage of total number of rows where column value has the dominant scale


One example of data type profiling would be finding that a column defined as VARCHAR that actually only had numeric values stored. Changing the data type of the column to NUMBER would make storage and processing more efficient.

Domain

The Domain tab, shown in Figure 20-12, displays results about the possible set of values that exist in a certain attribute. Information can be viewed from either the Tabular or Graphical sub-tabs.

Figure 20-12 Domain Discovery Results

Description of Figure 20-12 follows
Description of "Figure 20-12 Domain Discovery Results"

Table 20-3 describes the various measurement results available in the Domain tab.

Table 20-3 Domain Results

Measurement Description

Found Domain

The discovered domain values

% Compliant

The percentage of all column values that are compliant with the discovered domain values

Six Sigma

The Six-Sigma value for the domain results


The process of discovering a domain on a column involves two phases. First, the distinct values in the column are used to determine whether that column might be defined by a domain. Typically, there are few distinct values in a domain. Then, if a potential domain is identified, the count of distinct values is used to determine whether that distinct value is compliant with the domain. The properties that control the threshold for both phases of domain discovery can be set in the Property Inspector.

If you find a result that you want to know more about, drill down and use the Data Drill panel to view details about the cause of the result.

For example, a domain of four values was found for the column REGION_ID: 3,2,4, and 1. If you want to see which records contributed to this finding, select the REGION_ID row and view the details in the Data Drill panel.

Pattern

The Pattern tab displays information discovered about patterns within the attribute. Pattern discovery is the profiler's attempt at generating regular expressions for data it discovered for a specific attribute. Note that non-English characters are not supported in the pattern discovery process.

Table 20-4 describes the various measurement results available in the Pattern tab.

Table 20-4 Pattern Results

Measurement Description

Dominant Character Pattern

The most frequently discovered character pattern or consensus pattern.

% Compliant

The percentage of rows whose data pattern agree with the dominant character pattern

Dominant Word Pattern

The most frequently discovered word character pattern or consensus pattern

& Compliant

The percentage of rows whose data pattern agree with the dominant word pattern

Common Format

Name, Address, Date, Boolean, Social Security Number, E-mail, URL. This is the profiler's attempt to add semantic understanding to the data that it sees. Based on patterns and some other techniques, it will try to figure out which domain bucket a certain attribute's data belongs to.

% Compliant

The percentage of rows whose data pattern agree with the consensus common format pattern


Unique Key

The Unique Key tab, shown in Figure 20-13, provides information about the existing unique keys that were documented in the data dictionary, and possible unique keys or key combinations that were detected by the data profiling operation. The uniqueness % is shown for each. The unique keys that have No in the Documented? column are the ones that are discovered by data profiling. For example, the unique key UK_1 was discovered as a result of data profiling, whereas COUNTRY_C_ID_PK exists in the data dictionary.

Figure 20-13 Unique Key Results

Description of Figure 20-13 follows
Description of "Figure 20-13 Unique Key Results"

For example, a phone number is unique in 98% of the records. It can be a unique key candidate, and you can then use Warehouse Builder to cleanse the noncompliant records for you. You can also use the drill-down feature to view the cause of the duplicate phone numbers in the Data Drill panel. Table 20-5 describes the various measurement results available in the Unique Key tab.

Table 20-5 Unique Key Results

Measurement Description

Unique Key

The discovered unique key

Documented?

Yes or No. Yes indicates that a unique key on the column exists in the data dictionary. No indicates that the unique key was discovered as a result of data profiling.

Discovered?

From analyzing the column values, data profiling determines whether a unique key should be created on the columns in the Local Attribute(s) column.

Local Attribute(s)

The name of the column in the table that was profiled

# Unique

The number of rows, in the source object, in which the attribute represented by Local Attribute is unique

% Unique

The percentage of rows, in the source object, for which the attribute represented by Local Attribute are unique

Six Sigma

For each column, number of null values (defects) to the total number of rows in the table (opportunities)


Functional Dependency

The Functional Dependency tab, shown in Figure 20-14, displays information about the attribute or attributes that seem to depend on or determine other attributes. Information can be viewed from either the Tabular or Graphical sub-tabs. You can use the Show drop-down list to change the focus of the report. Note that unique keys defined in the database are not discovered as functional dependencies during data profiling.

Figure 20-14 Functional Dependency

Description of Figure 20-14 follows
Description of "Figure 20-14 Functional Dependency"

Table 20-6 describes the various measurement results available in the Functional Dependency tab.

Table 20-6 Functional Dependency Results

Measurement Description

Determinant

The name of the attribute that is found to determine the attribute listed in the Dependent

Dependent

The name of the attribute found to be determined by value of another attribute

# Defects

The number of values in the Determinant attribute that were not determined by the Dependent attribute

% Compliant

The percentage of values that met the discovered dependency

Six Sigma

The six sigma value

Type

The suggested action for the discovered dependency


For example, if you select Only 100% dependencies from the Show drop down list, the information shown is limited to absolute dependencies. If you have an attribute that is always dependent on another attribute, shown in Figure 20-14, Warehouse Builder can suggest that it be a candidate for a reference table. Suggestions are shown in the Type column. Removing the attribute into a separate reference table normalizes the schema.

The Functional Dependency tab also has a Graphical subtab so that you can view information graphically. You can select a dependency and property from the drop down lists to view graphical data.

For example, in Figure 20-15, you select the dependency where DEPARTMENT_ID seems to determine COMMISSION_PCT. (DEPARTMENT_ID->COMMISSION_PCT). In the majority of cases, COMMISION_PCT is null. Warehouse Builder therefore determines that most DEPARTMENT_ID values determine COMMISION_PCT to be null. By switching the Property to Non-Compliant, you can view the exceptions to this discovery. Figure 20-15 shows that for the DEPARTMENT_ID values of 80, the COMMISION_PCT values are not null. This makes sense after you discover that the department with DEPARTMENT_ID 80 is Sales department.

Figure 20-15 Graphical Functional Dependency

Description of Figure 20-15 follows
Description of "Figure 20-15 Graphical Functional Dependency"

Referential

The Referential tab displays information about foreign keys that were documented in the data dictionary, as well as relationships discovered during profiling. For each relationship, you can see the level of compliance. Information can be viewed from both the Tabular and Graphical subtabs. In addition, two other subtabs are available only in the Referential tab: Joins and Redundant Columns. Table 20-7 describes the various measurement results available in the Referential tab.

Table 20-7 Referential Results

Measurement Description

Relationship

The name of the relationship

Type

The type of relationship

Documented?

Yes or No. Yes indicates that a foreign key on the column exists in the data dictionary. No indicates that the foreign key was discovered as a result of data profiling.

Discovered?

From analyzing the column values, data profiling determines whether a foreign key should be created on the column represented by Local Attribute(s).

Local Attribute(s)

The name of the attribute in the source object

Remote Key

The name of the key in the referenced object to which the local attribute refers

Remote Attribute(s)

The name of the attributes in the referenced object

Remote Relation

The name of the object that the source object references

Remote Module

The name of the module that contains the references object

Cardinality Range

The range of the cardinality between two attributes

For example, the EMP table contains 5 rows of employees data. There are two employees each in department 10 and 20 and one employee in department 30. The DEPT table contains three rows of department data with deptno value 10, 20, and 30.

Data profiling will find a row relationship between the EMP and DEPT tables. The cardinality range will be 1-2:1-1. This is because in EMP, the number of rows in each distinct value ranges from 1 (for deptno 30) to 2 (deptno 10 and 20). In DEPT, there is only one row for each distinct value (10, 20, and 30)

# Orphans

The number of orphan rows in the source object

% Compliant

The percentage of values that met the discovered dependency

Six Sigma

For each column, number of null values (defects) to the total number of rows in the table (opportunities)


For example, you are analyzing two tables for referential relationships: the Employees table and the Departments table. Using the Referential data profiling results shown in Figure 20-16, you discover that the DEPARTMENT_ID column in the Employees table is found to be related to DEPARTMENT_ID column in the Departments table 98% of the time. You can then click the hyperlinked Yes in the Discovered? column to view the rows that did not comply with the discovered foreign key relationship.

Figure 20-16 Referential Results

Description of Figure 20-16 follows
Description of "Figure 20-16 Referential Results"

You can also select the Graphical subtab to view information graphically. This view is effective for seeing noncompliant records such as orphans. To use the Graphical subtab, make a selection from the Reference and Property drop-down lists. Figure 20-17 shows graphical representation showing records in the Employees table.

Figure 20-17 Graphical Referential Results

Description of Figure 20-17 follows
Description of "Figure 20-17 Graphical Referential Results"

The Joins subtab displays a join analysis on the reference selected in the Reference drop down list. The results show the relative size and exact counts of the three possible outcomes for referential relationships: joins, orphans, and childless objects.

For example, both the EMPLOYEES and DEPARTMENTS tables contain a DEPARTMENT_ID column. There is a one-to-many relationship between the DEPARTMENT_ID column in the DEPARTMENTS table and the DEPARTMENT_ID column in the EMPLOYEES table. The Joins represent the values that have values in both tables. Orphans represent values that are only present in the EMPLOYEES table and not the DEPARTMENTS table. And finally, Childless values are present in the DEPARTMENTS table and not the EMPLOYEES table. You can drill into values on the diagram to view more details in the Data Drill panel, as shown in Figure 20-18.

Figure 20-18 Join Results

Description of Figure 20-18 follows
Description of "Figure 20-18 Join Results"

The Redundant Columns subtab displays information about columns in the child table that are also contained in the primary table. Redundant column results are only available when perfectly unique columns are found during profiling.

For example, two tables EMP and DEPT, shown in Table 20-8 and Table 20-9, having the following foreign key relationship: EMP.DEPTNO (uk) = DEPT.DEPTNO (fk).

Table 20-8 EMP Table

Employee Number Dept. No Location

100

1

CA

200

2

NY

300

3

MN


Table 20-9 DEPT Table

Dept No Location Zip

1

CA

94404

3

MN

21122

3

MN

21122

1

CA

94404


In this example, the Location column in the EMP table is a redundant column because you can get the same information from the join.

Data Rule

The Data Rule tab displays the data rules that are defined as a result of data profiling for the table selected in the object tree. The details for each data rule include the following:

  • Rule Name: Represents the name of the data rule.

  • Rule Type: Provides a brief description about the type of data rule.

  • Origin: Represents the origin of the data rule. For example, a value of Derived indicates that the data rule is derived.

  • % Compliant: Percent of rows that comply with the data rule.

  • # Defects: Number of rows that do not comply with the data rule.

The data rules on this tab reflect the active data rules in the Data Rule panel. You do not directly create data rules on this tab.

Deriving Data Rules

At this stage in the data profiling process, you can begin to tune the data profiling results by deriving data rules that can be used to clean up your data. A data rule is an expression that determines the set of legal data that can be stored within a data object. Data rules also determine the set of legal relationships that can exist between data objects. Although you can create data rules and apply them manually to your data profile, derived data rules allow you to move quickly and seamlessly between data profiling and data correction. For more information about how to create data rules, see "Using Data Rules".

For example, you have a table called Employees with the following columns: Employee_Number, Gender, Employee_Name. The profiling result shows that 90% of the values in the Employee_Number column are unique, making it an excellent candidate for a unique key. The results also show that 85% of the values in the Gender column are either 'M' or 'F', making it also a good candidate for a domain. You can then derive these rules directly from the Profile Results Canvas.

To derive a data rule:

  1. Select a data profile in the navigation tree, right-click, and select Open Editor.

    The Data Profile Editor is displayed with the profiling results.

  2. Review the profiling results and determine which findings you want derived into data rules.

    The types of results that warrant data rules vary. Some results commonly derived into data rules include a detected domain, a functional dependency between two attributes, or a unique key.

  3. Select the tab that displays the results from which you want to derive a data rule.

    Figure 20-19 Tab Selected for Deriving Data Rule

    Description of Figure 20-19 follows
    Description of "Figure 20-19 Tab Selected for Deriving Data Rule"

    The Domain tab is selected in Figure 20-19. Data profiling has detected a domain with 100% compliance for the Region_ID attribute.

  4. Select the cell that contains the results you want derived into a data rule and then from the Profile menu select Derive Data Rule. Or click the Derive Data Rule button.

    The Derive Data Rule Wizard opens and displays the Welcome page.

  5. Click Next.

    The Name and Description page is displayed.

  6. The Name field displays a default name for the data rule. You can either accept the default name or enter a new name.

  7. Click Next.

    The Define Rule page is displayed.

  8. Provide details about the data rule parameters.

    The Type field that represents the type of data rule is populated based on the tab from which you derived the data rule. You cannot edit the type of data rule.

    Additional fields in the lower portion of this page define the parameters for the data rule. Some of these fields are populated with values based on the result of data profiling. The number and type of fields depends on the type of data rule.

  9. Click Next.

    The Summary page is displayed. Review the options you set in the wizard using this page. Click Back if you want to change any of the selected values.

  10. Click Finish.

    The data rule is created and it appears in the Data Rule panel of the Data Profile Editor. The derived data rule is also added to the Derived_Data_Rules node under Data Rules node in the Project Explorer. You can reuse this data rule by attaching it to other data objects.

Correcting Schemas and Cleansing Data

After deriving the data rules, you can use Warehouse Builder to automate the process of correcting source data based on the data profiling results. As part of the correction process, Warehouse Builder creates the following:

These objects are defined in Warehouse Builder. To implement them in your target schema, you must deploy the correction tables and correction mappings. Before you deploy a correction mapping, ensure that you do the following:

Creating Corrections

To create corrections, from the Data Profile Editor, select Profile and then Create Correction. The Create Correction Wizard opens and displays the Welcome page. Click Next. The wizard guides you through the following pages:

Select Target Module

Use the Select Target Module page to specify the target module in which the corrected objects are stored. You can create a new target module or you can select an existing module.

To use an existing module to store the corrected objects, choose Select an Existing Module. Select Create a New Target Module to create a new module. This launches the Create Module Wizard.

Select Objects

Use the Select Objects page to select the objects you want to correct.

The Filter drop-down list enables you to filter the objects that are available for selection. The default selection is All Objects. You can display only particular types of data objects such as tables or views.

The Available section lists all the objects available for correction. Each type of object is represented by a node. Within this node, the objects are grouped by module. Select the objects that you want to correct and move them to the Selected section.

Select Data Rules and Data Types

Use this page to select the data rules that should be applied to the selected objects. The objects selected for correction are on the left side of the page and are organized into a tree by modules. The right panel contains two tabs: Data Rules and Data Types.

Data Rules

The Data Rules tab displays the available data rules for the object selected in the object tree. Specify the data rules that you want to apply to that corrected object by selecting the check box to the left of the data rule. Warehouse Builder uses these data rules to create constraints on the tables during the schema generation.

The Bindings section contains details about the table column to which the rule is bound. Click a rule name to display the bindings for the rule.

The method used to apply the data rule to the correction table depends on the type of data rule you are implementing. Warehouse Builder uses the following methods of object schema correction:

  • Creating constraints

    Creates a constraint reflecting the data rule on the correction table. If a constraint cannot be created, a validation message is displayed on the Data Rules Validation page.

    Constraints are created for the following types of rules: Custom, Domain List, Domain Pattern List, Domain Range, Common Format, No Nulls, and Unique Key.

  • Changing the data type

    Changes the data type of the column to NUMBER or DATE according to the results of profiling. The data type is changed for data rules of type IS Number and Is Name.

  • Creating a lookup table

    Creates a lookup table and adds the appropriate foreign key or unique key constraints to the corrected table and the lookup table. Warehouse Builder creates a lookup table for a Functional Dependency rule.

  • Name and Address parse

    Adds additional name and address attributes to the correction table. The name and address attributes correspond to a selection of the output values of the Name and Address operator. In the map that is created to cleanse data, a Name and Address operator is used to perform name and address cleansing.

    A name and address parse is performed for a data rule of type Name and Address.

Data Types

The Data Types tab displays the columns that are selected for correction. The change could be a modification of the data type, precision, or from fixed-length to variable-length. The Documented Data Type column displays the existing column definition and the New Data Type column displays the proposed correction to the column definition.

To correct a column definition, select the check box to the left of the column name.

Data Rules Validation

This page displays the validation warnings and errors that result for the data rules selected for correction. If there are any errors, you must first correct the errors and then proceed with the correction process.

Verify and Accept Corrected Tables

Use this page to verify and accept the corrected tables. This page lists the tables that have been modified with the applied rules and the data type changes. Select the tables you want to create in the corrected schema by selecting Create to the left of the object.

Figure 20-20 displays the Verify and Accept Corrected Tables page.

Figure 20-20 Verify and Accept Corrected Tables

Description of Figure 20-20 follows
Description of "Figure 20-20 Verify and Accept Corrected Tables"

The lower part of the page displays the details of the object selected on the top part of the page. It contains the following tabs: Columns, Constraints, and Data Rules.

The Columns tab provides details about the columns in the corrected table. You can select columns for creation, deletion, or modification. To specify that a column should be created, select Create to the left of the column name. You can modify the data type details of a column. However, you cannot modify a column name.

The Constrains tab displays the constraints for the corrected table. You can add constraints by clicking Add Constraint. Click Delete to delete the selected constraint.

The Data Rules tab displays the rule bindings for the corrected table. You can add, delete, or modify the rule bindings listed on this tab. These data rules are used to derive the data correction actions in the next step.

Choose Data Correction Actions

Use the Choose Data Correction Actions page to choose the action to perform to correct the source data. This page contains two sections: Select a Corrected Table and Choose Data Correction Actions. The Select a Corrected Table section lists the objects that you selected for corrections. Select a table in this section to display the affiliated data rules in the Choose Data Correction Actions section.

Choose Data Correction Actions

For each data rule, select an action from the drop down menu in the Action column. The settings you choose here determine how to handle data values that are not accepted due to data rule enforcement. Choose one of the following actions:

Ignore: The data rule is ignored and, therefore, no values are rejected based on this data rule.

Report: The data rule is run only after the data has been loaded for reporting purposes only. It is like the Ignore option, except a report is created that contains the values that did not adhere to the data rules. This action can be used for some rule types only.

Cleanse: The values rejected by this data rule are moved to an error table where cleansing strategies are applied. When you select this option, you must specify a cleansing strategy. See the following section for details about specifying cleansing strategies.

Cleansing Strategy

Use the Cleansing Strategy drop-down list to specify a cleansing strategy. This option is enabled only if you choose Cleanse in the Action column. The cleansing strategy depends on the type of data rule and the rule configuration. Error tables are used to store the records that do not conform to the data rule.

The options you can select for Cleansing Strategy are as follows:

  • Remove

    Does not populate the target table with error records. This option is available for all data rules types.

  • Custom

    Creates a function in the target table that contains a header, but no implementation details. You must add the implementation details to this function. A custom cleanse strategy is available for all data rule types except Unique Key, Referential, and Functional Dependency.

  • Set to Min

    Sets the attribute value of the error record to the minimum value defined in the data rule. This option is available only for Domain Range rules that have a minimum defined.

  • Set to Max

    Sets the attribute value of the error record to the maximum value defined in the data rule. This option is available for Domain Range rules that have a maximum defined.

  • Similarity

    Uses a similarity algorithm based on permitted domain values to find a value that is similar to the error record. If no similar value is found, the original value is used. This option is available for Domain List rules with character data type only.

  • Soundex

    Uses a soundex algorithm based on permitted domain values to find a value that is similar to the error record. If no soundex value is found, the original value is used. This option is available for Domain List rules with character data type only.

  • Merge

    Uses the Match-Merge algorithm to merge duplicate records into a single row. You can use this option for Unique Key data rules only.

  • Set to Mode

    Uses the mode value to correct the error records if a mode value exists for the functional dependency partition that fails. This option is used for Functional Dependency data rules only.

Summary

The Summary page provides a summary of the selected actions. When you are done assigning actions to the data rules, click Finish.

The correction schema is created and added to the Project Explorer. The correction objects and mappings are displayed under the module that you specify as the target module on the Select Target Module page of the Create Correction Wizard.

To create the correction tables in your target schemas, deploy the correction tables. To cleanse data, deploy and execute the correction mappings. The correction mappings have names that are prefixed with M_. For example, if your correction table is called EMPLOYEES, the correction mapping is called M_EMPLOYEES.

Viewing Correction Tables and Mappings

You can review the correction tables in the Data Object Editor to see the data rules and constraints created as part of the design of your table.

To view the correction mappings:

  1. Double-click the table or mapping to open the object in their respective editors.

  2. After the mapping is open, select View and then Auto Layout to view the entire mapping.

    Figure 20-21 displays a correction mapping generated by the Create Correction Wizard.

    Figure 20-21 Generated Correction Mapping

    Description of Figure 20-21 follows
    Description of "Figure 20-21 Generated Correction Mapping"

  3. Select the submapping ATTR_VALUE_1 and click the Visit Child Graph icon from the toolbar to view the submapping.

    The submapping is displayed as shown in Figure 20-22.

    Figure 20-22 Correction SubMapping

    Description of Figure 20-22 follows
    Description of "Figure 20-22 Correction SubMapping"

    The submapping is the element in the mapping that does the actual correction cleansing you specified in the Create Correction Wizard. In the middle of this submap is the DOMAINSIMILARITY transformation that was generated as a function by the Create Correction Wizard.

Using Data Rules

In addition to deriving data rules based on the results of data profiling, you can define your own data rules. You can bind a data rule to multiple tables within the project in which the data rule is defined. An object can contain any number of data rules.

You use the Design Center to create and edit data rules.

Once you create a data rule in Warehouse Builder, you can use it in any of the following scenarios.

Using Data Rules in Data Profiling

When you are using data profiling to analyze tables, you can use data rules to analyze how well data complies with a given rule and to collect statistics. From the results, you can derive a new data rule. If data profiling determines the majority of records have a value of red, white, and blue for a particular column, a new data rule can be derived that defines the color domain (red, white and blue). This rule can then be reused to profile other tables, or reused in cleansing, and auditing.

Using Data Rules in Data Cleansing and Schema Correction

Data rules can be used in two ways to cleanse data and correct schemas. The first way is to convert a source schema into a new target schema where the structure of the new tables strictly adheres to the data rules. The new tables would then have the right data types, constraints are enforced, and schemas are normalized. The second way data rules are used is in a correction mapping that will validate the data in a source table against the data rules, to determine which records comply and which do not. The analyzed data set is then corrected (for example, orphan records are removed, domain value inaccuracies are corrected, and so on) and the cleansed data set is loaded into the corrected target schema.

Using Data Rules in Data Auditing

Data rules are also used in data auditing. Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule, and they report defective data into auditing and error tables. In that sense they are like data-rule-based correction mappings, which also offer a report-only option for data that does not comply with the data rules. For more information about data auditors, see "About Data Auditors".

Types of Data Rules

Data rules in Warehouse Builder can be categorized as described in this section.

Domain List

A domain list rule defines a list of values that an attribute is allowed to have. For example, the Gender attribute can have 'M' or 'F'.

Domain Pattern List

A domain pattern list rule defines a list of patterns that an attribute is allowed to conform to. The patterns are defined in the Oracle Database regular expression syntax. An example pattern for a telephone number is as follows:

(^[[:space:]]*[0-9]{ 3 }[[:punct:]|:space:]]?[0-9]{ 4 }[[:space:]]*$)

Domain Range

A domain range rule defines a range of values that an attribute is allowed to have. For example, the value of the salary attribute can be between 100 and 10000.

Common Format

A common format rule defines a known common format that an attribute is allowed to conform to. This rule type has many subtypes: Telephone Number, IP Address, SSN, URL, E-mail Address.

No Nulls

A no nulls rule specifies that the attribute cannot have null values. For example, the department_id attribute for an employee in the Employees table cannot be null.

Functional Dependency

A functional dependency defines that the data in the data object may be normalized.

Unique Key

A unique key data rule defines whether an attribute or group of attributes are unique in the given data object. For example, the name of a department should be unique.

Referential

A referential data rule defines the type of a relationship (1:x) a value must have to another value. For example, the department_id attribute of the Departments table should have a 1:n relationship with the department_id attribute of the Employees table.

Name and Address

A name and address data rule uses the Warehouse Builder Name and Address support to evaluate a group of attributes as a name or address.

Custom

A custom data rule applies a SQL expression that you specify to its input parameters. For example, you can create a custom rule called VALID_DATE with two input parameters, START_DATE and END_DATE. A valid expression for this rule is: "THIS"."END_DATE" > "THIS"."START_DATE".

Creating Data Rule Folders

Each data rule belongs to a data rule folder, which is a container object used to group related data rules. To create a data rule, you must first create a data rule folder. The data rule folder acts is a container object that groups related data rules.

To create a data rule folder, in the navigation tree, right-click Data Rules and select New. The Create Data Rule Folder dialog is displayed.

Create Data Rule Folder

The Name field represents the name of the data rule folder. Enter a name for the data rule folder. To rename a data rule folder, select the name and enter the new name.

In the Description field, enter an optional description for the data rule folder.

You can choose to create a data rule immediately after you create a data rule folder. Select the Proceed to Data Rule wizard option to launch the Create Data Rule Wizard that helps you create a data rule.

Click OK to close the Create Data Rule Folder dialog.

Creating Data Rules

A data rule is an expression that determines the legal data within a data object or legal relationships between data objects. You can provide a definition for a data rule or derive a data rule based on the results of data profiling. For more information about data rules, see "About Data Rules".

To create a data rule, right-click the data rule folder in which you want to create the data rule, and select New. The Welcome page of the Create Data Rule Wizard is displayed. Click Next. The wizard guides you through the following steps:

Naming the Data Rule

The Name field represents the name of the data rule. Use this field to enter a name for the data rule. If you are deriving a data rule, a default name is assigned to the data rule. You can accept the default name or enter a different name.

On the Name tab of the Edit Data Rule dialog, use the Name field to rename the data rule. Select the name and enter the new name. Renaming a data rule has no effect on the data rule bindings. The binding names are not changed and the data rule remains in effect for the table.

Use the Description field to specify an optional description for the data rule.

Defining the Rule

Use the Define Rule page or the Define Rule tab to provide details about the data rule. The top portion of the page displays the Type drop-down list that represents the type of data rule. When you are deriving a data rule, Warehouse Builder automatically populates the Type field and you cannot edit this value. When you are creating a data rule, expand the Type field to view the types of data rules and select the type you want to create. When you edit a data rule, the Type field is disabled as you cannot change the type of data rule once it is created. For more information about types of data rules, see "Types of Data Rules".

The bottom portion of this page specifies additional details about the data rule. The number and names of fields displayed here depend on the type of data rule you create. For example, if you select Custom, use the Attributes section to define the attributes required for the rule. Use the Ellipsis button on the Expression field to define a custom expression involving the attributes you defined in the Attributes section. If you select Domain Range as the type of data rule, the bottom portion of the page provides fields to specify the data type of the range, the minimum value, and the maximum value. When you are deriving a data rule, some of these fields are populated based on the profiling results from which you are deriving the rule. You can edit these values.

Summary Page

The Summary page displays the settings that you selected on the wizard pages. Review these settings. Click Back to modify any selections you made. Click Finish to create the data rule.

Data rules that are derived from the Data Profile Editor belong to the data rule folder called Derived_Data_Rules in the project that contains the data profile. Data rules that you create are part of the data rule folder under which you create the data rule.

Editing Data Rules

After you create a data rule, you can edit its definition. You can rename the data rule and edit its description. You cannot change the type of data rule. However, you can change the other parameters specified for the data rule. For example, for a Domain Range type of data rule, you can edit the data type of the range, the minimum range value, and the maximum range value.

To edit a data rule, in the Project Explorer, right-click the data rule and select Open Editor. You can also double-click the name of the data rule. The Edit Data Rule dialog is displayed. This dialog contains the following tabs:

Applying Data Rules

Applying a data rule to an object binds the definition of the data rule to the object. For example, binding a rule to the table Dept ensures that the rule is implemented for the specified attribute in the table. You apply a data rule using the Data Object Editor. You can also apply a derived data rule from the Data Rule panel of the Data Profile Editor.

The Apply Data Rule Wizard enables you to apply a data rule to an object. Open the Data Object Editor for the object to which you want to apply the data rule. Navigate to the Data Rules tab. If any data rules are bound to the data object, these are displayed on this tab. To apply a new data rule to this object, click Apply Rule. The Welcome page of the Apply Data Rule Wizard is displayed. Click Next. The wizard guides you through the following pages:

Select Rule

Use the Select Rule page to select the data rule that you want to apply to the object. This page lists all available data rules. The data rules are grouped under the nodes BUILT_IN, DERIVED_DATA_RULES, and other data rule folders that you create.

The BUILT_IN node contains the default data rules that are defined in the repository. These include rules such as foreign key, unique key, not null. The DERIVED_DATA_RULES node lists all the data rules that were derived as a result of data profiling.

Name and Description

The Name and Description page enables you to specify a name and a optional description for the applied data rule.

The Name field contains the name of the data rule that you selected on the Define Rule page. You can accept this name or enter a new name.

Use the Description field to enter an optional description for the applied data rule.

Bind Rule Parameters

Use the Bind Rule Parameters page to bind the data rule to a column in your data object. The Binding column lists all columns in the data object to which the data rule is being applied. Use the drop-down list on this column to select the column to which the rule is bound.

Summary

The Summary page displays the options that you chose on the wizard pages. Click Back to go to the previous pages and change some options. Click Finish to apply the data rule. This adds information about the data rule bindings to the object metadata.

Tuning the Data Profiling Process

Data profiling is a highly processor and I/O intensive process and the execution time for profiling ranges from a few minutes to a few days. You can achieve the best possible data profiling performance by ensuring that the following conditions are satisfied:

Tuning Warehouse Builder for Better Data Profiling Performance

You can configure a data profile to optimize data profiling results. Use the configuration parameters to configure a data profile. For more information about configuration parameters, see "Configuration Parameters for Data Profiles".

Use the following guidelines to make your data profiling process faster.

  • Perform only the types of analysis that you require

    If you know that certain types of analysis are not required for the objects that you are profiling, use the configuration parameters to turn off these types of data profiling.

  • Analyze lesser amount of data

    Use the WHERE clause and Sample Rate configuration parameters

If the source data for profiling stored in an Oracle Database, it is recommended that the source schema be located on the same database instance as the profile workspace. You can do this by installing the Warehouse Builder repository into the same Oracle instance as the source schema location. This avoids using a database link to move data from source to profiling workspace.

Tuning the Oracle Database for Better Data Profiling Performance

To ensure good data profiling performance, the machine that runs the Oracle Database must have certain hardware capabilities. In addition to this, you must optimize the Oracle Database instance on which you are performing data profiling.

For efficient data profiling, the following considerations are applicable:

Multiple Processors

The machine that runs the Oracle Database needs multiple processors. Data profiling has been designed and tuned to take maximum advantage of the parallelism provided by the Oracle Database. While profiling large tables (more than ten million rows), it is highly recommended to use a multiple processor machine.

Hints are used in queries required to perform data profiling. It picks up the degree of parallelism from the initialization parameter file of the Oracle Database. The default initialization parameter file contains parameters that take advantage of parallelism.

Memory

It is important that you ensure a high memory hit ratio during data profiling. You can do this by assigning a larger size of the System Global Area. It is recommended that the size of the System Global Area be configured to be no less that 500 MB. If possible, configure it to 2 GB or 3GB.

For advanced database users, it is recommended that you observe the buffer cache hit ratio and library cache hit ratio. Set the buffer cache hit ratio to higher than 95% and the library cache hit ratio to higher than 99%.

I/O System

The capabilities of the I/O system have a direct impact on the data profiling performance. Data profiling processing frequently runs performs full table scans and massive joins. Since today's CPUs can easily out-drive the I/O system, you must carefully design and configure the I/O subsystem. Keep in mind the following considerations that aid better I/O performance.

  • You need a large number of disk spindles to support uninterrupted CPU and I/O cooperation. If you have only a few disks, the I/O system is not geared towards a high degree of parallel processing. It is recommended to have a minimum of two disks for each CPU.

  • Configure the disks. It is recommended that you create logical stripe volumes on the existing disks, each striping across all available disks. Use the following formula to calculate the stripe width.

    MAX(1,DB_FILE_MULTIBLOCK_READ_COUNT/number_of_disks)*DB_BLOCK_SIZE

    Here, DB_FILE_MULTIBLOCK_SIZE and DB_BLOCK_SIZE are parameters that you set in your database initialization parameter file. You can also use a stripe width that is a multiple of the value returned by the formula.

    To create an maintain logical volumes, you need a volume management software such as Veritas Volume Manager or Sun Storage Manager. If you are using Oracle Database 10g and you do not have any volume management software, you can use the Automatic storage Management feature of the Oracle Database to spread workload to disks.

  • Create different stripe volumes for different tablespaces. It is possible that some of the tablespaces occupy the same set of disks.

    For data profiling, the USERS and the TEMP tablespaces are normally used at the same time. So you can consider placing these tablespaces on separate disks to reduce interference.

Using Data Auditors

Data auditors are Warehouse Builder objects that you can use to continuously monitor your source schema to ensure that data adheres to the defined data rules. You can monitor an object only of you have defined data rules for the object. You can create data auditors for tables, views, materialized views, and external tables.

Data auditors can be deployed and executed as standalone processes, but they are typically run to monitor data quality in an operational environment like a data warehouse or ERP system and, therefore, can be added to a process flow and scheduled.

Note:

You cannot import metadata for data auditors in Merge mode. For more information about import mode options, see "Import Option".

Creating Data Auditors

Data auditors enable you to monitor the quality of data in an operational environment. For more information about data auditors, see "About Data Auditors".

Use the Create Data Auditor Wizard to create data auditors. Data auditors are part of an Oracle module in a project.

Use the following steps to create a data auditor:

  1. Expand the Oracle module in which you want to create the data auditor.

  2. Right-click Data Auditors and select New.

    The Create Data Auditor Wizard is displayed and it guides you through the following steps:

Naming the Data Auditor

The Name field represents the name of the data auditor. The name of the data auditor should be unique within the Oracle module to which it belongs. Use the Name tab of the Edit Data Auditor dialog to rename a data auditor. You select the name and then enter the new name.

The Description field represents the optional description that you can provide for the data auditor.

Selecting Objects

Use the Select Objects page or the Select Objects tab to select the data objects that you want to audit. The Available section displays the list of available objects for audit. This list contains only objects that have data rules bound to them. The Selected section displays the objects that are selected for auditing. Use the shuttle buttons to move objects from the Available section to the Selected section.On the Select Objects tab, the Selected section lists the objects currently selected for auditing. You can add more objects or remove existing objects from the lists using the shuttle arrows.

Choosing Actions

Use the Choose Action page or the Choose Action tab to select the action to be taken for records that do not comply with the data rules that are bound to the selected objects. Provide information in the following sections of this page: Error Threshold Mode and Data Rules.

Error Threshold Mode

Error threshold mode is the mode used to determine the compliance of data to data rules in the objects. You can select the following options:

  • Percent: The data auditor will set the audit result based on the percentage of records that do not comply with the data rule. This percentage is specified in the rule's defect threshold.

  • Six Sigma: The data auditor will set the audit result based on the Six Sigma values for the data rules. If calculated Six Sigma value for any rule is less than the specified Six Sigma Threshold value, then the data auditor will set the AUDIT RESULT to 2.

Data Rules

This section contains the following details:

  • Data Rule: The name of the table suffixed with the name of the bound data rule.

  • Rule Type: The type of data rule.

  • Action: The action to be performed if data in the source object does not comply with the data rule. Select Report to ensure that the data rule is audited. Select Ignore to cause the data rule to be ignored.

  • Defect Threshold: The percent of records that should comply to the data rules to ensure successful auditing. Specify a value between 1 and 100. This value is ignored if you set the Error Threshold Mode to Six Sigma.

  • Sigma Threshold: The required success rate. Specify a number between 0 and 7. If you set the value to 7, no failures are allowed. This value is ignored if you set the Error Threshold Mode to Percent.

Summary Page

The Summary page lists the options that you selected in the previous pages of the wizard. To modify any of the selected options, click Back. To proceed with the creation of the data auditor, click Finish. A data auditor is created using the settings displayed on the Summary page.

Editing Data Auditors

After you create a data auditor, you can edit it and modify any of its properties. To edit a data auditor, in the Design Center, right-click the data auditor and select Open Editor. The Edit Data Auditor dialog is displayed. Use the following tabs to modify the properties of the data auditor:

Reconciling Objects

If the definitions of the objects included in the data auditor were modified after the data auditor was created, the objects in the data auditor will not be identical with the actual objects. The table on this tab lists the audit objects and the source objects that were included for audit. To reconcile the definition of an audit object with the source object, select the check box on the row that corresponds to the changed object and click Reconcile. This reconciles the audit object definition with the source object.

Auditing Objects Using Data Auditors

After you create a data auditor, you can use it to monitor the data in your data objects. This ensures that the data rule violations for the objects are detected. When you run a data auditor, any records that violate the data rules defined on the data objects are written to the error tables.

There are two ways of using data auditors:

Data Auditor Execution Results

After you run a data auditor, the Job Details dialog displays the details of the execution. The Job Details dialog contains two tabs: Input Parameters and Execution Results. Figure 20-23 displays the Execution Results tab of the Job Details dialog. Note that the Job Details dialog is displayed only when you set the deployment preference Show Monitor to true. For more information about deployment preferences, see "Deployment Preferences".

Figure 20-23 Data Auditor Execution Results

Description of Figure 20-23 follows
Description of "Figure 20-23 Data Auditor Execution Results"

The Input Parameters tab contains the values of input parameters used to run the data auditor. The Execution Results tab displays the results of running the data auditor. This tab contains two sections: Row Activity and Output Parameters.

The Row Activity section contains details about the inserts into the error table for each step. In Figure 20-23, the data rule called E_NOT_NULL inserted one record into the error table. Note that when more than one data rule is specified, multi-table insert may be used in the data auditor. In this case, the count of the number of rows will not be accurate.

The Output Parameters section contains the following three parameters:

  • AUDIT_RESULT: Indicates the result of running the data auditor. The possible values for this parameter are as follows:

    0: No data rule violations occurred.

    1: At least one data rule violation occurred, but no data rule failed to meet the minimum quality threshold as defined in the data auditor. For more information about setting the threshold, see the section on Data Rules in "Choosing Actions".

    2: At least one data rule failed to meet the minimum quality threshold.

  • EO_<data_rule_name>: Represents the calculated error quality for the specified data rule. 0 indicates all errors and 100 indicates no errors.

  • SO_<data_rule_name>: Represents the Six Sigma quality calculated for the specified data rule.

Manually Running Data Auditors

You run a data auditor to check if the data in the data object adheres to the data rules defined for the object. You can run data auditors from the Design Center or the Control Center Manager. To run a data auditor from the Design Center, right-click the data auditor and select Start. In the Control Center Manager, select the data auditor, and from the File menu, select Start. The results are displayed in the Job Details panel described in "Data Auditor Execution Results".

Automatically Running Data Auditors

You can automate the process of running a data auditor using the following steps:

  1. Create a process flow that contains a Data Auditor Monitor activity.

  2. Schedule this process flow to run at a predefined time.

    For more information about scheduling objects, see "Process for Defining and Using Schedules".

Figure 20-24 displays a process flow that contains a Data Auditor Monitor activity. In this process flow, LOAD_EMP_MAP is a mapping that loads data into the EMP table. If the data load is successful, the data auditor EMP_DATA_AUDIT is run. The data auditor monitors the data in the EMP table based on the data rules defined for the table.

Figure 20-24 Data auditor Monitor Activity in a Process Flow

Description of Figure 20-24 follows
Description of "Figure 20-24 Data auditor Monitor Activity in a Process Flow"

Configuring Data Auditors

During the configuration phase, you assign physical deployment properties to the data auditor you created by setting the configuration parameters. The Configuration Properties dialog enables you to configure the physical properties of the data auditor.

To configure a data auditor:

  1. From the Project Explorer, expand the Databases node and then the Oracle node.

  2. Right-click the name of the data auditor you want to configure and select Configure.

    The Configuration Properties dialog is displayed.

  3. Configure the parameters listed in the following sections.

Run Time Parameters

Default Purge Group: This parameter is used when executing the package. Each audit record in the runtime schema is assigned to the purge group specified.

Bulk size: The number of rows to be fetched as a batch while processing cursors.

Analyze table sample percentage: The percentage of rows to be samples when the target tables are analyzed. You analyze target tables to gather statistics that you can use to improve performance while loading data into the target tables.

Commit frequency: The number of rows processed before a commit is issued.

Maximum number of errors: The maximum number of errors allowed before Warehouse Builder terminates the execution of this step.

Default Operating Mode: The operating mode used. The options you can select are Row based, Row based (target only), Set based, Set based fail over to row based, Set based fail over to row based (target only).

Default Audit Level: Use this parameter to indicate the audit level used when executing the package. When the package is run, the amount of audit information captured in the runtime schema depends on the value set for this parameter.

The options you can select are as follows:

ERROR DETAILS: At runtime, error information and statistical auditing information is recorded.

COMPLETE: All auditing information is recorded at runtime. This generates a huge amount of diagnostic data which may quickly fill the allocated tablespace.

NONE: No auditing information is recorded at runtime.

STATISTICS: At runtime, statistical auditing information is recorded.

Data Auditor

This category uses the same name as the data auditor and contains the following generic data auditor configuration parameters.

Generation comments: Specify additional comments for the generated code.

Threshold Mode: Specify the mode that should be used to measure failure thresholds. the options are PERCENTAGE and SIX SIGMA.

Language: The language used to define the generated code. The options are PL/SQL (default) and UNDEFINED. Ensure that PL/SQL (default) is selected.

Deployable: Select this option to indicate that you want to deploy this data auditor. Warehouse Builder generates code only if the data auditor is marked as deployable.

Referred Calendar: Specify the schedule to associate with the data auditor. The schedule defines when the data auditor will run.

Code Generation Options

ANSI SQL syntax: Select this option to use ANSI SQL code in the generated code. If this option is not selected, Warehouse Builder generates Oracle SQL syntax.

Commit control: Specifies how commit is performed. The options available for this parameter are: Automatic, Automatic Correlated, and Manual. Ensure that this parameter is set to Automatic.

Enable Parallel DML: Select this option to enable parallel DML at runtime.

Analyze table statistics: Select this option to generate the statement used to collect statistics for the data auditor.

Optimized Code: Select this option to indicate to Warehouse Builder that it should generate optimized code.

Generation Mode: Select the mode in whichWarehouse Builder should generate optimized code. the option you can select are as follows: All Operating Modes, Row based, Row based (target only), Set based, Set based fail over to row based, and Set based fail over to row based (target only)

Use Target Load Ordering: Select this option to generate code for target load ordering.

Error Trigger: Specify the name of the error trigger procedure.

Bulk Processing code: Select this option to generate bulk processing code.