20 Ensuring Data Quality

Oracle Warehouse Builder provides a set of features that enable you to ensure the quality of data that is moved from source systems to your data warehouse. Data profiling is a feature that enables you to discover inconsistencies and anomalies in your source data and then correct them. Before you transform data, you can define data rules and apply data profiling and data auditing techniques.

This chapter contains the following topics:

Steps to Perform Data Profiling
Using Data Profiles
Profiling the Data
Viewing the Results
Deriving Data Rules
Correcting Schemas and Cleansing Data
Using Data Rules
Tuning the Data Profiling Process
Using Data Auditors

Steps to Perform Data Profiling

The section "How to Perform Data Profiling" describes the steps in the data profiling process. This section lists the steps and provides references to the sections that describe how to perform each step.

Use the following steps to perform data profiling:

Load the metadata

See "Import or Select the Metadata".
Create a data profile

See "Using Data Profiles".
Profile the data

See "Profiling the Data".
View the profiling results

See "Viewing the Results".
Derive data rules based on the results of data profiling

You can create data rules based on the data profiling results. For more information, see "Deriving Data Rules".
Generate corrections

See "Correcting Schemas and Cleansing Data".
Define and edit data rules manually

See "Using Data Rules".
Generate, deploy and execute

See "Generate, Deploy, and Execute".

Using Data Profiles

Data profiling is a process that enables you to analyze the content and structure of your data to determine inconsistencies, anomalies, and redundancies in the data. To begin the process of data profiling, you must first create a data profile using the Design Center. You can then profile the objects contained in the data profile and then create correction tables and mappings.

This section contains the following topics:

Creating Data Profiles
Editing Data Profiles
Adding Data Objects to a Data Profile
Using the Data Profile Editor
Configuring Data Profiles

Creating Data Profiles

Use the following steps to create a data profile:

From the Warehouse Builder Project Explorer, expand the project node in which you want to create the data profile.
Right-click Data Profiles and select New.

The Create Data Profile Wizard opens and displays the Welcome page. Click Next to proceed. The wizard guides you through the following steps:

Note:

You cannot profile a source table that contains complex data types if the source module is located on a different database instance from the data profile.

Naming the Data Profile

Use the Name field to specify a name for your data profile. The name must be unique within a project. The Name tab of the Edit Data Profile dialog enables you to rename the data profile. You can select the name and enter the new name.

Use the Description field to provide an optional description for the data profile. On the Name tab, you can modify the description by selecting the current description and typing the new description.

Selecting Objects

Use the Select Objects page to specify the objects that you want to profile. The Select Objects tab of the Edit Data Profile dialog enables you to modify object selections that you made.

The Available section displays a list of objects available for profiling. Select the objects you want to profile and use the shuttle buttons to move them to the Selected list. You can select multiple objects by holding down the Ctrl key and selecting the objects. The Selected list displays the objects selected for profiling. You can modify this list using the shuttle buttons.

The objects available for profiling include tables, external tables, views, materialized views, dimensions, and cubes. The objects are grouped by module. When you select a dimensional object in the Available section, Warehouse Builder displays a warning informing you that the relational objects that are bound to these dimensional objects will also be added to the profile. Click Yes to proceed.

Note:

You cannot profile a source table that contains complex data types if the source module is located on a different database instance than the data profile.

Reviewing the Summary

The Summary page displays the objects that you have selected to profile. Review the information on this page. Click Back to make changes or click Finish to create the data profile.

The data profile is created and added to the Data Profiles node in the navigation tree. If this is the first data profile you have created in the current project, the Connection Information dialog for the selected control center is displayed. Enter the control center manager connection information and click OK. The Data Profile Editor opens.

Editing Data Profiles

Once you create a data profile, you can use the Data Profile Editor to modify its definition. You can also add data objects to an existing data profile. To add objects, you can use either the menu bar options or the Select Objects tab of the Edit Data Profile dialog. For information about using the menu bar options, see "Adding Data Objects to a Data Profile". The following instructions describe how to access the Edit Data Profile dialog.

To edit a data profile:

In the Project Explorer, right-click the data profile and select Open Editor.

The Data Profile Editor is displayed.
From the Edit menu, select Properties.

The Edit Data Profile dialog is displayed.
Use the following tabs on this dialog to modify the definition of the data profile:
- Name tab, see "Naming the Data Profile"
- Select Objects tab, see "Selecting Objects"
- Data Locations tab, see "Data Locations Tab"

Data Locations Tab

The Data Locations tab specifies the location that is used as the data profile workspace. This tab contains two sections: Available Locations and Selected Locations.

The Available Locations section displays all the Oracle locations in the current repository. The Selected Locations section displays the locations that are associated with the data profile. This section can contain more than one location. The location to which the data profile is deployed is the location for which the New Configuration Default option is selected.

To modify the data location associated with a data profile, use the shuttle arrow to move the new location to the Selected Locations section and select the New Configuration Default option for this location. When you change the location for a data profile, all objects that were created in the data profile workspace schema as a result of profiling this data profile are deleted.

You can also create a new location and associate it with the data profile. Click New below the Selected Locations section to display the Edit Database Location dialog. Specify the details of the new location and click OK. The new location is added to the Selected Locations section.

You cannot configure a data profile and change the location to which it is deployed. Note that this behavior is different from other objects. For example, consider Oracle modules. The Selected Locations section on the Data Locations tab of the Edit Module dialog contains a list of possible deployment locations. The objects in the module are deployed to the location that is set using Location configuration parameter of the Oracle module.

Adding Data Objects to a Data Profile

To add data objects to a data profile, use the following steps:

Right-click the data profile in the Project Explorer and select Open Editor.

The Data Profile Editor is displayed.
From the Profile menu, select Add Objects.

The Add Profile Tables dialog is displayed. Use this dialog to add data objects to the data profile.

Add Profile Tables Dialog

The Add Profile Tables dialog contains two sections: Available and Selected. The Available section displays the data objects that are available for profiling. To add an object displayed in this list to the data profile, select the object and move it to the Selected list using the shuttle arrows. Select multiple objects by holding down the Ctrl key and selecting the objects.

Click OK. The selected data objects are added to the data profile. You can see the objects on the Profile Objects tab of the Object Tree.

Using the Data Profile Editor

The Data Profile Editor provides a single access point for managing and viewing data profile information as well as correcting metadata and data. It combines the functionality of a data profiler, a target schema generator, and a data correction generator. As a data profiler, it enables you to perform attribute analysis and structural analysis of selected objects. As a target schema generator, it enables you to generate a target schema based on the profile analysis and source table rules. Finally, as a data correction generator, it enables you to generate mappings and transformations to provide data correction.

Figure 20-1 displays the Data Profile Editor.

Figure 20-1 Data Profile Editor

Description of "Figure 20-1 Data Profile Editor"

The Data Profile Editor consists of the following:

Menu Bar
Toolbars
Object Tree
Property Inspector
Monitor Panel
Profile Results Canvas
Data Drill Panel
Data Rule Panel

Menu Bar

The menu bar provides access to the Data Profile Editor commands. The Data Profile Editor contains the following menu items.

Profile The Profile menu contains the following menu options:

Close: Closes the Data Profile Editor.
Save All: Saves the changes made in the Data Profile Editor.
Export: Exports the data displayed in the Profile Results Canvas and the Data Drill Panel. Warehouse Builder can store the exported data in .csv or .html files. The number of files used to store the exported data depends on the amount of profiling data. For example, if you specify the name of the export file as prf_result.html, the files used to store the data begin with this file name and then continue with prf_results2.html, prf_results3.html, and so on depending upon the quantity of data.
Add: Adds objects to the data profile.
Profile: Profiles the objects contained in the data profile.
Derive Data Rule: Derives a data rule using the result of data profiling. This option is enabled only when you select a cell that contains a hyperlink in the Profile Results Canvas.
Remove Data Rule: Deletes the selected data rule.
Create Correction: Creates correction objects and mappings based on the results of data profiling.
Print: Prints the contents of the Data Profile Editor window.

Edit The Edit menu contains the following menu items:

Delete: Deletes the object selected in the Data Profile Editor.
Synchronize: Synchronizes the data profile metadata with the Warehouse Builder repository.
Properties: Displays the properties of the data profile.

Window Use the Window menu to display or hide the various panels of the Data Profile Editor. All the options act as toggle switches. For example, to hide the Data Rule panel, select Data Rule Panel from the Window menu. You can then display the Data Rule panel by selecting Data Rule Panel from the Window menu.

Help Use the Help menu options to access the online Help for the product. To display the context-sensitive Help for the current window, use the Topic option. This menu also contains options to access the different sections of the Oracle Technology Network.

Toolbars

The Data Profile Editor allows you the flexibility of using either the menu or toolbars to achieve your goals. The toolbar provides icons for commonly used commands that are also part of the menu bar. You can perform functions such as adding objects to a data profile, profiling data, deriving data rules, and creating corrections using the toolbar icons.

Figure 20-2 displays the various toolbars in the Data Profile Editor. When you first open the Data Profile Editor, all the toolbars are displayed in a single row. You can change their positions by holding down the left mouse button on the gray dots to the left of the toolbar, dragging, and dropping in the desired location.

Figure 20-2 Data Profile Editor Toolbars

Description of "Figure 20-2 Data Profile Editor Toolbars"

Object Tree

The object tree can be used to navigate through the objects included in the data profile. Figure 20-3 displays the object tree that contains two tabs: Profile Objects and Corrected Modules.

Figure 20-3 Profile Objects

Description of "Figure 20-3 Profile Objects"

The Profile Objects tab contains a navigation tree that includes all of the objects selected to be profiled. The tree goes down to the attributes because you can change properties for each attribute and ensure that on an attribute, you have the correct profiling settings. When you select an object in the Profile Objects tab, the profile details about that object can be viewed in the Profile Results Canvas. You can also open the Data Object Editor for a specific object by double-clicking the object.

The Corrected Modules tab lists the objects created as a result of performing data correction actions. This tab contains objects only if you have performed data correction actions. Data correction is the process of creating corrected source data objects based on the results of the data profiling. For more information about creating corrected schemas, see "Correcting Schemas and Cleansing Data".

Property Inspector

The Property Inspector enables you to define the configuration parameters for the data profiling operation. Many types of data profiling rely on the configuration parameters to set the limits and the assumptions of analysis.

For example, if the parameter Domain Discovery Max Distinct Values Count is set to 25, then if more than 25 distinct values are found in a column, no domains are considered for this column. Each type of analysis can also be turned off for performance purposes. If you run profiling without making changes to these properties, they will be set to default values.

You also use the Property Inspector to specify the types of data profiling that you want to perform. If you do not want to perform a particular type of profiling, use the configuration parameters to turn off the type of data profiling. By default, Warehouse Builder performs the following types of profiling: aggregation, data type discovery, pattern discovery, domain discovery, unique key discovery, functional dependency discovery, and row relationship discovery. Note that you cannot turn off the aggregation profiling.

For example, your data profile contains two tables called COUNTRIES and JOBS. For the COUNTRIES table, you want to perform aggregation profiling, data type profiling, domain discovery, pattern discovery, and data rule profiling. Use the Property Inspector to select the following configuration options: Enable Data Type Discovery, Enable Domain Discovery, Enable Pattern Discovery, Enable Data rule Profiling for table.

For details about the configuration parameters and their settings, see "Configuration Parameters for Data Profiles".

Monitor Panel

The Monitor panel is where you will be able to view details about currently running as well as past profiling events. The details about each profiling event includes the profile event name, profile job ID, status, timestamp, and the repository owner who executed the profiling. Figure 20-4 displays the Monitor panel.

Figure 20-4 Monitoring Profile Status

Description of "Figure 20-4 Monitoring Profile Status"

You can view more details about a job by double-clicking a job. The details include what problems were encountered ad how much time each profiling job took.

Profile Results Canvas

The Profile Results Canvas is where the results of the profiling can be viewed. Figure 20-5 displays the Profile Results Canvas panel. You can sort the results by clicking the heading of the column.

Figure 20-5 Profile Results

Description of "Figure 20-5 Profile Results"

Click the following tabs for detailed profile analysis of different aspects of the object:

Data Profile
Profile Object
Aggregation
Data Type
Domain
Pattern
Unique Key
Functional Dependency
Referential
Data Rule

Click the hyperlinks to drill into the data. When you drill into data, the data results appear in the Data Drill panel. Depending on the tab you have selected, the available format sub-tabs may change. The sub-tabs that are available for all analysis tabs are as follows: Tabular, which provides the results in a grid or table format; Graphical, which provides an array of charts and graphs. You can also derive rules from the Profile Results Canvas.

Click any header to sort the values in ascending and descending orders interchangeably. The graphical subtab displays the graphical representation of different measures.

Data Drill Panel

Use the Data Drill panel to view questionable data from a selected object in the Profile Results canvas. Data drill-down is available for all tabs in the profile results canvas. Figure 20-6 displays the Data Drill panel for a profiled object.

Figure 20-6 Data Drill Panel

Description of "Figure 20-6 Data Drill Panel"

The first sentence on this panel provides information about the attribute to which the displayed drill details belong. When you drill into data, the results appear on the left side of the Data Drill panel. Use the Distinct Values list to filter the data displayed on the left side of the panel. You can further drill into this data to view the entire records. Click any row on the left side of the panel to display the selected records on the right side of the Data Drill panel.

Note that the left side displays aggregated row counts. So there are 2 rows with salary 2400. And so while the entire table contains 109 rows, the left display only shows 57 rows, namely the distinct values found.

Data Rule Panel

On the Data Rule panel, you can view the data rules that have been created as a result of the profiling. Figure 20-7 displays the Data Rule panel.

Figure 20-7 Data Rules Panel

Description of "Figure 20-7 Data Rules Panel"

The left side of this panel displays the data rules created for the object that is selected in the Profile Objects tab of the object tree as a result of the data profiling. When you include a table in the data profile, any data rules or check constraints on the table are automatically added. The data rules are added to the Derived_Data_Rules node under the Data Rules node of the project that contains the data profile.

You can add rules to this panel in the following ways:

Some of the tabs on the Profile Results Canvas contain a Derive Data Rule button. This button is enabled when you select a hyperlink on the tab. Click Derive Data Rule to derive a data rule for the selected result.

If the data rules has already been derived, the Remove Rule button is enabled.
Click Apply Rule in the Data Rules panel. The Apply Data Rule wizard is displayed. For more information about this wizard, see "Deriving Data Rules". Use this wizard to select an existing data rule and apply it to the table selected in the object tree.

You can disable any of the data rules in the Data Rule panel by unchecking the check box to the left of the data rule. To delete a data rule, right-click the gray cell to the left of the data rule and select Delete.

The details displayed for an applied data rule are as follows:

Name: The name of the applied data rule.

Rule: Click this field to display the name of the module that contains the data rule and the name of the data rule. Click the Ellipsis button on this field to launch the Edit Data Rule dialog that enables you to edit the data rule.

Rule Type: The type of data rule. This is not editable.

Description: The description of the applied data rule.

Configuring Data Profiles

You can configure a data profile by setting its configuration parameters in the Property Inspector of the Data Profile Editor. For more information about the properties that you can configure, see "Configuration Parameters for Data Profiles".

You can set configuration parameters for a data profile at any of the following levels:

For all objects in the data profile

To set configuration parameters for all objects contained in the data profile, select the data profile in the Profile Objects tab of the Object Tree. In the Property Inspector, set the configuration parameters to the required values. These parameters are set for all the objects in the data profile.
For a single object in the data profile

To set configuration parameters for a single object within a data profile, select the object in the Profile Objects tab of the Object Tree. In the Property Inspector, set the configuration parameters.
For an attribute in an object

To set configuration parameters for an attribute within an object, in the Profile Objects tab of the Object Tree, expand the object node to display the attributes it contains. For example, you can expand a table node to display its columns. Select the attribute for which you want to specify configuration parameters. In the Property Inspector, set the configuration parameters.