23 Understanding Data Quality Management

Today, more than ever, organizations realize the importance of data quality. By ensuring that quality data is stored in your data warehouse or business intelligence application, you also ensure the quality of information for dependent applications and analytics.

Oracle Warehouse Builder offers a set of features that assist you in creating data systems that provide high quality information to your business users. You can implement a quality process that assesses, designs, transforms, and monitors quality. Within these phases, you will use specific functionality from Warehouse Builder to create improved quality information.

This chapter contains the following topics:

About the Data Quality Management Process

Quality data is crucial to decision-making and planning. The aim of building a data warehouse is to have an integrated, single source of data that can be used to make business decisions. Since the data is usually sourced from a number of disparate systems, it is important to ensure that the data is standardized and cleansed before loading into the data warehouse.

Warehouse Builder provides functionality that enables you to effectively manage data quality by assessing, transforming, and monitoring your data. Using Warehouse Builder for data management provides the following benefits:

  • Provides an end-to-end data quality solution.

  • Enables you to include data quality and data profiling as an integral part of your data integration process.

  • Stores metadata regarding the quality of your data alongside your data definitions.

  • Automatically generates the mappings that you can use to correct data. These mappings are based on the business rules that you choose to apply to your data and decisions you make on how to correct data.

Phases in the Data Quality Lifecycle

Ensuring data quality involves the following phases:

Figure 23-1 shows the phases involved in providing high quality information to your business users.

Figure 23-1 Phases Involved in Providing Quality Information

Description of Figure 23-1 follows
Description of "Figure 23-1 Phases Involved in Providing Quality Information"

Quality Assessment

In the quality assessment phase, you determine the quality of the source data. The first step in this phase is to import the source data, which could be stored in different sources, into Warehouse Builder. You can import metadata and data from both Oracle and non-Oracle sources.

After you load the source data, you use data profiling to assess its quality. Data profiling is the process of uncovering data anomalies, inconsistencies, and redundancies by analyzing the content, structure, and relationships within the data. The analysis and data discovery techniques form the basis for data monitoring.

Quality Design

The quality design phase consists of designing your quality processes. You can specify the legal data within a data object or legal relationships between data objects using data rules.

You can also correct and augment your data using data quality operators.

As part of the quality design phase, you also design the transformations that ensure data quality. These transformations could be mappings that are generated by Warehouse Builder as a result of data profiling or mappings you create.

See Also:

Quality Transformation

The quality transformation phase consists of running the correction mappings you designed to correct the source data.

Quality Monitoring

Quality monitoring is the process of examining warehouse data over time and alerting you when the data violates business rules set for the data.

About Data Profiling

Data profiling is the first step for any organization to improve information quality and provide better decisions. It is a robust data analysis method available in Warehouse Builder that you can use to discover and measure defects in your data before you start working with it. Because of its integration with the ETL features in Warehouse Builder and other data quality features, such as data rules and built-in cleansing algorithms, you can also generate data cleansing mappings and schema correction scripts. This enables you to automatically correct any inconsistencies, redundancies, and inaccuracies in both the data and metadata.

Data profiling enables you to discover many important things about your data. Some common findings include the following:

  • A domain of valid product codes

  • A range of product discounts

  • Columns that hold the pattern of an e-mail address

  • A one-to-many relationship between columns

  • Anomalies and outliers within columns

  • Relations between tables even if they are not documented in the database

To begin the process of data profiling, you must first create a data profile using the Design Center. You can then profile the objects contained in the data profile and then create correction tables and mappings.

Benefits of Data Profiling

Using the data profiling functionality in Warehouse Builder enables you to:

  • Profile data from any source or combination of sources that Warehouse Builder can access.

  • Explore data profiling results in tabular or graphical format.

  • Drill down into the actual data related to any profiling result.

  • Derive data rules, either manually or automatically, based on the data profiling results.

  • Attach any data rule to a target object and select an action to perform if the rule fails.

  • Create a data auditor from a data rule to continue monitoring the quality of data being loaded into an object.

  • Derive quality indices such as six-sigma valuations.

  • Profile or test any data rules you want to verify before putting in place.

Types of Data Profiling

Following the selection of data objects, determine the aspects of your data that you want to profile and analyze.

Data profiling offers three main types of analysis:

In addition to these analysis, you can create custom profiling processes using data rules that allows you to validate custom rules against the actual data and get a score of their accuracy.

Figure 23-2 displays a representation of the types of data profiling and how you can perform each type.

Figure 23-2 Data Profiling Overview

Description of Figure 23-2 follows
Description of "Figure 23-2 Data Profiling Overview"

Attribute Analysis

Attribute analysis seeks to discover both general and detailed information about the structure and content of data stored within a given column or attribute. Attribute analysis looks for information about patterns, domains, data types, and unique values.

Attribute analysis consists of pattern analysis, domain analysis, data type analysis, and unique key analysis.

Pattern Analysis 

Pattern analysis attempts to discover patterns and common types of records by analyzing the string of data stored in the attribute. It identifies the percentages of your data that comply with a certain regular expression format pattern found in the attribute. Using these pattern results, you can create data rules and constraints to help clean up current data problems. Some commonly identified patterns include dates, e-mail addresses, phone numbers, and social security numbers.

Table 23-1 shows a sample attribute, Job Code, that could be used for pattern analysis.

Table 23-1 Sample Columns Used for Pattern Analysis

Job ID Job Code

7

337-A-55

9

740-B-74

10

732-C-04

20

43-D-4


Table 23-2 shows the possible results from pattern analysis, where D represents a digit and X represents a character. After looking at the results and knowing that it is company policy for all job codes be in the format of DDD-X-DD, you can derive a data rule that requires all values in this attribute to conform to this pattern.

Table 23-2 Pattern Analysis Results

Job Code % Occurred

DDD-X-DD

75%

DD-X-D

25%


Domain Analysis 

Domain analysis identifies a domain or set of commonly used values within the attribute by capturing the most frequently occurring values. For example, the Status column in the Customers table is profiled and the results reveal that 90% of the values are among the following: "MARRIED", "SINGLE", "DIVORCED". Further analysis and drilling down into the data reveal that the other 10% contains misspelled versions of these words with few exceptions. Configuration of the profiling determines when something is qualified as a domain, so review the configuration before accepting domain values. You can then let Warehouse Builder derive a rule that requires the data stored in this attribute to be one of the three values that were qualified as a domain.

Datatype Analysis 

Data type analysis enables you to discover information about the data types found in the attribute. This type of analysis reveals metrics such as minimum and maximum character length values as well as scale and precision ranges. In some cases, the database column is of data type VARCHAR2, but the values in this column are all numbers. Then you may want to ensure that you only load numbers. Using data type analysis, you can have Warehouse Builder derive a rule that requires all data stored within an attribute to be of the same data type.

Unique Key Analysis 

Unique key analysis provides information to assist you in determining whether or not an attribute is a unique key. It does this by looking at the percentages of distinct values that occur in the attribute. You might determine that attributes with a minimum of 70% distinct values should be flagged for unique key analysis. For example, using unique key analysis you could discover that 95% of the values in the EMP_ID column are unique. Further analysis of the other 5% reveals that most of these values are either duplicates or nulls. You could then derive a rule that requires that all entries into the EMP_ID column be unique and not null.

Functional Dependency

Functional dependency analysis reveals information about column relationships. This enables you to search for things such as one attribute determining another attribute within an object.

Table 23-3 shows the contents of the Employees table in which the attribute Dept. Location is dependent on the attribute Dept. Number. Note that the attribute Dept. Number is not dependent on the attribute Dept. Location.

Table 23-3 Employees Table

ID Name Salary Dept Number Dept Location

10

Alison

1000

10

SF

20

Rochnik

1000

11

London

30

Meijer

300

12

LA

40

John

500

13

London

50

George

200

13

London

60

Paul

600

13

London

70

Ringo

100

13

London

80

Yoko

600

13

London

90

Jones

1200

10

SF


Referential Analysis

Referential analysis attempts to detect aspects of your data objects that refer to other objects. The purpose behind this type of analysis is to provide insight into how the object you are profiling is related or connected to other objects. Because you are comparing two objects in this type of analysis, one is often referred to as the parent object and the other as the child object. Some of the common things detected include orphans, childless objects, redundant objects, and joins. Orphans are values that are found in the child object, but not found in the parent object. Childless objects are values that are found in the parent object, but not found in the child object. Redundant attributes are values that exist in both the parent and child objects.

Table 23-4 and Table 23-5 show the contents of two tables that are candidates for referential analysis. Table 23-4 is the child object and Table 23-5 is the parent object.

Table 23-4 Employees Table (Child)

ID Name Dept. Number City

10

Alison

17

NY

20

Rochnik

23

SF

30

Meijer

23

SF

40

Jones

15

SD


Table 23-5 Department Table (Parent)

Dept. Number Location

17

NY

18

London

20

SF

23

SF

55

HK


Referential analysis of these two objects would reveal that Dept. Number 15 from the Employees table is an orphan and Dept. Numbers 18, 20, and 55 from the Department table are childless. It would also reveal a join on the Dept. Number column.

Based on these results, you could derive referential rules that determine the cardinality between the two tables.

Data Rule Profiling

In addition to attribute analysis, functional dependency, and referential analysis, Warehouse Builder offers data rule profiling. Data rule profiling enables you to create rules to search for profile parameters within or between objects.

This is very powerful as it enables you to validate rules that apparently exist and are defined by the business users. By creating a data rule, and then profiling with this rule you can verify if the data actually complies with the rule, and whether or not the rule needs amending or the data needs cleansing.

For example, you could create a rule that Income = Salary + Bonus for the Employee table shown in Table 23-6. You can then catch errors such as the one for employee Alison.

Table 23-6 Sample Employee Table

ID Name Salary Bonus Income

10

Alison

1000

50

1075 X

20

Rochnik

1000

75

1075

30

Meijer

300

35

335

40

Jones

1200

500

1700


About the Data Profile Editor

The Data Profile Editor provides a single access point for managing and viewing data profile information as well as correcting metadata and data. It combines the functionality of a data profiler, a target schema generator, and a data correction generator. As a data profiler, it enables you to perform attribute analysis and structural analysis of selected objects. As a target schema generator, it enables you to generate a target schema based on the profile analysis and source table rules. Finally, as a data correction generator, it enables you to generate mappings and transformations to provide data correction.

Figure 23-3 displays the Data Profile Editor.

Figure 23-3 Data Profile Editor

Description of Figure 23-3 follows
Description of "Figure 23-3 Data Profile Editor"

The Data Profile Editor consists of the following:

  • Menu Bar

  • Toolbars

  • Object Tree

  • Property Inspector

  • Monitors Panel

  • Profile Results Canvas

  • Data Drill Panel

  • Data Rule Panel

Refer to the online help for detailed information about the contents of each panel.

About Six Sigma

Warehouse Builder provides Six Sigma results embedded within the other data profiling results to provide a standardized approach to data quality.

What is Six Sigma?

Six Sigma is a methodology that attempts to standardize the concept of quality in business processes. It achieves this goal by statistically analyzing the performance of business processes. The goal of Six Sigma is to improve the performance of these processes by identifying the defects, understanding them, and eliminating the variables that cause these defects.

Six Sigma metrics give a quantitative number for the number of defects for each 1,000,000 opportunities. The term "opportunities" can be interpreted as the number of records. The perfect score is 6.0. The score of 6.0 is achieved when there are only 3.4 defects for each 1,000,000 opportunities. The score is calculated using the following formula:

  • Defects Per Million Opportunities (DPMO) = (Total Defects / Total Opportunities) * 1,000,000

  • Defects (%) = (Total Defects / Total Opportunities)* 100%

  • Yield (%) = 100 - %Defects

  • Process Sigma = NORMSINV(1-((Total Defects) / (Total Opportunities))) + 1.5

    where NORMSINV is the inverse of the standard normal cumulative distribution.

Six Sigma Metrics for Data Profiling

Six Sigma metrics are also provided for data profiling in Warehouse Builder. When you perform data profiling, the number of defects and anomalies discovered are shown as Six Sigma metrics. For example, if data profiling finds that a table has a row relationship with a second table, the number of records in the first table that do not adhere to this row-relationship can be described using the Six Sigma metric.

Six Sigma metrics are calculated for the following measures in the Data Profile Editor:

  • Aggregation: For each column, the number of null values (defects) to the total number of rows in the table (opportunities).

  • Data Types: For each column, the number of values that do not comply with the documented data type (defects) to the total number of rows in the table (opportunities).

  • Data Types: For each column, the number of values that do not comply with the documented length (defects) to the total number of rows in the table (opportunities).

  • Data Types: For each column, the number of values that do not comply with the documented scale (defects) to the total number of rows in the table (opportunities).

  • Data Types: For each column, the number of values that do not comply with the documented precision (defects) to the total number of rows in the table (opportunities).

  • Patterns: For each column, the number of values that do not comply with the common format (defects) to the total number of rows in the table (opportunities).

  • Domains: For each column, the number of values that do not comply with the documented domain (defects) to the total number of rows in the table (opportunities).

  • Referential: For each relationship, the number of values that do not comply with the documented foreign key (defects) to the total number of rows in the table (opportunities).

  • Referential: For each column, the number of values that are redundant (defects) to the total number of rows in the table (opportunities).

  • Unique Key: For each unique key, the number of values that do not comply with the documented unique key (defects) to the total number of rows in the table (opportunities).

  • Unique Key: For each foreign key, the number of rows that are childless (defects) to the total number of rows in the table (opportunities).

  • Data Rule: For each data rule applied to the data profile, the number of rows that fail the data rule to the number of rows in the table.

About Data Correction and Augmentation

Warehouse Builder enables you to automatically create corrected data objects and correction mappings based on the results of data profiling. On top of these automated corrections that make use of the underlying Warehouse Builder architecture for data quality, you can create your own data quality mappings to correct and cleanse source data.

To perform data correction and augmentation, use one of the following methods:

Automatic Data Correction Based on Data Profiling Results

When you perform data profiling, Warehouse Builder generates corrections for the objects that you profiled. You can then decide to create corrected objects based on results of data profiling. The corrections are in the form of data rules that can be bound to the corrected object.

Types of Corrections for Source Data

You can perform the following types of corrections on source data objects:

  • Schema correction

    Schema correction creates scripts that you can use to create a corrected set of source data objects with data rules applied to them. The corrected data objects adhere to the data rules derived from the results of data profiling.

    The correction tables have names that are prefixed with TMP__. For example, when you profile the EMPLOYEES table, the correction table will be called TMP__EMPLOYEES.

  • Data Correction

    Data correction is the process of creating correction mappings to remove anomalies and inconsistencies in the source data before loading it into the corrected data objects. Correction mappings enforce the data rules defined on the data objects. While moving data from the old "dirty" tables in the profile source tables into the corrected tables, these mappings correct records that do not comply with the data rules.

    The name of the correction mapping is the object name prefixed with M_. For example, the correction mapping for the EMPLOYEE table is called M_EMPLOYEE.

About Performing Data Correction

To perform data correction on source data, you must specify the following:

Data Correction Actions

Based on the data profiling results, Warehouse Builder derives a set of data rules that you can use to cleanse the source data. You can automatically generate corrections based on these data rules by performing data correction actions.

For each data rule derived as a result of data profiling, you must choose a correction action that specifies how data values that are not accepted due to data rule enforcement should be handled. The correction actions you can choose are:

  • Ignore: The data rule is ignored and, therefore, no values are rejected based on this data rule.

  • Report: The data rule is run only after the data has been loaded for reporting purposes. It is similar to the Ignore option, except that a report is created that contains the values that do not adhere to the data rules. This action can be used for some rule types only.

  • Cleanse: The values rejected by this data rule are moved to an error table where cleansing strategies are applied. When you select this option, you must specify a cleansing strategy.

    See "Cleansing Strategy" for details about specifying cleansing strategies.

Cleansing Strategies for Data Correction

When you decide to automatically generate corrected objects based on data profiling results, you must specify how inconsistent data from the source should be cleansed before being stored in the corrected object. To do this, you specify a cleansing strategy for each data rule that is applied to the correction object. Error tables are used to store the records that do not conform to the data rule.

The cleansing strategy you use depends on the type of data rule and the rule configuration. Table 23-7 describes the cleansing strategies and lists the types of data rules for which each strategy is applicable.

Table 23-7 Cleansing Strategies for Data Correction

Cleansing Strategy Description Applicable To Data Rule Types

Remove

Does not populate the target table with error records.

All

Custom

Creates a function in the target table that contains a header, but no implementation details. You must add the implementation details to this function.

Domain List

Domain Pattern List

Domain Range

Common Format

No Nulls

Name and Address

Custom

Set to Min

Sets the attribute value of the error record to the minimum value defined in the data rule.

Domain Range rules that have a minimum value defined

Set to Max

Sets the attribute value of the error record to the maximum value defined in the data rule.

Domain Range rules that have a maximum defined

Similarity

Uses a similarity algorithm based on permitted domain values to find a value that is similar to the error record. If no similar value is found, the original value is used.

Domain List rules with character data type

Soundex

Uses a soundex algorithm based on permitted domain values to find a value that is similar to the error record. If no soundex value is found, the original value is used.

Domain List rules with character data type

Merge

Uses the Match-Merge algorithm to merge duplicate records into a single row.

Unique Key

Set to Mode

Uses the mode value to correct the error records if a mode value exists for the functional dependency partition that fails.

Functional Dependency


Data Correction and Augmentation Using Operators

Warehouse Builder provides functionality that enables you to design data correction and augmentation processes for the source data. While transforming the source data, you can use the following operators to ensure data quality:

  • Match-Merge Operator

  • Name and Address Operator

See Also:

Chapter 24, "Data Quality Operators" for information about these operators and their usage

About Data Rules

Data rules are definitions for valid data values and relationships that can be created in Warehouse Builder. They determine legal data within a table or legal relationships between tables. Data rules help ensure data quality. They can be applied to tables, views, dimensions, cubes, materialized views, and external tables. Data rules are used in many situations including data profiling, data and schema cleansing, and data auditing.

The metadata for a data rule is stored in the workspace. To use a data rule, you apply the data rule to a data object. For example, you create a data rule called gender_rule that specifies that valid values are 'M' and 'F'. You can apply this data rule to the emp_gender column of the Employees table. Applying the data rule ensures that the values stored for the emp_gender column are either 'M' or 'F'. You can view the details of the data rule bindings on the Data Rule tab of the Data Object Editor for the Employees table.

There are two ways to create a data rule. A data rule can be derived from the results of data profiling, or it can be created using the Data Rule Wizard. For more information about data rules, see "Using Data Rules".

Types of Data Rules

Table 23-8 describes the types of data rules.

Table 23-8 Types of Data Rules

Data Rule Type Description Example

Domain List

Defines a list of values that an attribute is allowed to have.

The Gender attribute can have 'M' or 'F'.

Domain Pattern List

Defines a list of patterns that an attribute is allowed to conform to. The patterns are defined in the Oracle Database regular expression syntax.

A pattern for telephone number is:

(^[[:space:]]*[0-9]{ 3 }[[:punct:]|:space:]]?[0-9]{ 4 }[[:space:]]*$)

Domain Range

Defines a range of values that an attribute is allowed to have.

The value of the Salary attribute can be between 100 and 10000.

Common Format

Defines a known common format that an attribute is allowed to conform to.

This rule type has many subtypes: Telephone Number, IP Address, SSN, URL, E-mail Address. Each type has predefined formats listed. You can add more formats to this list.

An E-mail address should be in the following format:

^(mailto:[-_a-z0-9.]+@[-_a-z0-9.]+$)

No Nulls

Specifies that the attribute cannot have null values.

The department_id attribute for an employee in the Employees table cannot be null.

Functional Dependency

A functional dependency defines that the data in the data object may be normalized.

The Dept_name attribute is dependent on the Dept_no attribute.

Unique Key

Defines whether an attribute or group of attributes are unique in the given data object.

The name of a department must be unique.

Referential

Defines the type of relationship (1:x) a value must have with another value.

The department_id attribute of the Departments table must have a 1:n relationship with the Department_id attribute of the Employees table.

Name and Address

Uses the Name and Address support to evaluate a group of attributes as a name or address.

The department_id attribute of the Departments table should have a 1:n relationship with the department_id attribute of the Employees table.

Custom

Applies an SQL expression that you specify to its input parameters.

A custom rule called VALID_DATE has two input parameters, START_DATE and END_DATE. A valid expression for this rule is defined as follows:

"THIS"."END_DATE" > "THIS"."START_DATE".


Implementation of Data Rules

Warehouse Builder uses different methods of applying data rules to a correction object. The method used depends on the type of data rule you are implementing.

Table 23-9 describes the methods used for object schema correction. It also lists the data rule types for which each correction is used.

Table 23-9 Data Rules Implementation for Schema Correction

Schema Correction Method Description Data Rule Types for which Correction Method Can be Used

Create Constraints

Creates a constraint reflecting the data rule on the correction table. If a constraint cannot be created, a validation message is displayed on the Data Rules Validation page of the Apply Data Rule Wizard.

Custom

Domain List

Domain Pattern List

Domain Range

Common Format

No Nulls

Unique Key

Change the data type

Changes the data type of the column to NUMBER or DATE according to the results of profiling. The data type is changed for data rules of type Is Number and Is Name.

 

Create a lookup table

Creates a lookup table and adds the appropriate foreign key or unique key constraints to the corrected table and the lookup table.

Functional Dependency

Name and Address Parse

Adds additional name and address attributes to the correction table. The name and address attributes correspond to a selection of the output values of the Name and Address operator. In the map that is created to cleanse data, a Name and Address operator is used to perform name and address cleansing.

Name and Address


About Quality Monitoring

Quality monitoring builds on your initial data profiling and data quality initiatives. It enables you to monitor the quality of your data over time. You can define the business rules to which your data should adhere.

To monitor data using Warehouse Builder you must create data auditors. Data auditors ensure that your data complies with the business rules you defined.

About Data Auditors

Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule by auditing and marking how many errors are occurring against the audited data. To monitor and audit a data object, you must first define data rules that specify valid data for that data object.

Data auditors are an important tool in ensuring that data quality levels are up to the standards set by the users of the system. They also help to determine spikes in bad data allowing events to the tied to these spikes.

Data auditors can be deployed and executed ad-hoc, but they are typically run to monitor the quality of the data in an operational environment like a data warehouse or ERP system and, therefore, can be added to a process flow and scheduled.

Data Auditor Thresholds 

Data auditors have thresholds that allow you to create logic based on the fact that too many non-compliant records can divert the process flow into an error or notification stream. You can specify a threshold value for each data rule that the data auditor audits. This value is used to determine if the data in the data object is within the limits that you defined. Based on this threshold, the process can choose actions.

For example, you create a data auditor the audit the data in the Employees table. This table contains two data rules, emp_email_unique_rule and emp_sal_min_rule. You specify the threshold value for both rules is 80%. This means that if less than 80% of the data in the Employees table does not comply with the data rules, the auditing for the table fails.

Tip:

"Specifying Data Correction Actions" for information about specifying threshold value

Audit Results for Data Auditors 

In addition to setting thresholds for non-compliant records, you can also capture audit results and store them for analysis purposes. When executed, the data auditor sets several output values. One of these values is the audit result. Audit results provide information about the extent of data rule violations that occurred while running the data auditor.

Performing Data Profiling

Data profiling is, by definition, a resource-intensive process that requires forethought and planning. It analyzes data and columns and performs many iterations to detect defects and anomalies in your data. So it warrants at least some forethought and planning in order to be as effective as possible.

Before you begin profiling data, first reduce the data set by doing a random sampling. Next identify the data objects that you want to target. Instead of profiling everything, choose objects that are deemed crucial. You should not select an entire source system for profiling at the same time. Not only is it a waste of resources, but it is also often unnecessary. Select areas of your data where quality is essential and has the largest fiscal impact.

For example, you have a data source that contains five tables: CUSTOMERS, REGIONS, ORDERS, PRODUCTS, and PROMOTIONS. You decide that the two most important tables with respect to data quality are CUSTOMERS and ORDERS. The CUSTOMERS table is known to contain many duplicate and erroneous entries that cost your company money on wasted marketing efforts. The ORDERS table is known to contain data about orders in an incorrect format. In this case, you would select only these two tables for data profiling.

Data Profiling Restrictions

  • You can only profile data in the default configuration.

  • The profiling workspace location should be an Oracle 10g database or higher.

  • You cannot profile a source table that contains complex data types if the source module and the data profile are located on different database instances.

  • You cannot profile a table that contains more than 165 columns.

Steps to Perform Data Profiling

After you have chosen the objects you want to profile, use the following steps to guide you through the profiling process:

  1. Import or Select the Metadata

  2. Create a Data Profile

  3. Profile the Data

  4. View Profile Results

  5. Derive Data Rules

  6. Generate Corrections

  7. Define and Edit Data Rules Manually

  8. Generate, Deploy, and Execute

The data profiling process ends at step 4. Steps 5 to 7 are optional and can be performed if you want to perform data correction after the data profiling. Step 8 is required when you perform both data profiling and data correction along with data profiling.

Import or Select the Metadata

Data profiling requires the profiled objects to be present in the project in which you are performing data profiling. Ensure that these objects are either imported into this project or created in it. Also ensure that the data is loaded into the objects. Having the data loaded is essential to data profiling.

Also, because data profiling uses mappings to run the profiling, you must ensure that all locations that you are using are registered. Data profiling attempts to register your locations. If, for some reason, data profiling cannot register your locations, you must explicitly register the locations before you begin profiling.

Create a Data Profile

After your system is set up, you can create a data profile using the Design Center. A data profile is a metadata object in the workspace. It includes the set of data objects you want profiled, the settings controlling the profiling operations, the results returned after you profile the data, and correction information (if you decide to use these corrections).

To create a data profile:

  1. From the Project Explorer, expand the project node in which you want to create a data profile.

  2. Right-click Data Profiles and select New.

    The Welcome page of the Create Data Profile Wizard is displayed.

  3. On the Name and Description page, enter a name and an optional description for the data profile. Click Next.

  4. On the Select Objects page, select the objects you want to include in the data profile and use the arrows to move them to the Selected list. Click Next.

    To select multiple objects, hold down the Ctrl key while selecting objects. You can include tables, views, materialized views, external tables, dimensions, and cubes in your data profile.

  5. (Optional) If you selected tables, views, or materialized views that contain attribute sets, the Choose Attribute Set dialog box is displayed. The list at the bottom of this dialog box displays the attribute sets defined on the data object.

    • To profile only the attributes defined in the attribute set, select the attribute set from the list.

    • To profile all columns in the data object, select <all columns> from the list.

  6. (Optional) If you selected dimensional objects on the Select Objects page, a warning is displayed informing you that the relational objects bound to these dimensional objects will also be added to the profile. Click Yes to proceed.

  7. On the Summary page, review the choices you made on the previous wizard pages. Click Back to change any selected values. Click Finish to create the data profile.

    The Warehouse Builder note dialog is displayed. Click OK to display the Data Profile Editor for the newly created data profile.

    The new data profile is added to the Data Profiles node in the navigation tree.

Using Attribute Sets to Profile a Subset of Columns from a Data Object

You can use an attribute set to restrict a data profiling operation to a subset of columns from a table, view or materialized view. Reasons to use an attribute set include:

  • You can decrease profiling time by excluding columns for which you do not need profiling results.

  • Data profiling can only profile up to 165 columns from a table, view or materialized view at a time. You can use an attribute set to select a set of 165 or fewer columns to profile from the object.

Data profiling, using attribute sets, consists of the following high-level steps:

  1. Defining Attribute Sets

  2. Profiling the Columns in an Attribute Set

Defining Attribute Sets Use the following steps to define an attribute set in a table, view, or materialized view.

  1. In the Project Explorer, double-click the table, view, or materialized view.

    The Data Object Editor for the selected object is opened.

  2. In the Details panel, select the Attribute Sets tab.

  3. In the Attribute Sets section, click a blank area in the Name column and enter the name of the attribute set to create.

  4. Close the Data Object Editor.

  5. Double-click the data object in which you created an attribute set in step 3.

    The Data Object Editor for the selected object is displayed.

  6. On the Attributes Sets tab, select the name of the attribute set created in step 3.

    The Attributes of the selected attribute set section displays the attributes in the data object.

  7. Select Include for all the attributes that you want included in the attribute set.

  8. Save your changes and close the Data Object Editor.

Profiling the Columns in an Attribute Set Use the following steps to profile columns contained in the attribute set.

  1. In the Project Explorer, right-click the Data Profiles node and select New.

    The Welcome page of the Create Data Profile Wizard is displayed.

  2. On the Welcome page, click Next.

  3. On the Name and Description page, enter a name and an optional description for the data profile. Click Next.

  4. On the Select Objects page, select the data object that you want to profile and use the shuttle arrows to move the data profile to the Selected list.

  5. When the selected data object contains attribute sets, the Choose Attribute Set dialog box is displayed.

  6. Select the attribute set that you want to profile and click OK.

    The Select Objects page is displayed.

  7. On the Select Objects page, click Next.

  8. On the Summary page, review the options you choose on the previous wizard pages and click Finish.

    The data profile is created and added to the Navigator tree.

Profile the Data

Data profiling is achieved by performing deep scans of the selected objects. This can be a time-consuming process, depending on the number of objects and type of profiling you are running. However, profiling is run as an asynchronous job, and the client can be closed during this process. You will see the job running in the job monitor and Warehouse Builder prompts you when the job is complete.

Steps to Profile Data

After you have created a data profile, you can open it in the Data Profile Editor to profile the data or review profile results from a previous run. The objects you selected when creating the profile are displayed in the object tree of the Data Profile Editor. You can add objects to the profile by selecting Profile and then Add.

To profile the data:

  1. Expand the Data Profiles node in the Project Explorer, right-click a data profile, and select Open Editor.

    The Data Profile Editor opens the selected data profile.

  2. To specify the types of analysis to be performed on the data, configure the data profile.

    See "Configuring Data Profiles".

  3. From the Profile menu, select Profile.

    • (Optional) If this is the first time you are profiling data, the Data Profile Setup dialog box is displayed. Enter the details of the profiling workspace. For more information about the information to be entered, click Help.

    Warehouse Builder begins preparing metadata for profiling. The progress window containing the name of the object being created to profile the data is displayed. After the metadata preparation is complete, the Profiling Initiated dialog box is displayed informing you that the profiling job has started.

  4. On the Profiling Initiated dialog box, click OK.

    Once the profiling job starts, the data profiling is asynchronous and you can continue working or even close the client. Your profiling process will continue to run until it is completed.

  5. View the status of the profiling job in the Monitor Panel of the Data Profile Editor.

    You can continue to monitor the progress of your profiling job in the Monitor panel. After the profiling job is complete, the status displays as complete.

  6. After the profiling is complete, the Retrieve Profile Results dialog box is displayed and you are prompted to refresh the results. Click OK to retrieve the data profiling results and display them in the Data Profile Editor.

    You can use this option if you have previously profiled data in the same data profile. It allows you to control when the new profiling results become visible in the Data Profile Editor.

Note:

Data profiling results are overwritten on subsequent profiling executions.

Configuring Data Profiles

You can, and should, configure a data profile before running it if there are specific types of analysis you do, or do not, want to run.

You can configure a data profile at one of the following levels:

  • The entire profile (all the objects it contains)

  • An individual object in the data profile

    For example, the data profile contains three tables. If you want to perform certain specific types of analysis on one table, configure this table only.

  • An attribute within an object

    For example, if you know that you only have one problematic column in a table and you already know that most of the records should conform to values within a certain domain, then you can focus your profiling resources on domain discovery and analysis. By narrowing down the type of profiling necessary, you use less resources and obtain the results faster.

Steps to Configure Data Profiles
  1. In the Project Explorer, right-click the data profile and select Open Editor.

    The Data Profile Editor for the data profile is displayed.

  2. On the Property Inspector panel, set configuration properties at the required level.

    To configure the entire data profile:

    1. On the Profile Objects tab, select the data profile.

    2. On the Property Inspector panel, set the configuration properties for the data profile.

    To configure a particular object in the data profile:

    1. On the Profile Objects tab, expand the node that represents the data profile.

    2. Select the data object.

    3. On the Property Inspector panel, set the configuration properties for the object.

    To configure an attribute within an object in a data profile:

    1. On the Profile Objects tab, expand the node that represents the data profile.

    2. Expand the data object that contains the attribute and select the required attribute.

    3. On the Property Inspector panel, set the configuration properties for the attribute.

    Note:

    "Reference for Setting Data Profiling Configuration Parameters" for details about configuration parameters you can set for data profiles

    .

View Profile Results

After the profile operation is complete, you can open the data profile in the Data Profile Editor to view and analyze the results. The profiling results contain a variety of analytical and statistical information about the data profiled. You can immediately drill down into anomalies and view the data that caused them. You can then determine what data must be corrected.

To view the profile results:

  1. Select the data profile in the navigation tree, right-click, and select Open Editor.

    The Data Profile Editor opens and displays the data profile.

  2. If you have previous data profiling results displayed in the Data Profile Editor, refresh the view when prompted so that the latest results are shown.

    The results of the profiling are displayed in the Profile Results Canvas.

  3. Minimize the Data Rule and Monitor panels by clicking on the arrow symbol in the upper left corner of the panel.

    This maximizes your screen space.

  4. Select objects in the Profile Objects tab of the object tree to focus the results on a specific object.

    The profiling results for the selected object are displayed using the following tabs of the Profile Results Canvas.

You can switch between various objects in the data profile. The tab that you had selected for the previous object remains selected.

Data Profile

The Data Profile tab contains general information about the data profile. Use this tab to store any notes or information about the data profile.

Profile Object

The Profile Object tab contains two sub-tabs: Object Data and Object Notes. The Object Data tab lists the data records in the object you have selected in the Profile Objects tab of the object tree. The number of rows that were used in the sample are listed. You can use the buttons along the top of the tab to execute a query, get more data, or add a WHERE clause.

Aggregation

The Aggregation tab displays all the essential measures for each column, such as minimum and maximum values, number of distinct values, and null values. Some measures are available only for specific data types. These include the average, median, and standard deviation measures. Information can be viewed from either the Tabular or Graphical sub-tabs.

Table 23-10 describes the various measurement results available in the Aggregation tab.

Table 23-10 Aggregation Results

Measurement Description

Minimum

The minimum value with respect to the inherent database ordering of a specific type

Maximum

The maximum value with respect to the inherent database ordering of a specific type

# Distinct

The total number of distinct values for a specific attribute

% Distinct

The percentage of distinct values over the entire row set number

Not Null

Yes or No

Recommended NOT NULL

From analyzing the column values, data profiling determines that this column should not allow null values

# Nulls

The total number of null values for a specific attribute

%Nulls

The percentage of null values over the entire row set number

Six Sigma

For each column, number of null values (defects) to the total number of rows in the table (opportunities).

Average

The average value for a specific attribute for the entire row set

Median

The median value for a specific attribute for the entire row set

Std Dev

The standard deviation for a specific attribute.


A hyperlinked value in the aggregation results grid indicates that you can click the value to drill down into the data. This allows you to analyze the data in the sample that produced this result.

For example, if you scroll to the SALARY column, shown in Figure 23-4, and click the value in the Maximum cell showing 24000, the Data Drill Panel on the bottom changes to show you all the distinct values in this column with counts on the left. On the right, the Data Drill can zoom into the value you select from the distinct values and display the full record where these values are found.

Figure 23-4 Aggregation Tabular Results

Description of Figure 23-4 follows
Description of "Figure 23-4 Aggregation Tabular Results"

The graphical analysis displays the results in a graphical format. You can use the graphical toolbar to change the display. You can also use the Column and Property menus to change the displayed data object.

Data Type

The Data Type tab provides profiling results about data types. This includes metrics such as length for character data types and the precision and scale for numeric data types. For each data type that is discovered, the data type is compared to the dominant data type found in the entire attribute and the percentage of rows that comply with the dominant measure is listed.

One example of data type profiling would be finding a column defined as VARCHAR but is actually storing only numeric values. Changing the data type of the column to NUMBER would make storage and processing more efficient.

Table 23-11 describes the various measurement results available in the Data Type tab.

Table 23-11 Data Type Results

Measurement Description

Columns

Name of the column

Documented Data Type

Data type of the column in the source object

Dominant Data Type

From analyzing the column values, data profiling determines that this is the dominant (most frequent) data type.

% Dominant Data Type

Percentage of total number of rows where column value has the dominant data type

Documented Length

Length of the data type in the source object

Minimum Length

Minimum length of the data stored in the column

Maximum Length

Maximum length of the data stored in the column

Dominant Length

From analyzing the column values, data profiling determines that this is the dominant (most frequent) length.

% Dominant Length

Percentage of total number of rows where column value has the dominant length

Documented Precision

Precision of the data type in the source object

Minimum Precision

Minimum precision for the column in the source object

Maximum Precision

Maximum precision for the column in the source object

Dominant Precision

From analyzing the column values, data profiling determines that this is the dominant (most frequent) precision.

% Dominant Precision

Percentage of total number of rows where column value has the dominant precision

Documented Scale

Scale specified for the data type in the source object

Minimum Scale

Minimum scale of the data type in the source object

Maximum Scale

Maximum scale of the data type in the source object

Dominant Scale

From analyzing the column values, data profiling determines that this is the dominant (most frequent) scale.

% Dominant Scale

Percentage of total number of rows where column value has the dominant scale


Domain

The Domain tab displays results about the possible set of values that exist in a certain attribute. Information can be viewed from either the Tabular or Graphical subtabs.

Figure 23-5 displays the Domain tab of the Data Profile Editor.

Figure 23-5 Domain Discovery Results

Description of Figure 23-5 follows
Description of "Figure 23-5 Domain Discovery Results"

The process of discovering a domain on a column involves two phases. First, the distinct values in the column are used to determine whether that column might be defined by a domain. Typically, there are few distinct values in a domain. Then, if a potential domain is identified, the count of distinct values is used to determine whether that distinct value is compliant with the domain. The properties that control the threshold for both phases of domain discovery can be set in the Property Inspector.

If you find a result that you want to know more about, drill down and use the Data Drill panel to view details about the cause of the result.

For example, a domain of four values was found for the column REGION_ID: 3,2,4, and 1. If you want to see which records contributed to this finding, select the REGION_ID row and view the details in the Data Drill panel.

Table 23-12 describes the various measurement results available in the Domain tab.

Table 23-12 Domain Results

Measurement Description

Found Domain

The discovered domain values

% Compliant

The percentage of all column values that are compliant with the discovered domain values

Six Sigma

The Six-Sigma value for the domain results


Pattern

The Pattern tab displays information discovered about patterns within the attribute. Pattern discovery is the profiler's attempt at generating regular expressions for data it discovered for a specific attribute. Note that non-English characters are not supported in the pattern discovery process.

Table 23-13 describes the various measurement results available in the Pattern tab.

Table 23-13 Pattern Results

Measurement Description

Dominant Character Pattern

The most frequently discovered character pattern or consensus pattern.

% Compliant

The percentage of rows whose data pattern agree with the dominant character pattern.

Dominant Word Pattern

The most frequently discovered word character pattern or consensus pattern.

& Compliant

The percentage of rows whose data pattern agree with the dominant word pattern.

Common Format

Name, Address, Date, Boolean, Social Security Number, E-mail, URL. This is the profiler's attempt to add semantic understanding to the data that it sees. Based on patterns and some other techniques, it will try to figure out which domain bucket a certain attribute's data belongs to.

% Compliant

The percentage of rows whose data pattern agree with the consensus common format pattern.


Unique Key

The Unique Key tab provides information about the existing unique keys that were documented in the data dictionary, and possible unique keys or key combinations that were detected by the data profiling operation. The uniqueness % is shown for each. The unique keys that have No in the Documented? column are the ones that are discovered by data profiling.

For example, a phone number is unique in 98% of the records. It can be a unique key candidate, and you can then cleanse the noncompliant records. You can also use the drill-down feature to view the cause of the duplicate phone numbers in the Data Drill panel. Table 23-14 describes the various measurement results available in the Unique Key tab.

Table 23-14 Unique Key Results

Measurement Description

Unique Key

The discovered unique key.

Documented?

Yes or No. Yes indicates that a unique key on the column exists in the data dictionary. No indicates that the unique key was discovered as a result of data profiling.

Discovered?

From analyzing the column values, data profiling determines whether a unique key should be created on the columns in the Local Attribute(s) column.

Local Attribute(s)

The name of the column in the table that was profiled.

# Unique

The number of rows, in the source object, in which the attribute represented by Local Attribute is unique.

% Unique

The percentage of rows, in the source object, for which the attribute represented by Local Attribute are unique.

Six Sigma

For each column, number of null values (defects) to the total number of rows in the table (opportunities).


Functional Dependency

The Functional Dependency tab displays information about the attribute or attributes that seem to depend on or determine other attributes. Information can be viewed from either the Tabular or Graphical sub-tabs. You can use the Show list to change the focus of the report. Note that unique keys defined in the database are not discovered as functional dependencies during data profiling.

Table 23-15 describes the various measurement results available in the Functional Dependency tab.

Table 23-15 Functional Dependency Results

Measurement Description

Determinant

The name of the attribute that is found to determine the attribute listed in the Dependent

Dependent

The name of the attribute found to be determined by value of another attribute

# Defects

The number of values in the Determinant attribute that were not determined by the Dependent attribute

% Compliant

The percentage of values that are compliant with the discovered dependency

Six Sigma

The six sigma value

Type

The suggested action for the discovered dependency


For example, if you select Only 100% dependencies from the Show list, the information shown is limited to absolute dependencies. If you have an attribute that is always dependent on another attribute, it is recommended that it be a candidate for a reference table. Suggestions are shown in the Type column. Removing the attribute into a separate reference table normalizes the schema.

The Functional Dependency tab also has a Graphical sub-tab so that you can view information graphically. You can select a dependency and property from the lists to view graphical data.

For example, in Figure 23-6, you select the dependency where DEPARTMENT_ID seems to determine COMMISSION_PCT. (DEPARTMENT_ID->COMMISSION_PCT). In a majority of cases, COMMISION_PCT is null. Warehouse Builder therefore determines that most DEPARTMENT_ID values determine COMMISION_PCT to be null. By switching the Property to Non-Compliant, you can view the exceptions to this discovery. Figure 23-6 shows that for the DEPARTMENT_ID values of 80, the COMMISION_PCT values are not null. This makes sense after you discover that the department with DEPARTMENT_ID 80 is Sales department.

Figure 23-6 Graphical Functional Dependency

Description of Figure 23-6 follows
Description of "Figure 23-6 Graphical Functional Dependency"

Referential

The Referential tab displays information about foreign keys that were documented in the data dictionary, as well as relationships discovered during profiling. For each relationship, you can see the level of compliance. Information can be viewed from both the Tabular and Graphical subtabs. In addition, two other sub-tabs are available only in the Referential tab: Joins and Redundant Columns.

Table 23-16 describes the various measurement results available in the Referential tab.

Table 23-16 Referential Results

Measurement Description

Relationship

The name of the relationship.

Type

The type of relationship.

Documented?

Yes or No. Yes indicates that a foreign key on the column exists in the data dictionary. No indicates that the foreign key was discovered as a result of data profiling.

Discovered?

From analyzing the column values, data profiling determines whether a foreign key should be created on the column represented by Local Attribute(s).

Local Attribute(s)

The name of the attribute in the source object.

Remote Key

The name of the key in the referenced object to which the local attribute refers.

Remote Attribute(s)

The name of the attributes in the referenced object.

Remote Relation

The name of the object that the source object references.

Remote Module

The name of the module that contains the referenced object.

Cardinality Range

The range of the cardinality between two attributes.

For example, the EMP table contains 5 rows of employee data. There are two employees each in department 10 and 20 and one employee in department 30. The DEPT table contains three rows of department data with deptno value 10, 20, and 30.

Data profiling will find a row relationship between the EMP and DEPT tables. The cardinality range will be 1-2:1-1. This is because in EMP, the number of rows per distinct value ranges from 1 (for deptno 30) to 2 (deptno 10 and 20). In DEPT, there is only one row for each distinct value (10, 20, and 30).

# Orphans

The number of orphan rows in the source object.

% Compliant

The percentage of values that are compliant with the discovered dependency.

Six Sigma

For each column, number of null values (defects) to the total number of rows in the table (opportunities)


For example, if you are analyzing two tables for referential relationships: the Employees table and the Departments table. Using the Referential data profiling results shown in Figure 23-7, you discover that the DEPARTMENT_ID column in the Employees table is related to DEPARTMENT_ID column in the Departments table 98% of the time. You can then click the hyperlinked Yes in the Discovered? column to view the rows that did not comply with the discovered foreign key relationship.

Figure 23-7 Referential Results

Description of Figure 23-7 follows
Description of "Figure 23-7 Referential Results"

You can also select the Graphical subtab to view information graphically. This view is effective for viewing noncompliant records, such as orphans. To use the Graphical sub-tab, make a selection from the Reference and Property lists.

The Joins sub-tab displays a join analysis on the reference selected in the Reference list. The results show the relative size and exact counts of the three possible outcomes for referential relationships: joins, orphans, and childless objects.

For example, both the EMPLOYEES and DEPARTMENTS tables contain a DEPARTMENT_ID column. There is a one-to-many relationship between the DEPARTMENT_ID column in the DEPARTMENTS table and the DEPARTMENT_ID column in the EMPLOYEES table. The Joins represent the values that have values in both tables. Orphans represent values that are only present in the EMPLOYEES table and not the DEPARTMENTS table. And Childless values are present in the DEPARTMENTS table and not the EMPLOYEES table. You can drill into values on the diagram to view more details in the Data Drill panel.

Figure 23-8 displays the Joins subtab of the Referential tab.

The Redundant Columns sub-tab displays information about columns in the child table that are also contained in the primary table. Redundant column results are only available when perfectly unique columns are found during profiling.

For example, consider two tables EMP and DEPT, shown in Table 23-17 and Table 23-18, having the following foreign key relationship: EMP.DEPTNO (uk) = DEPT.DEPTNO (fk).

Table 23-17 EMP Table

Employee Number Dept. No Location

100

1

CA

200

2

NY

300

3

MN


Table 23-18 DEPT Table

Dept No Location Zip

1

CA

94404

3

MN

21122

3

MN

21122

1

CA

94404


In this example, the Location column in the EMP table is a redundant column because you can get the same information from the join.

Data Rule

The Data Rule tab displays the data rules that are defined as a result of data profiling for the table selected in the object tree. The details for each data rule include the following:

  • Rule Name: Represents the name of the data rule.

  • Rule Type: Provides a brief description about the type of data rule.

  • Origin: Represents the origin of the data rule. For example, a value of Derived indicates that the data rule is derived.

  • % Compliant: Percent of rows that comply with the data rule.

  • # Defects: Number of rows that do not comply with the data rule.

The data rules on this tab reflect the active data rules in the Data Rule panel. You do not directly create data rules on this tab.

Derive Data Rules

Based on the results of data profiling, you can derive data rules that can be used to clean up your data. A data rule is an expression that determines the set of legal data that can be stored within a data object. Use data rules to ensure that only values compliant with the data rules are allowed within a data object. Data rules will form the basis for correcting or removing data if you decide to cleanse the data. You can also use data rules to report on non-compliant data.

Although you can create data rules and apply them manually to your data profile, derived data rules allow you to move quickly and seamlessly between data profiling and data correction.

For example, if you have a table called Employees with the following columns: Employee_Number, Gender, Employee_Name. The profiling result shows that 90% of the values in the Employee_Number column are unique, making it a prime candidate for the unique key. The results also show that 85% of the values in the Gender column are either 'M' or 'F', making it a good candidate for a domain. You can then derive these rules directly from the Profile Results Canvas.

Steps to Derive Data Rules

  1. Select a data profile in the navigation tree, right-click, and select Open Editor.

    The Data Profile Editor is displayed with the profiling results.

  2. Review the profiling results and determine which findings you want derived into data rules.

    The types of results that warrant data rules vary. Some results commonly derived into data rules include a detected domain, a functional dependency between two attributes, or a unique key.

  3. Select the tab that displays the results from which you want to derive a data rule.

    For example, to create a data rule that enforces a unique key rule for the EMPLOYEE_NUMBER column, navigate to the Unique Key tab.

  4. Select the cell that contains the results you want derived into a data rule.

  5. From the Profile menu select Derive Data Rule. Or click the Derive Data Rule button.

    For example, to create a Unique Key rule on the EMPLOYEE_NUMBER column, select this column and click Derive Data Rule.

    The Derive Data Rule Wizard displays.

  6. On the Welcome page, click Next.

  7. On the Name and Description page, the Name field displays a default name for the data rule. To specify a new name, select the name, enter the new name, and click Next.

  8. On the Define Rule page, provide details about the data rule parameters and click Next.

    The Type field that represents the type of data rule is populated based on the tab from which you derived the data rule. You cannot edit the type of data rule.

    Additional fields in the lower portion of this page define the parameters for the data rule. Some of these fields are populated with values based on the result of data profiling. The number and type of fields depends on the type of data rule.

  9. On the Summary page, review the options you set in the wizard using this page. Click Back if you want to change any of the selected values. Click Finish to create the data rule.

    The data rule is created and it appears in the Data Rule panel of the Data Profile Editor. The derived data rule is also added to the Derived_Data_Rules node under the Data Rules node in the Project Explorer. You can reuse this data rule by attaching it to other data objects.

Generate Corrections

After you have derived data rules from the profiling results, you can automate the process of correcting source data based on profiling results. You can create the schema and mapping corrections. The schema correction creates scripts that you can use to create a corrected set of source data objects with the derived data rules applied. The mapping correction creates new correction mappings to take your data from the source objects and load them into new objects.

Steps to Automate Data Correction

  1. Generate correction objects based on data profiling results.

    See "Steps to Create Corrections".

    You can view the corrected objects as described in "Viewing the Correction Tables and Mappings".

  2. Deploy the generated corrections to create the corrected objects in the target schema.

    See "Steps to Deploy Correction Objects".

Steps to Create Corrections

Use the Data Profile Editor to create corrections based on the profiling results.

To create corrections:

  1. If the Data Profile is not already open, open it by right-clicking the data profile in the Project Explorer and selecting Open Editor.

  2. From the Profile menu, select Create Correction.

    The Create Correction Wizard is displayed.

  3. On the Select Target Module page, specify the target module that will contain the corrections and click Next. You can create a new module or use an existing module.

    • To store the corrections in an existing target module, choose Select an existing module and select the module from the Available list.

    • To store correction in a new target module, select Create a new target module. The Create Module Wizard guides you through the steps of creating a new target module.

    You can remove correction objects created as a result of previous corrections by selecting Remove previous correction objects.

  4. On the Select Objects page, select the objects that you want to correct by moving them to the Selected list and click Next.

    The Filter list enables you to filter the objects that are available for selection. The default selection is All Objects. You can display only particular types of data objects such as tables or views.

  5. On the Select Data Rules and Data Types page, perform schema correction by selecting the corrections that you want to implement for each object. Click Next.

    See "Selecting Data Rules and Data Types for Corrected Objects".

  6. (Optional) On the Data Rules Validation page, note down the validation errors, if any, and correct them before proceeding.

    If correction objects from a previous data correction action exist for the objects selected for correction, this page displays a message. Click Next to remove previously created correction objects.

  7. On the Verify and Accept Corrected Tables page, select the objects that you want to correct and click Next.

    Se e "Selecting the Objects to be Corrected".

  8. On the Choose Data Correction Actions page, specify the correction actions to be performed for objects and click Next.

    See "Choosing Data Correction and Cleansing Actions".

  9. On the Summary page, click Finish to create the correction objects.

The correction schema is created and added to the Project Explorer. The correction objects and mappings are displayed under the module that you specify as the target module on the Select Target Module page of the Create Correction Wizard. Depending on the data correction actions specified, correction objects may include transformations used to cleanse and correct data.

Selecting Data Rules and Data Types for Corrected Objects

Use this page to select the data rules that should be applied to the selected objects. The objects selected for correction are on the left side of the page and are organized into a tree by modules. The right panel contains two tabs: Data Rules and Data Types.

Data Rules The Data Rules tab displays the available data rules for the object selected in the object tree. Specify the data rules that you want to apply to that corrected object by selecting the check box to the left of the data rule. Warehouse Builder uses these data rules to create constraints on the tables during the schema generation.

The Bindings section contains details about the table column to which the rule is bound. Click a rule name to display the bindings for the rule.

The method used to apply the data rule to the correction table depends on the type of data rule you are implementing.

Data Types The Data Types tab displays the columns that are selected for correction. The change could be a modification of the data type, precision, or from fixed-length to variable-length. The Documented Data Type column displays the existing column definition and the New Data Type column displays the proposed correction to the column definition.

To correct a column definition, select the check box to the left of the column name.

Selecting the Objects to be Corrected

Use the Verify and Accept corrected Tables page to confirm the objects that you want to correct. This page lists the tables that have been modified with the applied rules and the data type changes.

Perform the following actions on this page:

  1. To create a data object in the corrected schema, select Create to the left of the object.

    The lower part of the page displays the details of the selected object using tabs. The Columns tab displays the details of columns that will be created in the corrected data object. The Constraints tab displays details of constraints that will be created on the corrected data object. The Data Rules tab displays details of data rules that will be created on the corrected data object.

  2. On the Columns tab, perform the following actions:

    • To create a displayed column in the corrected data object, select Create to the left of the column name.

    • To remove a displayed column from the corrected object, deselect Create to the left of the column name.

    • To modify a displayed column in the corrected object, edit the Data Type, Length, Precision, Seconds Precision, and Scale attributes. However, you cannot modify a column name.

  3. On the Constraints tab, perform the following actions:

    • To add additional constraints, click Add Constraint.

    • To remove a constraint from the corrected data object, select the constraint and click Delete.

  4. On the Data Rules tab, perform the following actions:

    • To apply a data rule, that was derived based on data profiling results, to the corrected object, select the check box to the left of the data rule.

      Ensure that the Bindings column contains the column to which the data rule should be applied.

    • To apply a new data rule to the corrected object, click Apply Rule. The Apply Data Rule wizard guides you through the process of applying a data rule.

Choosing Data Correction and Cleansing Actions

Use the Choose Data Correction Actions page to select the actions to perform to correct the source data. This page contains two sections: Select a Corrected Table and Choose Data Correction Actions. The Select a Corrected Table section lists the objects that you selected for corrections. Select a table in this section to display the affiliated data rules in the Choose Data Correction Actions section.

Choose Data Correction Actions For each data rule, select an action from the menu in the Action column. The settings you select here determine how to handle data values that are not accepted due to data rule enforcement. Select one of the following actions:

  • Ignore

  • Report

  • Cleanse

See Also:

"Data Correction Actions" for a description of each correction action.

Cleansing Strategy Use the Cleansing Strategy list to specify a cleansing strategy. This option is enabled only if you select Cleanse in the Action column. The cleansing strategy depends on the type of data rule and the rule configuration. Error tables are used to store the records that do not conform to the data rule.

Steps to Deploy Correction Objects

To deploy the correction mappings created as part of the data correction process:

  1. Grant the SELECT privilege on the source tables to PUBLIC.

    For example, your correction mapping contains the table EMPLOYEES from the HR schema. You can successfully deploy this correction mapping only if the SELECT privilege is granted to PUBLIC on the HR.EMPLOYEES table.

  2. To create the correction objects in your target schema, deploy the correction tables created as a result of data profiling.

  3. To cleanse data, deploy and execute the correction mappings.

Viewing the Correction Tables and Mappings

You can review the correction tables in the Data Object Editor to see the data rules and constraints created as part of the design of your table.

To view the correction mappings:

  1. Double-click the table or mapping to open the object in their respective editors.

  2. After the mapping is open, select View and then Auto Layout to view the entire mapping.

    Figure 23-9 displays a correction map generated by the Create Correction wizard.

    Figure 23-9 Generated Correction Mapping

    Description of Figure 23-9 follows
    Description of "Figure 23-9 Generated Correction Mapping"

  3. Select the submapping ATTR_VALUE_1 and click the Visit Child Graph icon from the toolbar to view the submapping.

    Figure 23-10 displays the submapping that is displayed.

    Figure 23-10 Correction Submapping

    Description of Figure 23-10 follows
    Description of "Figure 23-10 Correction Submapping"

    The submapping is the element in the mapping that performs the actual correction cleansing you specified in the Create Correction Wizard. In the middle of this submap is the DOMAINSIMILARITY transformation that was generated as a function by the Create Correction Wizard.

Define and Edit Data Rules Manually

Data rules can be derived or manually created. Before and after you have created the corrections, you can define additional data rules manually.

For more information about defining and editing data rules manually, see "Creating Data Rules".

Generate, Deploy, and Execute

Finally, you can generate, deploy, and execute the correction mappings and data rules. After you run the correction mappings with the data rules, your data is corrected. The derived data rules remain attached to the objects in the corrected schema for optional use in data monitors.

Editing Data Profiles

Once you create a data profile, you can use the Data Profile Editor to modify its definition. You can also add data objects to an existing data profile. To add objects, you can use either the menu bar options or the Select Objects tab of the Edit Data Profile dialog box.

To edit a data profile:

  1. In the Project Explorer, right-click the data profile and select Open Editor.

    The Data Profile Editor is displayed.

  2. From the Edit menu, select Properties.

    The Edit Data Profile dialog box is displayed.

  3. Edit any of the following properties of the data profile and click OK.

    • To modify the name or description of the data profile, on the Name tab, select the name or description and enter the new value.

    • To add or remove objects, on the Select Objects tab, use the arrows to add objects to or remove objects from the data profile.

    • To change the location that is used as the data profiling staging area, use the Data Locations tab.

      Use the arrows to move the new location to the Selected Locations section. Ensure that you select New Configuration Default to set this location as the default profiling location for the data profile.

      Note:

      If you modify the profiling location after you have performed data profiling, the previous profiling results are lost.

Adding Data Objects to a Data Profile

To add data objects to a data profile, complete the following steps:

  1. Right-click the data profile in the Project Explorer and select Open Editor.

    The Data Profile Editor is displayed.

  2. From the Profile menu, select Add Objects.

    The Add Profile Tables dialog box is displayed.

  3. On the Add Profile Tables dialog box, select the objects that you want to add to the data profile. Select the objects and use the arrows to move them to the Selected section.

    You can select multiple objects by holding down the Ctrl key and selecting objects.

Reference for Setting Data Profiling Configuration Parameters

The configuration parameters that you can set for data profiles are categorized as follows:

Load Configuration

This category contains the following parameters:

  • Enable Data Type Discovery: Set this parameter to true to enable data type discovery for the selected table.

  • Enable Common Format Discovery: Set this parameter to true to enable common format discovery for the selected table.

  • Copy Data into Workspace: Set this parameter to true to enable copying of data from the source to the profile workspace.

  • Random Sample Rate: This value represents the percent of total rows that will be randomly selected during loading.

  • Sample Set Filter: This represents the WHERE clause that will be applied on the source when loading data into the profile workspace. Click the Ellipsis button on this field to display the Expression Builder. Use the Expression Builder to define the WHERE clause.

  • Null Value Representation: This value will be considered as null value during profiling. You must enclose the value in single quotation marks. The default value is null, which is considered as a database null.

Aggregation Configuration

This category consists of one parameter called Not Null Recommendation Percentage. If the percentage of null values in a column is less than this threshold percent, then that column will be discovered as a possible Not Null column.

Pattern Discovery Configuration

This category contains the following parameters:

  • Enable Pattern Discovery: Set this to true to enable pattern discovery.

  • Maximum Number of Patterns: This represents the maximum number of patterns that the profiler will get for the attribute. For example, when you set this parameter to 10, it means that the profiler will get the top 10 patterns for the attribute.

Domain Discovery Configuration

This category contains the following parameters:

  • Enable Domain Discovery: Set this to true to enable domain discovery.

  • Domain Discovery Max Distinct Values Count: The maximum number of distinct values in a column in order for that column to be discovered as possibly being defined by a domain. Domain discovery of a column occurs if the number of distinct values in that column is at or below the Max Distinct Values Count property, and the number of distinct values as a percentage of total rows is at or below the Max Distinct Values Percent property.

  • Domain Discovery Max Distinct Values Percent: The maximum number of distinct values in a column, expressed as a percentage of the total number of rows in the table, in order for that column to be discovered as possibly being defined by a domain. Domain Discovery of a column occurs if the number of distinct values in that column is at or below the Max Distinct Values Count property, and the number of distinct values as a percentage of total rows is at or below the Max Distinct Values Percent property.

  • Domain Value Compliance Min Rows Count: The minimum number of rows for the given distinct value in order for that distinct value to be considered as compliant with the domain. Domain Value Compliance for a value occurs if the number of rows with that value is at or above the Min Rows Count property, and the number of rows with that value as a percentage of total rows is at or above the Min Rows Percent property.

  • Domain Value Compliance Min Rows Percent: The minimum number of rows, expressed as a percentage of the total number of rows, for the given distinct value in order for that distinct value to be considered as compliant with the domain. Domain Value Compliance for a value occurs if the number of rows with that value is at or above the Min Rows Count property, and the number of rows with that value as a percentage of total rows is at or above the Min Rows Percent property.

Relationship Attribute Count Configuration

This category contains one parameter called Maximum Attribute Count. This is the maximum number of attributes for unique key, foreign key, and functional dependency profiling.

Unique Key Discovery Configuration

This category contains the following parameters:

  • Enable Unique Key Discovery: Set this parameter to true to enable unique key discovery.

  • Minimum Uniqueness Percentage: This is the minimum percentage of rows that need to satisfy a unique key relationship.

Functional Dependency Discovery Configuration

This category contains the following parameters:

  • Enable Functional Dependency Discovery: Set this parameter to true to enable functional dependency discovery.

  • Minimum Functional Dependency Percentage: This is the minimum percentage of rows that need to satisfy a functional dependency relationship.

Row Relationship Discovery Configuration

This category contains the following parameters:

  • Enable Relationship Discovery: Set this parameter to true to enable foreign key discovery.

  • Minimum Relationship Percentage: This is the minimum percentage of rows that need to satisfy a foreign key relationship.

Redundant Column Discovery Configuration

This category contains the following parameters:

  • Enable Redundant Columns Discovery: Set this parameter to true to enable redundant column discovery with respect to a foreign key-unique key pair.

  • Minimum Redundancy Percentage: This is the minimum percentage of rows that are redundant.

Data Rule Profiling Configuration

This category contains one parameter called Enable Data Rule Profiling for Table. Set this parameter to true to enable data rule profiling for the selected table. This setting is only applicable for a table, and not for an individual attribute.

Tuning the Data Profiling Process

Data profiling is a highly processor and I/O intensive process and the execution time for profiling ranges from a few minutes to a few days. You can achieve the best possible data profiling performance by ensuring that the following conditions are satisfied:

  • Your database is set up correctly for data profiling.

  • The appropriate data profiling configuration parameters are used when you perform data profiling.

Tuning the Data Profile for Better Data Profiling Performance

You can configure a data profile to optimize data profiling results. Use the configuration parameters to configure a data profile.

Use the following guidelines to make your data profiling process faster:

  • Perform only the types of analysis that you require

    If you know that certain types of analysis are not required for the objects that you are profiling, use the configuration parameters to turn off these types of data profiling.

  • Analyze lesser amount of data

    Use the WHERE clause and Sample Rate configuration parameters.

If the source data for profiling is stored in an Oracle Database, it is recommended that the source schema be located on the same database instance as the profile workspace. You can do this by installing the workspace into the same Oracle instance as the source schema location. This avoids using a database link to move data from source to profiling workspace.

Tuning the Oracle Database for Better Data Profiling Performance

To ensure good data profiling performance, the computer that runs the Oracle Database must have certain hardware capabilities. In addition to this, you must optimize the Oracle Database instance on which you are performing data profiling.

For efficient data profiling, the following considerations are applicable:

  • Multiple Processors

  • Memory

  • I/O System

Multiple Processors

The computer that runs the Oracle Database needs multiple processors. Data profiling has been designed and tuned to take maximum advantage of the parallelism provided by the Oracle Database. While profiling large tables (more than 10 million rows), it is highly recommended to use a multiple processor computer.

Hints are used in queries required to perform data profiling. It picks up the degree of parallelism from the initialization parameter file of the Oracle Database. The default initialization parameter file contains parameters that take advantage of parallelism.

Memory

It is important that you ensure a high memory hit ratio during data profiling. You can ensure this by assigning a larger size of the System Global Area. It is recommended that the size of the System Global Area be at least 500 MB. If possible, configure it to 2 GB or 3 GB.

For advanced database users, it is recommended that you observe the buffer cache hit ratio and library cache hit ratio. Set the buffer cache hit ratio to higher than 95% and the library cache hit ratio to higher than 99%.

I/O System

The capabilities of the I/O system have a direct impact on the data profiling performance. Data profiling processing frequently performs full table scans and massive joins. Since today's CPUs can easily out-drive the I/O system, you must carefully design and configure the I/O subsystem. Keep in mind the following considerations that aid better I/O performance:

  • You need a large number of disk spindles to support uninterrupted CPU and I/O cooperation. If you have only a few disks, the I/O system is not geared towards a high degree of parallel processing. It is recommended to have a minimum of two disks for each CPU.

  • Configure the disks. It is recommended that you create logical stripe volumes on the existing disks, each striping across all available disks. Use the following formula to calculate the stripe width:

    MAX(1,DB_FILE_MULTIBLOCK_READ_COUNT/number_of_disks) X DB_BLOCK_SIZE

    Here, DB_FILE_MULTIBLOCK_SIZE and DB_BLOCK_SIZE are parameters that you set in your database initialization parameter file. You can also use a stripe width that is a multiple of the value returned by the formula.

    To create and maintain logical volumes, you need a volume management software such as Veritas Volume Manager or Sun Storage Manager. If you are using Oracle Database 10g or a higher version and you do not have any volume management software, you can use the Automatic Storage Management feature of the Oracle Database to spread the workload to disks.

  • Create different stripe volumes for different tablespaces. It is possible that some of the tablespaces occupy the same set of disks.

    For data profiling, the USERS and the TEMP tablespaces are normally used at the same time. So you can consider placing these tablespaces on separate disks to reduce interference.

Using Data Rules

In addition to deriving data rules based on the results of data profiling, you can define your own data rules. You can bind a data rule to multiple tables within the project in which the data rule is defined. An object can contain any number of data rules.

Use the Design Center to create and edit data rules. Once you create a data rule, you can use it in any of the following scenarios.

Using Data Rules in Data Profiling

When you are using data profiling to analyze tables, you can use data rules to analyze how well data complies with a given rule and to collect statistics. From the results, you can derive a new data rule. If data profiling determines that the majority of records have a value of red, white, and blue for a particular column, a new data rule can be derived that defines the color domain (red, white, and blue). This rule can then be reused to profile other tables, or reused in cleansing, and auditing.

Using Data Rules in Data Cleansing and Schema Correction

Data rules can be used in two ways to cleanse data and correct schemas. The first way is to convert a source schema into a new target schema where the structure of the new tables strictly adheres to the data rules. The new tables would then have the right data types, constraints would be enforced, and schemas would be normalized. The second way data rules are used is in a correction mapping that validates the data in a source table against the data rules, to determine which records comply and which do not. The analyzed data set is then corrected (for example, orphan records are removed, domain value inaccuracies are corrected, and so on) and the cleansed data set is loaded into the corrected target schema.

Using Data Rules in Data Auditing

Data rules are also used in data auditing. Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule, and they report defective data into auditing and error tables. In that sense they are like data-rule-based correction mappings, which also offer a report-only option for data that does not comply with the data rules.

Creating Data Rule Folders

Each data rule belongs to a data rule folder, which is a container object that groups related data rules. To create a data rule, you must first create a data rule folder.

To create a data rule folder, in the navigation tree, right-click Data Rules and select New. The Create Data Rule Folder dialog box is displayed.

Creating Data Rules

The Data Rules folder in the Project Explorer contains the data rules. Every data rule must belong to a data rule folder. The subfolder DERIVED_DATA_RULES contains the data rules derived as a result of data profiling. You can create additional data rule folders to contain any data rules that you create.

To create a data rule:

  1. Right-click the Data Rule folder in which the data rule should be created and select New.

    The Welcome page of the Create Data Rule Wizard is displayed.

  2. On the Welcome page, click Next.

  3. On the Name and Description page, specify a name and an optional name for the data rule. Click Next.

  4. On the Define Rule page, specify the type of data rule to create. Also specify any additional information required to create the data rule. Click Next.

    See "Defining the Data Rule" for information about defining the data rule.

    For example, when you create a Domain Range rule, you must specify the values that represent the valid domain values.

  5. On the Summary page, review the selections you made in the wizard. Click Back to modify any selected values. Click Finish to create the data rule.

    The data rule is added to the data rule folder under which you created the data rule.

Defining the Data Rule

Use the Define Rule page or the Define Rule tab to provide details about the data rule. The top section of the page displays the Type list that represents the type of data rule. When you are creating a data rule, expand the Type field to view the types of data rules and select the type you want to create. When you edit a data rule, the Type field is disabled as you cannot change the type of data rule once it is created. For more information about types of data rules, see "Types of Data Rules".

Note:

When you are deriving a data rule, the Type field is automatically populated and you cannot edit this value.

The bottom section of this page specifies additional details about the data rule. The number and names of the fields displayed in this section depend on the type of data rule you create.

For example, if you select Custom as the Type, use the Attributes section to define the attributes required for the rule. Use the Ellipsis button on the Expression field to define a custom expression involving the attributes you defined in the Attributes section.

If you select Domain Range as the type of data rule, the bottom section of the page provides fields to specify the data type of the range, the minimum value, and the maximum value. When you are deriving a data rule, some of these fields are populated based on the profiling results from which you are deriving the rule. You can edit these values.

Editing Data Rules

After you create a data rule, you can edit its definition. You can rename the data rule and edit its description. You cannot change the type of data rule. However, you can change the other parameters specified for the data rule. For example, for a Domain Range type of data rule, you can edit the data type of the range, the minimum range value, and the maximum range value.

To edit a data rule:

  1. In the Project Explorer, right-click the data rule and select Open Editor.

    The Edit Data Rule dialog box is displayed.

  2. On the Name tab, you can perform the following tasks:

    • To rename a data rule, select the name and enter the new name.

    • To edit the description for the data rule, select the description and enter the new description.

  3. On the Define tab, edit the properties of the data rule.

    Note:

    You cannot change the type of data rule. You can only modify the properties related to that type of data rule such as the domain bounds, domain list, number of attributes in a unique key, and so on.

Applying Data Rules to Objects

Applying a data rule to an object binds the definition of the data rule to the object. For example, binding a rule to the table Dept ensures that the rule is implemented for the specified attribute in the table. You apply a data rule using the Data Object Editor. You can also apply a derived data rule from the Data Rule panel of the Data Profile Editor.

The Apply Data Rule Wizard enables you to apply a data rule to a data object. You can apply precreated data rules or any data rule you created to data objects. The types of data objects to which you can apply data rules are tables, views, materialized views, and external tables.

To apply a data rule to a data object:

  1. In the Project Explorer, right-click the object to which you want to apply a data rule select Open Editor.

    The Data Object Editor for the data object is displayed.

  2. Go to the Data Rules tab.

    Any data rules already bound to the data object are displayed on this tab.

  3. Click Apply Rule.

    The Apply Data Rule Wizard is displayed.

  4. On the Welcome page, click Next.

  5. On the Select Rule page, select the data rule that you want to apply to the data object and click Next.

    Data rules are grouped under the nodes BUILT_IN, DERIVED_DATA_RULES, and any other data rule folders that you create.

    The BUILT_IN node contains the default data rules defined in the workspace. These include rules such as foreign key, unique key, not null.

    The DERIVED_DATA_RULES node lists all the data rules that were derived as a result of data profiling.

  6. On the Name and Description page, enter a name and an optional description for the applied data rule. Click Next.

  7. On the Bind Rule Parameters page, use the Binding list to select the column in the data object to which the data rule must be applied. Click Next.

  8. On the Summary page, review the sections you made on the previous wizard pages. Click Back to modify selected values. Click Finish to apply the data rule.

    The data rule is bound to the data object and is listed on the Data Rules tab.

Monitoring Data Quality Using Data Auditors

Data auditors are objects that you can use to continuously monitor your source schema to ensure that the data adheres to the defined data rules. You can monitor an object only if you have defined data rules for the object. You can create data auditors for tables, views, materialized views, and external tables.

To monitor data quality, perform the following steps:

  1. Create a data auditor containing the data objects that you want monitor.

    See "Creating Data Auditors"

  2. Run the data auditor to identify the records that do not comply with the data rules defined on the data objects. You can either run data auditors manually or schedule them to run at specified times.

    See "Auditing Data Objects Using Data Auditors" for information about running data auditors.

Note:

You cannot import metadata for data auditors in Merge mode. For more information about import mode options, refer to the Oracle Warehouse Builder Installation and Administration Guide.

Creating Data Auditors

Use the Create Data Auditor Wizard to create data auditors. Data auditors are part of an Oracle module in a project.

To create a data auditor:

  1. Expand the Oracle module in which you want to create the data auditor.

  2. Right-click Data Auditors and select New.

    The Create Data Auditor Wizard is displayed.

  3. On the Name and Description page, enter a name and an optional description for the data auditor. Click Next.

  4. On the Select Objects page, select the data objects that you want to audit and click Next.

    The Available section lists the objects available for auditing. This contains only objects that have data rules bound to them. The Selected section contains the objects that are selected for auditing. Use the shuttle buttons to move objects to the Selected section. You can select multiple objects by holding down the Ctrl key while selecting objects.

  5. On the Choose Actions page, specify the action to be taken for records that do not comply with data rules bound to the selected objects and click Next.

    See "Specifying Data Correction Actions".

  6. On the Summary page, review the selections you made. Click Back to modify any selected values or click Finish to create the data auditor.

The new data auditor is added to the Data Auditors node. At this stage, only the metadata for the data auditor is stored in your workspace. To use this data auditor to monitor the quality of data in your data objects, you must run the data auditor.

Specifying Data Correction Actions

Use the Choose Actions page of the Create Data Auditor Wizard or the Choose Action tab of the Edit Data Auditor dialog box to specify data correction actions. This page contains two sections: Error threshold mode and Data Rules.

Error threshold mode

Error threshold mode is used to determine the compliance of data to data rules in the objects. Select one of the following options:

  • Percent: The data auditor will set the audit result based on the percentage of records that do not comply with the data rule. This percentage is specified in the rule's Defect Threshold value.

  • Six Sigma: The data auditor will set the audit result based on the Six Sigma values for the data rules. If the calculated Six Sigma value for any rule is less than the specified Sigma Threshold value, then the data auditor will set the AUDIT RESULT to 2.

Data Rules

The Data Rules section lists the data rules applied to the objects selected on the Select Object page. For each rule, specify the following:

  • Action: The action to be performed if data in the source object does not comply with the data rule. Select Report to ensure that the data rule is audited. Select Ignore if you want the data rule to be ignored.

  • Defect Threshold: The percent of records that should comply with the data rules to ensure successful auditing. Specify a value between 1 and 100. This value is ignored if you select Six Sigma in the Error threshold mode section.

  • Sigma Threshold: The required success rate. Specify a number between 0 and 7. If you set the value to 7, no failures are allowed. This value is ignored if you select Percent in the Error threshold mode section.

Editing Data Auditors

After you create a data auditor, you can edit it and modify any of its properties.

To edit a data auditor:

  1. In the Project Explorer, right-click the data auditor and select Open Editor.

    The Edit Data Auditor dialog box is displayed.

  2. On the Name tab, enter a new name or description for the data auditor.

  3. On the Select Objects tab, add or remove objects that will be audited as part of the data auditor using the arrows.

  4. On the Choose Actions tab, edit the data correction actions you specified.

    See "Specifying Data Correction Actions".

  5. On the Reconcile Objects tab, select the check box to the left of an objects to reconcile its definition with the latest repository definition. Click Reconcile.

  6. Click OK to close the Edit Data Auditor dialog box.

Auditing Data Objects Using Data Auditors

After you create a data auditor, you can use it to monitor the data in your data objects. This ensures that the data rule violations for the objects are detected. When you run a data auditor, any records that violate the data rules defined on the data objects are written to the error tables.

There are two ways of using data auditors:

  • Manually Running Data Auditors

  • Automatically Running Data Auditors

Manually Running Data Auditors

To check if the data in the data object adheres to the data rules defined for the object, you must run the data auditor. You can run data auditors from the Design Center or the Control Center Manager. To run a data auditor from the Design Center, right-click the data auditor and select Start. In the Control Center Manager, select the data auditor, and from the File menu, select Start. The results are displayed in the Job Details window as described in "Data Auditor Execution Results".

Automatically Running Data Auditors

You can automate the process of running a data auditor using the following steps:

  1. Create a process flow that contains a Data Auditor Monitor activity.

  2. Schedule this process flow to run at a predefined time.

    For more information about scheduling objects, see "Process for Defining and Using Schedules".

Figure 23-11 displays a process flow that contains a Data Auditor Monitor activity. In this process flow, LOAD_EMP_MAP is a mapping that loads data into the EMP table. If the data load is successful, the data auditor EMP_DATA_AUDIT is run. The data auditor monitors the data in the EMP table based on the data rules defined for the table.

Figure 23-11 Data Auditor Monitor Activity in a Process Flow

Description of Figure 23-11 follows
Description of "Figure 23-11 Data Auditor Monitor Activity in a Process Flow"

Data Auditor Execution Results

After you run a data auditor, the Job Details window displays the details of the execution. The Job Details window contains two tabs: Input Parameters and Execution Results. Note that the Job Details window is displayed only when you set the deployment preference Show Monitor to true. For more information about deployment preferences, see "Deployment Preferences".

Figure 23-12 displays the Execution Results tab of the Job Details window.

Figure 23-12 Data Auditor Execution Results

Description of Figure 23-12 follows
Description of "Figure 23-12 Data Auditor Execution Results"

The Input Parameters tab contains the values of input parameters used to run the data auditor. The Execution Results tab displays the results of running the data auditor. This tab contains two sections: Row Activity and Output Parameters.

The Row Activity section contains details about the inserts into the error table for each step. Note that when more than one data rule is specified, multi-table insert may be used in the data auditor. In this case, the count of the number of rows will not be accurate.

In Figure 23-12, the data rule called E_NOT_NULL inserted one record into the error table.

The Output Parameters section contains the following three parameters:

  • AUDIT_RESULT: Indicates the result of running the data auditor. The possible values for this parameter are as follows:

    0: No data rule violations occurred.

    1: At least one data rule violation occurred, but no data rule failed to meet the minimum quality threshold as defined in the data auditor.

    2: At least one data rule failed to meet the minimum quality threshold.

    For more information about setting the threshold, see the step on choosing actions in "Creating Data Auditors".

  • EO_<data_rule_name>: Represents the calculated error quality for the specified data rule. 0 indicates all errors and 100 indicates no errors.

  • SO_<data_rule_name>: Represents the Six Sigma quality calculated for the specified data rule.

Configuring Data Auditors

During the configuration phase, you assign physical deployment properties to the data auditor you created by setting the configuration parameters. The Configuration Properties dialog box enables you to configure the physical properties of the data auditor.

To configure a data auditor:

  1. From the Project Explorer, expand the Databases node and then the Oracle node.

  2. Right-click the name of the data auditor you want to configure and select Configure.

    The Configuration Properties dialog box is displayed.

  3. Based on your requirement, configure the parameters listed in Run Time Parameters, Data Auditor Parameters, and Code Generation Options.

Run Time Parameters

Default Purge Group: This parameter is used when executing the package. Each audit record in the runtime schema is assigned to the purge group specified.

Bulk size: The number of rows to be fetched as a batch while processing cursors.

Analyze table sample percentage: The percentage of rows to be samples when the target tables are analyzed. You analyze target tables to gather statistics that you can use to improve performance while loading data into the target tables.

Commit frequency: The number of rows processed before a commit is issued.

Maximum number of errors: The maximum number of errors allowed before the execution of this step is terminated.

Default Operating Mode: The operating mode used. The options you can select are Row based, Row based (target only), Set based, Set based fail over to row based, Set based fail over to row based (target only).

Default Audit Level: Use this parameter to indicate the audit level used when executing the package. When the package is run, the amount of audit information captured in the runtime schema depends on the value set for this parameter.

The options you can select are as follows:

ERROR DETAILS: At runtime, error information and statistical auditing information is recorded.

COMPLETE: All auditing information is recorded at runtime. This generates a huge amount of diagnostic data which may quickly fill the allocated tablespace.

NONE: No auditing information is recorded at runtime.

STATISTICS: At runtime, statistical auditing information is recorded.

Data Auditor Parameters

This category uses the same name as the data auditor and contains the following generic data auditor configuration parameters:

Generation comments: Specify additional comments for the generated code.

Threshold Mode: Specify the mode that should be used to measure failure thresholds. The options are PERCENTAGE and SIX SIGMA.

Language: The language used to define the generated code. The options are PL/SQL (default) and UNDEFINED. Ensure that PL/SQL (default) is selected.

Deployable: Select this option to indicate that you want to deploy this data auditor. Warehouse Builder generates code only if the data auditor is marked as deployable.

Referred Calendar: Specify the schedule to associate with the data auditor. The schedule defines when the data auditor will run.

Code Generation Options

ANSI SQL syntax: Select this option to use ANSI SQL code in the generated code. If this option is not selected, Oracle SQL syntax is generated.

Commit control: Specifies how commit is performed. The options available for this parameter are: Automatic, Automatic Correlated, and Manual. Ensure that this parameter is set to Automatic.

Enable Parallel DML: Select this option to enable parallel DML at runtime.

Analyze table statistics: Select this option to generate the statement used to collect statistics for the data auditor. If the target table is not in the same schema as the mapping and you wish to analyze the table, then you would need to grant ANALYZE ANY to the schema owning the mapping.

Optimized Code: Select this option to indicate that optimized code should be generated.

Generation Mode: Select the mode in which optimized code should be generated. The option you can select are: All Operating Modes, Row based, Row based (target only), Set based, Set based fail over to row based, and Set based fail over to row based (target only).

Use Target Load Ordering: Select this option to generate code for target load ordering.

Error Trigger: Specify the name of the error trigger procedure.

Bulk Processing code: Select this option to generate bulk processing code.

Viewing Data Auditor Error Tables

When you run a data auditor, either manually or as part of the process flow, records that do not comply with the data rules defined on the objects contained in the data auditor are written to error tables. Each object contained in the data auditor has a corresponding error table that stores non-compliant records for that object.

You can view all non-compliant records that are written to error tables by using the Repository Browser.

To view error tables created as a result of data auditor execution:

  1. Grant privileges on the error tables as described in "Granting Privileges on Error Tables".

  2. Use the Repository Browser to view the error tables. Perform the following steps:

Granting Privileges on Error Tables

Before you view data stored in error tables using the Repository Browser, you must grant privileges on the error tables to the OWBSYS user. This enables the Repository Browser to access error table data.

To grant privileges on error tables:

  1. In SQL*Plus, log in to the schema containing the error tables.

    The error table for an object is stored in the same schema as the object.

  2. Run the SQL script OWB_ORACLE_HOME\owb\rtp\sql\grant_error_table_privileges.sql.

  3. When prompted, enter the name of the error table for which you want to grant privileges.

    If you did not specify a name for the error table of an object, using the Error Table Name property, Warehouse Builder provides a default name. For objects that use error tables, the default error table name is the object name suffixed by "_ERR".

  4. Repeat steps 2 and 3 for each error table to which you want to grant privileges.

Setting Data Watch and Repair for Oracle Master Data Management (MDM)

Data Watch and Repair (DWR) is a profiling and correction solution designed to assist data governance in Oracle's Master Data Management (MDM) solutions. MDM applications need to provide a single consolidated view of data. To do this, they need to first clean up a system's master data before they can share it with multiple connected entities.

Warehouse Builder provides data profiling and data correction functionality that enables MDM applications to cleanse and consolidate data. You can use DWR for the following MDM applications:

  • Customer Data Hub (CDH)

  • Product Information Management (PIM)

  • Universal Customer Master (UCH)

Overview of Data Watch and Repair (DWR) for MDM

Data Watch and Repair (DWR) enables you to analyze, cleanse, and consolidate data stored in MDM databases using the following:

  • Data profiling

    Data profiling is a data analysis method that enables you to detect and measure defects in your source data.

    For more information about data profiling, see "About Data Profiling".

  • Data rules

    Data rules help ensure data quality by determining the legal data and relationships in the source data. You can import MDM-specific data rules, define your own data rules, or derive data rules based on the data profiling results.

    For more information about data rules, see "About Data Rules".

  • Data Correction

    Data correction enables you to correct any inconsistencies, redundancies, and inaccuracies in both the data and metadata. You can automatically create correction mappings to cleanse source data.

    For more information about data correction, see "About Data Correction and Augmentation".

DWR enables you to measure crucial business rules regularly. As you discover inconsistencies in the data, you can define and apply new data rules to ensure data quality.

Predefined Data Rules for MDM

Warehouse Builder provides a set of data rules that are commonly used in MDM applications. These include the following customer data rules that can be used in both Customer Data Hub (CDH) and Universal Customer Master (UCM) applications:

  • Attribute Completeness

  • Contact Completeness

  • Data Type

  • Data Domain

  • Restricted Values

  • Unique Key Discovery

  • Full Name Standardization

  • Common Pattern

  • Name Capitalization

  • Extended Phone Numbers

  • International Phone Numbers

  • No Access List by Name Only

  • No Access List by Name or SSN

  • No Email List

For more details about these data rules, refer to the Oracle Watch and Repair for MDM User's Guide.

Prerequisites for Performing Data Watch and Repair (DWR)

To use Data Watch and Repair (DWR), you need the following software:

  • Oracle Database 11g Release 1 (11.1) or later

  • One or more of the following Master Data Management (MDM) applications: Customer Data Hub (CDH), Product Information Management (PIM), or Universal Customer Master (UCH).

For MDM applications that run on an Oracle Database, you can directly use DWR. However, for MDM applications that do not run on an Oracle Database, you need to set up a gateway with the third-party database.

Steps to Perform Data Watch and Repair (DWR) Using Warehouse Builder

  1. Create a location corresponding to the Master Data Management (MDM) application database.

    Use the Oracle node under the Databases node in the Connection Explorer. Specify the details of the MDM database such as the user name, password, host name, port, service name, and database version.

  2. In the Project Explorer, expand the Applications node to display the nodes for the MDM applications.

    The CDH node represents Customer Data Hub application, the PIM node represents the Product Information Management application, and the UCM node represents the Universal Customer Master application.

  3. Right-click the node corresponding to the type of MDM application for which you want to perform DWR and select Create CMI Module.

    Use the Create Module Wizard to create a module that stores your MDM metadata definitions. Ensure that you select the location you created in step 1 while creating the module.

  4. Import metadata from your MDM application into the module created in step 3. Right-click the module and select Import.

    The Metadata Import Wizard is displayed. This wizard enables you to import MDM metadata. For more information, see "Using the Import Metadata Wizard".

  5. Import data rules specific to MDM as described in "Importing MDM Data Rules".

  6. Apply data rules to the MDM application tables as described in "Applying Data Rules to Objects".

    Applying data rules to tables enables you to determine if your table data complies with the business rules specified in the data rules. You can apply data rules you imported in step 5 or other data rules that you created.

    For more information about creating data rules, see "Creating Data Rules".

  7. Create a data profile that contains all tables from the MDM application that you want to profile.

    For more information about creating data profiles, see "Create a Data Profile".

  8. Perform data profiling on the MDM application objects as described in "Profile the Data".

  9. View the data profiling results as described in "View Profile Results".

  10. (Optional) Derive data rules based on data profiling results as described in "Derive Data Rules".

    Data rules derived from data profiling results are automatically applied to the table.

  11. Create correction mappings as described in "Steps to Create Corrections".

  12. Correct data and metadata using the correction mappings generated by Warehouse Builder as described in "Steps to Deploy Correction Objects".

  13. Write the corrected data, stored in the correction objects created in step 12, to the MDM application as described in "Writing Corrected Data and Metadata to the MDM Application".

Importing MDM Data Rules

Data rules required for Customer Data Hub (CDH) and Universal Customer Master (UCM) applications are provided in the OWB_ORACLE_HOME/misc/dwr/customer_data_rules.mdl file. To import these data rules, from the Design menu, select Import, then Warehouse Builder Metadata. In the Metadata Import dialog box, select customer_data_rules.mdl and click OK. For more information on using the Metadata Import dialog, click Help on this page.

The imported data rules are listed in the Global Explorer, under the MDM Customer Data Rules node of the Public Data Rules node.

Writing Corrected Data and Metadata to the MDM Application

The cleansed and corrected data is contained in the correction objects created as a result of data profiling.

To be more efficient, you can write back only those rows that need to be corrected. You can achieve this by modifying the generated correction mapping. Delete the branch that passes through the compliant rows unchanged (this is the branch that contains the minus filter and the minus set operators). Retain only the corrected rows processing branch in the correction mapping.

Use the following steps to write corrected data to the source MDM application:

  1. Create a mapping using the Mapping Editor.

  2. Drag and drop the corrected table on to the Mapping Editor. This represents the source table.

  3. For UCM, drag and drop the interface table that corresponds to the base table with which you are working.

    Use the MDM application tools and documentation to determine the base table for a particular interface table.

  4. Map the columns from the corrected table to the interface table.

  5. Deploy and execute the mapping to write corrected data to the source MDM application.

  6. Update the base table with changes made to the interface table. You can use Siebel Enterprise Integration Manager (EIM). EIM can be run using the command line or from a Graphical User Interface (GUI).

    For more details about using the EIM, refer to Siebel Enterprise Integration Manager Administration Guide.