23 Understanding Data Quality Management

Today, more than ever, organizations realize the importance of data quality. By ensuring that quality data is stored in your data warehouse or business intelligence application, you also ensure the quality of information for dependent applications and analytics.

Oracle Warehouse Builder offers a set of features that assist you in creating data systems that provide high quality information to your business users. You can implement a quality process that assesses, designs, transforms, and monitors quality. Within these phases, you will use specific functionality from Warehouse Builder to create improved quality information.

This chapter contains the following topics:

About the Data Quality Management Process
About Data Profiling
About Data Correction and Augmentation
About Data Rules
About Quality Monitoring
Performing Data Profiling
Tuning the Data Profiling Process
Using Data Rules
Monitoring Data Quality Using Data Auditors
Setting Data Watch and Repair for Oracle Master Data Management (MDM)

About the Data Quality Management Process

Quality data is crucial to decision-making and planning. The aim of building a data warehouse is to have an integrated, single source of data that can be used to make business decisions. Since the data is usually sourced from a number of disparate systems, it is important to ensure that the data is standardized and cleansed before loading into the data warehouse.

Warehouse Builder provides functionality that enables you to effectively manage data quality by assessing, transforming, and monitoring your data. Using Warehouse Builder for data management provides the following benefits:

Provides an end-to-end data quality solution.
Enables you to include data quality and data profiling as an integral part of your data integration process.
Stores metadata regarding the quality of your data alongside your data definitions.
Automatically generates the mappings that you can use to correct data. These mappings are based on the business rules that you choose to apply to your data and decisions you make on how to correct data.

Phases in the Data Quality Lifecycle

Ensuring data quality involves the following phases:

Quality Assessment
Quality Design
Quality Transformation
Quality Monitoring

Figure 23-1 shows the phases involved in providing high quality information to your business users.

Figure 23-1 Phases Involved in Providing Quality Information

Description of "Figure 23-1 Phases Involved in Providing Quality Information"

Quality Assessment

In the quality assessment phase, you determine the quality of the source data. The first step in this phase is to import the source data, which could be stored in different sources, into Warehouse Builder. You can import metadata and data from both Oracle and non-Oracle sources.

After you load the source data, you use data profiling to assess its quality. Data profiling is the process of uncovering data anomalies, inconsistencies, and redundancies by analyzing the content, structure, and relationships within the data. The analysis and data discovery techniques form the basis for data monitoring.

See Also:

"About Data Profiling"
"Performing Data Profiling"

Quality Design

The quality design phase consists of designing your quality processes. You can specify the legal data within a data object or legal relationships between data objects using data rules.

See Also:

"About Data Rules"
"Using Data Rules"

You can also correct and augment your data using data quality operators.

As part of the quality design phase, you also design the transformations that ensure data quality. These transformations could be mappings that are generated by Warehouse Builder as a result of data profiling or mappings you create.

See Also:

"Generate Corrections" for information about generating corrections based on the results of data profiling
"About Data Correction and Augmentation" for information about the data quality operators

Quality Transformation

The quality transformation phase consists of running the correction mappings you designed to correct the source data.

Quality Monitoring

Quality monitoring is the process of examining warehouse data over time and alerting you when the data violates business rules set for the data.

See Also:

"About Quality Monitoring"
"Monitoring Data Quality Using Data Auditors"

About Data Profiling

Data profiling is the first step for any organization to improve information quality and provide better decisions. It is a robust data analysis method available in Warehouse Builder that you can use to discover and measure defects in your data before you start working with it. Because of its integration with the ETL features in Warehouse Builder and other data quality features, such as data rules and built-in cleansing algorithms, you can also generate data cleansing mappings and schema correction scripts. This enables you to automatically correct any inconsistencies, redundancies, and inaccuracies in both the data and metadata.

Data profiling enables you to discover many important things about your data. Some common findings include the following:

A domain of valid product codes
A range of product discounts
Columns that hold the pattern of an e-mail address
A one-to-many relationship between columns
Anomalies and outliers within columns
Relations between tables even if they are not documented in the database

To begin the process of data profiling, you must first create a data profile using the Design Center. You can then profile the objects contained in the data profile and then create correction tables and mappings.

See Also:

"Performing Data Profiling"
"Create a Data Profile"

Benefits of Data Profiling

Using the data profiling functionality in Warehouse Builder enables you to:

Profile data from any source or combination of sources that Warehouse Builder can access.
Explore data profiling results in tabular or graphical format.
Drill down into the actual data related to any profiling result.
Derive data rules, either manually or automatically, based on the data profiling results.
Attach any data rule to a target object and select an action to perform if the rule fails.
Create a data auditor from a data rule to continue monitoring the quality of data being loaded into an object.
Derive quality indices such as six-sigma valuations.
Profile or test any data rules you want to verify before putting in place.

Types of Data Profiling

Following the selection of data objects, determine the aspects of your data that you want to profile and analyze.

Data profiling offers three main types of analysis:

Attribute Analysis
Functional Dependency
Referential Analysis

In addition to these analysis, you can create custom profiling processes using data rules that allows you to validate custom rules against the actual data and get a score of their accuracy.

Figure 23-2 displays a representation of the types of data profiling and how you can perform each type.

Figure 23-2 Data Profiling Overview

Description of "Figure 23-2 Data Profiling Overview"

Attribute Analysis

Attribute analysis seeks to discover both general and detailed information about the structure and content of data stored within a given column or attribute. Attribute analysis looks for information about patterns, domains, data types, and unique values.

Attribute analysis consists of pattern analysis, domain analysis, data type analysis, and unique key analysis.

Pattern Analysis

Pattern analysis attempts to discover patterns and common types of records by analyzing the string of data stored in the attribute. It identifies the percentages of your data that comply with a certain regular expression format pattern found in the attribute. Using these pattern results, you can create data rules and constraints to help clean up current data problems. Some commonly identified patterns include dates, e-mail addresses, phone numbers, and social security numbers.

Table 23-1 shows a sample attribute, Job Code, that could be used for pattern analysis.

Table 23-1 Sample Columns Used for Pattern Analysis

Job ID	Job Code
7	337-A-55
9	740-B-74
10	732-C-04
20	43-D-4

Table 23-2 shows the possible results from pattern analysis, where D represents a digit and X represents a character. After looking at the results and knowing that it is company policy for all job codes be in the format of DDD-X-DD, you can derive a data rule that requires all values in this attribute to conform to this pattern.

Table 23-2 Pattern Analysis Results

Job Code	% Occurred
DDD-X-DD	75%
DD-X-D	25%

Domain Analysis

Domain analysis identifies a domain or set of commonly used values within the attribute by capturing the most frequently occurring values. For example, the Status column in the Customers table is profiled and the results reveal that 90% of the values are among the following: "MARRIED", "SINGLE", "DIVORCED". Further analysis and drilling down into the data reveal that the other 10% contains misspelled versions of these words with few exceptions. Configuration of the profiling determines when something is qualified as a domain, so review the configuration before accepting domain values. You can then let Warehouse Builder derive a rule that requires the data stored in this attribute to be one of the three values that were qualified as a domain.

Datatype Analysis

Data type analysis enables you to discover information about the data types found in the attribute. This type of analysis reveals metrics such as minimum and maximum character length values as well as scale and precision ranges. In some cases, the database column is of data type VARCHAR2, but the values in this column are all numbers. Then you may want to ensure that you only load numbers. Using data type analysis, you can have Warehouse Builder derive a rule that requires all data stored within an attribute to be of the same data type.

Unique Key Analysis

Unique key analysis provides information to assist you in determining whether or not an attribute is a unique key. It does this by looking at the percentages of distinct values that occur in the attribute. You might determine that attributes with a minimum of 70% distinct values should be flagged for unique key analysis. For example, using unique key analysis you could discover that 95% of the values in the EMP_ID column are unique. Further analysis of the other 5% reveals that most of these values are either duplicates or nulls. You could then derive a rule that requires that all entries into the EMP_ID column be unique and not null.

Functional Dependency

Functional dependency analysis reveals information about column relationships. This enables you to search for things such as one attribute determining another attribute within an object.

Table 23-3 shows the contents of the Employees table in which the attribute Dept. Location is dependent on the attribute Dept. Number. Note that the attribute Dept. Number is not dependent on the attribute Dept. Location.

Table 23-3 Employees Table

ID	Name	Salary	Dept Number	Dept Location
10	Alison	1000	10	SF
20	Rochnik	1000	11	London
30	Meijer	300	12	LA
40	John	500	13	London
50	George	200	13	London
60	Paul	600	13	London
70	Ringo	100	13	London
80	Yoko	600	13	London
90	Jones	1200	10	SF

Referential Analysis

Referential analysis attempts to detect aspects of your data objects that refer to other objects. The purpose behind this type of analysis is to provide insight into how the object you are profiling is related or connected to other objects. Because you are comparing two objects in this type of analysis, one is often referred to as the parent object and the other as the child object. Some of the common things detected include orphans, childless objects, redundant objects, and joins. Orphans are values that are found in the child object, but not found in the parent object. Childless objects are values that are found in the parent object, but not found in the child object. Redundant attributes are values that exist in both the parent and child objects.

Table 23-4 and Table 23-5 show the contents of two tables that are candidates for referential analysis. Table 23-4 is the child object and Table 23-5 is the parent object.

Table 23-4 Employees Table (Child)

ID	Name	Dept. Number	City
10	Alison	17	NY
20	Rochnik	23	SF
30	Meijer	23	SF
40	Jones	15	SD

Table 23-5 Department Table (Parent)

Dept. Number	Location
17	NY
18	London
20	SF
23	SF
55	HK

Referential analysis of these two objects would reveal that Dept. Number 15 from the Employees table is an orphan and Dept. Numbers 18, 20, and 55 from the Department table are childless. It would also reveal a join on the Dept. Number column.

Based on these results, you could derive referential rules that determine the cardinality between the two tables.

Data Rule Profiling

In addition to attribute analysis, functional dependency, and referential analysis, Warehouse Builder offers data rule profiling. Data rule profiling enables you to create rules to search for profile parameters within or between objects.

This is very powerful as it enables you to validate rules that apparently exist and are defined by the business users. By creating a data rule, and then profiling with this rule you can verify if the data actually complies with the rule, and whether or not the rule needs amending or the data needs cleansing.

For example, you could create a rule that Income = Salary + Bonus for the Employee table shown in Table 23-6. You can then catch errors such as the one for employee Alison.

Table 23-6 Sample Employee Table

ID	Name	Salary	Bonus	Income
10	Alison	1000	50	1075 X
20	Rochnik	1000	75	1075
30	Meijer	300	35	335
40	Jones	1200	500	1700

About the Data Profile Editor

The Data Profile Editor provides a single access point for managing and viewing data profile information as well as correcting metadata and data. It combines the functionality of a data profiler, a target schema generator, and a data correction generator. As a data profiler, it enables you to perform attribute analysis and structural analysis of selected objects. As a target schema generator, it enables you to generate a target schema based on the profile analysis and source table rules. Finally, as a data correction generator, it enables you to generate mappings and transformations to provide data correction.

Figure 23-3 displays the Data Profile Editor.

Figure 23-3 Data Profile Editor

Description of "Figure 23-3 Data Profile Editor"

The Data Profile Editor consists of the following:

Menu Bar
Toolbars
Object Tree
Property Inspector
Monitors Panel
Profile Results Canvas
Data Drill Panel
Data Rule Panel

Refer to the online help for detailed information about the contents of each panel.

About Six Sigma

Warehouse Builder provides Six Sigma results embedded within the other data profiling results to provide a standardized approach to data quality.

What is Six Sigma?

Six Sigma is a methodology that attempts to standardize the concept of quality in business processes. It achieves this goal by statistically analyzing the performance of business processes. The goal of Six Sigma is to improve the performance of these processes by identifying the defects, understanding them, and eliminating the variables that cause these defects.

Six Sigma metrics give a quantitative number for the number of defects for each 1,000,000 opportunities. The term "opportunities" can be interpreted as the number of records. The perfect score is 6.0. The score of 6.0 is achieved when there are only 3.4 defects for each 1,000,000 opportunities. The score is calculated using the following formula:

Defects Per Million Opportunities (DPMO) = (Total Defects / Total Opportunities) * 1,000,000
Defects (%) = (Total Defects / Total Opportunities)* 100%
Yield (%) = 100 - %Defects
Process Sigma = NORMSINV(1-((Total Defects) / (Total Opportunities))) + 1.5

where NORMSINV is the inverse of the standard normal cumulative distribution.

Six Sigma Metrics for Data Profiling

Six Sigma metrics are also provided for data profiling in Warehouse Builder. When you perform data profiling, the number of defects and anomalies discovered are shown as Six Sigma metrics. For example, if data profiling finds that a table has a row relationship with a second table, the number of records in the first table that do not adhere to this row-relationship can be described using the Six Sigma metric.

Six Sigma metrics are calculated for the following measures in the Data Profile Editor:

Aggregation: For each column, the number of null values (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented data type (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented length (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented scale (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented precision (defects) to the total number of rows in the table (opportunities).
Patterns: For each column, the number of values that do not comply with the common format (defects) to the total number of rows in the table (opportunities).
Domains: For each column, the number of values that do not comply with the documented domain (defects) to the total number of rows in the table (opportunities).
Referential: For each relationship, the number of values that do not comply with the documented foreign key (defects) to the total number of rows in the table (opportunities).
Referential: For each column, the number of values that are redundant (defects) to the total number of rows in the table (opportunities).
Unique Key: For each unique key, the number of values that do not comply with the documented unique key (defects) to the total number of rows in the table (opportunities).
Unique Key: For each foreign key, the number of rows that are childless (defects) to the total number of rows in the table (opportunities).
Data Rule: For each data rule applied to the data profile, the number of rows that fail the data rule to the number of rows in the table.

About Data Correction and Augmentation

Warehouse Builder enables you to automatically create corrected data objects and correction mappings based on the results of data profiling. On top of these automated corrections that make use of the underlying Warehouse Builder architecture for data quality, you can create your own data quality mappings to correct and cleanse source data.

To perform data correction and augmentation, use one of the following methods:

Automatic Data Correction Based on Data Profiling Results
Data Correction and Augmentation Using Operators

Automatic Data Correction Based on Data Profiling Results

When you perform data profiling, Warehouse Builder generates corrections for the objects that you profiled. You can then decide to create corrected objects based on results of data profiling. The corrections are in the form of data rules that can be bound to the corrected object.

Types of Corrections for Source Data

You can perform the following types of corrections on source data objects:

Schema correction

Schema correction creates scripts that you can use to create a corrected set of source data objects with data rules applied to them. The corrected data objects adhere to the data rules derived from the results of data profiling.

The correction tables have names that are prefixed with TMP__. For example, when you profile the EMPLOYEES table, the correction table will be called TMP__EMPLOYEES.
Data Correction

Data correction is the process of creating correction mappings to remove anomalies and inconsistencies in the source data before loading it into the corrected data objects. Correction mappings enforce the data rules defined on the data objects. While moving data from the old "dirty" tables in the profile source tables into the corrected tables, these mappings correct records that do not comply with the data rules.

The name of the correction mapping is the object name prefixed with M_. For example, the correction mapping for the EMPLOYEE table is called M_EMPLOYEE.

About Performing Data Correction

To perform data correction on source data, you must specify the following:

Data Correction Actions
Cleansing Strategies for Data Correction

Data Correction Actions

Based on the data profiling results, Warehouse Builder derives a set of data rules that you can use to cleanse the source data. You can automatically generate corrections based on these data rules by performing data correction actions.

For each data rule derived as a result of data profiling, you must choose a correction action that specifies how data values that are not accepted due to data rule enforcement should be handled. The correction actions you can choose are:

Ignore: The data rule is ignored and, therefore, no values are rejected based on this data rule.
Report: The data rule is run only after the data has been loaded for reporting purposes. It is similar to the Ignore option, except that a report is created that contains the values that do not adhere to the data rules. This action can be used for some rule types only.
Cleanse: The values rejected by this data rule are moved to an error table where cleansing strategies are applied. When you select this option, you must specify a cleansing strategy.

See "Cleansing Strategy" for details about specifying cleansing strategies.

Cleansing Strategies for Data Correction

When you decide to automatically generate corrected objects based on data profiling results, you must specify how inconsistent data from the source should be cleansed before being stored in the corrected object. To do this, you specify a cleansing strategy for each data rule that is applied to the correction object. Error tables are used to store the records that do not conform to the data rule.

The cleansing strategy you use depends on the type of data rule and the rule configuration. Table 23-7 describes the cleansing strategies and lists the types of data rules for which each strategy is applicable.

Table 23-7 Cleansing Strategies for Data Correction

Cleansing Strategy	Description	Applicable To Data Rule Types
Remove	Does not populate the target table with error records.	All
Custom	Creates a function in the target table that contains a header, but no implementation details. You must add the implementation details to this function.	Domain List Domain Pattern List Domain Range Common Format No Nulls Name and Address Custom
Set to Min	Sets the attribute value of the error record to the minimum value defined in the data rule.	Domain Range rules that have a minimum value defined
Set to Max	Sets the attribute value of the error record to the maximum value defined in the data rule.	Domain Range rules that have a maximum defined
Similarity	Uses a similarity algorithm based on permitted domain values to find a value that is similar to the error record. If no similar value is found, the original value is used.	Domain List rules with character data type
Soundex	Uses a soundex algorithm based on permitted domain values to find a value that is similar to the error record. If no soundex value is found, the original value is used.	Domain List rules with character data type
Merge	Uses the Match-Merge algorithm to merge duplicate records into a single row.	Unique Key
Set to Mode	Uses the mode value to correct the error records if a mode value exists for the functional dependency partition that fails.	Functional Dependency

See Also:

"Types of Data Rules"

Data Correction and Augmentation Using Operators

Warehouse Builder provides functionality that enables you to design data correction and augmentation processes for the source data. While transforming the source data, you can use the following operators to ensure data quality :

Match-Merge Operator
Name and Address Operator

See Also:

Chapter 24, "Data Quality Operators" for information about these operators and their usage

About Data Rules

Data rules are definitions for valid data values and relationships that can be created in Warehouse Builder. They determine legal data within a table or legal relationships between tables. Data rules help ensure data quality. They can be applied to tables, views, dimensions, cubes, materialized views, and external tables. Data rules are used in many situations including data profiling, data and schema cleansing, and data auditing.

The metadata for a data rule is stored in the workspace. To use a data rule, you apply the data rule to a data object. For example, you create a data rule called gender_rule that specifies that valid values are 'M' and 'F'. You can apply this data rule to the emp_gender column of the Employees table. Applying the data rule ensures that the values stored for the emp_gender column are either 'M' or 'F'. You can view the details of the data rule bindings on the Data Rule tab of the Data Object Editor for the Employees table.

There are two ways to create a data rule. A data rule can be derived from the results of data profiling, or it can be created using the Data Rule Wizard. For more information about data rules, see "Using Data Rules".

Types of Data Rules

Table 23-8 describes the types of data rules.

Table 23-8 Types of Data Rules

Data Rule Type	Description	Example
Domain List	Defines a list of values that an attribute is allowed to have.	The Gender attribute can have 'M' or 'F'.
Domain Pattern List	Defines a list of patterns that an attribute is allowed to conform to. The patterns are defined in the Oracle Database regular expression syntax.	A pattern for telephone number is: (^[[:space:]][0-9]{ 3 }[[:punct:]\|:space:]]?[0-9]{ 4 }[[:space:]]$)
Domain Range	Defines a range of values that an attribute is allowed to have.	The value of the Salary attribute can be between 100 and 10000.
Common Format	Defines a known common format that an attribute is allowed to conform to. This rule type has many subtypes: Telephone Number, IP Address, SSN, URL, E-mail Address. Each type has predefined formats listed. You can add more formats to this list.	An E-mail address should be in the following format: ^(mailto:[-_a-z0-9.]+@[-_a-z0-9.]+$)
No Nulls	Specifies that the attribute cannot have null values.	The department_id attribute for an employee in the Employees table cannot be null.
Functional Dependency	A functional dependency defines that the data in the data object may be normalized.	The Dept_name attribute is dependent on the Dept_no attribute.
Unique Key	Defines whether an attribute or group of attributes are unique in the given data object.	The name of a department must be unique.
Referential	Defines the type of relationship (1:x) a value must have with another value.	The department_id attribute of the Departments table must have a 1:n relationship with the Department_id attribute of the Employees table.
Name and Address	Uses the Name and Address support to evaluate a group of attributes as a name or address.	The department_id attribute of the Departments table should have a 1:n relationship with the department_id attribute of the Employees table.
Custom	Applies an SQL expression that you specify to its input parameters.	A custom rule called VALID_DATE has two input parameters, START_DATE and END_DATE. A valid expression for this rule is defined as follows: "THIS"."END_DATE" > "THIS"."START_DATE".

Implementation of Data Rules

Warehouse Builder uses different methods of applying data rules to a correction object. The method used depends on the type of data rule you are implementing.

Table 23-9 describes the methods used for object schema correction. It also lists the data rule types for which each correction is used.

Table 23-9 Data Rules Implementation for Schema Correction

Schema Correction Method	Description	Data Rule Types for which Correction Method Can be Used
Create Constraints	Creates a constraint reflecting the data rule on the correction table. If a constraint cannot be created, a validation message is displayed on the Data Rules Validation page of the Apply Data Rule Wizard.	Custom Domain List Domain Pattern List Domain Range Common Format No Nulls Unique Key
Change the data type	Changes the data type of the column to `NUMBER` or `DATE` according to the results of profiling. The data type is changed for data rules of type Is Number and Is Name.
Create a lookup table	Creates a lookup table and adds the appropriate foreign key or unique key constraints to the corrected table and the lookup table.	Functional Dependency
Name and Address Parse	Adds additional name and address attributes to the correction table. The name and address attributes correspond to a selection of the output values of the Name and Address operator. In the map that is created to cleanse data, a Name and Address operator is used to perform name and address cleansing.	Name and Address

About Quality Monitoring

Quality monitoring builds on your initial data profiling and data quality initiatives. It enables you to monitor the quality of your data over time. You can define the business rules to which your data should adhere.

To monitor data using Warehouse Builder you must create data auditors. Data auditors ensure that your data complies with the business rules you defined.

About Data Auditors

Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule by auditing and marking how many errors are occurring against the audited data. To monitor and audit a data object, you must first define data rules that specify valid data for that data object.

Data auditors are an important tool in ensuring that data quality levels are up to the standards set by the users of the system. They also help to determine spikes in bad data allowing events to the tied to these spikes.

Data auditors can be deployed and executed ad-hoc, but they are typically run to monitor the quality of the data in an operational environment like a data warehouse or ERP system and, therefore, can be added to a process flow and scheduled.

Data Auditor Thresholds

Data auditors have thresholds that allow you to create logic based on the fact that too many non-compliant records can divert the process flow into an error or notification stream. You can specify a threshold value for each data rule that the data auditor audits. This value is used to determine if the data in the data object is within the limits that you defined. Based on this threshold, the process can choose actions.

For example, you create a data auditor the audit the data in the Employees table. This table contains two data rules, emp_email_unique_rule and emp_sal_min_rule. You specify the threshold value for both rules is 80%. This means that if less than 80% of the data in the Employees table does not comply with the data rules, the auditing for the table fails.

Tip:

"Specifying Data Correction Actions" for information about specifying threshold value

Audit Results for Data Auditors

In addition to setting thresholds for non-compliant records, you can also capture audit results and store them for analysis purposes. When executed, the data auditor sets several output values. One of these values is the audit result. Audit results provide information about the extent of data rule violations that occurred while running the data auditor.

See Also:

"Data Auditor Execution Results"

Performing Data Profiling

Data profiling is, by definition, a resource-intensive process that requires forethought and planning. It analyzes data and columns and performs many iterations to detect defects and anomalies in your data. So it warrants at least some forethought and planning in order to be as effective as possible.

Before you begin profiling data, first reduce the data set by doing a random sampling. Next identify the data objects that you want to target. Instead of profiling everything, choose objects that are deemed crucial. You should not select an entire source system for profiling at the same time. Not only is it a waste of resources, but it is also often unnecessary. Select areas of your data where quality is essential and has the largest fiscal impact.

For example, you have a data source that contains five tables: CUSTOMERS, REGIONS, ORDERS, PRODUCTS, and PROMOTIONS. You decide that the two most important tables with respect to data quality are CUSTOMERS and ORDERS. The CUSTOMERS table is known to contain many duplicate and erroneous entries that cost your company money on wasted marketing efforts. The ORDERS table is known to contain data about orders in an incorrect format. In this case, you would select only these two tables for data profiling.

See Also:

"Tuning the Data Profiling Process"

Data Profiling Restrictions

You can only profile data in the default configuration.
The profiling workspace location should be an Oracle 10g database or higher.
You cannot profile a source table that contains complex data types if the source module and the data profile are located on different database instances.
You cannot profile a table that contains more than 165 columns.

Steps to Perform Data Profiling

After you have chosen the objects you want to profile, use the following steps to guide you through the profiling process:

Import or Select the Metadata
Create a Data Profile
Profile the Data
View Profile Results
Derive Data Rules
Generate Corrections
Define and Edit Data Rules Manually
Generate, Deploy, and Execute

The data profiling process ends at step 4. Steps 5 to 7 are optional and can be performed if you want to perform data correction after the data profiling. Step 8 is required when you perform both data profiling and data correction along with data profiling.

Import or Select the Metadata

Data profiling requires the profiled objects to be present in the project in which you are performing data profiling. Ensure that these objects are either imported into this project or created in it. Also ensure that the data is loaded into the objects. Having the data loaded is essential to data profiling.

Also, because data profiling uses mappings to run the profiling, you must ensure that all locations that you are using are registered. Data profiling attempts to register your locations. If, for some reason, data profiling cannot register your locations, you must explicitly register the locations before you begin profiling.

Create a Data Profile

After your system is set up, you can create a data profile using the Design Center. A data profile is a metadata object in the workspace. It includes the set of data objects you want profiled, the settings controlling the profiling operations, the results returned after you profile the data, and correction information (if you decide to use these corrections).

To create a data profile:

From the Project Explorer, expand the project node in which you want to create a data profile.
Right-click Data Profiles and select New.

The Welcome page of the Create Data Profile Wizard is displayed.
On the Name and Description page, enter a name and an optional description for the data profile. Click Next.
On the Select Objects page, select the objects you want to include in the data profile and use the arrows to move them to the Selected list. Click Next.

To select multiple objects, hold down the Ctrl key while selecting objects. You can include tables, views, materialized views, external tables, dimensions, and cubes in your data profile.
(Optional) If you selected tables, views, or materialized views that contain attribute sets, the Choose Attribute Set dialog box is displayed. The list at the bottom of this dialog box displays the attribute sets defined on the data object.
- To profile only the attributes defined in the attribute set, select the attribute set from the list.
- To profile all columns in the data object, select <all columns> from the list.
(Optional) If you selected dimensional objects on the Select Objects page, a warning is displayed informing you that the relational objects bound to these dimensional objects will also be added to the profile. Click Yes to proceed.
On the Summary page, review the choices you made on the previous wizard pages. Click Back to change any selected values. Click Finish to create the data profile.

The Warehouse Builder note dialog is displayed. Click OK to display the Data Profile Editor for the newly created data profile.

The new data profile is added to the Data Profiles node in the navigation tree.

Using Attribute Sets to Profile a Subset of Columns from a Data Object

You can use an attribute set to restrict a data profiling operation to a subset of columns from a table, view or materialized view. Reasons to use an attribute set include:

You can decrease profiling time by excluding columns for which you do not need profiling results.
Data profiling can only profile up to 165 columns from a table, view or materialized view at a time. You can use an attribute set to select a set of 165 or fewer columns to profile from the object.

Data profiling, using attribute sets, consists of the following high-level steps:

Defining Attribute Sets
Profiling the Columns in an Attribute Set

Defining Attribute Sets Use the following steps to define an attribute set in a table, view, or materialized view.

In the Project Explorer, double-click the table, view, or materialized view.

The Data Object Editor for the selected object is opened.
In the Details panel, select the Attribute Sets tab.
In the Attribute Sets section, click a blank area in the Name column and enter the name of the attribute set to create.
Close the Data Object Editor.
Double-click the data object in which you created an attribute set in step 3.

The Data Object Editor for the selected object is displayed.
On the Attributes Sets tab, select the name of the attribute set created in step 3.

The Attributes of the selected attribute set section displays the attributes in the data object.
Select Include for all the attributes that you want included in the attribute set.
Save your changes and close the Data Object Editor.

Profiling the Columns in an Attribute Set Use the following steps to profile columns contained in the attribute set.

In the Project Explorer, right-click the Data Profiles node and select New.

The Welcome page of the Create Data Profile Wizard is displayed.
On the Welcome page, click Next.
On the Name and Description page, enter a name and an optional description for the data profile. Click Next.
On the Select Objects page, select the data object that you want to profile and use the shuttle arrows to move the data profile to the Selected list.
When the selected data object contains attribute sets, the Choose Attribute Set dialog box is displayed.
Select the attribute set that you want to profile and click OK.

The Select Objects page is displayed.
On the Select Objects page, click Next.
On the Summary page, review the options you choose on the previous wizard pages and click Finish.

The data profile is created and added to the Navigator tree.

Profile the Data

Data profiling is achieved by performing deep scans of the selected objects. This can be a time-consuming process, depending on the number of objects and type of profiling you are running. However, profiling is run as an asynchronous job, and the client can be closed during this process. You will see the job running in the job monitor and Warehouse Builder prompts you when the job is complete.

Steps to Profile Data

After you have created a data profile, you can open it in the Data Profile Editor to profile the data or review profile results from a previous run. The objects you selected when creating the profile are displayed in the object tree of the Data Profile Editor. You can add objects to the profile by selecting Profile and then Add.

To profile the data:

Expand the Data Profiles node in the Project Explorer, right-click a data profile, and select Open Editor.

The Data Profile Editor opens the selected data profile.
To specify the types of analysis to be performed on the data, configure the data profile.

See "Configuring Data Profiles".
From the Profile menu, select Profile.
- (Optional) If this is the first time you are profiling data, the Data Profile Setup dialog box is displayed. Enter the details of the profiling workspace. For more information about the information to be entered, click Help.
Warehouse Builder begins preparing metadata for profiling. The progress window containing the name of the object being created to profile the data is displayed. After the metadata preparation is complete, the Profiling Initiated dialog box is displayed informing you that the profiling job has started.
On the Profiling Initiated dialog box, click OK.

Once the profiling job starts, the data profiling is asynchronous and you can continue working or even close the client. Your profiling process will continue to run until it is completed.
View the status of the profiling job in the Monitor Panel of the Data Profile Editor.

You can continue to monitor the progress of your profiling job in the Monitor panel. After the profiling job is complete, the status displays as complete.
After the profiling is complete, the Retrieve Profile Results dialog box is displayed and you are prompted to refresh the results. Click OK to retrieve the data profiling results and display them in the Data Profile Editor.

You can use this option if you have previously profiled data in the same data profile. It allows you to control when the new profiling results become visible in the Data Profile Editor.

Note:

Data profiling results are overwritten on subsequent profiling executions.

Configuring Data Profiles

You can, and should, configure a data profile before running it if there are specific types of analysis you do, or do not, want to run.

You can configure a data profile at one of the following levels:

The entire profile (all the objects it contains)
An individual object in the data profile

For example, the data profile contains three tables. If you want to perform certain specific types of analysis on one table, configure this table only.
An attribute within an object

For example, if you know that you only have one problematic column in a table and you already know that most of the records should conform to values within a certain domain, then you can focus your profiling resources on domain discovery and analysis. By narrowing down the type of profiling necessary, you use less resources and obtain the results faster.

Steps to Configure Data Profiles

In the Project Explorer, right-click the data profile and select Open Editor.

The Data Profile Editor for the data profile is displayed.
On the Property Inspector panel, set configuration properties at the required level.

To configure the entire data profile:
1. On the Profile Objects tab, select the data profile.
2. On the Property Inspector panel, set the configuration properties for the data profile.
To configure a particular object in the data profile:
1. On the Profile Objects tab, expand the node that represents the data profile.
2. Select the data object.
3. On the Property Inspector panel, set the configuration properties for the object.
To configure an attribute within an object in a data profile:
1. On the Profile Objects tab, expand the node that represents the data profile.
2. Expand the data object that contains the attribute and select the required attribute.
3. On the Property Inspector panel, set the configuration properties for the attribute.
Note:
"Reference for Setting Data Profiling Configuration Parameters" for details about configuration parameters you can set for data profiles

.

View Profile Results

After the profile operation is complete, you can open the data profile in the Data Profile Editor to view and analyze the results. The profiling results contain a variety of analytical and statistical information about the data profiled. You can immediately drill down into anomalies and view the data that caused them. You can then determine what data must be corrected.

To view the profile results:

Select the data profile in the navigation tree, right-click, and select Open Editor.

The Data Profile Editor opens and displays the data profile.
If you have previous data profiling results displayed in the Data Profile Editor, refresh the view when prompted so that the latest results are shown.

The results of the profiling are displayed in the Profile Results Canvas.
Minimize the Data Rule and Monitor panels by clicking on the arrow symbol in the upper left corner of the panel.

This maximizes your screen space.
Select objects in the Profile Objects tab of the object tree to focus the results on a specific object.

The profiling results for the selected object are displayed using the following tabs of the Profile Results Canvas.

You can switch between various objects in the data profile. The tab that you had selected for the previous object remains selected.

Data Profile

The Data Profile tab contains general information about the data profile. Use this tab to store any notes or information about the data profile.

Profile Object

The Profile Object tab contains two sub-tabs: Object Data and Object Notes. The Object Data tab lists the data records in the object you have selected in the Profile Objects tab of the object tree. The number of rows that were used in the sample are listed. You can use the buttons along the top of the tab to execute a query, get more data, or add a WHERE clause.

Aggregation

The Aggregation tab displays all the essential measures for each column, such as minimum and maximum values, number of distinct values, and null values. Some measures are available only for specific data types. These include the average, median, and standard deviation measures. Information can be viewed from either the Tabular or Graphical sub-tabs.

Table 23-10 describes the various measurement results available in the Aggregation tab.

Table 23-10 Aggregation Results

Measurement	Description
Minimum	The minimum value with respect to the inherent database ordering of a specific type
Maximum	The maximum value with respect to the inherent database ordering of a specific type
# Distinct	The total number of distinct values for a specific attribute
% Distinct	The percentage of distinct values over the entire row set number
Not Null	Yes or No
Recommended NOT NULL	From analyzing the column values, data profiling determines that this column should not allow null values
# Nulls	The total number of null values for a specific attribute
%Nulls	The percentage of null values over the entire row set number
Six Sigma	For each column, number of null values (defects) to the total number of rows in the table (opportunities).
Average	The average value for a specific attribute for the entire row set
Median	The median value for a specific attribute for the entire row set
Std Dev	The standard deviation for a specific attribute.

A hyperlinked value in the aggregation results grid indicates that you can click the value to drill down into the data. This allows you to analyze the data in the sample that produced this result.

For example, if you scroll to the SALARY column, shown in Figure 23-4, and click the value in the Maximum cell showing 24000, the Data Drill Panel on the bottom changes to show you all the distinct values in this column with counts on the left. On the right, the Data Drill can zoom into the value you select from the distinct values and display the full record where these values are found.

Figure 23-4 Aggregation Tabular Results

Description of "Figure 23-4 Aggregation Tabular Results"

The graphical analysis displays the results in a graphical format. You can use the graphical toolbar to change the display. You can also use the Column and Property menus to change the displayed data object.

Data Type

The Data Type tab provides profiling results about data types. This includes metrics such as length for character data types and the precision and scale for numeric data types. For each data type that is discovered, the data type is compared to the dominant data type found in the entire attribute and the percentage of rows that comply with the dominant measure is listed.

One example of data type profiling would be finding a column defined as VARCHAR but is actually storing only numeric values. Changing the data type of the column to NUMBER would make storage and processing more efficient.

Table 23-11 describes the various measurement results available in the Data Type tab.

Table 23-11 Data Type Results

Measurement	Description
Columns	Name of the column
Documented Data Type	Data type of the column in the source object
Dominant Data Type	From analyzing the column values, data profiling determines that this is the dominant (most frequent) data type.
% Dominant Data Type	Percentage of total number of rows where column value has the dominant data type
Documented Length	Length of the data type in the source object
Minimum Length	Minimum length of the data stored in the column
Maximum Length	Maximum length of the data stored in the column
Dominant Length	From analyzing the column values, data profiling determines that this is the dominant (most frequent) length.
% Dominant Length	Percentage of total number of rows where column value has the dominant length
Documented Precision	Precision of the data type in the source object
Minimum Precision	Minimum precision for the column in the source object
Maximum Precision	Maximum precision for the column in the source object
Dominant Precision	From analyzing the column values, data profiling determines that this is the dominant (most frequent) precision.
% Dominant Precision	Percentage of total number of rows where column value has the dominant precision
Documented Scale	Scale specified for the data type in the source object
Minimum Scale	Minimum scale of the data type in the source object
Maximum Scale	Maximum scale of the data type in the source object
Dominant Scale	From analyzing the column values, data profiling determines that this is the dominant (most frequent) scale.
% Dominant Scale	Percentage of total number of rows where column value has the dominant scale

Domain

The Domain tab displays results about the possible set of values that exist in a certain attribute. Information can be viewed from either the Tabular or Graphical subtabs.

Figure 23-5 displays the Domain tab of the Data Profile Editor.

Figure 23-5 Domain Discovery Results

Description of "Figure 23-5 Domain Discovery Results"

The process of discovering a domain on a column involves two phases. First, the distinct values in the column are used to determine whether that column might be defined by a domain. Typically, there are few distinct values in a domain. Then, if a potential domain is identified, the count of distinct values is used to determine whether that distinct value is compliant with the domain. The properties that control the threshold for both phases of domain discovery can be set in the Property Inspector.

If you find a result that you want to know more about, drill down and use the Data Drill panel to view details about the cause of the result.

For example, a domain of four values was found for the column REGION_ID: 3,2,4, and 1. If you want to see which records contributed to this finding, select the REGION_ID row and view the details in the Data Drill panel.

Table 23-12 describes the various measurement results available in the Domain tab.

Table 23-12 Domain Results

Measurement	Description
Found Domain	The discovered domain values
% Compliant	The percentage of all column values that are compliant with the discovered domain values
Six Sigma	The Six-Sigma value for the domain results

Pattern

The Pattern tab displays information discovered about patterns within the attribute. Pattern discovery is the profiler's attempt at generating regular expressions for data it discovered for a specific attribute. Note that non-English characters are not supported in the pattern discovery process.

Table 23-13 describes the various measurement results available in the Pattern tab.

Table 23-13 Pattern Results

Measurement	Description
Dominant Character Pattern	The most frequently discovered character pattern or consensus pattern.
% Compliant	The percentage of rows whose data pattern agree with the dominant character pattern.
Dominant Word Pattern	The most frequently discovered word character pattern or consensus pattern.
& Compliant	The percentage of rows whose data pattern agree with the dominant word pattern.
Common Format	Name, Address, Date, Boolean, Social Security Number, E-mail, URL. This is the profiler's attempt to add semantic understanding to the data that it sees. Based on patterns and some other techniques, it will try to figure out which domain bucket a certain attribute's data belongs to.
% Compliant	The percentage of rows whose data pattern agree with the consensus common format pattern.

Unique Key

The Unique Key tab provides information about the existing unique keys that were documented in the data dictionary, and possible unique keys or key combinations that were detected by the data profiling operation. The uniqueness % is shown for each. The unique keys that have No in the Documented? column are the ones that are discovered by data profiling.

For example, a phone number is unique in 98% of the records. It can be a unique key candidate, and you can then cleanse the noncompliant records. You can also use the drill-down feature to view the cause of the duplicate phone numbers in the Data Drill panel. Table 23-14 describes the various measurement results available in the Unique Key tab.

Table 23-14 Unique Key Results

Measurement	Description
Unique Key	The discovered unique key.
Documented?	Yes or No. Yes indicates that a unique key on the column exists in the data dictionary. No indicates that the unique key was discovered as a result of data profiling.
Discovered?	From analyzing the column values, data profiling determines whether a unique key should be created on the columns in the Local Attribute(s) column.
Local Attribute(s)	The name of the column in the table that was profiled.
# Unique	The number of rows, in the source object, in which the attribute represented by Local Attribute is unique.
% Unique	The percentage of rows, in the source object, for which the attribute represented by Local Attribute are unique.
Six Sigma	For each column, number of null values (defects) to the total number of rows in the table (opportunities).

Functional Dependency

The Functional Dependency tab displays information about the attribute or attributes that seem to depend on or determine other attributes. Information can be viewed from either the Tabular or Graphical sub-tabs. You can use the Show list to change the focus of the report. Note that unique keys defined in the database are not discovered as functional dependencies during data profiling.

Table 23-15 describes the various measurement results available in the Functional Dependency tab.

Table 23-15 Functional Dependency Results

Measurement	Description
Determinant	The name of the attribute that is found to determine the attribute listed in the Dependent
Dependent	The name of the attribute found to be determined by value of another attribute
# Defects	The number of values in the Determinant attribute that were not determined by the Dependent attribute
% Compliant	The percentage of values that are compliant with the discovered dependency
Six Sigma	The six sigma value
Type	The suggested action for the discovered dependency

For example, if you select Only 100% dependencies from the Show list, the information shown is limited to absolute dependencies. If you have an attribute that is always dependent on another attribute, it is recommended that it be a candidate for a reference table. Suggestions are shown in the Type column. Removing the attribute into a separate reference table normalizes the schema.

The Functional Dependency tab also has a Graphical sub-tab so that you can view information graphically. You can select a dependency and property from the lists to view graphical data.

For example, in Figure 23-6, you select the dependency where DEPARTMENT_ID seems to determine COMMISSION_PCT. (DEPARTMENT_ID->COMMISSION_PCT). In a majority of cases, COMMISION_PCT is null. Warehouse Builder therefore determines that most DEPARTMENT_ID values determine COMMISION_PCT to be null. By switching the Property to Non-Compliant, you can view the exceptions to this discovery. Figure 23-6 shows that for the DEPARTMENT_ID values of 80, the COMMISION_PCT values are not null. This makes sense after you discover that the department with DEPARTMENT_ID 80 is Sales department.

Figure 23-6 Graphical Functional Dependency

Description of "Figure 23-6 Graphical Functional Dependency"

Referential

The Referential tab displays information about foreign keys that were documented in the data dictionary, as well as relationships discovered during profiling. For each relationship, you can see the level of compliance. Information can be viewed from both the Tabular and Graphical subtabs. In addition, two other sub-tabs are available only in the Referential tab: Joins and Redundant Columns.

Table 23-16 describes the various measurement results available in the Referential tab.

Table 23-16 Referential Results

Measurement	Description
Relationship	The name of the relationship.
Type	The type of relationship.
Documented?	Yes or No. Yes indicates that a foreign key on the column exists in the data dictionary. No indicates that the foreign key was discovered as a result of data profiling.
Discovered?	From analyzing the column values, data profiling determines whether a foreign key should be created on the column represented by Local Attribute(s).
Local Attribute(s)	The name of the attribute in the source object.
Remote Key	The name of the key in the referenced object to which the local attribute refers.
Remote Attribute(s)	The name of the attributes in the referenced object.
Remote Relation	The name of the object that the source object references.
Remote Module	The name of the module that contains the referenced object.
Cardinality Range	The range of the cardinality between two attributes. For example, the EMP table contains 5 rows of employee data. There are two employees each in department 10 and 20 and one employee in department 30. The DEPT table contains three rows of department data with deptno value 10, 20, and 30. Data profiling will find a row relationship between the EMP and DEPT tables. The cardinality range will be 1-2:1-1. This is because in EMP, the number of rows per distinct value ranges from 1 (for deptno 30) to 2 (deptno 10 and 20). In DEPT, there is only one row for each distinct value (10, 20, and 30).
# Orphans	The number of orphan rows in the source object.
% Compliant	The percentage of values that are compliant with the discovered dependency.
Six Sigma	For each column, number of null values (defects) to the total number of rows in the table (opportunities)

For example, if you are analyzing two tables for referential relationships: the Employees table and the Departments table. Using the Referential data profiling results shown in Figure 23-7, you discover that the DEPARTMENT_ID column in the Employees table is related to DEPARTMENT_ID column in the Departments table 98% of the time. You can then click the hyperlinked Yes in the Discovered? column to view the rows that did not comply with the discovered foreign key relationship.

Figure 23-7 Referential Results

Description of "Figure 23-7 Referential Results"

You can also select the Graphical subtab to view information graphically. This view is effective for viewing noncompliant records, such as orphans. To use the Graphical sub-tab, make a selection from the Reference and Property lists.

The Joins sub-tab displays a join analysis on the reference selected in the Reference list. The results show the relative size and exact counts of the three possible outcomes for referential relationships: joins, orphans, and childless objects.

For example, both the EMPLOYEES and DEPARTMENTS tables contain a DEPARTMENT_ID column. There is a one-to-many relationship between the DEPARTMENT_ID column in the DEPARTMENTS table and the DEPARTMENT_ID column in the EMPLOYEES table. The Joins represent the values that have values in both tables. Orphans represent values that are only present in the EMPLOYEES table and not the DEPARTMENTS table. And Childless values are present in the DEPARTMENTS table and not the EMPLOYEES table. You can drill into values on the diagram to view more details in the Data Drill panel.

Figure 23-8 displays the Joins subtab of the Referential tab.

Figure 23-8 Join Results

Description of "Figure 23-8 Join Results"

The Redundant Columns sub-tab displays information about columns in the child table that are also contained in the primary table. Redundant column results are only available when perfectly unique columns are found during profiling.

For example, consider two tables EMP and DEPT, shown in Table 23-17 and Table 23-18, having the following foreign key relationship: EMP.DEPTNO (uk) = DEPT.DEPTNO (fk).

Table 23-17 EMP Table

Employee Number	Dept. No	Location
100	1	CA
200	2	NY
300	3	MN

Table 23-18 DEPT Table

Dept No	Location	Zip
1	CA	94404
3	MN	21122
3	MN	21122
1	CA	94404

In this example, the Location column in the EMP table is a redundant column because you can get the same information from the join.

Data Rule

The Data Rule tab displays the data rules that are defined as a result of data profiling for the table selected in the object tree. The details for each data rule include the following:

Rule Name: Represents the name of the data rule.
Rule Type: Provides a brief description about the type of data rule.
Origin: Represents the origin of the data rule. For example, a value of Derived indicates that the data rule is derived.
% Compliant: Percent of rows that comply with the data rule.
# Defects: Number of rows that do not comply with the data rule.

The data rules on this tab reflect the active data rules in the Data Rule panel. You do not directly create data rules on this tab.

Derive Data Rules

Based on the results of data profiling, you can derive data rules that can be used to clean up your data. A data rule is an expression that determines the set of legal data that can be stored within a data object. Use data rules to ensure that only values compliant with the data rules are allowed within a data object. Data rules will form the basis for correcting or removing data if you decide to cleanse the data. You can also use data rules to report on non-compliant data.

Although you can create data rules and apply them manually to your data profile, derived data rules allow you to move quickly and seamlessly between data profiling and data correction.

For example, if you have a table called Employees with the following columns: Employee_Number, Gender, Employee_Name. The profiling result shows that 90% of the values in the Employee_Number column are unique, making it a prime candidate for the unique key. The results also show that 85% of the values in the Gender column are either 'M' or 'F', making it a good candidate for a domain. You can then derive these rules directly from the Profile Results Canvas.

Steps to Derive Data Rules

Select a data profile in the navigation tree, right-click, and select Open Editor.

The Data Profile Editor is displayed with the profiling results.
Review the profiling results and determine which findings you want derived into data rules.

The types of results that warrant data rules vary. Some results commonly derived into data rules include a detected domain, a functional dependency between two attributes, or a unique key.
Select the tab that displays the results from which you want to derive a data rule.

For example, to create a data rule that enforces a unique key rule for the EMPLOYEE_NUMBER column, navigate to the Unique Key tab.
Select the cell that contains the results you want derived into a data rule.
From the Profile menu select Derive Data Rule. Or click the Derive Data Rule button.

For example, to create a Unique Key rule on the EMPLOYEE_NUMBER column, select this column and click Derive Data Rule.

The Derive Data Rule Wizard displays.
On the Welcome page, click Next.
On the Name and Description page, the Name field displays a default name for the data rule. To specify a new name, select the name, enter the new name, and click Next.
On the Define Rule page, provide details about the data rule parameters and click Next.

The Type field that represents the type of data rule is populated based on the tab from which you derived the data rule. You cannot edit the type of data rule.

Additional fields in the lower portion of this page define the parameters for the data rule. Some of these fields are populated with values based on the result of data profiling. The number and type of fields depends on the type of data rule.
On the Summary page, review the options you set in the wizard using this page. Click Back if you want to change any of the selected values. Click Finish to create the data rule.

The data rule is created and it appears in the Data Rule panel of the Data Profile Editor. The derived data rule is also added to the Derived_Data_Rules node under the Data Rules node in the Project Explorer. You can reuse this data rule by attaching it to other data objects.

Generate Corrections

After you have derived data rules from the profiling results, you can automate the process of correcting source data based on profiling results. You can create the schema and mapping corrections. The schema correction creates scripts that you can use to create a corrected set of source data objects with the derived data rules applied. The mapping correction creates new correction mappings to take your data from the source objects and load them into new objects.

Steps to Automate Data Correction

Generate correction objects based on data profiling results.

See "Steps to Create Corrections".

You can view the corrected objects as described in "Viewing the Correction Tables and Mappings".
Deploy the generated corrections to create the corrected objects in the target schema.

See "Steps to Deploy Correction Objects".

Steps to Create Corrections

Use the Data Profile Editor to create corrections based on the profiling results.

To create corrections:

If the Data Profile is not already open, open it by right-clicking the data profile in the Project Explorer and selecting Open Editor.
From the Profile menu, select Create Correction.

The Create Correction Wizard is displayed.
On the Select Target Module page, specify the target module that will contain the corrections and click Next. You can create a new module or use an existing module.
- To store the corrections in an existing target module, choose Select an existing module and select the module from the Available list.
- To store correction in a new target module, select Create a new target module. The Create Module Wizard guides you through the steps of creating a new target module.
You can remove correction objects created as a result of previous corrections by selecting Remove previous correction objects.
On the Select Objects page, select the objects that you want to correct by moving them to the Selected list and click Next.

The Filter list enables you to filter the objects that are available for selection. The default selection is All Objects. You can display only particular types of data objects such as tables or views.
On the Select Data Rules and Data Types page, perform schema correction by selecting the corrections that you want to implement for each object. Click Next.

See "Selecting Data Rules and Data Types for Corrected Objects".
(Optional) On the Data Rules Validation page, note down the validation errors, if any, and correct them before proceeding.

If correction objects from a previous data correction action exist for the objects selected for correction, this page displays a message. Click Next to remove previously created correction objects.
On the Verify and Accept Corrected Tables page, select the objects that you want to correct and click Next.

Se e "Selecting the Objects to be Corrected".
On the Choose Data Correction Actions page, specify the correction actions to be performed for objects and click Next.

See "Choosing Data Correction and Cleansing Actions".
On the Summary page, click Finish to create the correction objects.

The correction schema is created and added to the Project Explorer. The correction objects and mappings are displayed under the module that you specify as the target module on the Select Target Module page of the Create Correction Wizard. Depending on the data correction actions specified, correction objects may include transformations used to cleanse and correct data.

Selecting Data Rules and Data Types for Corrected Objects

Use this page to select the data rules that should be applied to the selected objects. The objects selected for correction are on the left side of the page and are organized into a tree by modules. The right panel contains two tabs: Data Rules and Data Types.

Data Rules The Data Rules tab displays the available data rules for the object selected in the object tree. Specify the data rules that you want to apply to that corrected object by selecting the check box to the left of the data rule. Warehouse Builder uses these data rules to create constraints on the tables during the schema generation.

The Bindings section contains details about the table column to which the rule is bound. Click a rule name to display the bindings for the rule.

The method used to apply the data rule to the correction table depends on the type of data rule you are implementing.

See Also:

"Implementation of Data Rules".

Data Types The Data Types tab displays the columns that are selected for correction. The change could be a modification of the data type, precision, or from fixed-length to variable-length. The Documented Data Type column displays the existing column definition and the New Data Type column displays the proposed correction to the column definition.

To correct a column definition, select the check box to the left of the column name.

Selecting the Objects to be Corrected

Use the Verify and Accept corrected Tables page to confirm the objects that you want to correct. This page lists the tables that have been modified with the applied rules and the data type changes.

Perform the following actions on this page:

To create a data object in the corrected schema, select Create to the left of the object.

The lower part of the page displays the details of the selected object using tabs. The Columns tab displays the details of columns that will be created in the corrected data object. The Constraints tab displays details of constraints that will be created on the corrected data object. The Data Rules tab displays details of data rules that will be created on the corrected data object.
On the Columns tab, perform the following actions:
- To create a displayed column in the corrected data object, select Create to the left of the column name.
- To remove a displayed column from the corrected object, deselect Create to the left of the column name.
- To modify a displayed column in the corrected object, edit the Data Type, Length, Precision, Seconds Precision, and Scale attributes. However, you cannot modify a column name.
On the Constraints tab, perform the following actions:
- To add additional constraints, click Add Constraint.
- To remove a constraint from the corrected data object, select the constraint and click Delete.
On the Data Rules tab, perform the following actions:
- To apply a data rule, that was derived based on data profiling results, to the corrected object, select the check box to the left of the data rule.
  
  Ensure that the Bindings column contains the column to which the data rule should be applied.
- To apply a new data rule to the corrected object, click Apply Rule. The Apply Data Rule wizard guides you through the process of applying a data rule.

Choosing Data Correction and Cleansing Actions

Use the Choose Data Correction Actions page to select the actions to perform to correct the source data. This page contains two sections: Select a Corrected Table and Choose Data Correction Actions. The Select a Corrected Table section lists the objects that you selected for corrections. Select a table in this section to display the affiliated data rules in the Choose Data Correction Actions section.

Choose Data Correction Actions For each data rule, select an action from the menu in the Action column. The settings you select here determine how to handle data values that are not accepted due to data rule enforcement. Select one of the following actions:

Ignore
Report
Cleanse

See Also:

"Data Correction Actions" for a description of each correction action.

Cleansing Strategy Use the Cleansing Strategy list to specify a cleansing strategy. This option is enabled only if you select Cleanse in the Action column. The cleansing strategy depends on the type of data rule and the rule configuration. Error tables are used to store the records that do not conform to the data rule.

Steps to Deploy Correction Objects

To deploy the correction mappings created as part of the data correction process:

Grant the SELECT privilege on the source tables to PUBLIC.

For example, your correction mapping contains the table EMPLOYEES from the HR schema. You can successfully deploy this correction mapping only if the SELECT privilege is granted to PUBLIC on the HR.EMPLOYEES table.
To create the correction objects in your target schema, deploy the correction tables created as a result of data profiling.
To cleanse data, deploy and execute the correction mappings.

Viewing the Correction Tables and Mappings

You can review the correction tables in the Data Object Editor to see the data rules and constraints created as part of the design of your table.

To view the correction mappings:

Double-click the table or mapping to open the object in their respective editors.
After the mapping is open, select View and then Auto Layout to view the entire mapping.

Figure 23-9 displays a correction map generated by the Create Correction wizard.

Figure 23-9 Generated Correction Mapping

Description of "Figure 23-9 Generated Correction Mapping"
Select the submapping ATTR_VALUE_1 and click the Visit Child Graph icon from the toolbar to view the submapping.

Figure 23-10 displays the submapping that is displayed.

Figure 23-10 Correction Submapping

Description of "Figure 23-10 Correction Submapping"

The submapping is the element in the mapping that performs the actual correction cleansing you specified in the Create Correction Wizard. In the middle of this submap is the DOMAINSIMILARITY transformation that was generated as a function by the Create Correction Wizard.

Define and Edit Data Rules Manually

Data rules can be derived or manually created. Before and after you have created the corrections, you can define additional data rules manually.

For more information about defining and editing data rules manually, see "Creating Data Rules".

Generate, Deploy, and Execute

Finally, you can generate, deploy, and execute the correction mappings and data rules. After you run the correction mappings with the data rules, your data is corrected. The derived data rules remain attached to the objects in the corrected schema for optional use in data monitors.

Editing Data Profiles

Once you create a data profile, you can use the Data Profile Editor to modify its definition. You can also add data objects to an existing data profile. To add objects, you can use either the menu bar options or the Select Objects tab of the Edit Data Profile dialog box.

To edit a data profile:

In the Project Explorer, right-click the data profile and select Open Editor.

The Data Profile Editor is displayed.
From the Edit menu, select Properties.

The Edit Data Profile dialog box is displayed.
Edit any of the following properties of the data profile and click OK.
- To modify the name or description of the data profile, on the Name tab, select the name or description and enter the new value.
- To add or remove objects, on the Select Objects tab, use the arrows to add objects to or remove objects from the data profile.
- To change the location that is used as the data profiling staging area, use the Data Locations tab.
  
  Use the arrows to move the new location to the Selected Locations section. Ensure that you select New Configuration Default to set this location as the default profiling location for the data profile.
  
  Note:
  If you modify the profiling location after you have performed data profiling, the previous profiling results are lost.

Adding Data Objects to a Data Profile

To add data objects to a data profile, complete the following steps:

Right-click the data profile in the Project Explorer and select Open Editor.

The Data Profile Editor is displayed.
From the Profile menu, select Add Objects.

The Add Profile Tables dialog box is displayed.
On the Add Profile Tables dialog box, select the objects that you want to add to the data profile. Select the objects and use the arrows to move them to the Selected section.

You can select multiple objects by holding down the Ctrl key and selecting objects.