10 Understanding Data Quality Management

Today, more than ever, organizations realize the importance of data quality. By ensuring that quality data is stored in your data warehouse or business intelligence application, you also ensure the quality of information for dependent applications and analytics.

Oracle Warehouse Builder offers a set of features that assist you in creating data systems that provide high quality information to your business users. You can implement a quality process that assesses, designs, transforms, and monitors quality. Within these phases, you will use specific functionality from Warehouse Builder to create improved quality information.

This chapter contains the following topics:

About the Data Quality Management Process
About Data Profiling
About Data Quality
About Quality Monitoring

About the Data Quality Management Process

Quality data is crucial to decision-making and planning. The aim of building a data warehouse is to have an integrated, single source of data that can be used to make business decisions. Since the data is usually sourced from a number of disparate systems, it is important to ensure that the data is standardized and cleansed before loading into the data warehouse.

Warehouse Builder provides functionality that enables you to effectively manage data quality by assessing, transforming, and monitoring your data. The benefits of using Warehouse Builder for data management are as follows:

Provides an end-to-end data quality solution
Enables you to include data quality and data profiling as an integral part of your data integration process.
Stores metadata regarding the quality of your data alongside your data. Both the data and the metadata are stored in the design repository.
Automatically generates the mappings that you can use to correct data. These mappings are based on the business rules that you choose to apply to your data. and decision you make on how to correct data.

Phases in the Data Quality Life Cycle

Ensuring data quality involves the following phases:

Quality Assessment
Quality Design
Quality Transformation
Quality Monitoring

Figure 10-1 shows the phases involved in providing high quality information to your business users.

Figure 10-1 Phases Involved in Providing Quality Information

Description of "Figure 10-1 Phases Involved in Providing Quality Information"

Quality Assessment

In the quality assessment phase, you determine the quality of the source data. The first step in this phase is to load the source data, which could be stored in different sources, into Warehouse Builder. You can import metadata and data from both Oracle and non-Oracle sources.

After you load the source data, you use data profiling to assess its quality. Data profiling is the process of uncovering data anomalies, inconsistencies, and redundancies by analyzing the content, structure, and relationships within the data. The analysis and data discovery techniques form the basis for data monitoring. For more information about data profiling, see "About Data Profiling".

Quality Design

The quality design phase consists designing your quality processes. You can specify the legal data within a data object or legal relationships between data objects using data rules. For more information about data rules, see "About Data Rules".

You also correct and augment your data. You can use data quality operators to correct and augment data. For more information, see "About Data Quality".

As part of the quality design phase, you also design the transformations that ensure data quality. These transformations could be mappings that are generated by Warehouse Builder as a result of data profiling or mappings you create.

Quality Transformation

The quality transformation phase consists of running the correction mappings that are used to correct the source data.

Quality Monitoring

Data monitoring is the process of examining your data over time and alerting you when the data violates any business rules that are set. For more information about data monitoring, see "About Quality Monitoring".

About Data Profiling

Data profiling is the first step for any organization to improve information quality and provide better decisions. It is a robust data analysis method available in Warehouse Builder that you can use to discover and measure defects in your data before you start working with it. Because of its integration with the ETL features in Warehouse Builder and other data quality features, such as data rules and built-in cleansing algorithms, you can also generate data cleansing and schema correction. This enables you to automatically correct any inconsistencies, redundancies, and inaccuracies in both the data and metadata.

Data profiling enables you to discover many important things about your data. Some common findings include the following:

A domain of valid product codes
A range of product discounts
Columns that hold the pattern of an e-mail address
A one-to-many relationship between columns
Anomalies and outliers within columns
Relations between tables even if they are not documented in the database

This section contains the following topics:

Uses of Data Profiling
Types of Data Profiling
How to Perform Data Profiling
About Six Sigma

Uses of Data Profiling

Using the data profiling functionality in Warehouse Builder enables you to:

Profile data from any source or combination of sources that Warehouse Builder can access.
Explore data profiling results in tabular or graphical format.
Drill down into the actual data related to any profiling result.
Derive data rules, either manually or automatically, based on the data profiling results.
Attach any data rule to a target object and select an action to perform if the rule fails.
Create a data auditor from a data rule to continue monitoring the quality of data being loaded into an object.
Derive quality indices such as six-sigma valuations.
Profile or test any data rules you want to verify before putting in place.

Types of Data Profiling

Following the selection of data objects, determine the aspects of your data that you want to profile and analyze. As shown in Figure 10-2, data profiling offers three main types of analysis: attribute analysis, functional dependency, and referential analysis. You can also create custom profiling processes using data rules, allowing you to validate custom rules against the actual data and get a score of their accuracy.

Figure 10-2 Three Types of Data Profiling

Description of "Figure 10-2 Three Types of Data Profiling"

Attribute Analysis

Attribute analysis seeks to discover both general and detailed information about the structure and content of data stored within a given column or attribute. Attribute analysis looks for information about patterns, domains, data types, and unique values.

Figure 10-3 Data Profiling in Warehouse Builder

Description of "Figure 10-3 Data Profiling in Warehouse Builder"

Pattern analysis attempts to discover patterns and common types of records by analyzing the string of data stored in the attribute. It identifies the percentages of your data that comply with a certain regular expression format pattern found in the attribute. Using these pattern results, you can create data rules and constraints to help clean up current data problems. Some commonly identified patterns include dates, e-mail addresses, phone numbers, and social security numbers.

Table 10-1 shows a sample attribute, Job Code, that could be used for pattern analysis.

Table 10-1 Sample Columns Used for Pattern Analysis

Job ID	Job Code
7	337-A-55
9	740-B-74
10	732-C-04
20	43-D-4

Table 10-2 shows the possible results from pattern analysis, where D represents a digit and X represents a character. After looking at the results and knowing that it is company policy for all job codes be in the format of DDD-X-DD, you can derive a data rule that requires all values in this attribute to conform to this pattern.

Table 10-2 Pattern Analysis Results

Job Code	% Occurred
DDD-X-DD	75%
DD-X-D	25%

Domain analysis identifies a domain or set of commonly used values within the attribute by capturing the most frequently occurring values. For example, the Status column in the Customers table is profiled and the results reveal that 90% of the values are among the following: "MARRIED", "SINGLE", "DIVORCED". Further analysis and drilling down into the data reveal that the other 10% contains misspelled versions of these words with few exceptions. Configuration of the profiling determines when something is qualified as a domain, so review the configuration before accepting domain values. You can then let Warehouse Builder derive a rule that requires the data stored in this attribute to be one of the three values that were qualified as a domain.

Data type analysis enables you to discover information about the data types found in the attribute. This type of analysis reveals metrics such as minimum and maximum character length values as well as scale and precision ranges. In some cases, the database column is of data type VARCHAR2, but the values in this column are all numbers. Then you may want to ensure that you only load numbers. Using data type analysis, you can have Warehouse Builder derive a rule that requires all data stored within an attribute to be of the same data type.

Unique key analysis provides information to assist you in determining whether or not an attribute is a unique key. It does this by looking at the percentages of distinct values that occur in the attribute. You might determine that attributes with a minimum of 70% distinct values should be flagged for unique key analysis. For example, using unique key analysis you could discover that 95% of the values in the EMP_ID column are unique. Further analysis of the other 5% reveals that most of these values are either duplicates or nulls. You could then derive a rule that requires that all entries into the EMP_ID column be unique and not null.

Functional Dependency

Functional dependency analysis reveals information about column relationships. This enables you to search for things such as one attribute determining another attribute within an object.

Table 10-3 shows the contents of the Employees table in which the attribute Dept Location is dependent on the attribute Dept Number. Note that the attribute Dept Number is not dependent on the attribute Dept Location.

Table 10-3 Employees Table

ID	Name	Salary	Dept Number	Dept Location
10	Alison	1000	10	SF
20	Rochnik	1000	11	London
30	Meijer	300	12	LA
40	John	500	13	London
50	George	200	13	London
60	Paul	600	13	London
70	Ringo	100	13	London
80	Yoko	600	13	London
90	Jones	1200	10	SF

Referential Analysis

Referential analysis attempts to detect aspects of your data objects that refer to other objects. The purpose behind this type of analysis is to provide insight into how the object you are profiling is related or connected to other objects. Because you are comparing two objects in this type of analysis, one is often referred to as the parent object and the other as the child object. Some of the common things detected include orphans, childless objects, redundant objects, and joins. Orphans are values that are found in the child object, but not found in the parent object. Childless objects are values that are found in the parent object, but not found in the child object. Redundant attributes are values that exist in both the parent and child objects.

Table 10-4 and Table 10-5 show the contents of two tables that are candidates for referential analysis. Table 10-4 is the child object and Table 10-5 is the parent object.

Table 10-4 Employees Table (Child)

ID	Name	Dept. Number	City
10	Alison	17	NY
20	Rochnik	23	SF
30	Meijer	23	SF
40	Jones	15	SD

Table 10-5 Department Table (Parent)

Dept. Number	Location
17	NY
18	London
20	SF
23	SF
55	HK

Referential analysis of these two objects would reveal that Dept. Number 15 from the Employees table is an orphan and Dept. Numbers 18, 20, and 55 from the Department table are childless. It would also reveal a join on the Dept. Number column.

Based on these results, you could derive referential rules that determine the cardinality between the two tables.

Data Rule Profiling

In addition to attribute analysis, functional dependency, and referential analysis, Warehouse Builder offers data rule profiling. Data rule profiling enables you to create rules to search for profile parameters within or between objects.

This is very powerful as it enables you to validate rules that apparently exist and are defined by the business users. By creating a data rule, and then profiling with this rule you can verify if the data actually complies with the rule, and whether or not the rule needs amending or the data needs cleansing.

For example, you could create a rule that Income = Salary + Bonus for the Employee table shown in Table 10-6. You can then catch errors such as the one for employee Alison.

Table 10-6 Sample Employee Table

ID	Name	Salary	Bonus	Income
10	Alison	1000	50	1075 X
20	Rochnik	1000	75	1075
30	Meijer	300	35	335
40	Jones	1200	500	1700

How to Perform Data Profiling

Data profiling is, by definition, a resource-intensive process that requires forethought and planning. It analyzes data and columns and performs many iterations to detect defects and anomalies in your data. So it warrants at least some forethought and planning in order to be as effective as possible.

Before beginning data profiling, you should first identify the data objects that you want to target. Instead of profiling everything, choose objects that are deemed crucial. You should not select an entire source system for profiling at the same time. Not only is it a waste of resources, but it is also often unnecessary. Select areas of your data where quality is essential and has the largest fiscal impact.

For example, you have a data source that contains five tables: Customers, Regions, Orders, Products, and Promotions. You decide that the two most important tables with respect to data quality are Customers and Orders. The Customers table is known to contain many duplicate and erroneous entries that cost your company money on wasted marketing efforts. The Orders table is known to contain data about orders in an incorrect format. In this case, you would select only these two tables for data profiling.

After you have chosen the object you want to profile, use the following steps to guide you through the profiling process:

Import or Select the Metadata
Create a Data Profile
Profile the Data
View Profile Results and Derive Data Rules
Generating Corrections
Define and Edit Data Rules Manually
Generate, Deploy, and Execute

The data profiling process ends at step 4. Steps 5 to 7 are optional and can be performed if you want to perform data correction after the data profiling.

Import or Select the Metadata

Data profiling requires the profiled objects to be present in the project in which you are performing data profiling. Ensure that these objects are either imported into this project or created in it. Also ensure that the data is loaded into the objects. Having the data loaded is essential to data profiling.

Also, because data profiling uses mappings to run the profiling, you must ensure that all locations that you are using are registered. Data profiling attempts to register your locations. If, for some reason, data profiling cannot register your locations, you will need to explicitly register the locations before you begin profiling.

Note:

You can only profile data in the default configuration.

Create a Data Profile

After your system is set up, you can create a data profile. A data profile is a metadata object in the Warehouse Builder repository and you create in the navigation tree. It contains the definitions and settings necessary for profiling objects. It includes the set of data objects you want profiled, the settings controlling the profiling operations, the results returned after you profile the data, and correction information (if you decide to use these corrections). For more information about creating data profiles, see "Using Data Profiles".

Profile the Data

After you have created a data profile, you can open it in the Data Profile Editor to profile the data or review profile results from a previous run. Data profiling is achieved by performing deep scans of the selected objects. This can be a time-consuming process, depending on the number of objects and type of profiling you are running. However, profiling is run as an asynchronous job, and the client can be closed during this process. You will see the job running in the job monitor and Warehouse Builder prompts you when the job is complete.

The results are generated and can be viewed from the Data Profile Editor as soon as they are available. For more information about how to profile data, see "Profiling the Data".

You can, and should, configure the profile before running it if there are specific types of analysis you do, or do not, want to run. Configuration of the profile and its objects is possible at the following levels:

the entire profile (all the objects it contains)
an individual object (for example, a table)
a single column in a a table

For example, if you know you only have one problematic column in a table and you already know that most of the records should conform to values within a certain domain, then you can focus your profiling resources on domain discovery and analysis. By narrowing down the type of profiling necessary, you use less resources and obtain the results faster. For more information about configuring data profiles, see "Property Inspector".

View Profile Results and Derive Data Rules

The profiling results contain a variety of analytical and statistical information about the data profiled. You can immediately drill down into anomalies and view the data that caused them. You can then determine what data must be corrected. For more information about viewing profiling results, see "Viewing the Results".

Based on your decisions, you can derive data rules. Data rules are used to ensure that only values compliant with the data rules are allowed within a data object. Data rules will form the basis for correcting or removing data if you decide to cleanse the data. You can also use data rules to report on non-compliant data. For more information about deriving data rules, see "Deriving Data Rules".

Generating Corrections

After you have derived data rules from the profiling results, you can create the schema and mapping corrections. The schema correction creates scripts that can be used to create a corrected set of source data objects with the derived data rules applied.

The mapping correction creates new correction mappings to take your data from the source objects and load them into new objects. For more information about creating schema and mapping corrections, see "Correcting Schemas and Cleansing Data".

Define and Edit Data Rules Manually

Data rules can be derived or manually created. Before and after you have created the corrections, you can define additional data rules manually. For more information about defining and editing data rules manually, see "Creating Data Rule Folders".

Generate, Deploy, and Execute

Finally, you can generate, deploy, and execute the correction mappings and data rules. After you run the correction mappings with the data rules, your data is corrected. The derived data rules remain attached to the objects in the corrected schema for optional use in data monitors.

About Six Sigma

Warehouse Builder provides Six Sigma results embedded within the other data profiling results to provide a standardized approach to data quality.

What is Six Sigma?

Six Sigma is a methodology that attempts to standardize the concept of quality in business processes. It achieves this goal by statistically analyzing the performance of business processes. The goal of Six Sigma is to improve the performance of these processes by identifying the defects, understanding them, and eliminating the variables that cause these defects.

Six Sigma metrics give a quantitative number for the number of defects in each 1,000,000 opportunities. The term "opportunities" can be interpreted as the number of records. The perfect score is 6.0. The score of 6.0 is achieved when there are only 3.4 defects in each 1,000,000 opportunities. The score is calculated using the following formula:

Defects Per Million Opportunities (DPMO) = (Total Defects / Total Opportunities) * 1,000,000
Defects (%) = (Total Defects / Total Opportunities)* 100%
Yield (%) = 100 - %Defects
Process Sigma = NORMSINV(1-((Total Defects) / (Total Opportunities))) + 1.5

where NORMSINV is the inverse of the standard normal cumulative distribution.

Six Sigma Metrics for Data Profiling

Six Sigma metrics are also provided for data profiling in Warehouse Builder. When you perform data profiling, the number of defects and anomalies discovered are shown as Six Sigma metrics. For example, if data profiling finds that a table has a row relationship with a second table, the number of records in the first table that do not adhere to this row-relationship can be described using the Six Sigma metric.

Six Sigma metrics are calculated for the following measures in the Data Profile Editor:

Aggregation: For each column, the number of null values (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented data type (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented length (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented scale (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented precision (defects) to the total number of rows in the table (opportunities).
Patterns: For each column, the number of values that do not comply with the common format (defects) to the total number of rows in the table (opportunities).
Domains: For each column, the number of values that do not comply with the documented domain (defects) to the total number of rows in the table (opportunities).
Referential: For each relationship, the number of values that do not comply with the documented foreign key (defects) to the total number of rows in the table (opportunities).
Referential: For each column, the number of values that are redundant (defects) to the total number of rows in the table (opportunities).
Unique Key: For each unique key, the number of values that do not comply with the documented unique key (defects) to the total number of rows in the table (opportunities).
Unique Key: For each foreign key, the number of rows that are childless (defects) to the total number of rows in the table (opportunities).
Data Rule: For each data rule applied to the data profile, the number of rows that fail the data rule to the number of rows in the table.

About Data Quality

Warehouse Builder enables you to automatically create correction mappings based on the results of data profiling. On top of these automated corrections that make use of the underlying Warehouse Builder architecture for data quality, you can create your own data quality mappings.

Warehouse Builder provides functionality that enables you to ensure data quality. During transformation of the source data, you can use the following operators to ensure data quality:

Match-Merge Operator

See "About the Match-Merge Operator"
Name and Address Operator

See "About the Name and Address Operator"

About the Match-Merge Operator

Match-Merge is a data quality operator that identifies matching records and merges them into a single record. Master data management working on various systems will make use of this operator to ensure that records are created and matched with a master record.

You define the business rules that the Match-Merge operator uses to identify records that refer to the same data. After the operator identifies matching records, it merges them into a single, consolidated record.

You have the following options for using a match-merge operator:

Define a new match-merge operator: Drag the Match-Merge operator from the Palette onto the mapping. The Mapping Editor launches the MATCHMERGE Wizard.
Edit an existing match-merge operator: Right-click the operator and select Open Details. The Mapping Editor opens the MATCHMERGE Editor.

About the Name and Address Operator

Warehouse Builder enables you to perform name and address cleansing on data using the Name and Address operator.

The Name and Address operator identifies and corrects errors and inconsistencies in name and address source data by comparing input data to the data libraries supplied by third-party name and address cleansing software vendors. You can purchase the data libraries directly from these vendors.

Note:

The Name and Address operator requires separate licensing and installation of third-party name and address cleansing software. Refer to the Oracle Warehouse Builder Installation and Administration Guide for more information.

You have the following options for using a Name and Address operator:

Define a new Name and Address operator: Drag the operator from the Toolbox onto the mapping. The Mapping Editor launches the Name and Address wizard.
Edit an existing Name and Address operator: Right-click the operator and select Open Details. The Mapping Editor opens the Name and Address Editor.

The errors and inconsistencies corrected by the Name and Address operator include variations in address formats, use of abbreviations, misspellings, outdated information, inconsistent data, and transposed names. The operator fixes these errors and inconsistencies by:

Parsing the name and address input data into individual elements.
Standardizing name and address data, using standardized versions of nicknames and business names and standard abbreviations of address components, as approved by the postal service of the appropriate country. Standardized versions of names and addresses facilitate matching and householding, and ultimately help you obtain a single view of your customer.
Correcting address information such as street names and city names. Filtering out incorrect or undeliverable addresses can lead to savings on marketing campaigns.
Augmenting names and addresses with additional data such as gender, ZIP+4, country code, apartment identification, or business and consumer identification. You can use this and other augmented address information, such as census geocoding, for marketing campaigns that are based on geographical location.

Augmenting addresses with geographic information facilitates geography-specific marketing initiatives, such as marketing only to customers in large metropolitan areas (for example, within an n-mile radius from large cities); marketing only to customers served by a company's stores (within an x-mile radius from these stores). Oracle Spatial, an option with Oracle Database, and Oracle Locator, packaged with Oracle Database, are two products that you can use with this feature.

The Name and Address operator also enables you to generate postal reports for countries that support address correction and postal matching. Postal reports often qualify you for mailing discounts. For more information, see "About Postal Reporting".

Example: Correcting Address Information

This example follows a record through a mapping using the Name and Address operator. This mapping also uses a Splitter operator to demonstrate a highly recommended data quality error handling technique.

Example Input

In this example, the source data contains a Customer table with the row of data shown in Table 10-7.

Table 10-7 Sample Input to Name and Address Operator

Address Column	Address Component
Name	Joe Smith
Street Address	8500 Normandale Lake Suite 710
City	Bloomington
ZIP Code	55437

The data contains a nickname, a last name, and part of a mailing address, but it lacks the customer's full name, complete street address, and the state in which he lives. The data also lacks geographic information such as latitude and longitude, which can be used to calculate distances for truckload shipping.

Example Steps

This example uses a mapping with a Name and Address operator to cleanse name and address records, followed by a Splitter operator to load the records into separate targets depending on whether they were successfully parsed. This section explains the general steps required to design such a mapping.

To make the listed changes to the sample record:

In the Mapping Editor, begin by adding the following operators to the canvas:
- A CUSTOMERS table from which you extract the records. This is the data source. It contains the data in "Example Input".
- A Name and Address operator. This action launches the Name and Address Wizard. Follow the steps of the wizard.
- A Splitter operator. For information on using this operator, see "Splitter Operator".
- Three target operators into which you load the successfully parsed records, the records with parsing errors, and the records whose addresses are parsed but not found in the postal matching software.
Map the attributes from the CUSTOMERS table to the Name and Address operator ingroup. Map the attributes from the Name and Address operator outgroup to the Splitter operator ingroup.

You are not required to use the Splitter operator, but it provides an important function in separating good records from problematic records.
Define the split conditions for each of the outgroups in the Splitter operator and map the outgroups to the targets.

Figure 10-4 shows a mapping designed for this example. The data is mapped from the source table to the Name and Address operator, and then to the Splitter operator. The Splitter operator separates the successfully parsed records from those that have errors. The output from OUTGRP1 is mapped to the CUSTOMERS_GOOD target. The split condition for OUTGRP2 is set such that records whose Is Parsed flag is False are loaded to the NOT_PARSED target. That is, the Split Condition for OUTGRP2 is set as INGRP1.ISPARSED='F'. The Records in the REMAINING_RECORDS group are successfully parsed, but their addresses are not found by the postal matching software. These records are loaded to the PARSED_NOT_FOUND target.

Figure 10-4 Name and Address Operator Used with a Splitter Operator in a Mapping

Diagram of Name and Address operator with Splitter operator

Description of "Figure 10-4 Name and Address Operator Used with a Splitter Operator in a Mapping"

Example Output

If you run the mapping designed in this example, the Name and Address operator standardizes, corrects, and completes the address data from the source table. In this example, the target table contains the address data as shown in Table 10-8. Compare it with the input record from Table 10-7.

Table 10-8 Sample Output from Name and Address Operator

Address Column	Address Component
First Name Standardized	JOSEPH
Last Name	SMITH
Primary Address	8500 NORMANDALE LAKE BLVD
Secondary Address	STE 710
City	BLOOMINGTON
State	MN
Postal Code	55437-3813
Latitude	44.849194
Longitude	-093.356352
Is Parsed	True or False. Indicates whether a record can be separated into individual elements.
Is Good Name	True or False. Indicates whether the name was found in a postal database.
Is Good Address	True or False. Indicates whether the address was found in a postal database or was parsed successfully.
Is Found	True or False. Indicates whether the address was found in a postal database.
Name Warning	True or False. Indicates whether problems occurred in parsing the name.
Street Warning	True or False. Indicates whether problems occurred in parsing the address.
City Warning	True or False. Indicates whether problems occurred in parsing the city name.

In this example, the following changes were made to the input data:

Joe Smith was separated into separate columns for First_Name_Standardized and Last_Name.
Joe was standardized into JOSEPH and Suite was standardized into STE.
Normandale Lake was corrected to NORMANDALE LAKE BLVD.
The first portion of the postal code, 55437, was augmented with the ZIP+4 code to read 55437-3813.
Latitude and longitude locations were added.
The records were tested in various ways, and the good records are directed to a different target from the ones that have problems.

Handling Errors in Name and Address Data

Name and Address parsing, like any other type of parsing, depends on identification of keywords and patterns containing those keywords. Free-form name and address data difficult to parse because the keyword set is large and it is never 100% complete. Keyword sets are built by analyzing millions of records, but each new data set is likely to contain some undefined keywords.

Because most free-form name and address records contain common patterns of numbers, single letters, and alphanumeric strings, parsing can often be performed based on just the alphanumeric patterns. However, alphanumeric patterns may be ambiguous or a particular pattern may not be found. Name and Address parsing errors set parsing status codes that you can use to control data mapping.

Since the criteria for quality vary among applications, numerous flags are available to help you determine the quality of a particular record. For countries with postal matching support, use the Is Good Group flag, because it verifies that an address is a valid entry in a postal database. Also use the Is Good Group flag for U.S. Coding Accuracy Support System (CASS) and Canadian Software Evaluation and Recognition Program (SERP) certified mailings.

Unless you specify postal reporting, an address does not have to be found in a postal database to be acceptable. For example, street intersection addresses or building names may not be in a postal database, but they may still be deliverable. If the Is Good Group flag indicates failure, additional error flags can help determine the parsing status.

The Is Parsed flag indicates success or failure of the parsing process. If Is Parsed indicates parsing success, you may still wish to check the parser warning flags, which indicate unusual data. You may want to check those records manually.

If Is Parsed indicates parsing failure, you must preserve the original data to prevent data loss.

Use the Splitter operator to map successful records to one target and failed records to another target.

About Postal Reporting

All address lists used to produce mailings for discounted automation postal rates must be matched by postal report-certified software. Certifications depend on the third-party vendors of name and address software and data. The certifications may include the following:

United States Postal Service: Coding Accuracy Support System (CASS)
Canada Post: Software Evaluation and Recognition Program (SERP)
Australia Post: Address Matching Approval System (AMAS)

United States Postal Service CASS Certification

The Coding Accuracy Support System (CASS) was developed by the United States Postal Service (USPS) in cooperation with the mailing industry. The system provides mailers a common platform to measure the quality of address-matching software, focusing on the accuracy of five-digit ZIP Codes, ZIP+4 Codes, delivery point codes, and carrier route codes applied to all mail. All address lists used to produce mailings for automation rates must be matched by CASS-certified software.

To meet USPS requirements, the mailer must submit a CASS report in its original form to the USPS.

Canada Post SERP Certification

Canada Post developed a testing program called Software Evaluation and Recognition Program (SERP), which evaluates software packages for their ability to validate, or validate and correct, mailing lists to Canada Post requirements. Postal programs that meet SERP requirements are listed on the Canada Post Web site.

Canadian postal customers who use Incentive Lettermail, Addressed Admail, and Publications Mail must meet the Address Accuracy Program requirements. Customers can obtain a Statement of Accuracy by comparing their databases to Canada Post's address data.

Australia Post AMAS Certification

The Address Matching Approval System (AMAS) was developed by Australia Post to improve the quality of addressing. It provides a standard by which to test and measure the ability of address-matching software to:

Correct and match addresses against the Postal Address File (PAF)
Append a unique Delivery Point Identifier (DPID) to each address record, which is a step toward barcoding mail.

AMAS allows companies to develop address matching software which:

Prepares addresses for barcode creation
Ensures quality addressing
Enables qualification for discounts on PreSort letter lodgements

PreSort Letter Service prices are conditional upon customers using AMAS Approved Software with Delivery Point Identifiers (DPIDs) being current against the latest version of the PAF.

A declaration that the mail was prepared appropriately must be made when using the Presort Lodgement Document, available from post offices.

About Data Rules

Data rules are definitions for valid data values and relationships that can be created in Warehouse Builder. They determine legal data within a table or legal relationships between tables. Data rules help ensure data quality. They can be applied to tables, views, dimensions, cubes, materialized views, and external tables. Data rules are used in many situations including data profiling, data and schema cleansing, and data auditing.

The metadata for a data rule is stored in the repository. To use a data rule, you apply the data rule to a data object. For example, you create a data rule called gender_rule that specifies that valid values are 'M' and 'F'. You can apply this data rule to the emp_gender column of the Employees table. Applying the data rule ensures that the values stored for the emp_gender column are either 'M' or 'F'. You can view the details of the data rule bindings on the Data Rule tab of the Data Object Editor for the Employees table.

There are two ways to create a data rule. A data rule can be derived from the results of data profiling, or it can be created using the Data Rule Wizard. For more information about data rules, see "Using Data Rules".

About Quality Monitoring

Quality monitoring builds on your initial data profiling and data quality initiatives. It enables you to monitor the quality of your data over time. You can define the business rules to which your data should adhere.

To monitor data using Warehouse Builder you need to create data auditors. Data auditors ensure that your data complies with the business rules you defined. You can define the business rules that your data should adhere to using a feature called data rules.

About Data Auditors

Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule, and they can off-load defective data into auditing and error tables.

Data auditors have thresholds that allow you to create logic based on the fact that too many non-compliant records can divert the process flow into an error or notification stream. Based on this threshold, the process can choose actions. Also the audit result can be captured and stores for analysis purposes.

Data auditors can be deployed and executed ad-hoc, but they are typically run to monitor the quality of the data in an operational environment like a data warehouse or ERP system and, therefore, can be added to a process flow and scheduled.

When executed, the data auditor sets several output values. One of these output values is called the audit result. If the audit result is 0, then there were no errors. If the audit result is 1, at least one error occurred. If the audit result is 2, then at least one data rule failed to meet the specified error threshold. Data auditors also set the actual measured values such as Error Percent and Six Sigma values. For more information about using data auditors, see "Using Data Auditors".

Data auditors are a very important tool in ensuring data quality levels are up to the standards set by the users of the system. It also helps determine spikes in bad data allowing events to the tied to these spikes.

You can use the Metadata Loader (MDL) utilities to import or export metadata related to data auditors. For more information about using the MDL utilities, see Chapter 33, "Importing and Exporting with the Metadata Loader (MDL)".

Quick Lookup

This Book

This Page