Skip Headers

Oracle9i Warehouse Builder User's Guide
Release 2 (9.0.4)

Part Number B10657-01
Go To Table Of Contents
Contents
Go To Index
Index

Go to previous page Go to next page

19
Data Quality: Name and Address Cleansing

Parse table maintenance enables the parser in the Name and Address operator to recognize regional or data-specific patterns. This chapter contains information applicable to the Name and Address functionality provided by Warehouse Builder. If you have also purchased Oracle9i Warehouse Builder Name and Address, this functionality is available in the form of the Name and Address operator. For more information about the Name and Address operator, refer to Chapter 7, "Using Mapping Operators".

This chapter contains the following topics:

Overview

To use the Name and Address operator, you must install Oracle9i Warehouse Builder Name and Address (it is provided as part of the Oracle9i Server pack). You can also purchase the license separately from the Oracle Store.


Note:

To use the Name and Address operator, relocate it from the toolbox onto the mapping canvas. Oracle9i Warehouse Builder Name and Address contains libraries and the PL/SQL code packages referenced by this operator. You can download a quarterly update of the library data from OracleMetalink. Refer to the Oracle9i Warehouse Builder Installation and Configuration Guide for more information.


About the Name and Address Operator

The Name and Address operator identifies and corrects errors or inconsistencies in name and address source data. These errors and inconsistencies include variations in address formats, use of abbreviations, misspellings, outdated information, inconsistent data, or transposed names.

The operator fixes these errors and inconsistencies by:

The Name and Address operator accomplishes these tasks by comparing input data to data in the Oracle9i Warehouse Builder Name and Address data libraries.

Parse Table Maintenance

Parse table maintenance is a process of enhancing the parsing definitions to enable the parser to recognize regional or data-specific patterns. The user creates a custom table of words, phrases (or tokens), and patterns, and the table maintenance process combines the custom table with a standard definitions table.

Table Maintenance creates for the parser both a Word/Phrase and Pattern table. The Word/Phrase Table contains context-specific information about words and phrases. For example, the word 'Paul' is defined in the Word/Phrase Table as a beginning attribute of first name with a male gender. The Pattern Table contains information about sequences of words/phrases (patterns).

Table Maintenance Process

The table maintenance system accepts three input files and produces three output files, as shown in Figure 19-1.

Figure 19-1 Table Maintenance System Input and Output Files

Text description of tblmaint.gif follows.

Text description of the illustration tblmaint.gif

Begin table maintenance after having identified a sufficient number of parsing problems.

To perform table maintenance, follow these general steps:

  1. Add pattern additions and modifications to the client table.

  2. Perform table compilation.

  3. Restart the Name and Address server.

  4. Check the new parsing results.

This is usually an iterative process.

Name and Address Operator Input Roles and Output Components

This section describes the input roles and output components for the Name and Address operator.

Input Roles

Table 19-1 lists input roles and descriptions for the Name and Address Operator.

Table 19-1 Name and Address Operator Input Roles  
Input Role Description

None

This setting causes generation to fail.

First Name

First name. This can be a nickname or shortened version of the first name.

Middle Name

Middle name or initial. Use when there is only one middle name, or for the first of several middle names (such as "Herbert" in George Herbert Walker Bush).

Middle Name 2

Second middle name; for example, "Walker" in George Herbert Walker Bush.

Middle Name 3

Third middle name; for example, "Louise" in Ethel May Roberta Louise Mertz.

Last Name

Last name, or surname.

First Part Name

First part of the Person name, including:

  • Pre name

  • First name

  • Middle name(s)

Use when these components are contained in one source column.

Last Part Name

Last part of Person Name, including:

  • Last name

  • Post Name

Use when these components are all contained in one source column.

Pre Name

Information that precedes and qualifies the name; for example, Ms., Mr., or Dr.

Post Name

Generation or other information qualifying the name; for example, Jr. or Ph.D.

Person

Full person name, including:

  • First Part Name (consisting of Pre Name, First Name, and Middle Names)

  • Last Part Name (consisting of Last Name and Post Name)

Use when these components are all contained in one source column.

Person 2

Designates a second person if the input includes multiple personal contacts.

Person 3

Designates a third person if the input includes multiple personal contacts.

Firm Name

Name of the company or organization.

Primary Address

Box, route, or street address, including:

  • Street name

  • House number

  • City map grid direction; for example, SW or N

  • Street type; for example, avenue, street, road, etc.

This does not include the Unit Designator or the Unit Number.

Secondary Address

The second part of the street address, including:

  • Unit Designator

  • Unit Number

For example, in a secondary address of Suite 2100, the Unit Designator is `STE' (a standardization of "Suite") and the Unit Number is `2100'.

Address

Full address line, including:

  • Primary Address

  • Secondary Address

Use when these components share one column.

Address 2

Generic address line.

Neighborhood

Neighborhood or barrio, common in South and Latin American addresses.

City

Name of city.

State

Name of state or province.

Postal Code

Postal code, such as a ZIP code in the United States or a Postal Code in Canada.

Country Name

Full country name.

Country Code

The ISO 3166-1993 (E) two- or three-character country code. For example, `US' or `USA' for United States; `CA' or `CAN' for Canada.

Last Line

Last address line, including:

  • City

  • State or province

  • Postal code

Use when these components are all contained in one source column.

Line1, Line2, Line3, Line4, Line5, Line6, Line7, Line8, Line9, Line10

Non-discrete roles intended for use as free form name, business, personal, and address text. These generic selections can store any type of name and address data. They do not provide the parser with any information about the data content. Whenever possible, use the discrete input roles provided instead.

Output Components

Table 19-2 lists the Name and Address operator output components and their descriptions.

Table 19-2 Name and Address Operator Output Components  
Output Component Description

None

This setting causes generation to fail.

First Name

The first name found in the input name.

First Name Standardized

Standard version of first name; for example, Theodore for Ted or James for Jim.

Middle Name

Middle name or initial. Use this for a single middle name, or for the first of several middle names (such as "Herbert" in George Herbert Walker Bush).

Middle Name 2

Second middle name; for example, "Walker" in George Herbert Walker Bush.

Middle Name 3

Third middle name; for example, "Louise" in Ethel May Roberta Louise Mertz.

Middle Name Standardized

Standardized version of the middle name; for example, Theodore for Ted or James for Jim. Used when there is only one middle name, or for the first of several middle names (such as "Herbert" in George Herbert Walker Bush).

Middle Name 2 Standardized

Standardized version of the second middle name; for example, Theodore for Ted or James for Jim.

Middle Name 3 Standardized

Standardized version of the third middle name; for example, Theodore for Ted or James for Jim.

Last Name

Last name, or surname.

Pre Name

Title or salutation appearing before a name; for example, Ms. or Dr.

Post Name

Name suffix indicating generation; for example, Sr., Jr., or III.

Other Post Name

Name suffix indicating certification, academic degree, or affiliation; for example, Ph.D., M.D., or R.N.

Relationship

Information related to another person; for example, "Trustee for."

Gender

Probable gender:

  • M= Male

  • F= Female

  • N= Neutral (either Male or Female)

  • Blank=unknown.

Person

First name, middle name, and last name.

Person Count

Number of persons the record references. For example, a record with a Person name of "John and Jane Doe" has a Person Count of 2.

Firm Name

Name of the company or organization, including divisions.

Firm Count

Number of firms referenced in the record.

Address

Full address line, including:

  • Primary Address

  • Secondary Address

Miscellaneous Address

Miscellaneous address information, such as a telephone number or an e-mail address.

Neighborhood

Neighborhood or barrio, common in South and Latin American addresses.

City

Name of city. The US city names may be converted to United States Postal Service preferred names.

State

Name of state or province. For countries such as the United Kingdom, this column may contain a county name.

Postal Code

Full postal code with spaces and other non-alphanumeric characters removed.

Postal Code Formatted

Formatted version of postal code that includes spaces and other non-alphanumeric characters, such as dashes.

Country Name

The full country name.

Country Code

The ISO 3166-1993 (E) two-character country code, as defined by the International Organization for Standardization; for example, `US' for United States or `CA' for Canada.

Country Code 3

The ISO 3166-1993 (E) three-character country code, as defined by the International Organization for Standardization; for example, `USA' for United States, `FRA' for France, or `UKR' for Ukraine.

Last Line

Final address line, including:

  • City

  • State or province (if state contains a county instead of a state or province, it is not included)

  • Formatted postal code if address was fully assigned

Primary Address

Box, route, or street address, including:

  • Street name

  • House number

  • City map grid direction; for example, SW or N

  • Street type; for example, avenue, street, road, etc.

This does not include the Unit Designator or the Unit Number.

Secondary Address

The second part of the street address, including:

  • Unit Designator

  • Unit Number

For example, in a secondary address of Suite 2100, the Unit Designator is `STE' (a standardization of "Suite") and the Unit Number is `2100'.

Street Name

Name of street.

Street Number

Number that identifies the address, such as a house or building number, sometimes referred to as the primary range. For example, in 200 Oracle Parkway, the Street Number is 200.

Street Type

Street identifier; for example, ST, AVE, RD, DR, or HWY.

Pre Directional

Street directional indicator appearing before the street name; for example, in 100 N University Drive, the Pre Directional is `N'.

Post Directional

Street directional indicator appearing after the street name; for example, in 100 15th Ave. S., the Post Directional is `S'.

Unit Designator

Type of secondary address, such as APT or STE. For example, in a secondary address of Suite 2100, the Unit Designator is `STE' (a standardization of "Suite").

Name Designator

Personal name designation; for example, "ATTN" (to the attention of) or "C/O" (care of).

Unit Number

Number that identifies the secondary address, such as the apartment or suite number. For example, in a secondary address of Suite 2100, the Unit Number is `2100'.

Box Name

The name for a post office box address; for example, for "PO Box 95," the Box Name is `PO BOX'.

Box Number

The number for a post office box address; for example, for "PO Box 95," the Box Number is `95'.

Route Name

Route name for a rural route address. For an address of "Route 5 Box 10," the Route Name is `RTE' (a standardization of "Route").

Route Number

Route number for a rural route address. For an address of "Route 5 Box 10," the Route Number is `5'.

Building Name

Building name, such as "Cannon Bridge House." Building names are common in the United Kingdom.

Parse Status

Trillium parse status code.

Parse Status Description

Text description of the Trillium parse status.

Urbanization Name

Urban unit name used in Puerto Rico. This applies to United States (Puerto Rico) addresses only.

ZIP 5

The five-digit United States postal code. This applies to United States addresses only.

ZIP 4

The four-digit suffix that is added to the five-digit United States postal code to further specify location. This applies to United States addresses only.

Delivery Point

Two-digit postal delivery point, which is combined with a full nine-digit postal code and check digit to form a delivery point bar code. This applies to United States addresses only.

DPBC Check Digit

Check digit for forming a delivery point bar code. This applies to United States addresses only.

CART

Four-character USPS Carrier route. This applies to United States addresses only.

LACS Flag

Indicates whether the address requires a LACS conversion.

The Locatable Address Conversion System (LACS) provides new addresses when a 911 emergency system has been implemented. 911 address conversions typically involve changing rural-style addresses to city-style addresses, but on occasion they may involve renaming or renumbering existing city-style addresses.

  • T= Address needs 9-1-1 conversion (from old rural route to new street address) and should be submitted to a LACS vendor

  • F= Address does not require conversion.

This applies to United States addresses only.

FIPS County

The three-digit county code as defined by the Federal Information Processing Standard (FIPS). This applies to United States addresses only.

FIPS Code

The complete (state plus county) code assigned to the county by the Federal Information Processing Standard (FIPS). Because FIPS county codes are unique within a state, a complete FIPS Code includes the two-digit state code followed by the three-digit county code. This applies to United States addresses only.

Census ID

United States Census tract and block-group number. The first six digits are the tract number; the final digit is the block-group number within the tract. These codes are used for matching to demographic-coding databases. This applies to United States addresses only.

Metropolitan Statistical Area

Metropolitan Statistical Area (MSA) number. For example, `0000' indicates that the address does not lie within any MSA, and typically indicates a rural area. This applies to United States addresses only.

Minor Census District

Minor Census District. This applies to United States addresses only.

Latitude

Latitude in degrees north of the equator: positive for north of the equator; negative for south (always positive for North America).

Longitude

Longitude in degrees east of the Greenwich Meridian: positive for east of GM; negative for west (always negative for North America).

Address Type

Type of address:

  • F= Firm

  • G= General Delivery

  • H= High-rise apartment or office building

  • HD= High-rise default (the Name and Address operator can detect a finer level of postal code assignment if further input information were available)

  • B= Box

  • R= Rural Code

  • S= Street

  • M= Military

  • P= Post Office Box

Is Found

Indicates whether the address is listed in the postal matching database for the country indicated by the address:

  • T= The address was found in a postal matching database.

  • F= The address was not found in a postal matching database. This can mean that the address is not a legal address, and also that postal matching is not available for the country.

This flag is true only if all of the flags listed below are true. If postal matching is available, this flag is the best indicator of record quality.

Is Parsed

Indicates whether the name or address was parsed:

  • T= The name or address was parsed successfully. A name or address is considered to be successfully parsed even if some parsing warnings exist.

  • F= The name or address may not be parsed.

Check the status of parsing warning flags (such as Name Warning, City Warning, etc.).

Is Good Group

Indicates whether the name group, address group, or name and address group was processed successfully:

  • For name groups, this means the name has been successfully parsed.

  • For address groups, this means the address has been found in a postal matching database if one is available, or has been successfully parsed if no postal database is installed.

  • For name and address groups, both the name and the address have been successfully processed.

Is Good Address

Indicates whether the address was processed successfully:

  • T= Successfully processed: either the address was found in the postal matching database or, if no postal matching database is installed for the country indicated by the address, the address was successfully parsed.

  • F= Not successfully processed: if a postal matching database is installed for the country indicated by the address, this indicates that address was not found in the database. If no postal matching database is available for the country, this indicates that the address may not be parsed.

This flag is easier to use if you have a mix of records from both postal matched and non postal matched countries.

Is Good Name

Indicates whether the name was parsed successfully:

  • T= The name was parsed successfully. Note that a name is considered to be successfully parsed even if some parsing errors exist.

  • F= The name may not be parsed.

Is Address Verifiable

Indicates whether postal matching is available for the country of the address, but does not indicate the outcome of the matching operation:

  • T= Postal matching is available for the country address.

  • F= Matching is not available for the country of the address. This indicates that a postal matching database is installed for the country indicated by the address but matching is not available for the address. No postal matching database is installed for the country indicated by the address.

Name Warning

Indicates whether the parser had difficulty parsing a name:

  • T= The parser had difficulty parsing a name.

  • F= No difficulty parsing name.

Street Warning

Indicates whether the parser had difficulty parsing a street address:

  • T= The parser had difficulty parsing a street address.

  • F= No difficulty parsing street address.

City Warning

Indicates whether the parser had difficulty parsing a city:

  • T= The parser had difficulty parsing a city.

  • F= No difficulty parsing city.

City Found

Indicates whether the postal matcher found the city:

  • T= The postal matcher found the city.

  • F= The postal matcher did not find the city.

Street Name Found

Indicates whether the postal matcher found the street name:

  • T= The postal matcher found the street name.

  • F= The postal matcher did not find the street name.

Street Number Found

Indicates whether the postal matcher found the street number within a valid range of numbers for the named street:

  • T= The postal matcher found the street number.

  • F= The postal matcher did not find the street number within a valid range of numbers for the named street.

Street Components Found

Indicates whether the postal matcher found the street components, such as the Pre Directional or Post Directional:

  • T= The postal matcher found the street components.

  • F= The postal matcher did not find the street components.

Non-ambiguous Match Found

Indicates whether the postal matcher found a matching address in the postal database:

  • T= The postal matcher found a match between the input record and a single entry in the postal database.

  • F= The address is ambiguous. The postal matcher found that the address matched several postal database entries and could not make a selection. For example, if the input address is "100 4th Avenue," but the postal database contains "100 4th Ave N" and "100 4th Ave S," the input's missing directional causes the match to fail.

Address Corrected

Indicates whether the address was corrected in any way during matching. Standardization is not considered correction in this case.

  • T= Some component of the address was changed, aside from standardization. One of the other Corrected flags, indicating the component that was changed, must also be True.

  • F= No components of the address were changed, with the possible exception of standardization.

Postal Code Corrected

Indicates whether the postal code was corrected during matching. Correction can include the addition of a postal extension:

  • T= The postal code was corrected.

  • F= The postal code was not corrected.

City Corrected

Indicates whether the city name was corrected during matching. Postal code input is used to determine the city name preferred by the postal service.

  • T= The city name was corrected during matching.

  • F= The city name was not corrected.

Street Corrected

Indicates whether the street name was corrected during matching. Some correct street names may be changed to an alternate name preferred by the postal service.

  • T= The street name was corrected during matching.

  • F= The street name was not corrected.

Street Components Corrected

Indicates whether any of the street components, such as the Pre Directional or Post Directional, were corrected during matching:

  • T= Street components were corrected.

  • F= Street components were not corrected.

Parsing Country

Country parser that was used for the final parse of the record. Often, the Primary Country parser is used. However, if Warehouse Builder can determine the country of the input record, it will use that country parser if the parser is available. For many countries, a substitute parser is used; for example, the USA parser is used for the US Virgin Islands.

Installation Type

Type of Canadian postal installation:

  • STN= Station

  • RPO= Retail Postal Outlet

For example, for the address, "PO Box 7010, Scarborough ON M1S 3C6," the Installation Type is `STN'.

Installation Name

Name of Canadian postal installation. For example, for the address, "PO Box 7010, Scarborough ON M1S 3C6," the Installation Name is `AGINCOURT'.

Locality Name

Name of county or other localization, common in the United Kingdom.

Locality Code

Code for the county or other localization, common in the United Kingdom.

Delivery Office Code

This applies to Hong Kong addresses only.

Delivery Beat Code

This applies to Hong Kong addresses only.

Name and Address Operator Supported Countries

Table 19-3 lists supported countries for the Name and Address Operator.

Table 19-3 Countries Supported by the Name and Address Operator  
Name Postal Code Matching Available

Argentina

No

Australia

Yes

Belgium

No

Brazil

Yes

Canada

Yes

Chile

No

Colombia

No

Denmark

No

France

Yes

Germany

Yes

Hong Kong

Yes

India

No

Ireland

No

Italy

No

Mexico

Yes

Malaysia

No

Netherlands

Yes

New Zealand

No

Peru

No

Philippines

No

Portugal

Yes

Singapore

No

South Africa

No

Spain

Yes

Sweden

No

Switzerland

No

United Arab Emirates

No

United Kingdom

Great Britain Only

United States

Yes

Venezuela

No

Table Maintenance Compiler

The table maintenance compiler is a command line utility, TABMAINT, located in $ORACLE_HOME/nas/bin/platform, where platform is win32, solaris, hpux, or aix. Its usage is:

TABMAINT -parmfile parameterFileName

TABMAINT echoes all input from the standard definitions and user definitions files to standard output along with any errors. Redirect this output to a file to reduce compile time from several minutes to a few seconds. The typical convention is to name the output file TABERR (that is, TABMAINT -parmfile parameterFileName > TABERR). At the completion of the compilation, view the end of TABERR for any compiler errors resulting from syntax errors in the user definition file.

The parameter file specifies input and output file names. Each country requires a separate parameter file (for example, USWDPAT for USA) because the file path and country code prefix of STDDEF differ. The following is an example for a USA parameter file, named pftabmntUS.par:

STDDEF c:\owbServer\owb\NAS_DATA\NorthAm\US\USWDPAT
USERDEF c:\owbServer\owb\NAS_DATA\NorthAm\US\CLWDPAT
TABLEDEF c:\owbServer\owb\NAS_DATA\NorthAm\US\CLTABDEF
TABLEPAT c:\owbServer\owb\NAS_DATA\NorthAm\US\CLTABPAT

Your data installation will differ from this example, depending on the Oracle Home you selected when installing the Oracle9i Warehouse Builder Server Side components.

The following is a usage example of the parameter file shown above:

cd c:\owbServer\owb\bin\win32\
TABMAINT -parmfile pftabmntUS.par > TABERR

Table 19-4 defines the input and output files specified in the parameter file.

Table 19-4 Parameter File Definitions for TABMAINT  
Parameter Name Description Compiler Reserved Word

STDDEF

The input file maintained by Oracle9i Warehouse Builder Name and Address. This file includes all standard definitions for titles, locations, mailing routes and postal codes, as well as patterns and masks for other name and address components. Its usual name is xxWDPAT where xx is the two-character country code.

Do not modify this table. Oracle updates it periodically. Modify USERDEF only.

Yes

USERDEF

The optional client input table. This file is combined with the STDDEF file to create Word/Phrase and Pattern Tables. Its usual name is CLWDPAT. This file initially contains only a comment line.

USERDEF is the only table in the Table Maintenance process that you can customize.

Yes

TABERR

An optional file composed of both input tables, with appended line numbers and warnings of input errors. Use this file to identify errors in the input. TABERR is not a reserved word recognized by the compiler, but a convention to identify this file. You may give it any name and use redirection of standard output to produce it.

No

TABLEDEF

The encoded Word/Phrase Table output by Table Maintenance and used by the parser. Because this file is encoded, you cannot read this file. Its name is CLTABDEF.

Yes

TABLEPAT

The encoded Pattern Table output by Table Maintenance and used by the parser. Like the Word/Phrase Table, you annot read this file. Its name is CLTABPAT.

Yes

Identifying Parsing Problems

Although you can readily see parsing problems in the output of the Name and Address components, you must investigate various log files to find problems in the input. The following two parameters of nameAddr.properties control tracing/log output.

Table Maintenance

To perform table maintenance, you must enable detailed parse logging. If detailed parse logging is off, stop the Name and Address server, edit nameAddr.properties to enable detailed logging (EnableDetailLogging parameter), and restart the server.

To perform table maintenance:

  1. Execute your Name and Address parsing process and examine the parse detail log to identify parsing problems.

  2. Add new parsing pattern lines to the CLWDPAT file under the appropriate country folder.

    If possible, edit similar patterns copied from the standard definition table (having a file name ending with WDPAT). You can also copy patterns from word pattern tables of other countries.

  3. If you have not done so, create the parameter file.

  4. Execute the table maintenance utility:

    TABMAINT -parmfile paramFileName > TABERR
    
    
  5. If the Name and Address server is running, stop it using NAStop and restart it using NAStart.

  6. Execute the Name and Address parsing process and examine the parse detail log to determine whether the parser corrected the errors.

    If your results are incorrect, return to step one.

Common Table Entries

The most common parsing problems involve incorrect identification of personal names and businesses. First name definitions are very important in identifying personal names. Likewise, business patterns are important in identifying businesses. Without these specific definitions, the parser must rely only on combinations of ALPHA and 1ALPHA patterns to make a decision between a personal name and a business. For example, in the Brazil client pattern table, use the following patterns to identify a new first name:

'DEJOARA'               MODIFY NAME DEF ATT=FIRST,GENDER=F
'ATHOS'                 MODIFY NAME DEF ATT=FIRST,GENDER=M

Title attributes also help identify personal names. The parsed title appears in the PRENAME output component. The recode specifies how an input may be recoded for output. For example, you can add the following to a table:

'PADRE'                 MODIFY NAME BEG ATT=TITLE, GENDER=M
'PROFESSOR'             MODIFY NAME BEG ATT=TITLE, RECODE=PROF

Last name definitions are helpful in identifying last names having multiple words. In particular, their identification prevents the parser from interpreting last name parts as first names or middle initials. The common Brazilian name "Costa E Silva" would be parsed as a first name, middle initial, and last name without the following last name definitions.

'COSTA E SILVA'         INSERT NAME END ATT=LAST
'C E SILVA'             INSERT NAME END ATT=LAST

Business definitions may contain entire business names or just keywords specifying a business. Without business definitions, the parser may interpret any business name having a personal name pattern as a personal name. For example, "John Smith Company" would be parsed as a first, middle, and last name if company were not defined as a business pattern.

'COMPANY'               INS NAME BEG ATT=BUSINESS
                        DEF ATT=B-DESCRIPTIVE,CAT='HN21'

Generic definitions may be problematic, however. If you instruct the parser to interpret "Diesel" as a business keyword, it also interprets last names of "Diesel" as business names. If such problems occur, you can delete a business keyword:

'DIESEL'                DELETE

Syntax for USERDEF Entries

This section describes the syntax for USERDEF entries. Note the following about the syntax documentation for USERDEF entries:

INSERT or MODIFY

Abbreviations

Either enter the commands literally or use the following abbreviations:

Format

When the operation is INSERT or MODIFY, the input record format is:

SYMBOL OPERATION [MODIFIER] [LINE-TYPE] [POSITION] KEYWORD=VALUE 

Symbols

Example: INSERT

'MARY'       INSERT NAME BEG ATT=FIRST,GENDER=F

Example: MODIFY

'MARY'       MODIFY NAME BEG ATT=FIRST,GENDER=F

The operation MODIFY constitutes a DELETE and an INSERT. the original entry is deleted and the modified entry is inserted.

Patterns

Example: Name Pattern:

'FIRST 1ALPHA ALPHA'
       INSERT PATTERN NAME DEF
       RECODE='FIRST MIDDLE LAST'
       EXPORT='FIRST(1) MIDDLE(1) LAST(1)'

Example: Street Pattern:

'1ALPHA 1ALPHA 1ALPHA TYPE'
    INS PATTERN STREET DEF
    RECODE='STREET-NAME STREET-NAME STREET-NAME TYPE'

DELETE

Abbreviations

Either enter the command literally or abbreviate DELETE with DEL.

Deleting Symbols

When the operation is DELETE, the record format is SYMBOL DELETE

Example:

'MARY'      DELETE 

Deleting Synonyms

When the keyword is SYNONYM, you must enter the actual synonym:

SYMBOL DELETE SYNONYM=VALUE

Example:

'BV'       DELETE SYNONYM=BOULEVARD

Deleting Patterns

When the operation is DELETE PATTERN, you must enter the actual pattern followed by DELETE PATTERN.

Example:

'FIRST 1ALPHA ALPHA'
DELETE PATTERN

Conventions for USERDEF Entries

Case

You can add new entries to the USERDEF file in either upper or lower case letters with one exception: the symbol value for a mask type entry (that is, when the modifier is MASK) must be entered in lower case. The parser interprets all other characters for their literal case status.

Comments

Line Lengths

Use the following conventions for table entries that span multiple lines:

Multiple Definitions

Occasionally, an entry may contain multiple meanings. The first definition is entered in the standard way. Indent subsequent definitions under the initial operational value.

Example:

'CENTER'      INSERT NAME DEF ATT=BUSINESS
                      STREET END ATT=SEC-TYPE,RECODE=CTR
                      GEOG DEF RECODE=CENTER

Export Line for Name Patterns

Name patterns require export lines. Append a number enclosed in parentheses for each name element.

Example 1: RICHARD G SMITH AND MARY ROGERS

In this example, two individuals have separate last names.

'FIRST 1ALPHA ALPHA CONNECTOR FIRST ALPHA'
   INSERT PATTERN NAME DEF
   RECODE='FIRST MIDDLE LAST CONNECTOR FIRST LAST'
   EXPORT='FIRST(1) MIDDLE(1) LAST(1) CONNECTOR(2) FIRST(2) LAST(2)'

Example 2: RICHARD G AND MARY SMITH

In this example, two individuals share the same last name. 'LAST(12)' indicates that 'LAST' is shared by NAME1 and NAME2. Note that in numbering connectors, the connector number typically agrees with the name following the connector.

'FIRST 1ALPHA CONNECTOR FIRST ALPHA'
   INSERT PATTERN NAME DEF
   RECODE='FIRST MIDDLE CONNECTOR FIRST LAST'
   EXPORT='FIRST(1) MIDDLE(1) CONNECTOR(2) FIRST(2) LAST(12)' 

Quotation Marks

Table Maintenance interprets entries enclosed by single quotation marks as one entity. If you wish to include quotation marks within a SYMBOL or VALUE, use double quotation marks ("). The parser converts double quotation marks specified within a SYMBOL or within a VALUE in an input record to single quotation marks ('). If a recode string contains more than one word, type the entire string with single quotes.

Examples:

'O"BRIEN'           INSERT NAME END ATT=LAST

'AS TRUSTEE FOR'    INSERT SYNONYM='TRUSTEE FOR'

'MEBAR HARBER_'     INSERT GEOG DEF ATT=CITY-CHANGE,
RECODE='MEBAR HARBOR'

Masks

While a word or phrase is a data element, a mask is a description of the word or phrase. Masks use alpha, numeric, or special characters to represent letters, numbers, and, one-to-one, the special characters. A mask defines the characters of the data element using:

For example, you can use a mask to define any series of five numerals as a ZIP code instead of the user entering each of the 99,999 possible combinations in the USERDEF. The following is an example mask entry:

'nnnnn' INSERT MASK GEOG DEF ATT=ZIPCODE 

Masks may include special characters if they are part of the word representation. For example, a mask for a nine-digit ZIP code is entered:

'nnnnn-nnnn' INSERT MASK GEOG DEF ATT=ZIPCODE 

Mask Recodes

You can use masks to introduce and/or exclude literals and special characters in your recodes. For example, a mask for a telephone number is entered:

'nnn nnn-nnnn' INSERT MASK MISC DEF ATT=IGNORE,RECODE='(nnn)nnn-nnnn' 

It would recode the entry '508 663-9955' to '(508) 663-9955'. A mask for an account number could be recoded and customized in this way:

'aannnnnn' INSERT MASK MISC DEF ATT=IGNORE,RECODE='ACCTNOaannnnnn' 

It would recode the entry 'CL123456' to 'ACCTNOCL123456'.

Patterns

A pattern consists of attributes and intrinsic attributes that include any alpha, numeric, or special character representation of a data element. Table 19-5 contains the 14 intrinsic attributes that can be included in a pattern.

Table 19-5 Intrinsic Attributes  
Attribute Definition

1ALPHA

A single letter

1NUMERIC

A single numeral

ALPHA

Letters only

ALPHA - 1NUMERIC

Letters followed by a single numeral

ALPHA-1SPECIAL

Letters followed by a non-letter, non-number, or non-hyphen

ALPHA-NUMERIC

Letters followed by numerals

ALPHA-SPECIAL

Letters followed by special characters

HYPHEN

Hyphen

NUMERIC

Numerals only

NUMERIC-1ALPHA

Numerals followed by one letter

NUMERIC-1SPECIAL

Numerals followed by one special character

NUMERIC-ALPHA

Numerals followed by letters

NUMERIC-SPECIAL

Numerals followed by special characters

OTHER-SPECIAL

Any characters other than letters and numerals, such as hex or packed decimal

Most patterns are defined using recodes. Recoding instructions include information about changing the attributes of the words that fall into the given pattern. The following two examples are valid entries for USERDEF.

'FIRST 1ALPHA ALPHA' 
           INSERT PATTERN NAME DEF 
           RECODE='FIRST MIDDLE LAST'
           EXPORT='FIRST(1) MIDDLE(1) LAST(1)'

'HSNO ALPHA TYPE' 
           INSERT PATTERN STREET DEF
           RECODE='HSNO STREET-NAME TYPE'

Line Types

Classify lines into one of four types to facilitate the parser in identifying each word in each line. An entry can contain more than one line type identification. You can assign the following line types:

Name

Name of either a person or business. Names are usually the first one or two lines in an address record.

'BOOKSTORE' INSERT NAME DEF ATT=BUSINESS,CAT=S5942 

Street

All descriptions of streets and numeric addressing, including box numbers, rural routes, and apartment numbers. A street line is usually in the middle of a record, and may be one or more lines.

'LANE' INSERT STREET END ATT=TYPE,RECODE=LN 

Geography

The city, state, ZIP code or postal code, and country in the address. Geography lines are usually located at the end of an address record.

'MASSACHUSETTS' INSERT GEOG END ATT=STATE,RECODE=MA 

Miscellaneous

Information that does not fit into the other three line types, such as account name or a comment.

'HOLD MAIL' INSERT MISC DEF ATT=HOLD 

Positions

You can define a symbol in relation to its position within the address line. The three valid positions are:

Beginning

The first word in a line, any word that follows a title, or any words that appear before a first name (including the first name). For example, in the line

MR JOSEPH SMITH 

every word except 'Smith' is at the beginning of the line.

Default

When the physical location of the word in the line is irrelevant, use the Position Default in the definition. A default word may appear anywhere on the line, including the beginning and ending.

Ending

The last word and any further non-alphabetic characters are the ending of a line. For example, in the line

BRIARWOOD ESTATES APT 3 

the word Apt and the Apt# are the end of the line.

Logical vs. Physical Beginning and Ending Positions

An address line is not limited to one beginning and ending position. If a line beginning is identified by the Table Maintenance process, and a word with the attributes CONNECTOR (such as AND, &, OR), RELATIONSHIP (such as FBO, In Trust For), DBA (such as Doing Business As), or a comma is identified, then the line has two beginnings and two endings. Examples of beginning and ending line positions are displayed below:

JOHN  J  SMITH   AND   MARY   SMITH
BEG       END    CONN   BEG    END

JOHN J  SMITH   IN TRUST FOR  JOHN J  SMITH  JR
BEG      END    RELATIONSHIP   BEG           END

SMITH,   MARY  A
BEG      BEG   END

The physical beginning of the above lines is the first word on the line. There can be more than one logical beginning of a line. Multiple beginning attributes in a row continue the logical beginning of a line.

JOHN     JAMES    THOMAS  (all are beginning attributes of FIRST NAME)
BEG       BEG       BEG

Keywords

Table 19-6, "Keywords and Descriptions" shows the seven valid keywords and their meanings. Underlined letters show the appropriate abbreviation.

Table 19-6 Keywords and Descriptions  
Keyword Abbreviation Description

Attribute

Att

The meanings assigned by the parser to identify the token (word/symbol) entries.

Example: 'ROAD' INSERT STREET END ATT=TYPE

Category

Cat

A user-defined, free-form way to categorize name and address elements. Categories should be limited to six characters (based on assigning multiple categories throughout a record), with a maximum of 50 bytes per record for all categories. A category can be any grouping that is useful during parsing to identify name/address components, such as assigning the industry-wide SIC codes to business/company names.

A record using multiple categories requires the following syntax: 'BOY SCOUTS' INSERT NAME DEF ATT=BUSINESS,CAT='S8641|HN25'

Export

Exp

Enables you to define the pattern of a name written to a string for exporting. The export line indicates which person each attribute belongs to in a name pattern.

A typical export line for a name pattern is: 'FIRST 1ALPHA CONNECTOR 1ALPHA ALPHA' INSERT PATTERN NAME DEF RECODE='FIRST MIDDLE CONNECTOR FIRST LAST'EXPORT='FIRST(1) MIDDLE(1) CONNECTOR(2) FIRST(2) LAST(12)'

Gender

Gen

The code Gender allows you to assign a gender to a name component. M=male, F=female, and N=neuter (gender unknown). First names defined in the table require a gender code.

Example: 'MARY' INSERT NAME BEG ATT=FIRST,GENDER=F

Recode

Rec

For tokens (the word symbol), the value assigned to a Recode is the value the parser uses as output for this word. For patterns, the Recode value is the attribute value used to overwrite the attribute value in the original pattern string during parser output.

Example: 'ROAD' INSERT STREET END ATT=TYPE,RECODE=RD

Special

Spec

Specifies that external logic is needed to identify this name and/or address component, and names the file (subroutine) containing that logic.

Example: 'CASH-N-CARRY' 'ALPHA-SPECIAL'

Synonym

Syn

A shortcut for defining an entry with the same value as a prior entry. For example, 'PBOX' INSERT SYNONYM='PO BOX' identifies PBOX and PO BOX as synonyms for the parser. Any data element can be defined with synonyms.

Attributes

The following table shows the valid attributes, their abbreviations, and corresponsing definitions.

Table 19-7 Attributes, Abbreviations, and Descriptions  
Attribute Abbreviation Description

1ALPHA

Not applicable.

A single letter

1NUMERIC

Not applicable

A single numeral

ALPHA

Not applicable

Letters only

ALPHA-1NUMERIC

Not applicable

Letters followed by a single numeric

ALPHA-1SPECIAL

Not applicable

Letters followed by a non-letter, non-number, or non-hyphen

ALPHA-NUMERIC

Not applicable

Letters followed by numerals

ALPHA-SPECIAL

Not applicable

Letters followed by special characters

HYPHEN

Not applicable

A hyphen

NUMERIC

Not applicable

Numerals only

NUMERIC-1ALPHA

Not applicable

Numerals followed by one letter

NUMERIC-1SPECIAL

Not applicable

Numerals followed by one special character

NUMERIC-ALPHA

Not applicable

Numerals followed by letters

NUMERIC-SPECIAL

Not applicable

Numerals followed by special characters

OTHER-SPECIAL

Not applicable

Any characters other than letters and numerals, such as hex or packed decimal

APARTMENT

APT

Apt., condo, etc.

APARTMENT#

APT#

Usually the value after apartment word

APT-COMPLEX

A-CPLX

Apts., estates, etc.

ATTENTION

ATTN

Secondary name for business

B-DESCRIPTIVE

B-DES

Non-name/address word in business name

B-REDEFINE

B-RDEF

DBA, Doing Business As, etc., related to business name

BOX

Not applicable

Mailbox, etc.

BOX#

Not applicable

Usually the value after Box word

BUS-EDIT

BEDIT

Edits a word/phrase from the business name

Example:

BUS-EDIT will delete "INC" from the business name. The "*:" is a required prefix to the word to be edited.

`*:INC' INSERT NAME DEF ATT=BUS-EDIT

Optionally, you can add a category to qualify the case. For example, adding CAT='HP017021' to this attribute defines "017021" as the low/high first three digits of the zip code in the record containing "INC". A business name containing "INC" is deleted from the name when the first three digits of the corresponding zip code are between 017 and 021.

NOTE: BUS-EDIT requires that the parameter BUSINESS_NAME_EDIT must be set to "Y" in the Parser Parameter file.

BUSINESS

BUS

Name of a business.

BUSINESS?

BUS?

Possible name for a business.

C-TITLE

C-TTL

Compound title, such as Dr./Mrs.

CARE-OF

C-OF

C/O, In Care Of, etc.

CITY

CIT

CitY

CITY-CHANGE

CIT-CHG

Corrected spelling of city entry

COMPLEX-NAME

CPLX-NM

Apt or Business Complex name

CONCATENEE

CON

Part of geography/name/street phrase, such as "Von" or "O" in a name, `of the' in Avenue of the Americas in a street, or "Los" or "St." in geography line/city name.

CONNECTOR

CTR

And, or, with, etc. between names

COUNTRY

CNTRY

Country namE

COUNTY

CTY

Name of county

DEPT

DPT

Dept., Section, Group, etc.

DESCRIPTIVE

DES

Non-name/address word(s), such as Income Acct or General Fund

DIRECTION

DIR

Street direction such as North, Northwest, or NW

DRAWER

DRW

Drawer or abbreviations of drawer

DRAWER#

DRW#

Usually the value after Drawer word

FIRST

FST

First Name

FLOOR

FLR

Floor, basement, penthouse, etc.

FLOOR#

FLR#

Usually the value after Floor word

GENERATION

GENR

Jr., Sr., III

HIGHWAY

HWY

Highway, Expressway, etc.

HIGHWAY#

HWY

Usually the value after Highway word

HSN

Not applicable

House number

IGNORE

IGN

Does not affect name/address, such as "et al"

LAST

LST

Surname

L-TYPE

L-TYP

Leading street type, such as Calle A or Avenue A

LOT

Not applicable

Lot, perhaps for a house not yet built

LOT#

Not applicable

Usually the value after Lot word

MIDDLE

MID

Middle name

MILITARY-COMPLEX

M-CPLX

Base, ship, etc.

NEIGHBORHOOD

NEI

Geographic neighborhood name, usually not USA

NUM-QUAL

N-Q

NUMBER QUALIFIER - Handles "#", "no." etc. when they have no meaning. This is useful when a "#" or other insignificant character is present between apt and apt# - ex. apt # 10

NUMBER

NMB

Ten, fifteen; a house or street number expressed as a word

PBOX

PBOX

Post office b

PBOX#

PBOX#

Usually the value after PBOX word

P-TITLE

P-TTL

Plural title, such as Drs.

POLE

POL

Telephone pole used to identify address

POLE#

POL#

Usually the value after Pole word

POSTAL- CODE

P-CD

Canadian or European format

PROVINCE

PRV

Canadian province

REDEFINE

RDEF

Also Known As (AKA) for a name, Doing Business As (dba) for a business name, etc

RELATIONSHIP

REL

In Trust For, FBO, etc.

ROUTE

RTE

Rural route, etc

ROUTE#

RTE#

Usually the value after Route word

S-CONNECTOR

S-CTR

Street connector

S-DIRECTION

S-DIR

Single direction, such as North or South (vs. Direction which may be compound North West, etc.)

SEC-TYPE

S-TYP

Secondary street type, for streets with more than one, such as Hilltop Circle Lane

SITE

Not applicable

Site, block, etc. (similar to Lot)

SITE#

SIT#

Usually the value after Site word

STAT

ST

Either state or two-character abbreviation for US state

STREET-NAME

STR-NM

Name of the street

STREET-NAME-LONG

STR-NM-L

Allows street names to occupy the display and non-display parser fields when the street name length is greater than 25 characters - designed for use in countries where long street names are common

SUITE

STE

Suite, Rooms, etc.

SUITE#

STE#

Usually the value after the Suite word

TITLE

TTL

Mr., Mrs., Capt., MD, etc.

TYPE

TYP

Street type, such as St., Ave., or Boulevard

UNIT

UNT

Unit, building, etc

UNIT#

UNT#

Usually the value after Unit word

ZIPCODE

ZIP

US zip code, five or nine numerals

Special Routines

In some instances, the parser requires more informatin to correcly identify a pattern. To solve this problem, special routines for specific patterns can be coded during table maintenance. These routines are invoked when data fits the specific data requirement. If the special routine is unsuccessful, the normal action of recode and export occurs.

For example, consider the following:

Name line = 'JOHN NICOLI CHRIS NICOLI' 

Since 'CHRIS' is coded in the standard definition table (STDDEF) with a beginning attribute of FIRST and it is not in the beginning of the lin, the resulting pattern is 'FIRST ALPHA ALPHA ALPHA'. It should be noted that "beginning" refers to the physical beginning, after components that are valid for the beginning, or after a connector), .

Under most circumstances, data meeting this shape is recoded as 'FIRST MIDDLE MIDDLE LAST'. Because the word 'NICOLI' is repeated, you can derive that the connector has been omitted. By applying special routine HDTNP01 to the pattern 'FIRST ALPHA ALPHA ALPHA'; the routine looks for a repeating last name and if found, modifies the pattern as if it came in as 'FIRST LAST ALPHA LAST'. It then looks up that pattern for a valid recode. The pattern 'FIRST LAST ALPHA LAST' should be in the table with the appropriate recode. If the pattern is not in the table with an appropriate recode, a normal pattern failure takes place. If the special routine fails its specification, the action contained in the original entry is invoked.

This section lists the special routines you can enter into Table Maintenance to effect name and street pattern results. Note that square brackets, [ ], indicate optional data.

Name Pattern Routines

HDTNP01

Routine for name patterns in the form of:

'[TITLE] WORD WORD [TITLE] WORD WORD' 

This routine skips over the titles and checks if the second word and fourth word are equal. If they are, it changes them to LST_ATT (last attribute) and returns a positive pattern hit. If the second and fourth words are not equal, the routine makes no changes.

Example: 'JOHN THOMAS CHRIS THOMAS'

HDTNP02

This routine for name patterns is written as follows:

'[TITLE] WORD WORD CONNECTOR [TITLE] WORD WORD' 

This routine skips over the titles and checks if the second word and last word are equal. If they are, it changes them to LST_ATT (last attribute) and returns a positive pattern hit. If the second and last words are not equal, the routine acts as if it has not been invoked.

Example: 'JOHN THOMAS AND JOYCE THOMAS'

HDTNP03

Routine for name patterns in the form of:

'[TITLE] WORD WORD CONNECTOR [TITLE] WORD WORD WORD'

This routine skips over the titles and checks to determine if the second word and thelast word are equal. If they are equal, it changes them to LST_ATT (last attribute) and returns a positive pattern hit. If not, the routine bahves as if it was not invoked.

Example: 'JOHN THOMAS AND JOYCE ANN THOMAS'

HDTNP04

This is the routine for name patterns in any form. The first pair of FIRST or ALPHA attributes with the same original value is turned to LST_ATT (last attributes) and all others that match it. Although this routine could replace routines HDTNP01, HDTNP02, and HDTNP03, it should be used sparingly.

Example: 'JOHN ANTHONY THOMAS BILL THOMAS'

HDTNP05

This routine for name patterns checks if the token carrying an ALPHA with special characters should be a business word. It follows the pattern below:

[*]a-a-a[*]

where the words must be in the form of any number of alpha characters.

Example: 'CASH-N-CARRY'

HDTNP06

This name routine checks if the second ALPHA token has a beginning attribute of FIRST. It handles the pattern 'ALPHA ALPHA' recoding to 'LAST FIRST'.

Example: 'NICOLI JOHN'

HDTNP07

This name routine converts 0's and 1's for tokens 'O' and 'I' when no other meaning has already been assigned. It handles the pattern 'ALPHA 1ALPHA ALPHA-NUMERIC' when there is no parsed type on intrinsic ALPHA token types.

Example: 'JOHN C N1C0L1'

Street Pattern Routines

HDTSP01

This street routine determines if the ambiguous route value is RD and changes its type to a street type. If the next token (Route#) could be a street direction, the routine changes it as well.

Example 1: '10 MAIN RD 2'

Example 2: '10 MAIN RD E'

HDTSP02

This street routine determines if the right most direction has the value of 'NO' followed by a number of some kind. If so, it turns the 'NO' into an ignore. Also, if the pattern is 'HSNO ROUTE ROUTE#', and ROUTE is a rural route, this routine turns HSNO into BOX#.

Example 1: '10 MAIN ST NO 3'

Example 2: '35 RR 1'

HDTSP03

This street routine handles routes without a number. 'RFD BOX 25' generates a token with a value of '1'.

Example: 'RFD BOX 25'

HDTSP04

This street routine inspects the last APT-COMPLEX token for a value of 'COURT' and, if the next line does not have the line type of 'STREET', the routine changes the attribute of APT-COMPLEX to a type.

If the word entry in the word pattern table has the attribute APT-COMPLEX and the category code is 'HH32', it is handled the same as a value of 'COURT'.

In the first example, there is only one line containing street information. The special routine is invoked and 'COURT' is recoded to a street type. In the second example, there is an apartment line, plus an additional street line. The routine will not be invoked, leaving 'COURT' as an attribute of APT-COMPLEX.

Example 1

In this example, there is only one line containing street information. The special routine is invoked and 'COURT' is recoded to a street type.

10 RANDALL COURT'  = (S)TREET LINE
'LONDON, ENGLAND'  = (G)EOGRAPHY LINE


Example 2

In this example, there is an apartment line, plus an additional street line. The routine will not be invoked, leaving 'COURT' as an attribute of APT-COMPLEX.

10  RANDALL COURT        = (A)PARTMENT LINE
'100 KNIGHTSBRIDGE ROAD  = (S)TREET LINE
'LONDON, ENGLAND'        = (G)EOGRAPHY LINE


HDTSP06

This street routine looks for coordinate address construct and if the directional letter information is attached to the numeric information, it splits the numeric information from the directional letter information.

Example:

100N 200e


Go to previous page Go to next page
Oracle
Copyright © 1996, 2003 Oracle Corporation.

All Rights Reserved.
Go To Table Of Contents
Contents
Go To Index
Index