19
Data Quality: Name and Address Cleansing

Parse table maintenance enables the parser in the Name and Address operator to recognize regional or data-specific patterns. This chapter contains information applicable to the Name and Address functionality provided by Warehouse Builder. If you have also purchased Oracle9i Warehouse Builder Name and Address, this functionality is available in the form of the Name and Address operator. For more information about the Name and Address operator, refer to Chapter 7, "Using Mapping Operators".

This chapter contains the following topics:

Overview

To use the Name and Address operator, you must install Oracle9i Warehouse Builder Name and Address (it is provided as part of the Oracle9i Server pack). You can also purchase the license separately from the Oracle Store.

Note:

To use the Name and Address operator, relocate it from the toolbox onto the mapping canvas. Oracle9i Warehouse Builder Name and Address contains libraries and the PL/SQL code packages referenced by this operator. You can download a quarterly update of the library data from OracleMetalink. Refer to the Oracle9i Warehouse Builder Installation and Configuration Guide for more information.

About the Name and Address Operator

The Name and Address operator identifies and corrects errors or inconsistencies in name and address source data. These errors and inconsistencies include variations in address formats, use of abbreviations, misspellings, outdated information, inconsistent data, or transposed names.

The operator fixes these errors and inconsistencies by:

Separating name and address input data into individual elements.
Standardizing name and address data, such as full names instead of nicknames or abbreviations, as approved by the United States Postal Service.
Correcting address information such as street names and city names.
Augmenting names and addresses with additional data such as gender, ZIP+4, country code, apartment identification, or business and consumer identification.

The Name and Address operator accomplishes these tasks by comparing input data to data in the Oracle9i Warehouse Builder Name and Address data libraries.

Parse Table Maintenance

Parse table maintenance is a process of enhancing the parsing definitions to enable the parser to recognize regional or data-specific patterns. The user creates a custom table of words, phrases (or tokens), and patterns, and the table maintenance process combines the custom table with a standard definitions table.

Table Maintenance creates for the parser both a Word/Phrase and Pattern table. The Word/Phrase Table contains context-specific information about words and phrases. For example, the word 'Paul' is defined in the Word/Phrase Table as a beginning attribute of first name with a male gender. The Pattern Table contains information about sequences of words/phrases (patterns).

Table Maintenance Process

The table maintenance system accepts three input files and produces three output files, as shown in Figure 19-1.

Figure 19-1 Table Maintenance System Input and Output Files

Text description of tblmaint.gif follows.

Text description of the illustration tblmaint.gif

Begin table maintenance after having identified a sufficient number of parsing problems.

To perform table maintenance, follow these general steps:

Add pattern additions and modifications to the client table.
Perform table compilation.
Restart the Name and Address server.
Check the new parsing results.

This is usually an iterative process.

Name and Address Operator Input Roles and Output Components

This section describes the input roles and output components for the Name and Address operator.

Input Roles

Table 19-1 lists input roles and descriptions for the Name and Address Operator.

Table 19-1 Name and Address Operator Input Roles

Input Role	Description
None	This setting causes generation to fail.
First Name	First name. This can be a nickname or shortened version of the first name.
Middle Name	Middle name or initial. Use when there is only one middle name, or for the first of several middle names (such as "Herbert" in George Herbert Walker Bush).
Middle Name 2	Second middle name; for example, "Walker" in George Herbert Walker Bush.
Middle Name 3	Third middle name; for example, "Louise" in Ethel May Roberta Louise Mertz.
Last Name	Last name, or surname.
First Part Name	First part of the Person name, including: Pre name First name Middle name(s) Use when these components are contained in one source column.
Last Part Name	Last part of Person Name, including: Last name Post Name Use when these components are all contained in one source column.
Pre Name	Information that precedes and qualifies the name; for example, Ms., Mr., or Dr.
Post Name	Generation or other information qualifying the name; for example, Jr. or Ph.D.
Person	Full person name, including: First Part Name (consisting of Pre Name, First Name, and Middle Names) Last Part Name (consisting of Last Name and Post Name) Use when these components are all contained in one source column.
Person 2	Designates a second person if the input includes multiple personal contacts.
Person 3	Designates a third person if the input includes multiple personal contacts.
Firm Name	Name of the company or organization.
Primary Address	Box, route, or street address, including: Street name House number City map grid direction; for example, SW or N Street type; for example, avenue, street, road, etc. This does not include the Unit Designator or the Unit Number.
Secondary Address	The second part of the street address, including: Unit Designator Unit Number For example, in a secondary address of Suite 2100, the Unit Designator is `STE' (a standardization of "Suite") and the Unit Number is `2100'.
Address	Full address line, including: Primary Address Secondary Address Use when these components share one column.
Address 2	Generic address line.
Neighborhood	Neighborhood or barrio, common in South and Latin American addresses.
City	Name of city.
State	Name of state or province.
Postal Code	Postal code, such as a ZIP code in the United States or a Postal Code in Canada.
Country Name	Full country name.
Country Code	The ISO 3166-1993 (E) two- or three-character country code. For example, `US' or `USA' for United States; `CA' or `CAN' for Canada.
Last Line	Last address line, including: City State or province Postal code Use when these components are all contained in one source column.
Line1, Line2, Line3, Line4, Line5, Line6, Line7, Line8, Line9, Line10	Non-discrete roles intended for use as free form name, business, personal, and address text. These generic selections can store any type of name and address data. They do not provide the parser with any information about the data content. Whenever possible, use the discrete input roles provided instead.

Output Components

Table 19-2 lists the Name and Address operator output components and their descriptions.

Table 19-2 Name and Address Operator Output Components

Output Component	Description
None	This setting causes generation to fail.
First Name	The first name found in the input name.
First Name Standardized	Standard version of first name; for example, Theodore for Ted or James for Jim.
Middle Name	Middle name or initial. Use this for a single middle name, or for the first of several middle names (such as "Herbert" in George Herbert Walker Bush).
Middle Name 2	Second middle name; for example, "Walker" in George Herbert Walker Bush.
Middle Name 3	Third middle name; for example, "Louise" in Ethel May Roberta Louise Mertz.
Middle Name Standardized	Standardized version of the middle name; for example, Theodore for Ted or James for Jim. Used when there is only one middle name, or for the first of several middle names (such as "Herbert" in George Herbert Walker Bush).
Middle Name 2 Standardized	Standardized version of the second middle name; for example, Theodore for Ted or James for Jim.
Middle Name 3 Standardized	Standardized version of the third middle name; for example, Theodore for Ted or James for Jim.
Last Name	Last name, or surname.
Pre Name	Title or salutation appearing before a name; for example, Ms. or Dr.
Post Name	Name suffix indicating generation; for example, Sr., Jr., or III.
Other Post Name	Name suffix indicating certification, academic degree, or affiliation; for example, Ph.D., M.D., or R.N.
Relationship	Information related to another person; for example, "Trustee for."
Gender	Probable gender: M= Male F= Female N= Neutral (either Male or Female) Blank=unknown.
Person	First name, middle name, and last name.
Person Count	Number of persons the record references. For example, a record with a Person name of "John and Jane Doe" has a Person Count of 2.
Firm Name	Name of the company or organization, including divisions.
Firm Count	Number of firms referenced in the record.
Address	Full address line, including: Primary Address Secondary Address
Miscellaneous Address	Miscellaneous address information, such as a telephone number or an e-mail address.
Neighborhood	Neighborhood or barrio, common in South and Latin American addresses.
City	Name of city. The US city names may be converted to United States Postal Service preferred names.
State	Name of state or province. For countries such as the United Kingdom, this column may contain a county name.
Postal Code	Full postal code with spaces and other non-alphanumeric characters removed.
Postal Code Formatted	Formatted version of postal code that includes spaces and other non-alphanumeric characters, such as dashes.
Country Name	The full country name.
Country Code	The ISO 3166-1993 (E) two-character country code, as defined by the International Organization for Standardization; for example, `US' for United States or `CA' for Canada.
Country Code 3	The ISO 3166-1993 (E) three-character country code, as defined by the International Organization for Standardization; for example, `USA' for United States, `FRA' for France, or `UKR' for Ukraine.
Last Line	Final address line, including: City State or province (if state contains a county instead of a state or province, it is not included) Formatted postal code if address was fully assigned
Primary Address	Box, route, or street address, including: Street name House number City map grid direction; for example, SW or N Street type; for example, avenue, street, road, etc. This does not include the Unit Designator or the Unit Number.
Secondary Address	The second part of the street address, including: Unit Designator Unit Number For example, in a secondary address of Suite 2100, the Unit Designator is `STE' (a standardization of "Suite") and the Unit Number is `2100'.
Street Name	Name of street.
Street Number	Number that identifies the address, such as a house or building number, sometimes referred to as the primary range. For example, in 200 Oracle Parkway, the Street Number is 200.
Street Type	Street identifier; for example, ST, AVE, RD, DR, or HWY.
Pre Directional	Street directional indicator appearing before the street name; for example, in 100 N University Drive, the Pre Directional is `N'.
Post Directional	Street directional indicator appearing after the street name; for example, in 100 15th Ave. S., the Post Directional is `S'.
Unit Designator	Type of secondary address, such as APT or STE. For example, in a secondary address of Suite 2100, the Unit Designator is `STE' (a standardization of "Suite").
Name Designator	Personal name designation; for example, "ATTN" (to the attention of) or "C/O" (care of).
Unit Number	Number that identifies the secondary address, such as the apartment or suite number. For example, in a secondary address of Suite 2100, the Unit Number is `2100'.
Box Name	The name for a post office box address; for example, for "PO Box 95," the Box Name is `PO BOX'.
Box Number	The number for a post office box address; for example, for "PO Box 95," the Box Number is `95'.
Route Name	Route name for a rural route address. For an address of "Route 5 Box 10," the Route Name is `RTE' (a standardization of "Route").
Route Number	Route number for a rural route address. For an address of "Route 5 Box 10," the Route Number is `5'.
Building Name	Building name, such as "Cannon Bridge House." Building names are common in the United Kingdom.
Parse Status	Trillium parse status code.
Parse Status Description	Text description of the Trillium parse status.
Urbanization Name	Urban unit name used in Puerto Rico. This applies to United States (Puerto Rico) addresses only.
ZIP 5	The five-digit United States postal code. This applies to United States addresses only.
ZIP 4	The four-digit suffix that is added to the five-digit United States postal code to further specify location. This applies to United States addresses only.
Delivery Point	Two-digit postal delivery point, which is combined with a full nine-digit postal code and check digit to form a delivery point bar code. This applies to United States addresses only.
DPBC Check Digit	Check digit for forming a delivery point bar code. This applies to United States addresses only.
CART	Four-character USPS Carrier route. This applies to United States addresses only.
LACS Flag	Indicates whether the address requires a LACS conversion. The Locatable Address Conversion System (LACS) provides new addresses when a 911 emergency system has been implemented. 911 address conversions typically involve changing rural-style addresses to city-style addresses, but on occasion they may involve renaming or renumbering existing city-style addresses. T= Address needs 9-1-1 conversion (from old rural route to new street address) and should be submitted to a LACS vendor F= Address does not require conversion. This applies to United States addresses only.
FIPS County	The three-digit county code as defined by the Federal Information Processing Standard (FIPS). This applies to United States addresses only.
FIPS Code	The complete (state plus county) code assigned to the county by the Federal Information Processing Standard (FIPS). Because FIPS county codes are unique within a state, a complete FIPS Code includes the two-digit state code followed by the three-digit county code. This applies to United States addresses only.
Census ID	United States Census tract and block-group number. The first six digits are the tract number; the final digit is the block-group number within the tract. These codes are used for matching to demographic-coding databases. This applies to United States addresses only.
Metropolitan Statistical Area	Metropolitan Statistical Area (MSA) number. For example, `0000' indicates that the address does not lie within any MSA, and typically indicates a rural area. This applies to United States addresses only.
Minor Census District	Minor Census District. This applies to United States addresses only.
Latitude	Latitude in degrees north of the equator: positive for north of the equator; negative for south (always positive for North America).
Longitude	Longitude in degrees east of the Greenwich Meridian: positive for east of GM; negative for west (always negative for North America).
Address Type	Type of address: F= Firm G= General Delivery H= High-rise apartment or office building HD= High-rise default (the Name and Address operator can detect a finer level of postal code assignment if further input information were available) B= Box R= Rural Code S= Street M= Military P= Post Office Box
Is Found	Indicates whether the address is listed in the postal matching database for the country indicated by the address: T= The address was found in a postal matching database. F= The address was not found in a postal matching database. This can mean that the address is not a legal address, and also that postal matching is not available for the country. This flag is true only if all of the flags listed below are true. If postal matching is available, this flag is the best indicator of record quality.
Is Parsed	Indicates whether the name or address was parsed: T= The name or address was parsed successfully. A name or address is considered to be successfully parsed even if some parsing warnings exist. F= The name or address may not be parsed. Check the status of parsing warning flags (such as Name Warning, City Warning, etc.).
Is Good Group	Indicates whether the name group, address group, or name and address group was processed successfully: For name groups, this means the name has been successfully parsed. For address groups, this means the address has been found in a postal matching database if one is available, or has been successfully parsed if no postal database is installed. For name and address groups, both the name and the address have been successfully processed.
Is Good Address	Indicates whether the address was processed successfully: T= Successfully processed: either the address was found in the postal matching database or, if no postal matching database is installed for the country indicated by the address, the address was successfully parsed. F= Not successfully processed: if a postal matching database is installed for the country indicated by the address, this indicates that address was not found in the database. If no postal matching database is available for the country, this indicates that the address may not be parsed. This flag is easier to use if you have a mix of records from both postal matched and non postal matched countries.
Is Good Name	Indicates whether the name was parsed successfully: T= The name was parsed successfully. Note that a name is considered to be successfully parsed even if some parsing errors exist. F= The name may not be parsed.
Is Address Verifiable	Indicates whether postal matching is available for the country of the address, but does not indicate the outcome of the matching operation: T= Postal matching is available for the country address. F= Matching is not available for the country of the address. This indicates that a postal matching database is installed for the country indicated by the address but matching is not available for the address. No postal matching database is installed for the country indicated by the address.
Name Warning	Indicates whether the parser had difficulty parsing a name: T= The parser had difficulty parsing a name. F= No difficulty parsing name.
Street Warning	Indicates whether the parser had difficulty parsing a street address: T= The parser had difficulty parsing a street address. F= No difficulty parsing street address.
City Warning	Indicates whether the parser had difficulty parsing a city: T= The parser had difficulty parsing a city. F= No difficulty parsing city.
City Found	Indicates whether the postal matcher found the city: T= The postal matcher found the city. F= The postal matcher did not find the city.
Street Name Found	Indicates whether the postal matcher found the street name: T= The postal matcher found the street name. F= The postal matcher did not find the street name.
Street Number Found	Indicates whether the postal matcher found the street number within a valid range of numbers for the named street: T= The postal matcher found the street number. F= The postal matcher did not find the street number within a valid range of numbers for the named street.
Street Components Found	Indicates whether the postal matcher found the street components, such as the Pre Directional or Post Directional: T= The postal matcher found the street components. F= The postal matcher did not find the street components.
Non-ambiguous Match Found	Indicates whether the postal matcher found a matching address in the postal database: T= The postal matcher found a match between the input record and a single entry in the postal database. F= The address is ambiguous. The postal matcher found that the address matched several postal database entries and could not make a selection. For example, if the input address is "100 4th Avenue," but the postal database contains "100 4th Ave N" and "100 4th Ave S," the input's missing directional causes the match to fail.
Address Corrected	Indicates whether the address was corrected in any way during matching. Standardization is not considered correction in this case. T= Some component of the address was changed, aside from standardization. One of the other Corrected flags, indicating the component that was changed, must also be True. F= No components of the address were changed, with the possible exception of standardization.
Postal Code Corrected	Indicates whether the postal code was corrected during matching. Correction can include the addition of a postal extension: T= The postal code was corrected. F= The postal code was not corrected.
City Corrected	Indicates whether the city name was corrected during matching. Postal code input is used to determine the city name preferred by the postal service. T= The city name was corrected during matching. F= The city name was not corrected.
Street Corrected	Indicates whether the street name was corrected during matching. Some correct street names may be changed to an alternate name preferred by the postal service. T= The street name was corrected during matching. F= The street name was not corrected.
Street Components Corrected	Indicates whether any of the street components, such as the Pre Directional or Post Directional, were corrected during matching: T= Street components were corrected. F= Street components were not corrected.
Parsing Country	Country parser that was used for the final parse of the record. Often, the Primary Country parser is used. However, if Warehouse Builder can determine the country of the input record, it will use that country parser if the parser is available. For many countries, a substitute parser is used; for example, the USA parser is used for the US Virgin Islands.
Installation Type	Type of Canadian postal installation: STN= Station RPO= Retail Postal Outlet For example, for the address, "PO Box 7010, Scarborough ON M1S 3C6," the Installation Type is `STN'.
Installation Name	Name of Canadian postal installation. For example, for the address, "PO Box 7010, Scarborough ON M1S 3C6," the Installation Name is `AGINCOURT'.
Locality Name	Name of county or other localization, common in the United Kingdom.
Locality Code	Code for the county or other localization, common in the United Kingdom.
Delivery Office Code	This applies to Hong Kong addresses only.
Delivery Beat Code	This applies to Hong Kong addresses only.

Name and Address Operator Supported Countries

Table 19-3 lists supported countries for the Name and Address Operator.

Table 19-3 Countries Supported by the Name and Address Operator

Name	Postal Code Matching Available
Argentina	No
Australia	Yes
Belgium	No
Brazil	Yes
Canada	Yes
Chile	No
Colombia	No
Denmark	No
France	Yes
Germany	Yes
Hong Kong	Yes
India	No
Ireland	No
Italy	No
Mexico	Yes
Malaysia	No
Netherlands	Yes
New Zealand	No
Peru	No
Philippines	No
Portugal	Yes
Singapore	No
South Africa	No
Spain	Yes
Sweden	No
Switzerland	No
United Arab Emirates	No
United Kingdom	Great Britain Only
United States	Yes
Venezuela	No

Table Maintenance Compiler

The table maintenance compiler is a command line utility, TABMAINT, located in $ORACLE_HOME/nas/bin/platform, where platform is win32, solaris, hpux, or aix. Its usage is:

TABMAINT -parmfile parameterFileName

TABMAINT echoes all input from the standard definitions and user definitions files to standard output along with any errors. Redirect this output to a file to reduce compile time from several minutes to a few seconds. The typical convention is to name the output file TABERR (that is, TABMAINT -parmfile parameterFileName > TABERR). At the completion of the compilation, view the end of TABERR for any compiler errors resulting from syntax errors in the user definition file.

The parameter file specifies input and output file names. Each country requires a separate parameter file (for example, USWDPAT for USA) because the file path and country code prefix of STDDEF differ. The following is an example for a USA parameter file, named pftabmntUS.par:

STDDEF c:\owbServer\owb\NAS_DATA\NorthAm\US\USWDPAT
USERDEF c:\owbServer\owb\NAS_DATA\NorthAm\US\CLWDPAT
TABLEDEF c:\owbServer\owb\NAS_DATA\NorthAm\US\CLTABDEF
TABLEPAT c:\owbServer\owb\NAS_DATA\NorthAm\US\CLTABPAT

Your data installation will differ from this example, depending on the Oracle Home you selected when installing the Oracle9i Warehouse Builder Server Side components.

The following is a usage example of the parameter file shown above:

cd c:\owbServer\owb\bin\win32\
TABMAINT -parmfile pftabmntUS.par > TABERR

Table 19-4 defines the input and output files specified in the parameter file.

Table 19-4 Parameter File Definitions for TABMAINT

Parameter Name	Description	Compiler Reserved Word
STDDEF	The input file maintained by Oracle9i Warehouse Builder Name and Address. This file includes all standard definitions for titles, locations, mailing routes and postal codes, as well as patterns and masks for other name and address components. Its usual name is `xx``WDPAT` where `xx` is the two-character country code. Do not modify this table. Oracle updates it periodically. Modify USERDEF only.	Yes
USERDEF	The optional client input table. This file is combined with the `STDDEF` file to create Word/Phrase and Pattern Tables. Its usual name is `CLWDPAT`. This file initially contains only a comment line. `USERDEF` is the only table in the Table Maintenance process that you can customize.	Yes
TABERR	An optional file composed of both input tables, with appended line numbers and warnings of input errors. Use this file to identify errors in the input. `TABERR` is not a reserved word recognized by the compiler, but a convention to identify this file. You may give it any name and use redirection of standard output to produce it.	No
TABLEDEF	The encoded Word/Phrase Table output by Table Maintenance and used by the parser. Because this file is encoded, you cannot read this file. Its name is `CLTABDEF`.	Yes
TABLEPAT	The encoded Pattern Table output by Table Maintenance and used by the parser. Like the Word/Phrase Table, you annot read this file. Its name is `CLTABPAT`.	Yes

Identifying Parsing Problems

Although you can readily see parsing problems in the output of the Name and Address components, you must investigate various log files to find problems in the input. The following two parameters of nameAddr.properties control tracing/log output.

TraceLevel: This parameter controls tracing of all data flowing into and out of the Name and Address server. Valid values are 0 (disable tracing) and 1 (enable tracing). If tracing is enabled, output is written to NASvrTrace.log, which is located in the owb/bin/admin folder. Tracing degrades performance, so enable it only for testing or troubleshooting. The trace log is useful for verifying that the server receives Name and Address parsing commands and returns a result.
EnableDetailLogging: This parameter enables or disables detailed logging at the service provider level. Valid values are true or false. Enable logging only for troubleshooting. Use detail logging to determine how input is parsed and to identify pattern matching problems.

The parseLog and parseDetail output files each have a three-character country code extension. The logs are located at $ORACLE_HOME/owb/bin/admin. The parse detail log shows the inputs, outputs, and applied pattern matching rules for each input record. Detailed logging degrades performance, so enable it only for diagnostics. Examine the detail log before and after table maintenance to verify effects of the new or modified patterns.

To diagnose parsing problems, edit nameAddr.properties, changing the line EnableDetailLogging=false to EnableDetailLogging=true. When you encounter problems, perform a run with only a few records of interest and examine the parse detail log. Examine the log again after performing the table maintenance. The log grows rapidly, so the EnableDetailLogging configuration item should be false on production runs.

The following portion of a detailed parse log entry shows a pattern-matching problem with the France parser for a record containing the name Oracle Corporation. Since Corporation is not a typical French business name token, the parser incorrectly patterns Oracle Corporation as a first and last name.
```
***RECODED RECORD 
***
1) <<< ORACLE >>> <<< CORPORATION >>>
2) <<< 89 >>> <<< AVENUE >>> <<< FRANCOIS ARAGO >>>
3) <<< 92000 >>> <<< NANTERRE >>>
4) <<< FRANCE >>>

COMPREHENSION = 10
CONFIDENCE LEVEL = 10
LINE PATTERN = NSGG

-->LINE (1) IS A NAME LINE DUE TO RULE 2) FIRST IS A NAME LINE 
PATTERN=FIRST, LAST,
-->LINE (2) IS A STREET LINE DUE TO RULE 3) LINE HAS ATTRIBUTES OF ONLY ONE 
LINE TYPE PATTERN =HSNO, TYPE, STREET-NAME, 
-->LINE (3) IS A GEOG LINE DUE TO RULE 7) GEOGRAPHY NEEDED A CITY PATTERN 
=POSTAL-CODE, CITY, 
-->LINE (4) IS A GEOG LINE DUE TO RULE 15) LAST LINE HAS GEOGRAPHY PATTERN 
=IGNORE
```
To instruct the parser to recognize Oracle Corporation as a business name, add the following line to the French CLWDPAT and execute TABMAINT:
```
'CORPORATION' INS NAME DEF ATT=BUSINESS
```

Table Maintenance

To perform table maintenance, you must enable detailed parse logging. If detailed parse logging is off, stop the Name and Address server, edit nameAddr.properties to enable detailed logging (EnableDetailLogging parameter), and restart the server.

To perform table maintenance:

Execute your Name and Address parsing process and examine the parse detail log to identify parsing problems.
Add new parsing pattern lines to the CLWDPAT file under the appropriate country folder.

If possible, edit similar patterns copied from the standard definition table (having a file name ending with WDPAT). You can also copy patterns from word pattern tables of other countries.
If you have not done so, create the parameter file.

Execute the table maintenance utility:

TABMAINT -parmfile paramFileName > TABERR

If the Name and Address server is running, stop it using NAStop and restart it using NAStart.
Execute the Name and Address parsing process and examine the parse detail log to determine whether the parser corrected the errors.

If your results are incorrect, return to step one.

Common Table Entries

The most common parsing problems involve incorrect identification of personal names and businesses. First name definitions are very important in identifying personal names. Likewise, business patterns are important in identifying businesses. Without these specific definitions, the parser must rely only on combinations of ALPHA and 1ALPHA patterns to make a decision between a personal name and a business. For example, in the Brazil client pattern table, use the following patterns to identify a new first name:

'DEJOARA'               MODIFY NAME DEF ATT=FIRST,GENDER=F
'ATHOS'                 MODIFY NAME DEF ATT=FIRST,GENDER=M

Title attributes also help identify personal names. The parsed title appears in the PRENAME output component. The recode specifies how an input may be recoded for output. For example, you can add the following to a table:

'PADRE'                 MODIFY NAME BEG ATT=TITLE, GENDER=M
'PROFESSOR'             MODIFY NAME BEG ATT=TITLE, RECODE=PROF

Last name definitions are helpful in identifying last names having multiple words. In particular, their identification prevents the parser from interpreting last name parts as first names or middle initials. The common Brazilian name "Costa E Silva" would be parsed as a first name, middle initial, and last name without the following last name definitions.

'COSTA E SILVA'         INSERT NAME END ATT=LAST
'C E SILVA'             INSERT NAME END ATT=LAST

Business definitions may contain entire business names or just keywords specifying a business. Without business definitions, the parser may interpret any business name having a personal name pattern as a personal name. For example, "John Smith Company" would be parsed as a first, middle, and last name if company were not defined as a business pattern.

'COMPANY'               INS NAME BEG ATT=BUSINESS
                        DEF ATT=B-DESCRIPTIVE,CAT='HN21'

Generic definitions may be problematic, however. If you instruct the parser to interpret "Diesel" as a business keyword, it also interprets last names of "Diesel" as business names. If such problems occur, you can delete a business keyword:

'DIESEL'                DELETE

Syntax for USERDEF Entries

This section describes the syntax for USERDEF entries. Note the following about the syntax documentation for USERDEF entries:

Most words are entered literally in USERDEF. However, where indicated, abbreviations are allowed.
Items enclosed in square brackets, [], are optional. Square brackets identify optional items. Do not type the square brackets themselves on the input line.
Items enclosed in curly brackets, {}, may be repeated. Curly brackets identify repeatable items. Do not type the curly brackets themselves on the input line.
The pipe symbol, |, indicates that you select one of the items in the command.

INSERT or MODIFY

Abbreviations

Either enter the commands literally or use the following abbreviations:

INSERT can be abbreviated as INS
MODIFY can be abbreviated as MOD

Format

When the operation is INSERT or MODIFY, the input record format is:

SYMBOL OPERATION [MODIFIER] [LINE-TYPE] [POSITION] KEYWORD=VALUE

Symbols

Example: `INSERT`

'MARY'       INSERT NAME BEG ATT=FIRST,GENDER=F

Example: `MODIFY`

'MARY'       MODIFY NAME BEG ATT=FIRST,GENDER=F

The operation MODIFY constitutes a DELETE and an INSERT. the original entry is deleted and the modified entry is inserted.

Patterns

Example: Name Pattern:

'FIRST 1ALPHA ALPHA'
       INSERT PATTERN NAME DEF
       RECODE='FIRST MIDDLE LAST'
       EXPORT='FIRST(1) MIDDLE(1) LAST(1)'

Example: Street Pattern:

'1ALPHA 1ALPHA 1ALPHA TYPE'
    INS PATTERN STREET DEF
    RECODE='STREET-NAME STREET-NAME STREET-NAME TYPE'

DELETE

Abbreviations

Either enter the command literally or abbreviate DELETE with DEL.

Deleting Symbols

When the operation is DELETE, the record format is SYMBOL DELETE

Example:

'MARY'      DELETE

Deleting Synonyms

When the keyword is SYNONYM, you must enter the actual synonym:

SYMBOL DELETE SYNONYM=VALUE

Example:

'BV'       DELETE SYNONYM=BOULEVARD

Deleting Patterns

When the operation is DELETE PATTERN, you must enter the actual pattern followed by DELETE PATTERN.

Example:

'FIRST 1ALPHA ALPHA'
DELETE PATTERN

Conventions for USERDEF Entries

Case

You can add new entries to the USERDEF file in either upper or lower case letters with one exception: the symbol value for a mask type entry (that is, when the modifier is MASK) must be entered in lower case. The parser interprets all other characters for their literal case status.

Comments

To specify comment line, use an asterisk (*) in column 1 or a double forward slash (//) on the same line as the entry. Everything following the '//' is ignored.
Do not leave the first character of an input record blank. The parser ignores records beginning with a blank.
When writing comments, use alpha, numeric, or alpha-numeric characters only.

Line Lengths

Use the following conventions for table entries that span multiple lines:

Indent every additional line within each entry.
Begin every new entry in column 1.
The maximum line length is 150 characters, including the newline character.
Entry definition length may not exceed 30 characters.
Components of an entry may not exceed more than one line.
Use only one line for types, or symbols, that are considered the "left side of the equation" in a table entry and do not allow them to carry over to a second line. This includes word/phrase definitions, masks and pattern entries.

Multiple Definitions

Occasionally, an entry may contain multiple meanings. The first definition is entered in the standard way. Indent subsequent definitions under the initial operational value.

Example:

'CENTER'      INSERT NAME DEF ATT=BUSINESS
                      STREET END ATT=SEC-TYPE,RECODE=CTR
                      GEOG DEF RECODE=CENTER

Export Line for Name Patterns

Name patterns require export lines. Append a number enclosed in parentheses for each name element.

Example 1: `RICHARD G SMITH AND MARY ROGERS`

In this example, two individuals have separate last names.

'FIRST 1ALPHA ALPHA CONNECTOR FIRST ALPHA'
   INSERT PATTERN NAME DEF
   RECODE='FIRST MIDDLE LAST CONNECTOR FIRST LAST'
   EXPORT='FIRST(1) MIDDLE(1) LAST(1) CONNECTOR(2) FIRST(2) LAST(2)'

Example 2: `RICHARD G AND MARY SMITH`

In this example, two individuals share the same last name. 'LAST(12)' indicates that 'LAST' is shared by NAME1 and NAME2. Note that in numbering connectors, the connector number typically agrees with the name following the connector.

'FIRST 1ALPHA CONNECTOR FIRST ALPHA'
   INSERT PATTERN NAME DEF
   RECODE='FIRST MIDDLE CONNECTOR FIRST LAST'
   EXPORT='FIRST(1) MIDDLE(1) CONNECTOR(2) FIRST(2) LAST(12)'

Quotation Marks

Table Maintenance interprets entries enclosed by single quotation marks as one entity. If you wish to include quotation marks within a SYMBOL or VALUE, use double quotation marks ("). The parser converts double quotation marks specified within a SYMBOL or within a VALUE in an input record to single quotation marks ('). If a recode string contains more than one word, type the entire string with single quotes.

Examples:

'O"BRIEN'           INSERT NAME END ATT=LAST

'AS TRUSTEE FOR'    INSERT SYNONYM='TRUSTEE FOR'

'MEBAR HARBER_'     INSERT GEOG DEF ATT=CITY-CHANGE,
RECODE='MEBAR HARBOR'

Masks

While a word or phrase is a data element, a mask is a description of the word or phrase. Masks use alpha, numeric, or special characters to represent letters, numbers, and, one-to-one, the special characters. A mask defines the characters of the data element using:

'n' to represent a numeric (0 -9)
'a' to represent a letter (a -z, lower case only)

For example, you can use a mask to define any series of five numerals as a ZIP code instead of the user entering each of the 99,999 possible combinations in the USERDEF. The following is an example mask entry:

'nnnnn' INSERT MASK GEOG DEF ATT=ZIPCODE

Masks may include special characters if they are part of the word representation. For example, a mask for a nine-digit ZIP code is entered:

'nnnnn-nnnn' INSERT MASK GEOG DEF ATT=ZIPCODE

Mask Recodes

You can use masks to introduce and/or exclude literals and special characters in your recodes. For example, a mask for a telephone number is entered:

'nnn nnn-nnnn' INSERT MASK MISC DEF ATT=IGNORE,RECODE='(nnn)nnn-nnnn'

It would recode the entry '508 663-9955' to '(508) 663-9955'. A mask for an account number could be recoded and customized in this way:

'aannnnnn' INSERT MASK MISC DEF ATT=IGNORE,RECODE='ACCTNOaannnnnn'

It would recode the entry 'CL123456' to 'ACCTNOCL123456'.

Patterns

A pattern consists of attributes and intrinsic attributes that include any alpha, numeric, or special character representation of a data element. Table 19-5 contains the 14 intrinsic attributes that can be included in a pattern.

Table 19-5 Intrinsic Attributes

Attribute	Definition
1ALPHA	A single letter
1NUMERIC	A single numeral
ALPHA	Letters only
ALPHA - 1NUMERIC	Letters followed by a single numeral
ALPHA-1SPECIAL	Letters followed by a non-letter, non-number, or non-hyphen
ALPHA-NUMERIC	Letters followed by numerals
ALPHA-SPECIAL	Letters followed by special characters
HYPHEN	Hyphen
NUMERIC	Numerals only
NUMERIC-1ALPHA	Numerals followed by one letter
NUMERIC-1SPECIAL	Numerals followed by one special character
NUMERIC-ALPHA	Numerals followed by letters
NUMERIC-SPECIAL	Numerals followed by special characters
OTHER-SPECIAL	Any characters other than letters and numerals, such as hex or packed decimal

Most patterns are defined using recodes. Recoding instructions include information about changing the attributes of the words that fall into the given pattern. The following two examples are valid entries for USERDEF.

'FIRST 1ALPHA ALPHA' 
           INSERT PATTERN NAME DEF 
           RECODE='FIRST MIDDLE LAST'
           EXPORT='FIRST(1) MIDDLE(1) LAST(1)'

'HSNO ALPHA TYPE' 
           INSERT PATTERN STREET DEF
           RECODE='HSNO STREET-NAME TYPE'

Line Types

Classify lines into one of four types to facilitate the parser in identifying each word in each line. An entry can contain more than one line type identification. You can assign the following line types:

Name

Name of either a person or business. Names are usually the first one or two lines in an address record.

'BOOKSTORE' INSERT NAME DEF ATT=BUSINESS,CAT=S5942

Street

All descriptions of streets and numeric addressing, including box numbers, rural routes, and apartment numbers. A street line is usually in the middle of a record, and may be one or more lines.

'LANE' INSERT STREET END ATT=TYPE,RECODE=LN

Geography

The city, state, ZIP code or postal code, and country in the address. Geography lines are usually located at the end of an address record.

'MASSACHUSETTS' INSERT GEOG END ATT=STATE,RECODE=MA

Miscellaneous

Information that does not fit into the other three line types, such as account name or a comment.

'HOLD MAIL' INSERT MISC DEF ATT=HOLD

Positions

You can define a symbol in relation to its position within the address line. The three valid positions are:

Beginning

The first word in a line, any word that follows a title, or any words that appear before a first name (including the first name). For example, in the line

MR JOSEPH SMITH

every word except 'Smith' is at the beginning of the line.

Default

When the physical location of the word in the line is irrelevant, use the Position Default in the definition. A default word may appear anywhere on the line, including the beginning and ending.

Ending

The last word and any further non-alphabetic characters are the ending of a line. For example, in the line

BRIARWOOD ESTATES APT 3

the word Apt and the Apt# are the end of the line.

Logical vs. Physical Beginning and Ending Positions

An address line is not limited to one beginning and ending position. If a line beginning is identified by the Table Maintenance process, and a word with the attributes CONNECTOR (such as AND, &, OR), RELATIONSHIP (such as FBO, In Trust For), DBA (such as Doing Business As), or a comma is identified, then the line has two beginnings and two endings. Examples of beginning and ending line positions are displayed below:

JOHN  J  SMITH   AND   MARY   SMITH
BEG       END    CONN   BEG    END

JOHN J  SMITH   IN TRUST FOR  JOHN J  SMITH  JR
BEG      END    RELATIONSHIP   BEG           END

SMITH,   MARY  A
BEG      BEG   END

The physical beginning of the above lines is the first word on the line. There can be more than one logical beginning of a line. Multiple beginning attributes in a row continue the logical beginning of a line.

JOHN     JAMES    THOMAS  (all are beginning attributes of FIRST NAME)
BEG       BEG       BEG

Keywords

Table 19-6, "Keywords and Descriptions" shows the seven valid keywords and their meanings. Underlined letters show the appropriate abbreviation.

Table 19-6 Keywords and Descriptions

Keyword	Abbreviation	Description
Attribute	Att	The meanings assigned by the parser to identify the token (word/symbol) entries. Example: `'ROAD' INSERT STREET END ATT=TYPE`
Category	Cat	A user-defined, free-form way to categorize name and address elements. Categories should be limited to six characters (based on assigning multiple categories throughout a record), with a maximum of 50 bytes per record for all categories. A category can be any grouping that is useful during parsing to identify name/address components, such as assigning the industry-wide SIC codes to business/company names. A record using multiple categories requires the following syntax: `'BOY SCOUTS' INSERT NAME DEF ATT=BUSINESS,CAT='S8641\|HN25'`
Export	Exp	Enables you to define the pattern of a name written to a string for exporting. The export line indicates which person each attribute belongs to in a name pattern. A typical export line for a name pattern is: `'FIRST 1ALPHA CONNECTOR 1ALPHA ALPHA' INSERT PATTERN NAME DEF RECODE='FIRST MIDDLE CONNECTOR FIRST LAST'EXPORT='FIRST(1) MIDDLE(1) CONNECTOR(2) FIRST(2) LAST(12)`'
Gender	Gen	The code Gender allows you to assign a gender to a name component. M=male, F=female, and N=neuter (gender unknown). First names defined in the table require a gender code. Example: `'MARY' INSERT NAME BEG ATT=FIRST,GENDER=F`
Recode	Rec	For tokens (the word symbol), the value assigned to a Recode is the value the parser uses as output for this word. For patterns, the Recode value is the attribute value used to overwrite the attribute value in the original pattern string during parser output. Example: `'ROAD' INSERT STREET END ATT=TYPE,RECODE=RD`
Special	Spec	Specifies that external logic is needed to identify this name and/or address component, and names the file (subroutine) containing that logic. Example: `'CASH-N-CARRY' 'ALPHA-SPECIAL'`
Synonym	Syn	A shortcut for defining an entry with the same value as a prior entry. For example, `'PBOX' INSERT SYNONYM='PO BOX'` identifies `PBOX` and `PO BOX` as synonyms for the parser. Any data element can be defined with synonyms.

Attributes

The following table shows the valid attributes, their abbreviations, and corresponsing definitions.

Table 19-7 Attributes, Abbreviations, and Descriptions

Attribute	Abbreviation	Description
1ALPHA	Not applicable.	A single letter
1NUMERIC	Not applicable	A single numeral
ALPHA	Not applicable	Letters only
ALPHA-1NUMERIC	Not applicable	Letters followed by a single numeric
ALPHA-1SPECIAL	Not applicable	Letters followed by a non-letter, non-number, or non-hyphen
ALPHA-NUMERIC	Not applicable	Letters followed by numerals
ALPHA-SPECIAL	Not applicable	Letters followed by special characters
HYPHEN	Not applicable	A hyphen
NUMERIC	Not applicable	Numerals only
NUMERIC-1ALPHA	Not applicable	Numerals followed by one letter
NUMERIC-1SPECIAL	Not applicable	Numerals followed by one special character
NUMERIC-ALPHA	Not applicable	Numerals followed by letters
NUMERIC-SPECIAL	Not applicable	Numerals followed by special characters
OTHER-SPECIAL	Not applicable	Any characters other than letters and numerals, such as hex or packed decimal
APARTMENT	APT	Apt., condo, etc.
APARTMENT#	APT#	Usually the value after apartment word
APT-COMPLEX	A-CPLX	Apts., estates, etc.
ATTENTION	ATTN	Secondary name for business
B-DESCRIPTIVE	B-DES	Non-name/address word in business name
B-REDEFINE	B-RDEF	DBA, Doing Business As, etc., related to business name
BOX	Not applicable	Mailbox, etc.
BOX#	Not applicable	Usually the value after Box word
BUS-EDIT	BEDIT	Edits a word/phrase from the business name Example: BUS-EDIT will delete "`INC`" from the business name. The ":" is a required prefix to the word to be edited. `:INC' INSERT NAME DEF ATT=BUS-EDIT Optionally, you can add a category to qualify the case. For example, adding `CAT='HP017021`' to this attribute defines "`017021`" as the low/high first three digits of the zip code in the record containing "`INC`". A business name containing "`INC`" is deleted from the name when the first three digits of the corresponding zip code are between 017 and 021. NOTE: BUS-EDIT requires that the parameter BUSINESS_NAME_EDIT must be set to "Y" in the Parser Parameter file.
BUSINESS	BUS	Name of a business.
BUSINESS?	BUS?	Possible name for a business.
C-TITLE	C-TTL	Compound title, such as Dr./Mrs.
CARE-OF	C-OF	C/O, In Care Of, etc.
CITY	CIT	CitY
CITY-CHANGE	CIT-CHG	Corrected spelling of city entry
COMPLEX-NAME	CPLX-NM	Apt or Business Complex name
CONCATENEE	CON	Part of geography/name/street phrase, such as "Von" or "O" in a name, `of the' in Avenue of the Americas in a street, or "Los" or "St." in geography line/city name.
CONNECTOR	CTR	And, or, with, etc. between names
COUNTRY	CNTRY	Country namE
COUNTY	CTY	Name of county
DEPT	DPT	Dept., Section, Group, etc.
DESCRIPTIVE	DES	Non-name/address word(s), such as Income Acct or General Fund
DIRECTION	DIR	Street direction such as North, Northwest, or NW
DRAWER	DRW	Drawer or abbreviations of drawer
DRAWER#	DRW#	Usually the value after Drawer word
FIRST	FST	First Name
FLOOR	FLR	Floor, basement, penthouse, etc.
FLOOR#	FLR#	Usually the value after Floor word
GENERATION	GENR	Jr., Sr., III
HIGHWAY	HWY	Highway, Expressway, etc.
HIGHWAY#	HWY	Usually the value after Highway word
HSN	Not applicable	House number
IGNORE	IGN	Does not affect name/address, such as "et al"
LAST	LST	Surname
L-TYPE	L-TYP	Leading street type, such as Calle A or Avenue A
LOT	Not applicable	Lot, perhaps for a house not yet built
LOT#	Not applicable	Usually the value after Lot word
MIDDLE	MID	Middle name
MILITARY-COMPLEX	M-CPLX	Base, ship, etc.
NEIGHBORHOOD	NEI	Geographic neighborhood name, usually not USA
NUM-QUAL	N-Q	NUMBER QUALIFIER - Handles "#", "no." etc. when they have no meaning. This is useful when a "#" or other insignificant character is present between apt and apt# - ex. apt # 10
NUMBER	NMB	Ten, fifteen; a house or street number expressed as a word
PBOX	PBOX	Post office b
PBOX#	PBOX#	Usually the value after PBOX word
P-TITLE	P-TTL	Plural title, such as Drs.
POLE	POL	Telephone pole used to identify address
POLE#	POL#	Usually the value after Pole word
POSTAL- CODE	P-CD	Canadian or European format
PROVINCE	PRV	Canadian province
REDEFINE	RDEF	Also Known As (AKA) for a name, Doing Business As (dba) for a business name, etc
RELATIONSHIP	REL	In Trust For, FBO, etc.
ROUTE	RTE	Rural route, etc
ROUTE#	RTE#	Usually the value after Route word
S-CONNECTOR	S-CTR	Street connector
S-DIRECTION	S-DIR	Single direction, such as North or South (vs. Direction which may be compound North West, etc.)
SEC-TYPE	S-TYP	Secondary street type, for streets with more than one, such as Hilltop Circle Lane
SITE	Not applicable	Site, block, etc. (similar to Lot)
SITE#	SIT#	Usually the value after Site word
STAT	ST	Either state or two-character abbreviation for US state
STREET-NAME	STR-NM	Name of the street
STREET-NAME-LONG	STR-NM-L	Allows street names to occupy the display and non-display parser fields when the street name length is greater than 25 characters - designed for use in countries where long street names are common
SUITE	STE	Suite, Rooms, etc.
SUITE#	STE#	Usually the value after the Suite word
TITLE	TTL	Mr., Mrs., Capt., MD, etc.
TYPE	TYP	Street type, such as St., Ave., or Boulevard
UNIT	UNT	Unit, building, etc
UNIT#	UNT#	Usually the value after Unit word
ZIPCODE	ZIP	US zip code, five or nine numerals

Special Routines

In some instances, the parser requires more informatin to correcly identify a pattern. To solve this problem, special routines for specific patterns can be coded during table maintenance. These routines are invoked when data fits the specific data requirement. If the special routine is unsuccessful, the normal action of recode and export occurs.

For example, consider the following:

Name line = 'JOHN NICOLI CHRIS NICOLI'

Since 'CHRIS' is coded in the standard definition table (STDDEF) with a beginning attribute of FIRST and it is not in the beginning of the lin, the resulting pattern is 'FIRST ALPHA ALPHA ALPHA'. It should be noted that "beginning" refers to the physical beginning, after components that are valid for the beginning, or after a connector), .

Under most circumstances, data meeting this shape is recoded as 'FIRST MIDDLE MIDDLE LAST'. Because the word 'NICOLI' is repeated, you can derive that the connector has been omitted. By applying special routine HDTNP01 to the pattern 'FIRST ALPHA ALPHA ALPHA'; the routine looks for a repeating last name and if found, modifies the pattern as if it came in as 'FIRST LAST ALPHA LAST'. It then looks up that pattern for a valid recode. The pattern 'FIRST LAST ALPHA LAST' should be in the table with the appropriate recode. If the pattern is not in the table with an appropriate recode, a normal pattern failure takes place. If the special routine fails its specification, the action contained in the original entry is invoked.

This section lists the special routines you can enter into Table Maintenance to effect name and street pattern results. Note that square brackets, [ ], indicate optional data.

Name Pattern Routines

HDTNP01

Routine for name patterns in the form of:

'[TITLE] WORD WORD [TITLE] WORD WORD'

This routine skips over the titles and checks if the second word and fourth word are equal. If they are, it changes them to LST_ATT (last attribute) and returns a positive pattern hit. If the second and fourth words are not equal, the routine makes no changes.

Example: `'JOHN THOMAS CHRIS THOMAS'`

Coded as:
```
'FIRST FIRST FIRST FIRST' 
```

Entry as may be found in the STDDEF table:

'FIRST FIRST FIRST FIRST'
    INSERT PATTERN NAME DEF
    RECODE='FIRST MIDDLE MIDDLE LAST'
    EXPORT='FIRST(1) MIDDLE(1) MIDDLE(1) LAST(1)'
    SPECIAL=HDTNP01

Special routine HDTNP01 recodes the pattern to:
```
'FIRST LAST FIRST LAST' 
```

Then this pattern is looked up and the final result is the following:

'FIRST LAST FIRST LAST'
    INSERT PATTERN NAME DEF
    RECODE='FIRST LAST FIRST LAST'
    EXPORT='FIRST(1) LAST(1) FIRST(2) LAST(2)'

HDTNP02

This routine for name patterns is written as follows:

'[TITLE] WORD WORD CONNECTOR [TITLE] WORD WORD'

This routine skips over the titles and checks if the second word and last word are equal. If they are, it changes them to LST_ATT (last attribute) and returns a positive pattern hit. If the second and last words are not equal, the routine acts as if it has not been invoked.

Example: `'JOHN THOMAS AND JOYCE THOMAS'`

Coded as:
```
'FIRST FIRST CONNECTOR FIRST FIRST' 
```

Entry as may be found in the STDDEF table:

'FIRST FIRST CONNECTOR FIRST FIRST'
    INSERT PATTERN NAME DEF
    RECODE='FIRST MIDDLE CONNECTOR FIRST LAST'
    EXPORT='FIRST(1) MIDDLE(1) CONNECTOR(2) FIRST(2) LAST(12)'
    SPECIAL=HDTNP02

Routine HDTNP02 recodes the pattern to:
```
'FIRST LAST CONNECTOR FIRST LAST' 
```

Then this pattern is looked up and the final result is the following:

'FIRST LAST CONNECTOR FIRST LAST'
    INSERT PATTERN NAME DEF
    RECODE='FIRST LAST CONNECTOR FIRST LAST
    EXPORT='FIRST(1) LAST(1) CONNECTOR(2) FIRST(2) LAST(2)'

HDTNP03

Routine for name patterns in the form of:

'[TITLE] WORD WORD CONNECTOR [TITLE] WORD WORD WORD'

This routine skips over the titles and checks to determine if the second word and thelast word are equal. If they are equal, it changes them to LST_ATT (last attribute) and returns a positive pattern hit. If not, the routine bahves as if it was not invoked.

Example: `'JOHN THOMAS AND JOYCE ANN THOMAS'`

Coded as:

'FIRST FIRST CONNECTOR FIRST FIRST FIRST'

Entry as may be found in the STDDEF table:

'FIRST FIRST CONNECTOR FIRST FIRST FIRST'
    INSERT PATTERN NAME DEF 
    RECODE='FIRST MIDDLE CONNECTOR FIRST MIDDLE LAST'
    EXPORT='FIRST(1) MIDDLE(1) CONNECTOR(2) FIRST(2) MIDDLE(2) LAST(12)'
    SPECIAL=HDTNP03

Routine HDTNP03 recodes the pattern to:

'FIRST LAST CONNECTOR FIRST MIDDLE LAST'

Then this pattern is looked up and the final result is the following:

'FIRST LAST CONNECTOR FIRST MIDDLE LAST'
    INSERT PATTERN NAME DEF
    RECODE='FIRST LAST CONNECTOR FIRST MIDDLE LAST'
    EXPORT='FIRST(1) LAST(1) CONNECTOR(2) FIRST(2) MIDDLE(2)     									
            LAST(2)'

HDTNP04

This is the routine for name patterns in any form. The first pair of FIRST or ALPHA attributes with the same original value is turned to LST_ATT (last attributes) and all others that match it. Although this routine could replace routines HDTNP01, HDTNP02, and HDTNP03, it should be used sparingly.

Example: `'JOHN ANTHONY THOMAS BILL THOMAS'`

Coded as:
```
'FIRST FIRST FIRST FIRST FIRST' 
```

Entry as may be found in the STDDEF table:

'FIRST FIRST FIRST FIRST FIRST'
    INSERT PATTERN NAME DEF
    RECODE='FIRST MIDDLE MIDDLE MIDDLE LAST'
    EXPORT='FIRST(1) MIDDLE(1) MIDDLE(1) MIDDLE(1) LAST(1)'
    SPECIAL=HDTNP04

Routine HDTNP04 recodes the pattern to:
```
'FIRST FIRST LAST FIRST LAST' 
```

Then the pattern is looked up and the final result is the following:

'FIRST FIRST LAST FIRST LAST'
    INSERT PATTERN NAME DEF
    RECODE='FIRST MIDDLE LAST FIRST LAST'
    EXPORT='FIRST(1) MIDDLE(1) LAST(1) FIRST(2) LAST(2)'

HDTNP05

This routine for name patterns checks if the token carrying an ALPHA with special characters should be a business word. It follows the pattern below:

[*]a-a-a[*]

where the words must be in the form of any number of alpha characters.

Example: `'CASH-N-CARRY'`

Coded as:
```
'ALPHA-SPECIAL' 
```

Entry as may be found in the STDDEF table:

'ALPHA-SPECIAL'
    INSERT NAME PATTERN DEF
    RECODE='IGNORE'
    EXPORT='IGNORE(1)'
    SPECIAL=HDTNP05

Routine HDTNP05 recodes the pattern to:
```
'BUSINESS' 
```

Then this pattern is looked up and the final result is the following:

'BUSINESS'
    INSERT NAME PATTERN DEF
    RECODE='BUSINESS'
    EXPORT='BUSINESS(1)'

HDTNP06

This name routine checks if the second ALPHA token has a beginning attribute of FIRST. It handles the pattern 'ALPHA ALPHA' recoding to 'LAST FIRST'.

Example: `'NICOLI JOHN'`

Coded as:
```
'ALPHA ALPHA'
```

Entry as may be found in the STDDEF table:

'ALPHA ALPHA'
    INSERT PATTERN NAME DEF
    RECODE='FIRST LAST'
    EXPORT='FIRST(1) LAST(1)'
    SPECIAL=HDTNP06

Routine HDTNP06 recodes the pattern to:
```
'ALPHA FIRST' 
```

Then this offsetting pattern is looked up and the final result is the following:

'ALPHA FIRST'
    INSERT PATTERN NAME DEF
    RECODE='LAST FIRST'
    EXPORT='LAST(1) FIRST(1)'

HDTNP07

This name routine converts 0's and 1's for tokens 'O' and 'I' when no other meaning has already been assigned. It handles the pattern 'ALPHA 1ALPHA ALPHA-NUMERIC' when there is no parsed type on intrinsic ALPHA token types.

Example: `'JOHN C N1C0L1'`

Coded as:
```
'FIRST 1ALPHA ALPHA-NUMERIC'
```
Entry as may be found in the STDDEF table:
```
'FIRST 1ALPHA ALPHA-NUMERIC'
    INSERT PATTERN NAME DEF
    RECODE='FIRST MIDDLE LAST'
    EXPORT='FIRST(1) MIDDLE(1) LAST(1)'
    SPECIAL=HDTNP07 
```
Special routine HDTNP07 changes the '1's in 'N1C0L1' to 'I's and the '0' to an 'O', thus turning 'N1C0L1' to 'NICOLI'. This routine recodes alpha-numeric, alpha-1numeric and/or 1numeric-alpha types in this situation to last names.

Street Pattern Routines

HDTSP01

This street routine determines if the ambiguous route value is RD and changes its type to a street type. If the next token (Route#) could be a street direction, the routine changes it as well.

Example 1: `'10 MAIN RD 2'`

Coded as:
```
'HSNO ALPHA ROUTE ROUTE#'
```

Entry as may be found in the STDDEF table:

'HSNO ALPHA ROUTE ROUTE#'
    INSERT PATTERN STREET DEF
    RECODE='HSNO STREET-NAME ROUTE ROUTE#'
    SPECIAL=HDTSP01

Routine HDTSP01 recodes the pattern to:
```
'HSNO ALPHA TYPE ROUTE#' 
```

Then this pattern is looked up and the final result is the following:

'HSNO ALPHA TYPE ROUTE#'
     INSERT PATTERN STREET DEF
     RECODE='HSNO STREET-NAME TYPE ROUTE#'

Example 2: `'10 MAIN RD E'`

Coded as:
```
'HSNO ALPHA ROUTE ROUTE#'
```

Entry as may be found in the STDDEF table:

'HSNO ALPHA ROUTE ROUTE#'
    INSERT PATTERN STREET DEF
    RECODE='HSNO STREET-NAME ROUTE ROUTE#'
    SPECIAL=HDTSP01

Routine HDTSP01 recodes the pattern to:
```
'HSNO ALPHA TYPE S-DIRECTION' 
```

Then this pattern is looked up and the final result is the following:

'HSNO ALPHA TYPE S-DIRECTION'
INSERT PATTERN STREET DEF
RECODE='HSNO STREET-NAME TYPE S-DIRECTION'

HDTSP02

This street routine determines if the right most direction has the value of 'NO' followed by a number of some kind. If so, it turns the 'NO' into an ignore. Also, if the pattern is 'HSNO ROUTE ROUTE#', and ROUTE is a rural route, this routine turns HSNO into BOX#.

Example 1: `'10 MAIN ST NO 3'`

Coded as:
```
'HSNO ALPHA TYPE DIRECTION 1NUMERIC' 
```

Entry as may be found in the STDDEF table:

'HSNO ALPHA TYPE DIRECTION 1NUMERIC'
    INSERT PATTERN STREET DEF
    RECODE='HSNO STREET-NAME TYPE DIRECTION APARTMENT#'
    SPECIAL=HDTSP02

Routine HDTSP02 recodes the pattern to:

'HSNO STREET-NAME TYPE IGNORE APARTMENT#'

Then this pattern is looked up and the final result is the following:

'HSNO STREET-NAME TYPE IGNORE APARTMENT#
    INSERT PATTERN STREET DEF
    RECODE='HSNO STREET-NAME TYPE IGNORE APARTMENT#'

Example 2: `'35 RR 1'`

Coded as:
```
'HSNO ROUTE ROUTE#' 
```

Entry as may be found in the STDDEF table:

'HSNO ROUTE ROUTE#'
    INSERT PATTERN STREET DEF
    RECODE='HSNO STREET-NAME STREET-NAME'
    SPECIAL=HDTSP02

Routine HDTSP02 recodes the routine to:
```
'BOX# ROUTE ROUTE#' 
```

This pattern is then looked up and the final result is the following:

'BOX# ROUTE ROUTE#'
    INSERT PATTERN STREET DEF
    RECODE='BOX# ROUTE ROUTE#'

HDTSP03

This street routine handles routes without a number. 'RFD BOX 25' generates a token with a value of '1'.

Example: `'RFD BOX 25'`

Coded as:
```
'TYPE BOX BOX#' 
```

Entry as may be found in the STDDEF table:

'TYPE BOX BOX#'
    INSERT PATTERN STREET DEF
    RECODE='STREET-NAME BOX BOX#'
    SPECIAL=HDTSP03

Routine HDTSP03 recodes the pattern to:
```
'ROUTE ROUTE# BOX BOX#' 
```
The following pattern should be added to the STDDEF table to achieve the correct result from the recode:
```
'ROUTE ROUTE# BOX BOX#'
    INSERT PATTERN STREET DEF
    RECODE='ROUTE ROUTE# BOX BOX#'
```

HDTSP04

This street routine inspects the last APT-COMPLEX token for a value of 'COURT' and, if the next line does not have the line type of 'STREET', the routine changes the attribute of APT-COMPLEX to a type.

If the word entry in the word pattern table has the attribute APT-COMPLEX and the category code is 'HH32', it is handled the same as a value of 'COURT'.

In the first example, there is only one line containing street information. The special routine is invoked and 'COURT' is recoded to a street type. In the second example, there is an apartment line, plus an additional street line. The routine will not be invoked, leaving 'COURT' as an attribute of APT-COMPLEX.

Example 1

In this example, there is only one line containing street information. The special routine is invoked and 'COURT' is recoded to a street type.

10 RANDALL COURT'  = (S)TREET LINE
'LONDON, ENGLAND'  = (G)EOGRAPHY LINE

Coded as:
```
'HSNO ALPHA APT-COMPLEX'
```

Entry as may be found in the STDDEF table:

HSNO ALPHA APT-COMPLEX'
   INSERT PATTERN STREET DEF ATT=APARTMENT
   RECODE='APARTMENT# COMPLEX-NAME APT-COMPLEX'
   SPECIAL=HDTSP04

Routine HDTSP04 recodes the pattern to:
```
HSNO STREET-NAME TYPE' 
```

Then this pattern is looked up and the final result is the following:

HSNO STREET-NAME TYPE'
INSERT PATTERN STREET DEF
RECODE='HSNO STREET-NAME TYPE'

Example 2

In this example, there is an apartment line, plus an additional street line. The routine will not be invoked, leaving 'COURT' as an attribute of APT-COMPLEX.

10  RANDALL COURT        = (A)PARTMENT LINE
'100 KNIGHTSBRIDGE ROAD  = (S)TREET LINE
'LONDON, ENGLAND'        = (G)EOGRAPHY LINE

Coded as:

'HSNO ALPHA APT-COMPLEX'
'HSNO ALPHA TYPE'

Entry as may be found in the STDDEF table:

10  RANDALL COURT        = (A)PARTMENT LINE
'100 KNIGHTSBRIDGE ROAD  = (S)TREET LINE
'LONDON, ENGLAND'        = (G)EOGRAPHY LINE

Since there is an additional line with street information, Routine HDTSP04 will not change 'COURT' to an attribute of APT-COMPLEX, thus yielding the normal result.

HDTSP06

This street routine looks for coordinate address construct and if the directional letter information is attached to the numeric information, it splits the numeric information from the directional letter information.

Example:

100N 200e

Coded as:
```
'HSNO NUMERIC-1ALPHA
```
A split occurs and the resulting patter is:
```
'HSNO S-DIR STREET-NAME S-DIR'
```

Then this pattern is looked up and the final result is the following:

'HSNO S-DIR STREET-NAME S-DIR'
  INSERT PATTERN STREED DEF
  RECODE='HSNO S-DIR STREET-NAME S-DIR'

19 Data Quality: Name and Address Cleansing

Overview

About the Name and Address Operator

Parse Table Maintenance

Table Maintenance Process

Figure 19-1 Table Maintenance System Input and Output Files

Name and Address Operator Input Roles and Output Components

Input Roles

Table 19-1 Name and Address Operator Input Roles

Output Components

Table 19-2 Name and Address Operator Output Components

Name and Address Operator Supported Countries

Table 19-3 Countries Supported by the Name and Address Operator

Table Maintenance Compiler

Table 19-4 Parameter File Definitions for TABMAINT

Identifying Parsing Problems

Table Maintenance

Common Table Entries

Syntax for USERDEF Entries

INSERT or MODIFY

Abbreviations

Format

Symbols

Example: INSERT

Example: MODIFY

Patterns

Example: Name Pattern:

Example: Street Pattern:

DELETE

Abbreviations

Deleting Symbols

Example:

Deleting Synonyms

Example:

Deleting Patterns

Example:

Conventions for USERDEF Entries

Case

Comments

Line Lengths

Multiple Definitions

Example:

Export Line for Name Patterns

Example 1: RICHARD G SMITH AND MARY ROGERS

Example 2: RICHARD G AND MARY SMITH

Quotation Marks

Examples:

Masks

Mask Recodes

Patterns

Table 19-5 Intrinsic Attributes

Line Types

Name

Street

Geography

Miscellaneous

Positions

Beginning

Default

Ending

Logical vs. Physical Beginning and Ending Positions

Keywords

Table 19-6 Keywords and Descriptions

Attributes

Table 19-7 Attributes, Abbreviations, and Descriptions

Special Routines

Name Pattern Routines

HDTNP01

Example: 'JOHN THOMAS CHRIS THOMAS'

HDTNP02

Example: 'JOHN THOMAS AND JOYCE THOMAS'

HDTNP03

Example: 'JOHN THOMAS AND JOYCE ANN THOMAS'

HDTNP04

Example: 'JOHN ANTHONY THOMAS BILL THOMAS'

HDTNP05

Example: 'CASH-N-CARRY'

HDTNP06

Example: 'NICOLI JOHN'

HDTNP07

19
Data Quality: Name and Address Cleansing

Example: `INSERT`

Example: `MODIFY`

Example 1: `RICHARD G SMITH AND MARY ROGERS`

Example 2: `RICHARD G AND MARY SMITH`

Example: `'JOHN THOMAS CHRIS THOMAS'`

Example: `'JOHN THOMAS AND JOYCE THOMAS'`

Example: `'JOHN THOMAS AND JOYCE ANN THOMAS'`

Example: `'JOHN ANTHONY THOMAS BILL THOMAS'`

Example: `'CASH-N-CARRY'`

Example: `'NICOLI JOHN'`

Example: `'JOHN C N1C0L1'`

Example 1: `'10 MAIN RD 2'`

Example 2: `'10 MAIN RD E'`

Example 1: `'10 MAIN ST NO 3'`

Example 2: `'35 RR 1'`

Example: `'RFD BOX 25'`