Quality Systems Services
Copyright (c) Quality Systems Services 1995 - 2015
Name Data - The Use and Importance of Names
Name data has many uses in business, finding applications in:
Name Handling Technical Information...
Accurate name data is important to both your business and your customers. For your business it represents essential information for marketing and communication with individual customers, and a valuable independent information source for de- duplication of your databases. For your customers, a correctly spelled name on a mailing represents a high level of personal attention and competence from your business, where a misspelled name inspires no confidence in your businessís care towards the customer or your attention to detail. A mailing to an inappropriate name, for example the name of a deceased person, or a spoof name maliciously supplied, may cause offence to the customer.

What's in a Name?
Names are a rich source of information. In the simplest case we might extract from a name the following components:
This information is typically explicitly represented in the name. In addition, with appropriate processing we might extract more detailed personal information about an individual:
Representation of Name Data
We can see that a strategy for utilising name data to itís full potential should have several properties; it must provide easy access to the explicit information in the name, and implicit information such as gender which can be useful for tasks such as de-duplication; it must be able to manipulate the name data flexibly to meet the varying needs of tasks including indexing, de-duplication, addressing, and salutations. For example, the name 'John Smith' might optimally be represented in several forms depending on the task:
Gender e.g. Miss J. Smith
Marital Status e.g. Mrs J. Smith
Qualifications e.g. Dr. J. Smith PhD
Sensitive Information e.g. Mr J. Smith (deceased)
Inappropriate Information e.g. Mr Donald Duck
Titles
Forenames
Initials
Surnames
Marketing
De-duplication
Data cleansing
De-duplication 'MR J SMITH' or 'JOHN SMITH'
Addressing 'Mr. J. Smith'
Salutation 'Dear Mr. Smith'
The clear choice for representation is an explicit field format in which we have fields for title, initials, forenames, surname etc. This representation gives maximum flexibility to manipulate the name for the variety of uses to which it may be put and makes algorithms for tasks such as the generation of correct salutations straightforward.

Acquisition of Name Data
Often when acquiring personal data from disparate data sources we find that a variety of name formats may have been adopted. In more mature data sources we would expect to find some variation of a field-based format, which may or not be as complete as is desirable, for example we might find a name such as 'Mr. John H. Smith' represented in various ways:
Variability may be encountered in conventions of punctuation, casing, completeness of fields, assignment of data to fields etc. Due to incorporation of data from multiple sources and by multiple data capture operators we may find much inconsistency across a single data set, much more so across multiple data sets. Clearly when consolidating such data it is desirable to adopt a single comprehensive format, standardising and validating all data against this format.

Free Format Names
A more difficult situation which unfortunately is often encountered is the representation of an individualís name in a single 'free format name' field. In this format we may encounter a wide variety of name forms, for example:
Dr J McDonald
Dr. J. McDonald
Doctor John Mc Donald
Doctor John Mc-Donald
Dr. J. McDonald-Smythe
Dr. and Mrs. McDonald
Dr J / Mrs L McDonald
Dr J. McDonald PhD.
Dr J McDonald (deceased)
Here we see just a few sources of variability - capitalisation, punctuation, spacing, alternative forms for titles, hyphenation, fields representing more than one person, and additional information such as qualifications and annotations. Processing such free format data clearly requires very flexible processing techniques. Traditional approaches to processing such data have been naÔve and applied in an ad hoc fashion on an individual dataset basis, for example assume first word is title, last is surname etc. Such simple processing is unpredictable and has many limitations, for example it cannot cope reliably with data in which the free format varies, cannot recognise spelling mistakes or non- name data, and so on. Often one encounters field-formatted names in which errors have clearly been introduced by inadequate techniques for conversion from free format names.

Corrupted and Non-Name Data
Additional problems are presented by errors in the name data which may have originated at the data entry stage or by inadequate data consolidation techniques, for example:
Title Forename Initials Surname Full Initials
Mr. John H. Smith J. H.
Mr John J H Smith
MR JOHN H SMITH
Misspelled data e.g. Mis Smith
Corrupted data e.g. XX Mr Smith
Company names e.g. 'Dixons The Butchers'
Non-name data e.g. '17 Smith Street'
Undesirable data e.g. 'Mr F. Flintstone'
Other problems often found include:
Identification of non-name or undesirable data and correction of misspelled or corrupted names is clearly a very important task. Such processing has however been far beyond the abilities of conventional name processing systems.

NameBase
To address the needs for flexible name processing outlined above, QSS has developed an entirely new name processing system, NameBase. The system has been custom designed, incorporating innovative data representations and processing techniques. NameBase processes both field-formatted and free format name data from any database source, with capabilities including:
Validation of field-formatted names
Conversion of free format names to field format
Flexible user-defined field format
Standardise capitalisation, punctuation, hyphenation etc.
Expand or abbreviate titles
 
 
Spelling and punctuation correction
Automatic insertion/correction of punctuation
Automatic correction of spelling
Suggestions for ambiguous spelling correction
Detection of non-name data
Company names
Addresses, delivery notes etc.
Undesirable, deceased, spoof names
Creation of value-added information
Full initials from forenames
Determination of gender
Gender-correct default titles for salutation etc.
Auditing information
Reason for failing name
Type of editing applied to correct name
Example Output
Perhaps the most compelling demonstration of the capabilities of NameBase is to examine some example output. This table shows a few examples of free format name fields taken from real databases and their processing by NameBase. In this case the task was to reformat free format names into separate fields, standardising casing and punctuation, abbreviating titles, extracting gender where possible, and generating a full set of initials for de- duplication. Name entries representing multiple people were to be split by NameBase into multiple records.
The 'Status' field encodes the status of processing for a record.

These records show examples of exact name matches, with status 'E' (EXACT), with a variety of casing, spacing and punctuation, and nontrivial surname 'De Hutiray'.
ID Name Status Person Title Forenames Initials Surname Gender Full Initials
1 MR D. ALLEN E 1 Mr. D. Allen M D.
2 Mr A G B De Hutiray E 1 Mr. A. G. B. De Hutiray M A. G. B.
3 Captain Frank Gurney E 1 Capt. Frank Gurney M F.
4 A.Tottingham E 1 A. Tottingham U A.
These show how NameBase has recognised a field representing two people, and split this into two records, assigning titles and initials correctly to the appropriate individuals. Other options supported include keeping both individuals as a single record with compound title field.
ID Name Status Person Title Forenames Initials Surname Gender Full Initials
5 Mr M J & Mrs Davidson E 1 Mr. M. J. Davidson M M. J.
Mr M J & Mrs Davidson E 2 Mrs. Davidson F
6 Rev & Mrs Payne E 1 Rev. Payne U
Rev & Mrs Payne E 2 Mrs. Payne F
These records have status 'EE' (EXACT with EDIT) indicating that NameBase has successfully extracted a name after a simple editing of the field. Here are examples of corrupted entries, typing errors, and superfluous components e.g. 'Attn:'. In each case the correct name is extracted automatically.
ID Name Status Person Title Forenames Initials Surname Gender Full Initials
7 A Mr TURNER EE 1 Mr. Turner M
8 ,ISS C STYLES EE 1 Miss C. Styles F C.
9 . Mrs Marsden EE 1 Mrs. Marsden F
10 Attn: Chris Thompson EE 1 Chris Thompson U C.
11 Attn~: Mrs Shipley EE 1 Mrs. Shipley F
12 C.O Mrs Robertshaw EE 1 Mrs. Robertshaw F
13 C/O Mr Pearce EE 1 Mr. Pearce M
14 Customer Miss Mead EE 1 Miss Mead F
This record shows an example of an ambiguous name field (status 'A'). In this case NameBase provides two interpretations of the entry 'MR C O TOOLE', either 'Mr. C. O. Toole', or 'Mr. C. OíToole' since it has recognised that omission of punctuation from the input may have hidden the common surname 'OíToole'. NameBase provides options to control how such ambiguities are processed, selecting a best guess or leaving the choice for manual post-processing.
ID Name Status Person Title Forenames Initials Surname Gender Full Initials
15 MR C O TOOLE A 1 Mr. C. O. Toole M C. O.
MR C O TOOLE A 1 Mr. C. O'Toole M C.
This record gives another ambiguous example, with status 'I' (INEXACT). In this case the surname 'Hutchson' appears to be a misspelling and NameBase suggests six corrections to the spelling.
NameBase Technology
These tables of example results should demonstrate clearly the flexibility with which NameBase is able to treat name data, operating correctly on complex multi- person names in the presence of typing errors, misspelling, incorrect formatting of input fields, and many other errors encountered in real world databases. The key to the systemís accuracy and robustness is the specific system design adopted by QSS and unique to NameBase. Key features include:
ID Name Status Person Title Forenames Initials Surname Gender Full Initials
23 Anglia Coop EC 1 Anglia Coop F A.
24 Mr M Mouse EU 1 Mr. M. Mouse M M.
ID Name Status Person Title Forenames Initials Surname Gender Full Initials
16 E W HutchsonI 1 E. W. Hutchason U E. W.
E W HutchsonI 1 E. W. Hutcheon U E. W.
E W HutchsonI 1 E. W. Hutcheson U E. W.
E W HutchsonI 1 E. W. Hutchison U E. W.
E W HutchsonI 1 E. W. Hutchon U E. W.
E W HutchsonI 1 E. W. Hutchson U E. W.
These records show examples of inputs which have been rejected by NameBase. Record 17 is an example of an address line (status 'NA' - NO MATCH: ADDRESS), records 18-21 are correctly failed as company names (status 'NC' - NO MATCH: COMPANY) and record 22 is given status 'ND' - NO MATCH: DECEASED indicating that the name refers to a deceased individual.
ID Name Status Person Title Forenames Initials Surname Gender Full Initials
17 165 AINSLIE STREET NA 1 Ainslie Street U A.
165 AINSLIE STREET NA 1 Ainslie Street U
18 BEVAN FUNNELL LTD NC 1 Bevan Funnell M B.
BEVAN FUNNELL LTD NC 1 Bevan Funnell U
19 Vale Royal Fresh Foods NC 1 Vale Royal Fresh U V.
Vale Royal Fresh Foods NC 1 Vale Royal Fresh U V. R.
Vale Royal Fresh Foods NC 1 Vale Royal Fresh U
Vale Royal Fresh Foods NC 1 Royal Vale Fresh U R.
20 Aaa Appliances NC 0 U
21 Anglia Co-OP NC 1 Anglia F A.
22 Mr J Smith deceased ND 1 Mr. J. Smith M J.
These records give examples of entries which match exactly as names but are identified by NameBase as 'suspicious', record 23 (status 'EC' - EXACT but COMPANY like) because it looks like a company name, and record 24 (status 'EU' - EXACT but UNDESIRABLE) because while being a valid name it is potentially undesirable  (Mr. Mickey Mouse?).
Very large database of categorised name components
Formal model of valid name forms
Formal error correction rules
System for explaining and auditing matches
Open architecture
At the heart of NameBase lie two main components - a very large database of categorised name components, and an advanced pattern matching engine using formal specifications of valid name forms and error correction strategies.

Name Component Database
Using such a large database, which contains not only conventional name components such as forenames and surnames, but also surname prefixes, alternative title forms, company name indicators, etc. all annotated with information including classification, gender and frequency information gives NameBase a unique lead over conventional name processing which has tended to rely on lists of a few tens or hundreds of name components. A proprietory data format allows lightning speed access to the database while supporting ultra- efficient searches for misspelled entries.

Pattern Matching Engine
Utilising the database is an advanced pattern matching engine which uses formal specifications of valid name forms to interpret input fields, constituting a well-defined model of what is and isnít a valid name. The engine in addition uses formally defined error correction strategies to correct errors in input, reinterpret ambiguous name components or edit input to achieve an interpretation. Whereas conventional approaches to name processing have used hard-coded procedures for processing input, the use by NameBase of formally-defined techniques allows the system to explain and justify editing decisions made during processing. Conventional systems have typically only been able to output a ďbest guessĒ; NameBase by contrast outputs a choice of interpretations if the input is ambiguous, status codes indicating reasons for failure, success, or ambiguity of the input, and can output detailed information of the editing performed to match a corrupted input so that intelligent selective post- processing of the results may be carried out to ensure maximal accuracy.

Open Architecture and Data-centric Approach
A third key element to the design of NameBase is itís open architecture. The system is not tied to any particular platform, giving maximum flexibility for incorporation your existing system environments. An important aspect is that the systemís approach is essentially data-centric. Output data from the system contains rich information for manual post-processing meaning that such processing can be carried out on existing database terminals with low requirements in terms of processing power. Incorporation into complex multi- user environments becomes a simple task supported by the existing database infrastructure.