                       F i l e    I n f o r m a t i o n 

* DESCRIPTION
SOUNDEX.TXT - Documentation for SOUNDEX.

* ASSOCIATED FILES
SOUNDEX.PAS
DEMO.EXE
DEMO.PAS
SOUNDEX.TXT

* KEYWORDS TURBO PASCAL V4.0 SOUND CONVERT STRING DOCUMENTATION 

==========================================================================
}
The Soundex "sounds-like" Code

As a child, I was intrigued with the concept of homophones --
different words that sounded alike.  As a programmer, homophones
can be a real pain.

In 1918, Margaret Odell and Robert Russell registered an algorithm
with the U.S. Patent Office that deals with homophones. Called the
Soundex Algorithm, it allows the retrieval of similar-sounding
words.

Description of a Soundex Code

A Soundex code is a four-character string where the first
character is an uppercase letter, and the remaining characters
are digits.  To create the code, certain letters (all vowels plus
'W' and 'Y') are ignored, and case is ignored for all letters. 
The digits are created by assigning different values to letter-
groups that have similar sounds.  Where adjacent duplicate
letters occur (e.g. 'tt' in letter), only the first letter is
used to generate the code.  If the word is not long enough to
generate a four-character code, the remaining characters are set
to '0'.

UNIT Soundex

An examination of the Soundex UNIT should make the algorithm more
clear.  The INTERFACE portion of this UNIT has three items
accessible to programs -- a TYPE definition, a FUNCTION, and a
constant.

TYPE CodeString exports a self-documenting/error-checking TYPE to
a program, and performs a similar function for the internal parts
of the UNIT.

FUNCTION SoundexCode takes a string of any length, and converts
it to a proper Soundex code.  Notice that ridiculous input (e.g.
'@@#^%&*$') returns a consistent result ('0000').  However, since
short words can return the same value, it is up to the program to
distinguish short words from garbage -- my preference is to not
allow the function to get garbage in the first place.

The Copyright constant is here to get around Borland's smart
linker.  If this (unused) constant appeared in the IMPLEMENTATION
section, it would not appear in the object code.  Placing it here
allows me to "brand" my programs.  If this were a commercial
application, I would probably not show this part of the interface
in my documentation -- thus harming only those who wished to
"borrow" my work.

The IMPLEMENTATION portion of the UNIT contains only the FUNCTION
SoundexCode.  This function simply loops through the supplied
string, stripping out unneeded characters, and converting the
remainder to the code.

Comparing Soundex Codes

Now that we have a Soundex code, how can we use it?  In
particular, how do the codes vary?

Most, but not all, words that sound alike will generate identical
codes that can be compared directly.  However, consider the word
"tricks" which generates a Soundex code of 'T622'. One
misspelling could be "trix" which generates 'T620' -- close but
not quite the same.  Notice that the only difference is in the
last character of the code.  An approximate rule-of-thumb is to
say that as words vary in successive syllables, the resulting
Soundex codes will also vary.  This principle could be used to
expand a search for a keyword if the first lookup fails.

Using Soundex Codes.

The uses of a "sounds-like" search are many.  Consider a Pascal
compiler that, when faced with an undefined variable, suggests
like-sounding variables of the correct type which have already
been defined.  Or, in an order-entry system, look for like-
sounding customer names -- or even like-sounding products.  To
use a cliche: the possibilities are limited only by the
imagination.



