
                       Looking for Information....


                   Search and Ye Shall Find.....Maybe!

                  Copyright 1995 by  Peter Neuendorffer


 What it means to search for information on computers and the Internet can
 be illustrated by using real-life comparisons. Much of this search
 article is about common sense that is rooted in experience and  has been
 incorporated into computer programming theory and practice.

 A central problem of the information explosion, new or old, is how to
 find specific information. I have a consultant friend who had to
 uninstall a program that he couldn't find because the user had put every
 single file into a single large directory on the hard disk.  Liken this
 to having a big warehouse with no shelves. As files proliferate on
 computers, file names and extensions don't give a clue as to whats
 inside an individual file.  So what that Windows95 has come up with the
 long file name?  Just another elusive title and not much else!

 There are a number of important reference points for a search and for
 search tools which include knowing what you are looking for, which
 ballpark it probably is in,  and how much time and work space you have.
 How much stuff you will have to search  - the size of the search space -,
 and what you know  about how it is organized are all relevant. That's a
 lot to keep track of and to understand  for most people. The noted
 information scientist Norbert Wiener is quoted as asking "Am I walking to
 lunch or coming from lunch? I don't know!"  Not only did he not know what
 he was looking for, he probably didnt even know where he was!

 Finding one's sense of direction when undertaking a search can be
 crucial. The other day I mislaid my house keys. I knew what I was looking
 for. Surely they were in one of the correct obvious places. I ransacked
 the apartment, to no avail. This called for logic.  Could I reconstruct
 exactly what I did when I last came in?  It came to me that as I  was
 unloading my bundle of bathroom sundries the telephone rang.  Sure
 enough, my keys were behind the toilet paper in the broom closet.  A
 likely place. I had the What, and the time and space, but not the Where,
 nor had I accounted for human error.

 Indexing of information is common from your local phone tickler to the
 largest mainframes. One technique, hashing, involves storing information
 not randomly one item after another, but in mathema-tical order that
 makes retrieval faster.

 Searching keywords on the Internet can be a pain, because you don't
 have the What.  If you do, you don't have both the time and space, as a
 field of 200000 entries pops up for the word computer. How do you find
 what youre looking for if the search system doesn't understand what you
 want? Even in these crude searches, Internet searches are indexed.  The
 search engine knows something about the data it is searching and how it
 is organized - in addition to just how to search it. A simple tool in
 computer science for searching lists is taken from how we search for a
 name in a paper phone book, the Binary Search.

 The Binary Search must  have an  alphabetized list to succeed so it knows
 that the list is in alphabetical order.   It uses this logic: When we
 look for Jones in the  phone book, we unconsciously turn to the middle
 -actually to the left of middle- of the book. If we come up with
 Friendly, we unconsciously turn halfway further in the book. And back and
 forth by large chunks till we get the Jones page. This search is a lot
 faster than starting at the first name on page one and then proceeding.
 We are cutting the book in powers of 2.

 This algorithm has an O of less than one. Many problems in computer
 science are non-complete since it would take longer to solve them  than
 there is time in the universe.  For example, because it is so hard to
 find very large prime numbers, one company was able to patent two of
 them. Having devised a search method, they own the rights to the numbers.
 Finding out the exact location of Earth relative to Mars in 1000 years is
 the three body problem that is deemed not solubable unless we wait 1000
 years.

 Fortunately, we are not usually looking for such weighty information, but
 rather, something more like our Aunt Martha's phone number in our
 personal information manager. With such a small field to search on our
 computer,  it is fast. We know the where, the what, have the time and
 work space, and the software knows a lot about  how the PIM  warehouse is
 organized.

 However, when we get out into the bigger world, it is not simply who owns
 the information but who is able to find it that is  the important key.
 Just talking to the IRS or Social Security can be a trial as you wait
 endlessly through a hold pattern with recorded messages like Please do
 not hang up, or your call will be further delayed. or  Don't give up,
 we'll be with you momentarily. Maybe!


 If we attempt to index large amounts of information, that's OK, but we
 will have to be prepared to update the index constantly.  When I worked
 for a department store, we were constantly counting inventory as
 shipments came and went and merchandise was sold. The stock numbers were
 set up differently for every type of item and vendor. Certainly the POS
 system is a vast improvement, indexing the lists. If only discount
 coupons didn't bog down the supermarket checkout line, violating all of
 my time, space, and know what you are looking for rules.

 People question the computer's accuracy , discounting the answer, and
 this is another story. That can't be right, check again. or I hate
 computers, they're always wrong. If people get their information and then
 misread it or don't use it or reject it, what is the use of it all? The
 famous case in point in science is the cold fusion Stanford experiment,
 which astounded scientists because it would mean the ancient Alchemists
 were right about turning lead to gold. Unfortunately,  the certainty of
 their results were deemed to be  in the noise zone, or not much better
 than fiction.

 I have a friend who works as an order picker in a warehouse. He
 punches  stock changes into a computer  when he physically moves items
 around.  But what if the stores information gets out of kilter with
 reality?  Sales go up, inventory goes down some on paper, but actually
 there is much less in the warehouse than anyone realizes.  Although  the
 information was wrong, it was assumed to be right. The cozy computer
 system then was consistent but only with itself,  not with the real
 stockroom! it was supposed to reflect That essential match was going to
 pot. Somewhat like President Hoover's famous remark shortly before the
 Depression "Prosperity is just around the corner."

 According to Newsweek magazine, a state-of-the-art automated ware-house
 for running shoes ground to a halt. The workers found that they could not
 move anything into or out of the facility, even though conveyer belts
 kept spewing out merchandise for non-existent orders.  We live in the
 real world, not inside a computer. Information not matching reality is
 garbage, at best a theory or modelling.

 That method of organization was described in Lewis Carroll's
 Alice in Wonderland. At the tea party, the Mad Hatter ordered every-one
 to move down the table when the dishes were dirty. Garbage in, garbage
 out. But still, if you are not playing with a full deck, you can  pose a
 search question  which is  perfectly reasonable and  still get garbage
 for an answer.  Its like dealing with one of those Bostonians who are
 noted for firmly giving patently wrong street directions to passers-by
 who  have lost their  way.

 What if the search system doesn't have the foggiest idea of where   the
 item that youre looking for is located in your search space ?  It knows
 nothing about it's warehouse except that it  contains files which are in
 text format.  You are looking for the word computer on your hard drive.
 It has an equal probability of being in the first line of the first file,
 or the last line of the last file. You could index the drive, but that
 takes a lot of time, and  some space.

 Without an index to use, you start out with a zero probability of
 finding a match,  but, as you move ahead,  you are  more sure of finding
 your answer. By the end, you have a 100% probability if its there at
 all. If you skip to the middle to start, this is just like rearranging
 your warehouse and will still take the same time-- actually longer-since
 you have to rearrange the warehouse.

 The fruitless search takes the longest.  If you asked a sales person to
 look for an item in the color and size you want, he may come back 15
 minutes later  to report that  "We don't have it."   It  takes so long
 precisely because they do not have it. It's either there or it isn't but
 he had to search the entire stockroom to find the answer.  One assumes
 that most stockrooms are  well-organized and is surprised when this is
 not the case.

 Let's get back to searching your PC for text and presume you do not have
 an index. You are starting from scratch. Your work space and time are
 constant. You only have till three o'clock, and the hard drive only has
 so much free disk and memory space to work with. You can narrow down the
 size of the search space by looking only for DOC files and ignoring
 categories of files like programs. Or perhaps start the search in a
 subdirectory that is likely to have your information, like \DOC. The file
 names do tell you something about how the data is organized, but they are
 like labels on packages with the contents imprecise. Somewhat
 informative, but not detailed as that would take up much more space.
 Unfortunately, searching for a lower case a is different than searching
 for a upper case A because words like alice and ALICE are stored
 differently on the computer.

 Another way you can narrow down the search is picking what you are
 looking for carefully.  Computer is not going to be very descriptive on a computer. In the middle of a lake, looking for a computer would be very helpful because it is a rarer item then on land. A rarer item on the computer might be EISA motherboard. We can also search for more than one thing or field at once, using Boolean Logic.

 Boolean logic involves using AND OR NOT  like arithmetic.  A Boolean
 expression is either true or false.  In other words, we search for ALICE
 AND COMPUTER. Both words must be found near each other in our files to
 evaluate as true. If Search_it (ALICE and COMPUTER) then "we have a
 match". It turns out that you can string together combinations of the
 operators AND OR NOT. Of course on the computer, you have to have
 software that does this. In real life you use these conditions all the
 time without realizing it. "If it's lunch time and I'm hungry then I
 think I'll eat."

 One of the fundamental aspects of computers is being able to perform
 different actions based on the result of a condition.

             IF BEFORE LUNCH THAN EAT BREAKFAST ELSE EAT LUNCH.
         Or, IF THE_COMPANY_SHOWED_A_PROFIT than PAY_STOCKHOLDERS
                Else FILE_BANKRUPTCY.

 Each of the actions above could be the name for a procedure, module, or
 program that does all the necessary processing.

 A Boolean condition could be ALICE and (COMPUTER or GROCERY). This match
 would be a mention of the word ALICE and also one of the others, either
 COMPUTER or GROCERY. Structured Query Language searches utilize this type
 of searching; they also extract records that have common fields. But
 remember in our search, we know very little about how our information
 warehouse is organized.

 It turns out, luckily, that Boolean conditions can be chopped in two, and
 each half treated separately. This is like cutting the cards, and then
 cutting them again, and can produce the same kind of speed increase as
 the phone book example above. If we are searching for two things at once,
 say ALICE and COMPUTER, as soon as we know that Alice isn't there, we
 don't have to check for Computer.

 In searching, time and space are at a premium. You give up one for the
 other and must compromise. The ready availability to a great variety of
 knowledge bases opens up information if one has the ability to use search
 tools and learn to use those tools efficiently and with minimum cost.
 Although it is currently gauche to say, not all of us will live forever.
 It would be nice to know that we will  find what were looking for before
 the end!

 Peter Neundorffer is a regular WindoWatch contributor.  He is the creator
 of Alice and a DOS and Windows programmer. Peter  has very recently
 released a text search program for Windows he calls  Bool Text Searcher
 which can be retrieved as ABOOL11.ZIP"




                                      ww



