
   Go to the previous, next section.
   
                         Archie Project Overall Design
                                       
   @everyfooting Author: rootd // Editor: rootd // Texinfo: rootd @| @| 3
   December 1994
   
Brief Reminder of Functional Capabilities

   In order to best understand the design of the archie server, it's
   useful to review the capabilities of the archie system. Note, however,
   that this section is NOT the specification. The specification is the
   user manual given earlier in this document. If there is a difference
   between the user manual and this section, then there is an error in
   this section.
   
   Reviewing the functionality in this manner will help show how the
   archie system design is capable of meeting the requirements.
   
   In the following document, "users" are people who use archie for
   searching ftp sites, and "maintainers" are people who set up archie
   servers. A "system administrator" usually specifies an administrator
   at the user's site--not the maintainer's site (because of the
   privileges necessary to run an archie server, we assume that the
   maintainers are also qualified system administrators).
   
  The Environment Variable Concept
  
   Each of the interfaces that users can use to request archie searches
   have the concept of "environment variables." These are variables which
   determine the behavior of other commands. The SEARCH environment
   variable is an example of this--it allows the user to specify what
   type of search his later search commands will execute.
   
  Search Commands
  
   These commands actually cause some sort of search to be performed by
   the archie search engine. There are five types of searches: exact,
   subcase, sub, regex, and whatis.
   
  Maintenance Commands
  
   These commands give users the ability to assist the maintainers in
   improving the archie service. The only current example is the
   "addnewsite" command which requests that a new anonymous ftp site be
   added to the ftp-site-list. As we will see later, this one command
   needs to be handled differently from all the others.
   
  The Telnet Interface
  
   This is a miniature "shell" which allows people to telnet to an archie
   server and request archie searches. Because there is no remote
   software at all, the telnet interface needs to keep track of
   environment variables.
   
  The Email Interface
  
   This is a program which parses incoming email. It's very similar to
   the telnet interface (with the same command structure and environment
   variables) differing only because some environment variables (most of
   which deal with the display type of the interactive terminal) are not
   necessary in the email interface.
   
  The Prospero Interface
  
   This interface provides compatibility with existing clients which use
   the prospero protocol. Because the prospero protocol is a
   connectionless stateless protocol, the remote client needs to keep
   track of environment variables.
   
  Next Generation Interface
  
   This interface allows the "next generation" of archie client to exist,
   allowing the following new functionality to occur: 1) remote client
   can load balance between archie servers, and 2) the remote client will
   use a simple TCP connection to connect to the server (to make the
   creation of new archie clients easier).
   
The Master Archie Server

   Our archie server system has a concept of a "master archie server".
   This site has the responsibility of generating and testing the list of
   all ftp sites which the other archie servers will index. The current
   site list will be available via anonymous ftp.
   
   The master archie server will receive email with addnewsite commands.
   At this point, the master server will perform the following tasks:
   
    1. Get the primary IP address and primary domain name for the site.
    2. Check to see if that IP address is already in the site list.
    3. Anonymously ftp to that site and obtain an "ls -lR".
    4. Check the "ls -lR" for errors which would make it incompatible
       with the archie system.
    5. If appropriate, add the site to the site list.
       
   The master archie server should not also be a regular archie server to
   reduce load on this critical link in the archie system. In addition,
   users should not have to know that the master server even exists. They
   should be able to send the "addnewsite" command to any archie server,
   which will do the "correct thing" and forward it to the master server.
   
The Archie Server

   An "archie server" is a server that allows users to submit searches.
   It consists of the following software modules:
   
    1. Telnet interface
    2. Email interface
    3. Prospero interface
    4. Next Generation interface
    5. Queuer
    6. Searcher
    7. Database builder
    8. ls -lR getter
    9. Site list getter
       
   In addition, a database is created by the database builder and
   searched by the searcher. The site list is obtained from the master
   archie server by the site list getter.
   
   Each component is described in detail in a separate chapter. For now,
   we will give a brief description of each modules's function, and the
   data passed between modules. The precise format of the information
   passed between modules will be detailed in the appropriate chapter.
   
Interfaces

   An interface communicates with a user and helps a user submit his
   search request. then the interface sends the request to the queuer.
   There are four types of interfaces in the archie system: telnet,
   email, prospero, and next-generation.
   
   During normal operation, there are many interface processes running on
   the archie server. Each interface process only has to deal with one
   remote user at a time. This is the interface's primary purpose: to
   keep the complicated archie modules (queuer and searcher) from having
   to deal with multiple remote users over unreliable communication
   lines.
   
   When an interface has a search request, it sends the following
   information to the queuer:
   
    1. type of search
    2. number of hits
    3. niceness level
    4. search string
    5. verbose output option
    6. unique search identifier
       
   The information is sent to the queuer through a named pipe. The unique
   search identifier includes two parts, one of which is the pid of the
   interface. When the queuer needs to send the search results back to
   the interface, the queuer puts the data in a file and uses this pid to
   send a signal to the appropriate interface. Upon receipt of the
   signal, the interface knows that the output of the search is ready.
   
   The maintenance command addnewsite is dealt with in a different
   manner. If an addnewsite command is executed, the interface will send
   an email to the master archie site with the addnewsite command.
   
   Now for information specific to each interface.
   
  Telnet Interface
  
   This interface allows users to login to an archie server directly and
   interactively request searches. A login with no password will be
   created on the archie server allowing people to login. The telnet
   interface will be the shell (given in the /etc/passwd file) of the
   remote user. Since there is no "remote site client", the telnet
   interface/shell must keep track of all environment variables.
   
   Since we are allowing the general public to login without
   identification or a password, this interface is a potential security
   risk. Care must be taken to make certain that users are limited to
   archie functionality, and cannot break out of the archie-shell and
   gain unauthorized privileges. As a minimum, the "archie-user" must not
   own any files (designers note: will having different users owning
   different archie processes mess up our signal passing mechanism--this
   must be tested and dealt with if necessary).
   
   One telnet interface will be spawned for each remote user.
   
  Email Interface
  
   The fake archie user created for the telnet interface will also have a
   .forward file. This will cause two copies of incoming email to be
   created: one will be added to a log file, and the other will be
   forwarded into the email interface. This will cause one email
   interface to be spawned off for every incoming email archie request.
   
   To prevent email storms, the email interface must not respond to email
   error messages. In addition, all outgoing email should have
   "precedence=bulk" to minimize the generation of email error messages.
   
   The email client will parse the incoming email one line at a time,
   modifying the appropriate environment variables and generating the
   search commands as needed. In addition to the regular environment
   variables, the email interface will also keep the return address of
   the remote user.
   
   In the event multiple searches are requested in one email, the email
   interface will submit the searches one at a time, and wait for the
   result of one search before submitting the next search. Although this
   is unnecessary and (slightly) inefficient (with regards to the
   computational resources of the server which will have more email
   interface processes running at any one time), this will prevent one
   email with large or difficult searches from making the server
   completely useless until the email is completely processed.
   
  Prospero Interface
  
   The prospero interface provides compatibility with existing archie
   clients. These clients are located on remote machines and have two
   purposes: 1) by using the (non-computationally intensive) prospero
   protocol, remote clients reduce the load on the archie server (telnet
   sessions are much more intensive), and 2) the ability to support
   remote clients allows programmers to provide "friendlier" user
   interface support without requiring changes to the archie server.
   
   Unlike the telnet and email interfaces, a prospero interface is NOT
   spawned off each time a request comes in. Instead, when the archie
   server starts up several prospero interface daemons are created which
   attempt to read messages on one well-known UDP port. Because of the
   way UDP works, when a UDP packet comes in, ONE prospero daemon gets
   the packet (which contains an entire search request) and then the
   daemon processes the request (by submitting it to the queuer, and
   waiting for the signal indicating that the reply is ready--and then
   sending the reply to the remote client).
   
  Next Generation Interface
  
   inetd.conf is modified to listen on a new "well-known" tcp port, and
   each time a connection request comes in, a next generation interface
   is spawned.
   
   This interface is somewhat more intensive than the prospero interface,
   but is simpler to program (because it's a simple TCP connection) so a
   greater variety of remote clients are to be expected. In addition to
   the normal functionality, this interface will respond to a request for
   server loads. This will cause the interface to print out all the
   information on the server load on all known archie servers, allowing
   remote clients to load balance between servers.
   
Queuer

   There is only one queuer process running on an archie server. It is
   the "mother of all archie processes" and it's creation brings the
   archie server on line.
   
   The queuer has two purposes. When first created, the queuer is
   responsible for bringing the archie system on-line. Later, during
   normal operation, the queuer maintains the queue of search requests
   (allowing the searcher to only have to worry about one search at a
   time).
   
   During initialization, the queuer does the following things:
   
    1. spawn off the searcher
    2. spawn off the prospero interfaces
    3. record its PID in a well-known file
    4. create a FIFO for queuer input
       
   Search requests come in through the FIFO, and consist of the following
   information (repeated from the section on interfaces):
   
    1. type of search
    2. number of hits
    3. niceness level
    4. search string
    5. verbose output option
    6. unique search identifier
       
   Since the queuer deals with the order of the queue, the verbose output
   option and the niceness level are for the queuer to use. The other
   data, namely:
   
    1. type of search
    2. number of hits
    3. search string
    4. unique search identifier
       
   are sent to the searcher via the pipe created with the popen command.
   Only one search is sent at a time, and the queuer waits for the
   searcher to respond (the searcher sends a signal to its parent--the
   queuer--each time a search is complete) before sending another search.
   
Searcher

   The searcher has one purpose: to search the database as quickly as
   possible. The searcher obtains the following information from the
   queuer:
   
    1. type of search
    2. number of hits
    3. search string
    4. unique search identifier
       
   The searcher looks through the appropriate database, writes the result
   of the search to a file (in a known directory, with the unique search
   identifier name), and then sends a signal to it's parent (the queuer).
   Since the queuer spawned the searcher with the popen() command, the
   searcher can read the search information from standard input (but it
   has to be careful not to block). (designers note: an error recovery
   mechanism here would be good: perhaps an alarm could go off if a
   search takes a very long time--like 60 seconds, that way if an error
   occurs the entire archie server won't go down until manually
   restarted).
   
Database

   The speed of the searcher is dependent on the format of the database,
   so the database is a little complicated. The database consists of four
   directories, with one file in each directory for each indexed ftp
   site.
   
   The directories are: index, nocaseindex, data, and direc. Files inside
   the index directory are called index files. Files inside the
   nocaseindex directory are called nocaseindex files, etc...
   
   An example filename would be ftp.cs.pdx.edu-131.252.21.199 Note that
   this gives us the primary domain name and primary IP address, which we
   need for our output (by always using the primary domain name and
   primary ip address, we eliminate the possibility of duplicated ftp
   sites).
   
   The index files consist of the filenames of each file on the ftp site,
   followed by a newline.
   
   The nocaseindex files consist of the filenames of each file on the ftp
   site, with all upper case characters converted to lowercase
   characters.
   
   The data file consists of the information on each file listed in the
   index and nocaseindex files. This includes permissions, owner, group,
   link count, modification date, and size. In addition, the "directory
   number" (to be defined in a moment) is also listed on the line. Since
   each line is a specific size (in our prototype, 67 characters,
   including newline), we can lseek() to any specific line in the file.
   
   The index, nocaseindex, and data files for a particular remote site
   are all related. The data on line X in one file refers to same file as
   the data on line X in the other file. This means that we can search
   through the index file relatively quickly (because it only contains
   the names of files through which we are searching) and get the line
   number of the "hit". Then we lseek() to that line number in the data
   file and get the rest of the data on that file (except the directory,
   but we get the directory number). Because the size of the directory
   name varies widely, we can't put it in the data file. As a result, we
   put the directories in a separate file, and give the line number in
   the data file.
   
   By separating the file names from the rest of the data, we make the
   file through which we have to search smaller, increasing the speed of
   our search. In addition, it allows us to have a separate file for each
   type of search (which allows us to have the separate index file for
   case-insensitive searches, eliminating the need for our searcher to
   test each character twice while performing case-insensitive searches).
   
Database Builder

   Each night, the database builder will look to see if there are any
   ls-lr files which have been modified more recently than the
   corresponding database files in the INSTALLDIR/database-ng
   directories. If there are ls-lr files that are more recent, the
   database builder will recreate the database from the ls-lr files.
   
   The builder will go through the ls-lr file, and create the four parts
   of the database (index, nocaseindex, data, and direc). This will be
   done in a temporary directory in the same filesystem as the
   database-ng directory, and then mv instructions will be used to move
   them in (this reduces the probability that an intermittent error will
   occur if the searcher is searching one of these files at the time it
   is moved--as opposed to copied--there is still a small possibility of
   error that a semaphore system could fix, but the probability is
   sufficiently low that we will worry about this later).
   
   Upon completing the construction of the database, the builder will
   eliminate that particular ls-lr file (it's no longer needed and uses
   considerable disk space).
   
ls -lR Getter

   Each night, the ls -lR getter will do two things:
   
    1. Determine which ls -lR files need to be obtained
    2. Get them, and put them in the INSTALLDIR/lslr directory with the
       correct name (of the format ftp.cs.pdx.edu-131.252.21.14)
       
   There are two reasons why ls -lR files would need to be obtained:
    1. The current database files are over N days old (archie site
       determines N)
    2. An entry in the ftp site list file does not have a corresponding
       entry in the database.
       
   The ls -lR Getter will then anonymously ftp to the remote site and
   execute the ls -lR command. It will put the returned file in a
   temporary directory while it does the download, and then move it to
   the lslr directory once the download is complete.
   
Site List Getter

   The site list is maintained by the Master Archie Server and made
   available via anonymous ftp for anyone to download. Each archie site
   should download it about once a month. This is done manually by the
   archie server administrator so that they have the ability to customize
   the ftp site list (some searchers in Australia only index sites in
   Australia for example).
   
   Go to the previous, next section.
