
                            ===========
                            | VBSTATS |
                            ===========

         Experimental Windows Web Server Statistics Reporter
         ===================================================

By: Bob Denny <rdenny@netcom.com>
    July 15, 1994

I hereby place these programs into the public domain without
restriction and WITHOUT ANY WARRANTY OF ANY KIND. All I ask is that
if you distribute copies of the kit, PLEASE mark any changes
CLEARLY, and distribute it as a complete kit.

*** PRESERVE DIRECTORIES WHEN DECOMPRESSING! ***

=============
INTRODUCTION:
=============

This kit contains Visual Basic files for three programs that
together provide the ability to analyze Web server logfiles (in
Common Log Format ONLY), and produce HTML pages (and a usage graph)
describing the server usage. The basic idea came from the "wusage"
program developed by Tom Boutell, however this program works
entirely differently.

If you aren't familiar with VB and SQL, or don't want to be, stop
here. At this stage, the package is for dedicated tinkerers only.
Also, if you don't have VB Professional, stop here. The package
requires the Professional Edition, as it uses a relational database
and SQL to do the work, and requires you to edit the source and
build the executables. No VB components are included with this kit.

On the good side, the programs are really quite small, hopefully
well documented, and hopefully easy to follow. The database is in
Access/VB-native format, so you don't need SQL Server or anything
like that!

The kit is designed to work automatically with Windows httpd, it
defaults to checking to see if the server is running and if so,
sending the "cycle logfiles" signal to the server, then sucking the
cycled out log data into the database. Command line options permit
the kit to be used offline to analyze and report on data from other
kinds of servers (NCSA, CERN, etc.).

There are three programs:

  * logtodb  - Captures common log format files and loads the database
               (creating it if needed)
  * restrict - A GUI "object restriction" editor
  * reporter - The program that generates the HTML and graph

"Restriction" is the process of hiding objects (users, URL targets, and
sites) from consideration in statistics summaries. Objects can be hidden
from top-10 lists (but still count in the totals) or completely hidden
(not even count in the totals).

================
GETTING STARTED:
================

BEFORE DOING ANYTHING ELSE, read this document clear through and
poke around in the code in each of the programs. Then make any
changes in lib/config.bas (common to all three programs' MAK files).
You will have to at least change the server name and the server
admin's name. Now make all three executables in their
subdirectories, and move them to the parent (whtpstat), which we'll
call the "work area".

Next get a copy of a Common Log Format Web server log, preferably
one that has lots of data extending over several weeks, and put it
into the work area. Assuming it is called access.log, run the
following:

  logtodb <work>\access.log <work>\access.mdb

This will create the database and fill in the data from the server
log. The command argumetns override the automatic behavior used for
routine operation.

TIP: You'll need either Access 1.1 (not 2.0!) or the "Visual Data"
sample that comes with VB Professional. If you don't have Access
1.1, use the modified modules I supplied in the ./visdata
subdirectory to replace the originals that came with Visual Data,
then build the VISDATA executable. Use it to examine the tables and
QueryDefs in the database after it is built by the logtodb program.
The modifications permit you to see the QueryDef SQL by simply
clicking on the QueryDef name in the Tables/Queries list.

One more thing: If you are using the Windows httpd server, add a
directory ./htdocs/svrstats, the default location into which the
reporter puts the HTML and graph files.

Once you have a loaded database, you're ready to play around.

=============
THE DATABASE:
=============

The database is the heart of this package. It contains both tables
and QueryDefs, the latter are stored SQL used to produce the summary
report data, and perform maintenance tasks. Here's a brief
description of them.

    Table         Description
    ------------- ---------------------------------------------------
    Accesses      Each access to the server, from a line in the log
    AuthUsers     Each unique user that authenticated via the server
    Objects       Each unique URL target that was accessed
    PastTotals    Weekly totals generated by previous report runs
    RestrictPats  Contains patterns used for visiblity restriction
    RestrictKey   (not used, see below)
    Sites         Each unique client hostname or IP address
    TableKey      (not used, see below)

The RestrictKey table was supposed to contain a textual description
of the restriction codes (more on this later). The TableKey table
was supposed to contain a textual description of each table, listed
by "table code" used to identify tables during restriction processing.
(Baffled? Don't worry... this is fun!)

There is always a User with ID=1, called <none>, which links with
unauthenticated accesses.

Most of the SQL queries are in the database as QueryDefs. Many take
parameters ("parameter queries"), most of which are the start and
end date/times for the query. Doing this allows changes to be made to
the database structure (and the query) without changing the code in
the applications. Cool, eh? Well, there are a few SQL statements in
the code anyway, so it ain't pure. Maybe I'll move them into
querydefs some day. Here are brief descriptions of the QueryDefs:

    QueryDef               Resulting dynaset (D) or action (A)
    -------------          -------------------------------------------
    AuthUsersByAccessCount (D) Total accesses for each user
    AuthUsersByByteCount   (D) Total byte count for each user
    MarkNewAuthUsers       (A) Init restrict code for newly added users
    MarkNewObjects         (A) Init restrict code for newly added targets
    MarkNewSites           (A) Init restrict code for newly added sites
    ObjectsByAccessCount   (Q) Total accesses for each URL target
    PatListSelector        (Q) Restrict pattern lists for each table
    PurgeAccesses          (A) Remove logged accesses for a date range
    RemoveAllRestrictions  (A) Unrestrict all users, targets and sites
    RestrictAuthUsers      (A) Apply restriction patterns to users
    RestrictNewAuthUsers   (A) Apply restrictions to newly added users
    RestrictNewObjects     (A) Apply restrictions to newly added targets
    RestrictNewSites       (A) Apply restrictions to newly added sites
    RestrictObjects        (A) Apply restriction patterns to targets
    RestrictSites          (A) Apply restriction patterns to sites
    SitesByAccessCount     (Q) Total accesses for each site hostname/IP
    SitesByByteCount       (Q) Total byte count for each site hostname/IP
    TotalAccessesByMethod  (Q) Total accesses for each method (GET, etc.)
    TotalAccesses          (Q) Grand total of accesses
    TotalByteCount         (Q) Grand total of byte count
    TotalIndexQueries      (Q) Grand total ISINDEX requests

Some QueryDefs may be obsolete. They will be removed later.

Here is the ObjectsByAccessCount SQL:

PARAMETERS pStart DateTime, pEnd DateTime;
SELECT DISTINCTROW Count(*) AS Accesses, Objects.ObjectName AS Object
FROM ((Accesses
     INNER JOIN Objects ON Accesses.ObjectID = Objects.ObjectID)
     INNER JOIN AuthUsers ON Accesses.AuthUserID = AuthUsers.AuthUserID)
     INNER JOIN Sites ON Accesses.SiteID = Sites.SiteID
WHERE ((Accesses.ObjectID = [Objects].[ObjectID])
      AND (Accesses.Time >= [pStart] And Accesses.Time < [pEnd])
      AND (Objects.Restrict = 0)
      AND (Sites.Restrict < 2)
      AND (AuthUsers.Restrict < 2))
GROUP BY Objects.ObjectName
ORDER BY Count(*) DESC;

The "ID"s are counter-generated links between the tables. The
complexity arises out of the restriction capability.

First note the PARAMETERS are start and end time. The code sets
these to control the time interval over which the totals are
accumulated.

The SELECT part returns the access total for each "object" (URL
target), GROUPed BY the target name and ORDERed BY the access count,
DESCending. So the dynaset is a list of access counts for each URL
target, sorted by access count, in descending order. The first 10
rows of this are the "Top-10" accessed URLs and the number of
accesses for each.

The first INNER JOIN links to the Objects table so we can get the
URL target name instead of the internal ID which is in the Accesses
table. It also lets us get the restriction code for the object. The
second and third INNER JOINs are used to access the AuthUser and
Site tables in order to get the restriction codes for the site and
the user involved in the access records that make up the total for
each of the objects.

Here's the trick on restrictions. If the object's restriction code
is 0, it always counts. If it is 1, the object is hidden from top-10
lists. If it is 2, the object is ignored altogether.

The implications of this policy are subtle. For example, given an
access to /foo/bar.html by user farkle from site dorfle.com, this
access will be counted in the totals only if all three objects (URL
target, user, and site) have restriction codes of less than 2. If
either user farkle or site dorfle.com have restriction codes of 2,
the access will be ignored. In addition, if the target /foo/bar.html
has a restriction code of greater than 0, it will not be counted in
the top-10 URL LIST. However, that access will be counted in the
top-10 SITE LIST and top-10 USER LIST unless either farkle or
dorfle.com are restricted greater than 0.

Look at the WHERE clause (after the linking phrase): First, the date
range is applied. Next the top-10 restriction is applied to the
primary object (URL target), then the total-hiding restriction is
applied to the secondary objects (user and site).

The Restriction List Editor handles the application and removal of
restrictions so that the user needn't be concerned with all of this
gory detail. Thank God. Meanwhile, you will get probably confused by
all of this in the beginning.

Note that ALL accesses are kept in the database. Restrictions may be
changed at any time. Subsequent reports will reflect the new
restrictions. Consider the alternative where access records are
filtered at the time they are brought into the database.. What if
you change your mind later??? Maybe you can see why SQL is so cool.
Consider doing this in C or perl!

I'm out of breath, so let's press on to the description of the 3
programs:

========
LOGTODB:
========

This program handles the importing of Web server Common Log Format
files into the Access database. It will create the empty database if
it doean't exist.

If nothing is put on its command line, it expects to be run on the
same machine as an NCSA Windows httpd web server, with the server
running. It first signals the server to cycle its logs using the
SIGHTTPD.DLL supplied with the server kit. Then it waits for the
cycled-out log to appear in the server's logs directory as
ACCESS.001. Then it opens the text logfile and imports the data into
the database ACCESS.MDB, which it will create if needed in the
server's logs directory.

Logtodb may be run at any time, as often as desired. It will simply
rotate the data out of the server's textual log into the database,
and then delete the cycled-out text log. It is designed to be run
from a scheduler so it needs no human interaction.

If logtodb encounters a corrupted log record, the record is skipped.
An icon is displayed while the program runs, indicating the number
of records processed, in multiples of 10. No effort is made to
detect duplicate access records, however unique sites (hostname or
IP address), URL targets and authenticated users are kept in
separate tables, and the access records link to these other tables
via an internally generated 32-bit ID.

You can use this program outside of the Windows httpd environment by
adding the pathname of the server log and the pathname of the
database (in that order) to the logtodb command line. That's what
you did in the "getting started" section.

The BUILD_DB module was generated from a working copy of the
database that I originally created with Access 1.1. I used a program
called DB2BAS, which I hacked to generate BASIC that creates the
QueryDefs as well as the tables and fields. The hacked DB2BAS did
much of the job, but I still had to edit the result by hand. Any
changes you make to the database will have to be reflected in
BUILD_DB by hand. Sorry. The hacked DB2BAS is a mess. I'm not sure
it really saved me all that much time.

=========
RESTRICT:
=========

This is a GUI program that is used to maintain the restriction
specifications for the database. There is online help available. The
match patterns used in the restrictions are those of the Access/VB
SQL language, and are documented in the help. A small help button
next to the pattern entry field pops up a pattern reference window.
The patterns are kept in the RestrictPatterns table in the database.

Once you have made the changes to restrictions, you must choose
Apply Restrictions in the Database menu. This removes old
restrictions and uses the patterns to set the restriction codes
appropriate to the changed patterns and/or restriction levels. You
can also remove restrictions completely by choosing Remove
Restrictions in the database menu.

Normally, restrict expects to find the database in the Windows httpd
server's logs directory. You can override this by putting the
pathname of the database on the commmand line.

=========
REPORTER:
=========

This program generates the statistics HTML pages, and a Windows BMP
graph of total accesses by week. You'll have to create a directory
httpd\htdocs\svrstats or edit CONFIG.BAS to change that, so the
program knows where to put its output.

The program may be run at any time. It will produce reports for all
complete weeks from the first day for which information appears in
the database through the just-completed week. Nothing occurring
after last Sunday is reported because the week isn't over yet. In
order to prevent duplication of effort, the reporter records summary
infrmation for each report it generates in the PastTotals table. The
record number is used as the "tag" on the HTML filenames. Next time
the reporter runs, it looks at PastTotals and skips data for which
reports have already been generated.

NOTE: If you want to regenerate reports (perhaps you changed
the restrictions and want past reports re-run), delete all of the
PastTotals records and re-run the reporter. All past reports will be
regenerated. Note that the report numbers will not start back at 1
unless you compress the database after deleting the PastTotals data.

The reporting process is straightforward. It uses the QueryDefs to
produce the statistics information, then takes that and formats it
into HTML reports, one per week. At the end, it produces an "index"
page.

Finally, the reporter uses the data in the PastTotals table to
produce a column chart showing the total accesses by week, using the
VB Graph control. The image is automatically written to disk in
Windows DIB format (".BMP" file).

The program may be run from a scheduler, as it requires no human
interaction. As it runs, an icon displays the week-starting date of
the report being produced. At present, I haven't found a program that
will convert BMP to GIF without human interaction, so this part of
the process must be done manually. I use WinGif. It's only once a
week.

With nothing on the commmand line, reporter expects the database to
be in the Windows httpd server's logs directory, and will put its
HTML pages and graph into htdocs\svrstats. You'll have to create the
latter directory.

=====================
IDEAS FOR THE FUTURE:
=====================

"You have the sources, DO IT!"

* Add a program to resolve IP addresses into hostnames, including
PTR-spoof detection, on the sites listed in the database. Then
disable DNS reverse lookups in the Web server to improve its
response time.

* Add a GUI admin program so you don't have to edit CONFIG.BAS.

* Add some more graphs. It's easy! Thumbnails, other statistics...
  If you want a challenge, write a querydef that will produce a
  list of total accesses per day for the last 30 days and graph that.
  You could run this one every day.




