                     Basis of AI Backprop
                   Code from April 10, 1996
               Documentation from April 10, 1996

Copyright (c) 1990-96 by Donald R. Tveter

CONTENTS
--------
1. Introduction
2. Making the Simulators
3. A Simple Example
4. Basic Facilities
5. The Format Command
6. Taking Training and Testing Patterns from a File
7. Saving and Restoring Weights
8. Initializing Weights
9. The Seed Values
10. The Algorithm Command
11. The Delta-Bar-Delta Method
12. Quickprop
13. Making a Network
14. Recurrent Networks
15. Miscellaneous Commands
16. Limitations
17. The Pro Version Additions

1. Introduction
---------------
   This manual describes the free version of my Basis of AI Backprop
designed to accompany my not yet published (sigh) textbook, _The Basis
of AI_.  This program contains enough features for students in an
ordinary AI or Neural Networking course.  More serious users will
probably need the professional version of this software, see:

http://www.mcs.com/~drt/probp.html

or send me email at: drt@mcs.com.  Other free NN software for the
textbook is also available at:

http://www.mcs.com/~drt/svbp.html

   For more on backprop see my "Backpropagator's Review" at:

http://www.mcs.com/~drt/bprefs.html

   Notice: this is use at your own risk software.  There is no guarantee
that it is bug-free.  Use of this software constitutes acceptance for
use in an as is condition.  There are no warranties with regard to this
software. In no event shall the author be liable for any damages
whatsoever arising out of or in connection with the use or performance
of this software.

   There are four simulators that can be constructed from the included
files.  The program, bp, does back-propagation using real weights and
arithmetic.  The program, ibp, does back-propagation using 16-bit
integer weights, 16 and 32-bit integer arithmetic and some floating
point arithmetic.  The program, sbp, uses symmetric floating point
weights and its sole purpose is to produce weights for two-layer
networks for use with the Hopfield and Boltzman relaxation algorithms
(included in another package).  The program sibp does the same using
16-bit integer weights.  The integer versions are faster on systems
without floating point hardware however sometimes these versions don't
have enough range or precision and then using the floating point
versions is necessary.  DOS binaries are included here for systems with
floating point hardware.  If you need other versions write me.

2. Making the Simulators
------------------------
   This code has been written to use either 32-bit floating point
(float) or 64-bit floating point (double) arithmetic.  On System V
machines the standard seems to be that all floating point arithmetic is
done with double precision arithmetic so double arithmetic is faster
than float and therefore this is the default.  Other versions of C (e.g.
ANSI C) will do single precision real arithmetic which will ordinarily
be faster on most machines (I think).  To get 32-bit floating point set
the compiler flag FLOAT in the makefile.  The function, exp, defined in
real.c is double since System V specifies it as double.  If your C uses
float, change this definition as well.

   For UNIX systems, use either makefile.unx or makereal.unx.
The makefile.unx will make any of the programs and makefile will keep
the bp object code files around while makereal.unx will only make bp
but it keeps the bp object code files around.  Also for DOS systems
there are two makefiles to choose from, makefile and makereal.  Makefile
is designed to make all four programs but it only leaves around the
object files for ibp while erasing object files for sibp, sbp and bp.
On the other hand, makereal only makes bp and it leaves its object
files around.  For 16-bit DOS you need to set the flag, -DDOS16 and for
32-bit DOS, you need to set the flag -DDOS32.  The flags I have in the
DOS makefiles are what I use with Zortech C 3.1.  The code is known not
to compile with at least one version of Turbo C because of an oddity
(or bug?) in the compiler.

   There was a problem found with the previous free student version
where it crashed on a Sun when the program hits a call to free in the
file bp.c.  This can be solved by removing the two calls to free and the
amount of space you waste is minimal.  I haven't had a report of such a
problem with this version yet but if it happens, let me know, in all
probability removing a call or two to free in the file io.c will solve
the problem.

   This code will work with basic C compilers however the libraries
sometimes vary from system to system.  DOS systems seem to use the
function getch in the conio library for hot key capability.  For a
System V UNIX system the code uses a home-made function called getch for
hot key capability.  This is the default setting for a UNIX system and
it also works with Suns.  If you use BSD UNIX then you need to define
the compiler variable BSD either in the cc command by adding the
parameter, -DBSD.  To get the hotkey feature to work with a NeXT use the
parameter -DNEXT.  At this point I don't know what other variations of
UNIX use so you may need to adapt the ioctl function call in the file
io.c and the files rbp.h and ibp.h to make them fit some other version.
If your system uses some other standard then if you can send me the
documentation I should be able to make it work as well.  If necessary
the hot key option can be removed by removing or commenting out the
line:

#define HOTKEYS

in the rbp.h and ibp.h files.

   There are some other more minor options that can be compiled in or
left out but these are mentioned at other points in the documentation.

   To make a particular executable file use the makefile given with the
data files and make any or all of them like so:

        UNIX                        DOS

    make -f makereal.unx bp       make -f makereal bp
    make -f makefile.unx bp       make bp
    make -f makefile.unx ibp      make ibp
    make -f makefile.unx sibp     make sibp
    make -f makefile.unx sbp      make sbp

If you do get bugs on an odd system and you can let me telnet in to
your system (preferably on a separate login, rather than your personal
login) I will try and fix the problem for you.


3. A Simple Example
-------------------
   Each version would normally be called with the name of a file to read
commands from, as in:

bp xor

After the data file is read commands are then taken from the keyboard.
When no file name is specified bp will take commands from the keyboard
(stdin file).  Normally you will find it convenient to put the commands
you need to set up the network in a short file however it is possible to
type them all in to the program from the keyboard.  If you have more
than a tiny amount of data you should have the data ready in a training
file and a testing file if you have test data.

   The commands are one or two or three letter commands and most of them
have optional parameters.  The `a', `d', `f' and 'q' commands allow a
number of sub-commands on a line.  The maximum length of any line is 256
characters.  An `*' is a comment and it can be used to make the
remainder of the line a comment.  In addition ctrl-R will run the
training.

   Here is an example of a data file to do the xor problem:
           
* input file for the xor problem
           
m 2 1 1 x     * make a 2-1-1 network with extra input-output connections
s 7           * seed the random number function
ci            * clear and initialize the network with random weights

rt {          * read training patterns into memory
1 0 1
0 0 0
0 1 1
1 1 0}

e 0.5         * set eta, the learning rate to 0.5 (and eta2 to 0.5)
a 0.9         * set alpha, the momentum to 0.9

First in this example, the m command will make a network with 2 units in
the input layer, 1 unit in the second layer and 1 unit in the third
layer.  Much of the time a three layer network where the connections are
only between adjacent layers is as complex as a network needs to be
however there are problems where having additional connections between
the input units and output units will greatly speed-up the learning
process.  The xor problem is one of those problems where the extra
connections help so the 'x' at the end of the command will add these
two extra connections.  The `s' (seed) command sets the seed for the
random number function.  The "ci" command (clear and initialize) clears
the existing network weights and initializes the weights to random
values between -1 and +1.  The rt (read training set) command gives four
new patterns to be read into the program.  All of them are listed
between the curly brackets ({}).  The input pattern comes first followed
by the output pattern.  The command "e 0.5" sets eta, the learning
rate for the upper layer to 0.5 and eta2 for the lower layers to 0.5 as
well.  The last line sets alpha, the momentum parameter, to 0.9.

   After these commands are executed the following messages and prompt
appears:

Basis of AI Backprop (c) 1990-96 by Donald R. Tveter
   drt@mcs.com - http://www.mcs.com/~drt/home.html
               April 10, 1996 version.
taking commands from stdin now
[ACDFGMNPQTW?!acdefhlmopqrstw]? q

The characters within the square brackets are a list of the possible
commands.  To run 100 iterations of back-propagation and print out the
status of the learning every 10 iterations type "r 100 10" at the
prompt:

[ACDFGMNPQTW?!acdefhlmopqrstw]? r 100 10

This gives:
 
running . . .
   10      0.00 % 0.49947   
   20      0.00 % 0.49798   
   30      0.00 % 0.48713   
   40      0.00 % 0.37061   
   50      0.00 % 0.15681   
   59    100.00 % 0.07121    DONE

The program immediately prints out the "running .  .  ." message.  After
each 10 iterations a summary of the learning process is printed giving
the percentage of patterns that are right and the average value of the
absolute values of the errors of the output units.  The program stops
when the each output for each pattern has been learned to within the
required tolerance, in this case the default value of 0.1.  Sometimes
the integer versions will do a few extra iterations before declaring the
problem done because of truncation errors in the arithmetic done to
check for convergence.  Unlike the previous student version the default
for these values is to be "up-to-date" however this can be over-ridden
to save a little on CPU time.

   There are many factors that affect the number of iterations needed
for a network to converge.  For instance if your random number function
doesn't generate the same values as the one with the Zortech 3.1
compiler (which is the same one used by most UNIX C compilers) the
number of iterations it takes will be different.  The integer versions
produce slightly different results that the floating point versions.

Listing Patterns

   To get a listing of the status of each pattern use the `p' command
to give:

[ACDFGMNPQTW?!acdefhlmopqrstw]? p
    1  0.903  e 0.097 ok
    2  0.050  e 0.050 ok
    3  0.935  e 0.065 ok
    4  0.072  e 0.072 ok
   59    (TOL) 100.00 % (4 right  0 wrong)  0.07121 err/unit

The number folloing the e (for error) is the sum of the absolute values
of the output errors for each pattern.  An `ok' is given to every
pattern that has been learned to within the required tolerance.  To get
the status of one pattern, say, the fourth pattern, type "p 4" to give:

 0.07  (0.072) ok

To get a summary without the complete listing use "p 0".  To get the
output targets for a given pattern, say pattern 3, use "o 3".

   A particular test pattern can be input to the network by giving the
pattern at the prompt:

[ACDFGMNPQTW?!acdefhlmopqrstw]? 1 0
       0.903 

Examining Weights

   It is often interesting to see the values of some particular weights
in the network.  To see a listing of all the weights in a network you
can use the save weights command described later on and then list the
file containing the weights, however, to see the weights leading into a
particular node, say the node in row 3, node 1 use the w command as in:

[ACDFGMNPQTW?!acdefhlmopqrstw]? w 3 1

layer unit  inuse  unit value    weight   inuse   input from unit
  1     1     1      1.00000     5.38258     1        5.38258
  1     2     1      0.00000    -4.86238     1        0.00000
  2     1     1      1.00000   -10.86713     1      -10.86710
  3     b     1      1.00000     7.71563     2        7.71563
                                              sum =   2.23111

This listing also gives data on how the current activation value of the
node is computed using the weights and the activations values of the
nodes feeding into unit 1 of layer 3.  The `b' unit is the bias (also
called the threshold) unit.  The inuse column to the right of the unit
column is 1 when the unit is in use and 0 if it is not in use.  In this
free version there are no commands to take weights out of use.  A 1
indicates a regular weight in use and a 2 indicates a bias weight in
use.

   Besides saving weights you can save all the parameters to a file
with the save everything command as in:

   se saved

At the same time the weights will be written to the current weights
file.  The file saved is virtually the same as the one you get with the
'?' command.  To start over from where you left off you can use:

   bp saved

and this also reads in the patterns and weights.  This command DOES NOT
save training and testing patterns since normally you would have them
in a file of their own.

   To get a short online tutorial on how to use the program you can type
T at the command prompt and get the listing:

   A Tutorial

   The following topics are designed to be read in the order listed.

   To get help on a topic type the code on the right at the prompt.

   Understanding the Menus                                         h1
   Formatting Data for a Classification Problem                    h2
   Formatting Data for Function Approximation                      h3
   Formatting Data for a Recurrent Problem                         h4
   Making a Network for Classification or Function Approximation   h5
   Making a Recurrent Network                                      h6
   Reading the Data                                                h7
   Setting Algorithms and Parameters                               h8
   Running the Program                                             h9
   Saving Almost Everything                                        h10
   To Quit the Program                                             h11

   To end the program the `q' (for quit) command is entered:

[ACDFGMNPQTW?!acdefhlmopqrstw]? q


4. Basic Facilities
-------------------
   There are a very large number of parameters that can be set for
various algorithms in these programs.  Typing a `?'  will get a compact
listing of them all however they are packed rather tight.  To get a
better view of the parameters there are now many upper-case letter
commands that give a listing of parameters in a less compact form.
These screens list parameters, generally on the left of the screen in
the form of the commands you would need to set them.  The center of the
screen gives a short description of the parameter.  Sometimes one or two
lines are inadequate to describe the command so at the far right there
may be a sequence you can type to get more help with the command.

   The most important screen you can look at is the C for commands
screen that summarizes what each menu screen will show:

[ACDFGMNPQTW?!acdefhlmopqrstw]? C

Screen     Includes Information and Parameters on:

  A        algorithm parameters and tolerance
  C        this listing of major command groups
  D        delta-bar-delta parameters
  F        formats: patterns, output, paging, copying screen i/o
  G        gradient descent (plain backpropagation)
  M        miscellaneous commands: shell escape, seed values, clear,
           clear and initialize, quit, kick a network, run command,
           save almost everything
  N        network building: making a network, initializing
           a network, kicking a network
  P        pattern commands: reading patterns, testing patterns,
  Q        quickprop parameters
  T        a short tutorial
  W        weight commands: listing, saving, restoring
  ?        a compact listing of everything

One typical menu screen is the A screen that lists the main algorithm
parameters:

[ACDFGMNPQTW?!acdefhlmopqrstw]? A

Algorithm Parameters

a a <char>     sets all act. functions to <char>; {ls}             h aa
a ah s         hidden layer(s) act. function; {ls}                 h aa
a ao s         output layer act. function; {ls}                    h aa
a d d          the output layer derivative term; {cdf}             h ad
a i -          initializes units before using the training set; {+-}
a u p          the weight update algorithm; {Ccdpq}                h au
t 0.100        tolerance/unit for successful learning; (0..1)

f O -          allows out-of-date statistics to print; {+-}
f u -          compute up-to-date statistics; {+-}

The first of these listings is the line:

a a <char>     sets all act. functions to <char>; {ls}             h aa

which doesn't give a parameter value but instead it gives the pattern of
a command designed to set the activation function for the entire
network.  The first sequence is:

a a <char>

and this sequence will change the activation function but when you type
it in you will have to substitute a character code for the activation
function instead of the string <char>.  One other activation function is
the linear activation function denoted by the character l, so to get
this function you can type in the line:

a a l

The first `a' codes for the "algorithm" command, the second `a' codes
for the "activation function" and s is the letter for the function.  The
idea of putting the variable portion of the command within the angle
brackets (<>) is a notation devised by Computer Scientists to describe
computer languages.  The word inside these brackets describes the kind
of thing that is the variable portion of the command.

   The middle part of the line:

a a <char>     sets all act. functions to <char>; {ls}             h aa

gives a short description of the meaning of the command and within the
curly brackets there is a listing all the values for all the activation
functions, {ls}.  To get a more detailed explanation of the options type
the sequence on the right: `h aa', this gives:

and the following comes up:

   a a <char> sets every activation function to <char>.
   a ah <char> sets the hidden layer activation function to <char>.
   a ao <char> sets the output layer activation function to <char>,
   <char> can be any of the following:

   <char>                  Function                           Range

      l    linear function, x                             (-inf..+inf)
      s    standard sigmoid, 1 / (1 + exp(-x))               (0..+1)

Here you get the code for the function, the function and the range of
values the function can take on.  This range portion following another
standard of notation used by Mathematicians.  A ( or ) next to a number,
say 0, means the range runs very close to 0 but never exactly to 0, in
other cases (not shown above) a [ or ] next to a number means value can
range up to exactly the number.  Thus the range:

(0..+1]

meaning that the range can run from ALMOST EXACTLY 0 up to exactly +1.

   If we now return to the A screen, the second line was:

a ah s         hidden layer(s) act. function; {ls}                 h aa

Here the idea is to indicate that the activation function for the hidden
layer (or layers) of the network IS NOW the s (standard sigmoid
function).  Again there is a short explanation of this, the set of codes
for functions and information about how to get more help.  This line can
also be taken as a direction as to how to set the hidden layer
activation function as well.  To change it to l you can type in:

a ah l

(Note: normally you would only use the linear activation function in the
output layer of a network.)

   The third line:

a ao l         output layer act. function; {ls}                    h aa

is similar except it states that the activation function for the output
layer is l.


Paging

   In the student version the paging was a simple version of the System
V utility, pg.  Now the paging is more like the common UNIX more
command.  The default page size is 24 lines and it can be reset to
another value with the format command's paging size sub-command.  For
instance to get 12 lines / page instead of 24, use:

f P 12

To get no paging at all use:

f P 0

When the page is full you get the prompt:

More?

At this point you can type:

   q                  to quit viewing the text if you are in a loop,
   a blank            to get another page,
   ^D                 to get another half a page,
   a carriage return  to get one more line and
   c                  to continue without paging.

Mostly paging is needed for loops within the program, like running a
large number of iterations and printing the results, listing the values
of all the patterns or listing weights leading into a particular unit.
Typing the q quits these loops, however paging can also occur with some
of the longer screen menus that are generating lines of output without
running a loop.  For these cases the q does not work.

   Every new command entered from the keyboard sets the page counting
variable to 0 however if input is being taken from a file other than
stdin the counter is NOT reset.  Most of the time this doesn't matter
since the little data files like the xor example used to set up
parameters don't produce any output anyway, however if they do paging is
in effect.  Having paging here is helpful in case there is a problem
with reading the files.

Interrupts

   In UNIX entering an interrupt will stop the current command and the
program will give the user another prompt.  With DOS entering a ctrl-C
will generate a similar kind of interrupt however DOS only checks for
this condition when it has to do i/o.  However when the DOS version is
in a training loop the program also checks to see if a key has been hit
and if that key is the the escape key, the program will break the
training loop.

Control Command

   One control-key command is available in this version, hitting ctrl-R
will run the training algorithm, it is a shorthand for typing r followed
by a carriage return.

Passing Commands to the Operating System

   By using the '!' command you can pass commands to the operating
system from within the program.  The kind of typical things you might
want to do are to list the contents of a directory, list a file or after
saving weights to a file you might want to list them or even edit them
and read them back in.  Here is what you can say for DOS to list the
little data file xor:

! type xor

Once a string has been defined with a ! command it can be re-run simply
by typing the ! followed immediately by a carriage return.

Making a Copy of Your Session

   Sometimes you may want to make of copy of everything that you type in
and the program prints out.  For instance you may get exceptionally good
or bad results using a certain training sequence and an exact record of
what you and the program did could be worth having.  Or you may need an
exact copy of the training or testing set values.  Or you may need lots
of runs where you average the results using another program.  To turn on
the making of a copy use the format command to turn on the copying
process:

f c+

and to turn it off use:

f c-

The text is written to the file copy.

An Alphabetical Listing of the Commands

   The following listing is designed to give you an idea of the set of
commands available.  Details are given in later sections.

a <number>       sets the momentum parameter, alpha
a <options>      the algorithm command
c                clear the network
ci               clear and initialize the network
d <options>      set delta-bar-delta parameters
e <number(s)>    set the learning rate eta
f <options>      lots of formatting options
h <string>       gives more help with certain options
l <layer>        list the values of units on that layer
m <numbers>      make a network
o <number>       list the output targets of the training set pattern
p <options>      list information about training patterns
q                quit
qp               set quickprop parameters
r <options>      run the training algorithm
rt <options>     read the training set patterns
rw <filename>    read the weights
rx <filename>    read the extra training set patterns
s <seeds>        seed value
sb <real>        set bias weights
se <filename>    save almost everything
sw <filename>    save weights
t <options>      list testing file statistics of various sorts
t <real>         tolerance per output unit that must be met
tf <filename>    gives the file name with testing patterns
tr <int>         special test for a recurrent network
trp <int>        special test for a recurrent network
w <layer> <unit> list weights leading into unit

The Summary Line

   The default setting for the summaries you get produce up to
date statistics on the error and on how many patterns are correct.
Here are several lines of summaries from a problem that has training
data and test data for a classification problem:

   10      0.00 %  49.04 % 0.47087       0.00 %  62.50 % 0.38234 
   20      0.00 %  73.08 % 0.38584       0.00 %  77.88 % 0.38108 
   30      2.88 %  76.92 % 0.35043       4.81 %  79.81 % 0.33285 

The first column is of course the number of iterations, the next column
gives the percentage of training patterns that are correct based on the
tolerance.  The next column gives the percentage of correct training
patterns based on the maximum value of the output units.  The next
column is the average absolute value of the error per output unit.  Note
that many other programs will report the RMS error.  The columns on the
right list the percentage of test set patterns that are correct based on
tolerance, the percentage correct based on maximum value and finally the
average error on the test set.

   Some CPU time can be saved by altering certain parameter settings
that skip some of the forward passes used to determine the current set
of statistics.  The more often you print the statistics the more time
you can save by altering these parameter settings.  The penalty is that
the statistics will be out of date by one iteration with the quickprop
delta-bar-delta, supersab and regular periodic update methods and only
approximate for both the right and wrong continuous update methods.

   In all the update methods the training set statistics are computed
when the program passes back the error.  However then an update of the
weights is done and these numbers are out of date.  So if 100 iterations
have been done the program only has the statistics on iteration 99, the
values that were true before the weights were changed.  When it is time
to print out the program statistics the default is to do another forward
pass through the training set to get up to date statistics.  This can be
stopped by setting the off by 1 option in the format command like so:

f O+

The results that print out will now look like this:

   10 -1   0.00 %  64.42 % 0.42463       0.00 %  62.50 % 0.38234 
   20 -1   0.00 %  73.08 % 0.39058       0.00 %  77.88 % 0.38108 
   30 -1   8.65 %  75.00 % 0.35052       4.81 %  79.81 % 0.33285 

where the string "-1" comes right after the number of iterations done
and of course it means the numbers shown are for the previous iteration.
For the test set patterns one pass through the training set has to be
made to get up to date statistics on them so they are always up to date.
Most of the time you will probably be more interested in the test set
results than in the training set results so setting the off by 1 option
saves a little time and getting off by 1 results on the training set
is not important.

   The situation when you use the "right" and "wrong" continuous update
methods is even more complicated.  After the forward pass for one
pattern is done it is checked to see if it is right and the error is
added in to a sum of errors.  Then weights are changed.  Then another
pattern is processed in the same way.  When the weight changes are done
for this second pattern they may well ruin the right/wrong decision for
all the previous patterns.  Thus the number of right and wrong patterns
and the average error can be off by quite a lot.  With the off by 1
option off ("f O-") the program still does a forward pass to get the
up to date statistics on the training set.  However when the off by 1
option is on the statistics look like this:

   10 -1   1.92 %  66.35 % 0.42093 ?    31.73 %  40.38 % 0.46316 
   20 -1  12.50 %  64.42 % 0.38829 ?    40.38 %  44.23 % 0.53761 
   30 -1  34.62 %  75.00 % 0.30939 ?    50.96 %  63.46 % 0.37728 

where the ? after the training set error flags the fact that the numbers
are very suspect.

   Another option in the program is to do an extra forward pass through
the training set even when there are no statistics to print out.  The
option to give you up to date statistics is:

f u+

If you are using the periodic update method, quickprop or dbd you don't
need "f u+" as the program will report the correct values anyway.

   The one line form of the summary is the default but it can be turned
off using:

f s-

With this you get nothing whatsoever and normally you won't want this
unless perhaps you are producing your own customized output.

5. The Format Command (f)
-------------------------
   There are several ways to input and output patterns, numbers and
other values and there is one format command, `f', that is used to set
these options.  In the format command a number of options can be given
on a single line as for example in:

f b+ ir oc wB

Input Patterns

   The programs are able to read pattern values in two different
formats.  Real numbers follow the C language notation and must be
separated by a space.  The letters `H' used in recurrent networks is
also allowed.  The letter `x' with a default value of 0.5 is also
allowed.  The `x' character has a default value of 0.5.  The value of
`x' can be changed, for example to make `x' -1 use:

f x -1

Real input format is now the default but if you use the other format
(a compressed binary format) you can re-set the format to real with:

f ir

   The other format is the compressed format, a format consisting of 1s,
0s and the letters `x' and `H'.  In compressed format each value is one
character and it is not necessary to have blanks between the characters.
For example, in compressed format the patterns for xor could be written
out in either of the following ways:

101      10 1
000      00 0
011      01 1
110      11 0

The second example is preferable because it makes it easier to see the
input and the output patterns.

To change to compressed format use:

f ic

Output of Patterns

   Output format is controlled with the `f' command as in:

f or   * output node values using real (the C %f) format
f oc   * output node values using compressed format
f oa   * output node values using analog compressed format
f oe   * output values with e notation

The first sets the output to real numbers.  The second sets the output
to be compressed mode where the value printed will be a `1' when the
unit value is greater than 1.0 - tolerance, a `^' when the value is
above 0.5 but less than 1.0 - tolerance, a `v' when the value is less
than 0.5 but greater than the tolerance.  Below the tolerance value a
`0' is printed.  The tolerance can be changed using the `t' command (not
a part of the format command).  For example, to make all values greater
than 0.8 print as `1' and all values less than 0.2 print as `0' use:

t 0.2

Of course this same tolerance value is also used to check to see if all
the patterns have converged.  The third output format is meant to give
"analog compressed" output.  In this format a `c' is printed when a
value is close enough to its target value.  Otherwise, if the answer is
close to 1, a `1' is printed, if the answer is close to 0, a `0' is
printed, if the answer is above the target but not close to 1, a `^' is
printed and if the answer is below the target but not close to 0, a `v'
is printed.  This output format is designed for problems where the
output is a real number, as for instance, when the problem is to make a
network learn sin(x).  The format "e" writes out node values using
exponential notation with four places to the right of the decimal point.

Breaking up the Output Values

   In the compressed formats the default is to print a blank after every
10 values.  This can be altered using the `B' (for inserting breaks)
option within the format ('f') command.  The use for this command is to
separate output values into logical groups to make the output more
readable.  For instance, you may have 24 output units where it makes
sense to insert blanks after the 4th, 7th and 19th positions.  To do
this, specify:

f B 4 7 19

Then for example the output will look like:

  1 10^0 10^ ^000v00000v0 01000 e 0.17577
  2 1010 01v 0^0000v00000 ^1000 e 0.16341
  3 0101 10^ 00^00v00000v 00001 e 0.16887
  4 0100 0^0 000^00000v00 00^00 e 0.19880

The break option allows up to 20 break positions to be specified.  The
default output format is the real format with 10 numbers per line.  For
the output of real values the option specifies when to print a carriage
return rather than when to print a blank.

Pattern Formats

   There are two different types of problems that back-propagation can
handle, the general type of problem where every output unit can take on
an arbitrary value and the classification type of problem where the goal
is to turn on output unit i and turn off all the other output units when
the pattern is of class i.  The xor problem is an example of the general
type of problem.  For an example of a classification problem, suppose
you have a number of data points scattered about through two-dimensional
space and you have to classify the points as either class 1, class 2 or
class 3.  For a pattern of class 1 you can always set up the output:
"1 0 0", for class 2: "0 1 0" and for class 3: "0 0 1", however doing
the translation to bit patterns can be annoying so another notation can
be used.  Instead of specifying the bit patterns you can set the pattern
format option to classification (as opposed to the default value of
general) like so:

f pc

and then the program will read data in the form:

   1.33   3.61   1   *  shorthand for 1 0 0
   0.42  -2.30   2   *  shorthand for 0 1 0
  -0.31   4.30   3   *  shorthand for 0 0 1

and translate it to the bit string form.  To switch to the general form
use "f pg".  Another benefit of the classification format is that when
the program outputs a status line it will also include the percentage of
correct patterns based on the maximum value rather than just on
tolerance.


Controlling Summaries

   When the program is learning patterns you normally want to have it
print out the status of the learning process at regular intervals.  The
default is to print out a one-line summary of how learning is going
and this is set by using "f s+".  However if you want to customize
exactly what is printed out and you don't want the standard summary, use
"f s-".

Skipping the "running . . ." Message

   Normally whenever you run more training iterations the message,
"running . . ." prints out to reassure you that something is in fact
being done, however this can also be annoying at times.  To get rid of
this message use "f R-" and to bring it back use "f R+".

Ringing the Bell

   To ring the bell when the learning has been completed use "f b+" and
to turn off the bell use "f b-".

Echoing Input

   When you are reading commands from a file it is sometimes worthwhile
to see those commands echoed on the screen.  To do this, use "f e+" and
to turn off the echoing, use "f e-".

Paging

   To set the page size to some value, say, 25, use "f P 25" or to skip
paging use "f P 0".

Making a Copy of Your Session

   To make a copy of what appears on the screen use "f c+" to start
writing to the file "copy" and "f c-" to stop writing to this file.
Ending the session automatically closes this file as well.

Up-To-Date Statistics

   During the ith pass thru the network the program will collect
statistics on how many patterns are correct and how much error there is.
It does this so that it will know when to stop the training.  But it
gets these numbers BEFORE the weights are changed in the ith pass.  In
the case of periodic update methods (the periodic, delta-bar-delta,
quickprop and supersab) this is not much of a problem.  If the off by 1
flag is off ("f O-") there is another forward pass done whenever the
statistics are printed out so you get up to date statistics anyway.  If
the off by 1 flag is on ("f O+") you get the string "-1" after the
number of iterations is printed on the summary line.  Getting the
statistics in the off by 1 form is harmless and it saves a little CPU
time.  When the network converges the "-1" flag will not be shown.

   However with the continuous update methods the weights are changed
after each pattern and this skews the statistics gathered by the
training process by quite a lot.  To get an accurate assessment of how
well the training is going when results are printed on the summary line
you either need to have the off by 1 flag set to "f O+" or you need to
set the up to date statistics flag by: "f u+".  The default is to leave
this flag off: "f u-".  Furthermore, if you are training to get an
accurate assessment of how many iterations it takes to learn the
training set you need to set "f u+" (NOT JUST "f O+"!).  The "u+"
setting makes a check after every complete pass through the training
set.  The "f O+" setting only makes a check when it is time to print
the status line.


6. Taking Training and Testing Patterns from Files (rt,rx,tf)
-------------------------------------------------------------
   In the xor example given above the four patterns were part of the
data file and to read them in the following lines were used:

rt {
1 0 1
0 0 0
0 1 1
1 1 0 }

However it is also convenient to take patterns from a file that contains
nothing but a list of patterns (and possibly comments).  To read a new
set of patterns from some file, patterns, use:

rt patterns

To add an extra group of patterns to the current set you can use:

rx patterns

To read in test patterns from say the file, xtest, do the following:

tf xtest

To evaluate all the test patterns without listing them do "t0".  To list
them, use "t".  To list one particular test pattern, say pattern 3, do
"t 3".


7. Saving and Restoring Weights and Related Values (sw,rw,sw+,swe,swem)
--------------------------------------------------------------
   Sometimes the amount of time and effort needed to produce a set of
weights to solve a problem is so great that it is more convenient to
save the weights rather than constantly recalculate them.  To save the
weights to the current weights file use "sw".  The weights are then
written on a file called "weights" or to the last file name you have
specified.  The weights file looks like:

59r  m 2 1 1 x aahs aos bh 1.000000 bo 1.000000 Dh 1.000000 Do 1.000000  file = ../xor3.new
 8.926291e+000  1  1 1 to 2 1
-7.945858e+000  1  1 2 to 2 1
 3.898432e+000  2  2 b to 2 1
 5.382575e+000  1  1 1 to 3 1
-4.862383e+000  1  1 2 to 3 1
-1.086713e+001  1  2 1 to 3 1
 7.715632e+000  2  3 b to 3 1

To write the weights the program starts with the second layer, writes
out the weights leading into these units in order with the threshold
weight last, then it moves on to the third layer, and so on.  In
addition to writing out the weights the second column lists whether or
not the weights are in use.  If the weight is in use it is marked with a
1, if it is a bias unit weight it is marked as 2 and if it is not in use
it is marked with a 0.  This is not used in this free version.  The last
4 numbers on each line tell which units the weights run between.  The
first weight listed runs from layer 1 unit 1 to layer 2 unit 1.  The
letter b indicates the weight is a bias unit.  These last 4 values on a
line are ignored when the file is read so in fact if you want to make up
your own weights file you don't need to type them in.  These last four
values are just here for human convenience.  However the inuse values
must be present if you write your own weights file.  And you must use
only one weight per line.

   To restore these weights type `rw' for restore weights.  At this time
the program reads the header line and sets the total number of
iterations the program has gone through to be the first number it finds
on the header line.  It then reads the character immediately after the
number.  The `r' indicates that the weights will be real numbers
represented as character strings.

   The remaining text on the first line of a weight file is not used by
the restore weights command at this time and it is there to give you a
record of what size and type the network was.  The fact that the rest of
this line is not read by the restore weights program means that before
you read in weights you have to make the proper size network with the
"m" command.  The "m 2 1 1 x" of course means there are 2 units in the
first layer, one in the second, one in the third and the x means there
are extra connections from the input units to the output unit.
Following that the initial command file that was read in is given.

   To save weights to a file other than "weights" you can say: "sw
<filename>", where, of course, <filename> is the file you want to save
to.  To continue saving to the same file you can just do "sw".  If you
type "rw" to restore weights they will come from this current weights
file as well.  You can restore weights from another file by using: "rw
<filename>".  Of course this also sets the name of the file to write to
so if you're not careful you could lose your original weights file.


8. Initializing Weights (c,ci)
------------------------------
   All the weights in the network initially start out at 0 and they are
also set to 0 by using the clear (c) command.  In some problems where
all the weights are 0 the weight changes may cancel themselves out so
that no learning takes place.  Moreover, in most problems the training
process will usually converge faster if the weights start out with small
random values.  To do this use the clear and initialize command as in:

ci 0.5

where the random initial weights will run from -0.5 to +0.5.  If the
value is omitted the last range specified will be used.  The initial
value is 1.


9. The Seed Value (s)
---------------------
   The initial seed value is set to 0 and this value is as good as any
other value however networks often do not converge quickly or at all
with some sets of initial weights.  To get some other initial random
weights use the seed command as in:

s 7

where the seed is set to 7.  The seed value is of type unsigned.


10. The Algorithm Command (a)
-----------------------------
   A number of different variations on the original back-propagation
algorithm have been proposed in order to speed up convergence and some
of these have been built into these simulators.  These options are set
using the `a' command and a number of options can go on the one line.

Activation Functions

   To set the activation functions use:

a a <char>  * to set the activation function for all layers to <char>.
a ah <char> * to set the hidden layer(s) function to <char>.
a ao <char> * to set the output layer function to <char>.

where <char> can be:

   l  for the linear activation function:  x
   s  for the traditional smooth activation function:
      1.0 / (1.0 + exp(x))

   The s function is the standard smooth activation function originally
used by researchers and it is still the most commonly used one.  In the
bp program it is implemented by a table look-up (default) or if the
compiler variable LOOKUP is undefined in the file ibp.h the regular
time-consuming real valued calculations are done.

   The linear activation function gives networks only a very limited
ability to learn patterns and it is therefore hardly ever used by itself
in a network however it is often used in the output layer of networks
with 3 or more layers so that the network can give output values beyond
the range of the other activation functions.  For instance, suppose you
need to train a network to compute some non-linear function but you need
to produce outputs in the range -10 to 10.  The usual activation
functions are restricted to the range 0 to 1 or -1 to 1 but you can
choose a non-linear function for the network's hidden layers and with 
linear neurons in the output layer the network can produce values
in the range -10 to 10.


The Derivatives

   The correct derivative for the standard activation function is s(1-s)
where s is the activation value of a unit however when s is near 0 or 1
this term will give only very small weight changes during the learning
process.  To counter this problem Fahlman proposed the following one
for the output layer:

0.1 + s(1-s)

(For the original description of this method see "Faster Learning
Variations of Back-Propagation:  An Empirical Study", by Scott E.
Fahlman, in Proceedings of the 1988 Connectionist Models Summer School,
Morgan Kaufmann, 1989.)

   Besides Fahlman's derivative and the original one the differential
step size method (see "Stepsize Variation Methods for Accelerating the
Back-Propagation Algorithm", by Chen and Mars, in IJCNN-90-WASH-DC,
Lawrence Erlbaum, 1990) takes the derivative to be 1 in the layer going
into the output units and uses the correct derivative term for all other
layers.  The learning rate for the inner layers is normally set to some
smaller value.  To set a value for eta2 give two values in the `e'
command as in:

e 0.1 0.01

To set the derivative use the `a' command as in:

a dc   * use the correct derivative for whatever function
a dd   * use the differential step size derivative (default)
a df   * use Fahlman's derivative in only the output layer
a do   * use the original derivative (same as `c' above)

Update Methods

   The choices are the periodic (batch) method, the continuous (online)
method, delta-bar-delta and quickprop.  The following commands set the
update methods:

a uC   * for the "right" continuous update method
a uc   * for the "wrong" continuous update method
a ud   * for the delta-bar-delta method
a up   * for the original periodic update method (default)
a uq   * for the quickprop algorithm


11. The Delta-Bar-Delta Method (d)
----------------------------------
   The delta-bar-delta method attempts to find a learning rate, eta, for
each individual weight.  The parameters are the initial value for the
etas, the amount by which to increase an eta that seems to be too small,
the rate at which to decrease an eta that is too large, a maximum value
for each eta and a parameter used in keeping a running average of the
slopes.  Here are examples of setting these parameters:

d d 0.5    * sets the decay rate to 0.5
d e 0.1    * sets the initial etas to 0.1
d k 0.25   * sets the amount to increase etas by (kappa) to 0.25
d m 10     * sets the maximum eta to 10
d n 0.005  * an experimental noise parameter
d t 0.7    * sets the history parameter, theta, to 0.7

These settings can all be placed on one line:

d d 0.5  e 0.1  k 0.25  m 10  t 0.7

The version implemented here does not use momentum.  The symmetric
versions sbp and srbp do not implement delta-bar-delta.

   The idea behind the delta-bar-delta method is to let the program find
its own learning rate for each weight.  The `e' sub-command sets the
initial value for each of these learning rates.  When the program sees
that the slope of the error surface averages out to be in the same
direction for several iterations for a particular weight the program
increases the eta value by an amount, kappa, given by the `k' parameter.
The network will then move down this slope faster.  When the program
finds the slope changes signs the assumption is that the program has
stepped over to the other side of the minimum and so it cuts down the
learning rate by the decay factor given by the `d' parameter.  For
instance, a d value of 0.5 cuts the learning rate for the weight in
half.  The `m' parameter specifies the maximum allowable value for an
eta.  The `t' parameter (theta) is used to compute a running average of
the slope of the weight and must be in the range 0 <= t < 1.  The
running average at iteration i, a[i], is defined as:

a[i] = (1 - t) * slope[i] + t * a[i-1],

so small values for t make the most recent slope more important than the
previous average of the slope.  Determining the learning rate for
back-propagation automatically is, of course, very desirable and this
method often speeds up convergence by quite a lot.  Unfortunately, bad
choices for the delta-bar-delta parameters give bad results and a lot of
experimentation may be necessary.  If you have n patterns in the
training set try starting e and k around 1/n.  The n parameter is an
experimental noise term that is only used in the integer version.  It
changes a weight in the wrong direction by the amount indicated when the
previous weight change was 0 and the new weight change would be 0 and
the slope is non-zero.  (I found this to be effective in an integer
version of quickprop so I tossed it into delta-bar-delta as well.  If
you find this helps please let me know.)  For more on delta-bar-delta
see "Increased Rates of Convergence" by Robert A. Jacobs, in Neural
Networks, Volume 1, Number 4, 1988.


12. Quickprop (qp)
------------------
    Quickprop (see "Faster-Learning Variations on Back-Propagation: An
Empirical Study", by Scott E. Fahlman, in Proceedings of the 1988
Connectionist Models Summer School", Morgan Kaufmann, 1989 or ftp to
archive.cis.ohio-state.edu, look in the directory pub/neuroprose for the
file, fahlman.quickprop-tr.ps.Z.) may be one of the fastest network
training algorithms.  It is loosely based on Newton's method.

   The parameter mu is used to limit the size of the weight change to
less than or equal to mu times the previous weight change.  Fahlman
suggests mu = 1.75 is generally quite good so this is the initial value
for mu but slightly larger or slightly smaller values are sometimes
better.

   To get the process started quickprop makes the typical backprop
weight change of - eta * slope.  I have found that a good value for the
quickprop eta value is around 1 / n or 2 / n where n is the number of
patterns in the training set.  Other sources often use much larger
values.  In addition Fahlman uses this term at other times.  I had to
wonder if this was a good idea so in this code I've included a
capability to add it in or not add it in.  So far it seems to me that
sometimes adding in this extra term helps and sometimes it doesn't.  The
default is to use the extra term.

   Another factor involved in quickprop comes about from the fact that
the weights often grow very large very quickly.  To minimize this
problem there is a decay factor designed to keep the weights small.
The weight decay is implemented by decreasing the value of the slope
and it is different from the general weight decay that people use and
which is also implemented in this software.  Fahlman recently mentioned
that now he does not use does not use this unless the weights get very
large.  I've found that too large a decay factor can stall
out the learning process so that if your network isn't learning fast
enough or isn't learning at all one possible fix is to decrease the
decay factor.  Note:  in the old free version the value of the weight
decay constant is the value you enter / 1000 in order to allow small
weight decay values in the integer version however in this version the
problem is handled differently so that what you enter is exactly what
you get, not the value divided by 1000.

   I built in one additional feature for the integer version.  I found
that by adding small amounts of noise the time to convergence can be
brought down and the number of failures can be decreased somewhat.  This
seems to be especially true when the weight changes get very small.  The
noise consists of moving uphill in terms of error by a small amount when
the previous weight change was zero.  Good values for the noise seem to
be around 0.005.

   The parameters for quickprop are all set in the `qp' command like
so:

qp d <value>  * set the weight decay factor for all layers to <value>
qp d h 0      * the default weight decay for hidden layer units
qp d o 0.0001 * the default weight decay for output layer units
qp e 0.5      * the default value for eta
qp m 1.75     * the default value for mu
qp n 0        * the default value for noise
qp s+         * the default value is to always include the slope

or a whole series can go on one line:

qp d 0.1 e 0.5 m 1.75 n 0 s+


13. Making a Network (m)
------------------------
   In the simplest form of the make a network command you type an `m'
followed by the number of units in each layer as in:

m 8 4 4 2

Most of the time this type of network is all you will ever need but
there are others that can be tried and which may sometimes will work
better.  One innovation that often speeds up learning is to include
extra connections between the input and output layers.  To get this
type of network you add an x to the end of the m command as in:

m 8 4 2 x

These extra connections are said to be important when the problem to
be solved is almost linear and then the hidden layer units provide some
extra corrections to the output neurons to distort the results from a
purely linear model.

   In the student version every time you made a network all the training
and testing patterns were thrown out because they were attached to the
network.  (Not true in the pro version.)

   To make a recurrent network with 25 regular input units, twenty
hidden layer units (that are copied to the input layer) and 25 output
units use:

m 25+20 20 25

This means that the first layer will have 45 inputs and the first 25 are
regular input values but the next 20 come from the first hidden layer.
These 20 units are called the short term memory units.  Then there are
20 units in the hidden layer.  This value should match the number of
units given for the short term memory units.  At the moment there is no
check to see that it does.  Finally there are 25 units in the output
layer.  This recurrent network notation also requires a change in the
way training and testing patterns are written down for input into the
program.  For more on this see the next section.

14. Recurrent Networks
----------------------
   Recurrent back-propagation networks take values from hidden layer
and/or output layer units and copy them down to the input layer for use
with the next input.  These values that are copied down are a kind of
coded record of what the recent inputs to the network have been and this
gives a network a simple kind of short-term memory, possibly a little
like human short-term memory.  For instance, suppose you want a network
to memorize the two short sequences, "acb" and "bcd".  In the middle of
both of these sequences is the letter, `c'.  In the first case you want
a network to take in `a' and output `c', then take in `c' and output
`b'.  In the second case you want a network to take in `b' and output
`c', then take in `c' and output `d'.  To do this a network needs a
simple memory of what came before the `c'.

   Let the network be an 7-3-4 network where input units 1-4 and output
units 1-4 stand for the letters a-d and the `h' stands for the value of
a hidden layer unit.  So the codes are:

a: 1000
b: 0100
c: 0010
d: 0001

In action, the networks need to do the following.  When `a' is input,
`c' must be output:

   0010     <- output layer

   hhh      <- hidden layer

1000 stm    <- input layer

In this context, when `c' is input, `b' should be output:

   0100

   hhh

0010 stm

For the other string, when `b' is input, `c' is output:

   0010

   hhh

0100 stm

and when `c' in input, `d' is output:

   0001

   hhh

0010 stm

This is easy to do if the network keeps a short-term memory of what its
most recent inputs have been.  Suppose we input a and the output is c:

   0010     <- output layer

   hhh      <- hidden layer

1000 stm    <- input layer

Placing `a' on the input layer generates some kind of code (like a hash
code) on the 3 units in the hidden layer.  On the other hand, placing
`b' on the input units will generate a different code on the hidden
units.  All we need to do is save these hidden unit codes and input them
with a `c'.  In one case the network will output `b' and in the other
case it will output `d'.  In one particular run inputting `a' produced:

     0  0  1  0

  0.993 0.973 0.020

 1  0  0  0  0  0  0

When `c' is input the hidden layer units are copied down to input to
give:

        0  1  0  0

    0.006 0.999 0.461

0  0  1  0  0.993 0.973 0.020

For the other pattern, inputting `b' gave:

    0  0  1  0

  0.986 0.870 0.020

0  1  0  0  0  0  0

Then the input of `c' gave:

          0  0  0  1

      0.005 0.999 0.264

0  0  1  0  0.986 0.870 0.020

   This particular problem can be set up as follows:

m 7 3 4
s 7
ci
t 0.2
rt {
1000 H   0010
0010 H   0100

0100 H   0010
0010 H   0001
}

where the first four values on each line are the normal input.  The H
codes for however many hidden layer units there are.  The last four
values are the desired outputs.

   By the way, this simple problem does not converge particularly fast
and you may need to do a number of runs before you hit on initial values
that will work quickly.  It will work more reliably with more hidden
units.

   Rather than using recurrent networks to memorize sequences of letters
they are probably more useful at predicting the value of some variable
at time t+1 given its value at t, t-1, t-2, ... .  A very simple of this
is to give the value of sin(t+1) given a recent history of inputs to the
net.  Given a value of sin(t) the curve may be going up or down and the
net needs to keep track of this in order to correctly predict the next
value.  The following setup will do this:

m 1+5 5 1
f ir
a aol dd uq
qp e 0.02
ci
rt {
   0.00000  H   0.15636
   0.15636  H   0.30887
   0.30887  H   0.45378

   . . .

  -0.15950  H  -0.00319
  -0.00319  H   0.15321
}

and in fact it converges rather rapidly.  The complete set of data can
be found in the example file rsin.bp.

   Another recurrent network included in the examples is one designed to
memorize two lines of poetry.  The two lines were:

   I the heir of all the ages in the foremost files of time

   For I doubt not through all the ages ones increasing purpose runs

but for the sake of making the problem simpler each word was shortened
to 5 characters giving:

   i the heir of all the ages in the frmst files of

   time for i doubt not thru the ages one incre purpo runs

The letters were coded by taking the last 5 bits of their ASCII codes.
See the file poetry.bp.  

   Once upon a time I was wondering what would happen if the poetry
network learned its verses and then the program was given several words
in the middle of the verses.  Would it pick up the sequence and be able
to complete it given 1 or 2 or 3 or n words?  So given for example, the
short sequence "for i doubt" will it be able to "get on track" and
finish the verse?  To test for this there are an extra pair of commands,
tr and trp.  Given a test set (which should be the training set) they
start at every possible place in the test set, input n words and then
check to see if the net produces the right answer.  For this example I
tried n = 3, 4, 5, 6 and 7 with the following results:

[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 3
 TOL:  81.82 %  ERROR: 0.022967
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 4
 TOL:  90.48 %  ERROR: 0.005672
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 5
 TOL:  90.00 %  ERROR: 0.005974
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 6
 TOL: 100.00 %  ERROR: 0.004256
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 7
 TOL: 100.00 %  ERROR: 0.004513

So after getting just 3 words the program was 81.82% right in predicting
the next word to within the desired tolerance.  Given 6 or 7 words it
was getting them all right.  The trp command does the same thing except
it also prints the final output value for each of the tests made.


15. Miscellaneous Commands
--------------------------
   Below is a list of some miscellaneous commands, a short example of
each and a short description of the command.


!   Example: ! ls

Anything after `!' will be passed on to the OS as a command to execute.
An ! followed immediately by a carriage-return will repeat the last
command sent to the OS.

l   Example: l 2

Entering "l 2" will print the values of the units on layer 2, or
whatever layer is specified.

sb  Example: sb -3

Entering "sb -3" will set the bias unit weight to -3.  In the symmetric
versions the weight will be frozen at this value while in the regular
versions it will only be the initial value and should be set after the
other weights are initialized.


16. Limitations
---------------
   Weights in the ibp and sibp programs are 16-bit integer weights where
the real value of the weight has been multiplied by 1024.  The integer
versions cannot handle weights less than -32 or greater than 31.999.
The weight changes are all checked for overflow but there are other
places in these programs where calculations can possibly overflow as
well and none of these places are checked.  Input values for the integer
versions can run from -31.992 to 31.999.  Due to the method used to
implement recurrent connections, input values in the real version are
limited to -31992.0 and above.


17. The Pro Version Additions
-----------------------------
   This section lists the additions to the pro version at this time.
For a more detailed and more up-to-date description see the online pro
version manual at:

http://www.mcs.com/~drt/probp.html

The additional commands are:

ac <units>      add a weight connection between the units
ah <layer>      add a hidden unit to <layer>
b               benchmarking
i <filename>    read input from the file
k <numbers>     give the network a kick
n <options>     dynamic network building parameters
ofu <unit>      turn off a unit
onu <unit>      turn on a unit
ofw <weight>    turn off a weight
onw <weight>    turn on a weight
pw <number>     prune weights
rp              set rprop parameters
s <seeds>       set multiple seed values
ss <options>    set SuperSAB parameters
swem <option>   save weights every minimum flag
sw+             increment the weight file suffix
to              overall tolerance to be met (not per pattern, as with t)
u               the same as p but for recurrent classification problems
v               the same as t but for recurrent classification problems

Benchmarking allows you to make multiple runs of a problem and find the
mean, standard deviation and average CPU time to converge.  You can also
use it to average the outputs of multiple runs and thereby possibly get
a better overall answer.

You can make networks in a cascade type of architecture.  You can make
a new network with a different number of hidden layer units without
losing the training and testing patterns.  You can add hidden layer
units as the network is trained.  You can turn on and off individual
units.


The additonal options:

a bh <value>     set the hidden layer bias unit value
a bo <value>     set the output layer bias unit value
a Dh <value>     set the hidden layer sharpness/gain
a Do <value>     set the output layer sharpness/gain
a wd <value>     weight decay
f t <reals>      set target values for classification problems
f wR             saves all weight parameters
f wb             saves weights as binary
f wB             saves all weight parameters as binary
pm               print confusion matrix for training set
tm               print confusion matrix for test set

The activation functions available are:

<char>                  Function                           Range

   a    an efficient approximation of t            [-0.96016..0.96016]
   g    Gaussian function, exp(-(D*x)**2)                 (0..+1]
   l    linear function, D*x                            (-inf..+inf)
   p    piecewise linear version of s                     [0..+1]
   s    standard sigmoid, 1 / (1 + exp(-D*x))             (0..+1)
   t    tanh(D*x)                                         (-1..+1)
   x    D * x / (1 + |D * x|)                             (-1..+1)
   y    (D * x / 2) / (1 + |D * x|) + 0.5                 (0..+1)
   z    (D*x)**2 for x >= 0 and -(D*x)**2 for x < 0     (-inf..+inf)
