
  Program: extract -- a Portable Game Notation (PGN) extractor.
  Copyright (C) 1994 David Barnes
  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; either version 1, or (at your option)
  any later version.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

  David Barnes may be contacted as D.J.Barnes@ukc.ac.uk

Background
==========
These are the sources for a simple program to extract selected games
from a PGN format data file. There are several ways to specify the
criteria on which to extract: textual move sequences, the position
reached after a sequence of moves, and information in the tag fields.
The program includes a semantic analyser which will report errors in
game scores.

This version is significantly different from previous releases in the
inclusion of semantic analysis.  It also introduces the facility to
search for games based on the position reached at the end of a move
sequence (a positional variation) as opposed to simply a textual
sequence of moves (a textual variation).  There is also an option to
attempt to detect duplicate games.  Extracted games may be written out
either including or excluding comments, NAGs and variations.

If you find the program useful, or if you modify it in ways that you
think would be useful to other people, then I would be pleased to hear
from you.

The Files
=========
    COPYING   -- GNU General Public License
    README    -- this file.
    Makefile  -- see Compilation, below.
    Makefile.dos -- see Compilation, below.
    bool.h    -- type definitions.
    decode.c  -- functions for decoding the text of a move.
    defs.h    -- definitions relating to boards.
    extract.exe -- DOS executable.
    lists.c   -- functions for holding the extraction criteria.
    lists.h   -- type and prototype definitions for lists.c
    map.c     -- functions for implementing move semantics.
    map.h     -- more prototypes for the shared functions.
    moves.c   -- functions for collecting moves and variations
    pgn.l     -- the Lex lexical analyser.
    pgn.y     -- the YACC syntax analyser.
    protos.h  -- prototypes for the shared functions.
    taglist.h -- constants for pgn.y
    typedef.h -- type definitions.

Compilation
===========
The sources include a Makefile for the UNIX make program as well as one
for Borland's make. Compilation of the sources require either UNIX Lex
and YACC or DOS versions of these, such as Flex and Bison, which are
readily available on most archive sites.

Usage and Arguments
===================
The files from which games are to be extracted may either be listed on the
command line:

	extract [flags] file.pgn [file.pgn ... ]

or listed, one per line, in a file whose name is given after the -f flag.

        extract [flags] -ffile_list

The criteria on which to extract would normally be specified using the -v, -x,
and/or the -t flags.  The first two allow opening variations to limit the
extraction and the latter the presence of strings in the tag fields.
The -T flag is a command-line alternative to -t, but it is a little
long winded and -t is preferred.

There follows a brief summary of the different flags taken by extract, such
as is produced by the -h flag.  You are strongly advised to read the remainder
of this file, however, before attempting to use extract in earnest.

    Flags:
      -7 -- output the seven tag roster for each game. Other tags are lost.
      -aoutputfile -- the file to which extracted games are to be appended.
	    See -o flag for overwriting an existing file.
      -C -- don't include comments in the output. Ordinarily these are retained.
      -dduplicatefile -- the file to which duplicate extracted games are to be written.
      -D -- don't output duplicate extracted game scores.
      -ffile_list  -- file_list contains the list of PGN files to be
            searched - one per line. If this argument is used, there
            should be no additional file arguments on the command line.
      -h -- print this message.
      -? -- print this message.
      -llogfile  -- Write all diagnostics to logfile rather than stderr.
      -N -- don't include NAGs in the output. Ordinarily these are retained.
      -ooutputfile -- the file to which extracted games are to be written.
	    Any existing contents of the file are lost (see -a flag).
      -p -- match permutations of the variations.
      -r -- report any errors but don't extract.
      -s -- silent mode don't report each game as it is extracted.
      -S -- Use a simple soundex algorithm for tag matches. If used, this
	    option must precede the -t or -T options.
      -ttagfile -- file of player, date, or result, extraction criteria.
      -Tcriterion -- player, date, or result, extraction criteria.
      -vvariations -- the file variations contains the textual lines of interest.
      -V -- don't include variations in the output. Ordinarily these are retained.
      -wwidth -- set width as an approximate line width for output.
      -xvariations -- the file variations contains the lines resulting in
	     positions of interest.

Error messages and verbose reporting is done to the standard error output unless
the -l flag is used.

Variations
==========
There are two distinct ways to specify variations of interest;
positional variations (-x flag) and textual variations (-v flag).
These have quite different uses and the positional variations will be
quite new to users of previous versions of this program.

The major difference between the two is that positional variations
specify a complete move sequence whose end position is the primary
point of interest, whereas textual variations allow incomplete and
fuzzy move sequences to select games.

Positional Variations
=====================
The variations in which you are interested should be placed in a file
whose name is supplied with the -x flag.  Each variation should be
listed on a single line, without move numbers.  For instance,

    e4 c5 Nf3 d6 d4 cxd4 Nxd4 Nf6 Nc3 a6

indicates that you wish to pick up all games reaching the Najdorf
Variation of the Sicilian Defence.  Games reaching the end position of
this sequence are selected regardless of the route that was taken to
reach it.  This allows various transpositional sequences to be
specified by quoting just one line to reach the required point.
Therefore, games employing the following move order will be picked up
by quoting the line above.

    e4 c5 Nc3 d6 Nge2 Nf6 d4 cxd4 Nxd4 a6

A position is considered to match a required variation if it generates the same
hash value. No attempt is made to actually examine the state of the board. There
is, therefore, the potential for false hits but in my usage of extract I have not
found this to be a problem.

Textual Variations
==================
The variations in which you are interested should be placed in a file
whose name is supplied with the -v flag.  Each variation should be
listed on a single line, without move numbers.  For instance,

    e4 c5 Nf3 d6 d4 cxd4 Nxd4 Nf6 Nc3 a6

indicates that you wish to pick up all games following the normal move
order of the Najdorf Variation of the Sicilian Defence, and

    d4 Nf6 c4 e6 Nc3 Bb4

that you are interested in Nimzo-Indian games.

The extractor in this case is purely textual in nature.  It works by
string matching on moves, so there is no facility for picking up
transpositions automatically.  This means that if you also wanted to
recognise the following as a Najdorf, you would have to add this line
to the variations file in addition to that given above:

    e4 c5 Nc3 d6 Nge2 Nf6 d4 cxd4 Nxd4 a6

However, because of the way in which the matching is done, it is
possible to specify slight alternatives on the way in which individual
moves are written.  Notational alternatives for a single move are just
written separated from each other with a non-move character.  This
variation specifies both the shorter and longer ways of writing the
captures in a Najdorf:

    e4 c5 Nf3 d6 d4 cxd4|cd Nxd4|Nd4 Nf6 Nc3 a6

Because of the textual matching of variations specified using this
flag, there is little point in using this form in preference to the
positional matching if you are only interested in finding games that
reach a particular position.  The real use for this form is when you
wish to pick up games in a more general way.  For instance, the
character * may be used in place of any move to indicate that you don't
care what was played at that point.  So the following:

    * b6

means that you are interested in all games in which Black replied
1...b6 regardless of White's first move.  This form is only possible
with textual variations.

In addition, the character ! may be used in front of any move to
indicate that you wish to disallow particular moves from matching at
this point.  For instance, if you want to find Sicilian games where
White did not reply with Nf3 at move 2 you would specify:

    e4 c5 !Nf3

If you wished to disallow 2.Ne2 as well then

    e4 c5 !Nf3|Ne2

does the job.  (Adding parentheses makes no difference as the ! is
applied to all of the following string.)

Care should be taken combining ! * and variation permutations (see -p
below).  Disallowed moves take precedence over * moves.  If a single
disallowed move is found in a game within the length of the variation,
that game is excluded.  This was the most sensible interpretation that
I could find to place on this usage.

Textual Variation Permutations
==============================
Normally, textual variations are matched against the moves of the game
in the order in which they are listed.  However, the -p flag requests
that all permutations of a variation are tried against the moves of a
game.  This cuts down on the number of separate transpositional
orderings that it is necessary to list, at the cost of slower matching
of each game.  With the -p flag, the following could be used to look
for Nimzo-Indian games:

    d4 Nf6 c4 e6 Nf6 Nc3 Bb4

and it will pick up games which start as:

    1. c4 Nf6 2. Nc3 e6 3. d4 Bb4

for instance.  The don't-care symbol (*) may be used with this flag, so

    d4 * c4 * Nc3 *

will pick up Nimzo-Indian, Grunfeld, King's Indian, etc. defences.

Duplicate Games (-d and -D)
===========================
If either the -d or -D flag is used, extract attempts to recognise
duplicate extracted games.  This is done by comparing the hash value of
the end positions of extracted games and an additional cumulative hash
value for both.  If these match then games are considered to be
duplicates.  This is not guaranteed to be exact but it gives a good
approximation.  With -d the second and subsequent copies of a game are
written to the specified file.  With the -D flag duplicate games are
suppressed from the output. These two flags are mutually exclusive,
therefore.  You should note that games are only considered to be
duplicates on the basis of the moves played.  It may be that a game
considered to be a duplicate contains annotations and variations not
present in the one found earlier, so it might be necessary to do some
swapping around to obtain those you really wish to retain. You should,
therefore, use the -D flag with caution if you are trying to reorganise
your master collection rather than selecting out specific games for
examination.

It is a feature of extract that if no extraction criteria are specified
then all games are assumed to be required.  This option then provides
the ability to remove duplicates from a large game file:

    extract -ddupes.pgn -ounique.pgn file.pgn

will extract from file.pgn the unique set of games into unique.pgn and
the duplicates to dupes.pgn.

Tag Criteria in a file (-t)
============================
There are two ways to specify that you wish to use information in the
tag fields as extraction criteria.  The -t flag takes a file name
argument and is the preferred method because of its ease of use.  In
the file are listed tag name and string pairs corresponding to the
extraction criteria you wish to use in addition to, or as an
alternative to the variation lists.  Each line of this file should be
of the form:

    PGN-Tag-name Tag-string

and it is very important that the file should be terminated by a
non-tag name token (* is fine).

for instance:

    White "Tal"
    *

This requests that only those games where Tal had the White pieces are
to be considered for extraction.  If you wish to limit the year in
which those games were played you might list:

    White "Tal"
    Date "1962"
    *

Multiple pairs with the same tag name are or-ed together so:

    Date "1960"
    Date "1961"
    Date "1962"
    *

will select all games from the three listed years.  In general,
different tags are and-ed together, so:

    White "Tal"
    Black "Fischer"
    Date "1962"
    Result "1-0"
    *

selects only those games that Tal won with the White pieces against
Fischer in 1962.

It is important to note that:

    White "Tal"
    Black "Tal"
    *

does not find all games played by a Tal, but only those that he played
against himself.  In order to overcome this, I have introduced a
non-PGN tag that should only be used in the extraction criteria file:

    Player "Tal"
    Date "1962"
    *

finds all games from 1962 in which Tal had either the White pieces or
the Black.  In effect, the White and Black player lists are or-ed
together rather than and-ed using this pseudo-tag.

Pattern matching on tags is done so that a criterion should be a prefix
of the complete Tag string.  So

    Player "Karpov"
    *
    
would match:

	[White "Karpov"]
	[White "Karpov, A"]
	[White "Karpov, An"]
	[White "Karpov, Alexander"]

but not

	[White "Anatoli Karpov"]

See the -S flag for a soundex facility with tag matching.

Tag Criteria on the Command Line (-T)
=====================================
The alternative -T argument is for use where command line arguments are
more convenient -- perhaps where extract is being invoked from another
program.  The tag coverage is not as extensive as with a tag file, and
the syntax is rather cumbersome.  It is used as follows: after the -T
comes a single letter from the limited set [bdprw] to select string
prefixes of the tag fields of a game.  For instance:

  -TwPlayer -- Extract games where Player has the White pieces.
  -TbPlayer -- Extract games where Player has the Black pieces.
  -TpPlayer -- Extract games where Player has either colour.
  -TdDate   -- Extract games played on Date.
  -TrResult -- Extract games with result Result.

For example,

    extract -TwTal -TbFischer file.pgn

would extract games from file.pgn in which Tal had the White pieces and
Fischer the Black.

Criteria of the same tag type are or-ed together, so

    extract -Tr1-0 -Tr0-1 file.pgn

extracts only decisive games.

Criteria of different tag types are and-ed together so

    extract -TwTal -Td1962 -r1-0 file.pgn

would extract only those games in which Tal played with the White
pieces in 1962 and won.

Soundex Matching (-S)
=====================
There is a simple soundex algorithm available that attempts soundex
matches on White, Black, Site, Event, and Annotator tags if the -S flag
is used in combination with either -t or -T.  The -S flag should
preceded all -t and -T arguments.  It should be noted that the soundex
matching does produce false matches.  The algorithm was supplied by
John Brogan (jwbrogan@Netaxs.com).

Limitations
===========
The moves, variations, and commentary of each game are held internally
and reformatted when a game is extracted, rather than reproducing the
original text of the game source.

Tags not listed in taglist.h are not copied.

Duplicate detection is not guaranteed to be exact although I haven't come
across any failures, yet.

Acknowledgements
================
I would like to thank all those who used the program and made
suggestions for things to add.  In particular, thanks to Michael Kerry
whose help led to better determination of game boundaries in earlier
versions, and V. Armando Sole (SOLE@EMBL-Hamburg.de) whose own filter
program was the inspiration for adding textual variation permutations.
John Brogan (jwbrogan@Netaxs.com) suggested adding the ! notation to
the variation file and provided the spur for duplicate detection.  He
also provided the code for soundex matching.  Thanks, also, to all of
those people on the net who provide games in PGN format.  Finally,
thanks of course to Steven Edwards (sje@world.std.com) for his work on
developing the PGN standard.

David Barnes
D.J.Barnes@ukc.ac.uk
Date of this version: 8th September 1994.

Changes to the Original Release
===============================
7th Sep 1994: Added -D flag.
6th Sep 1994: Added -C and -V flags and soundex matching.
5th Sep 1994:
    Integrated the positional variation code from a separately
    developed program.
    Added -N flag.
    Added ! to the textual variation syntax.
    Removed the writing to extract.pgn that was present in an
    earlier unreleased version.
    Added -d flag.
8th July 1994:
    Added -o flag.
    Discarded writing to standard output in DOS version because of
    extensive problems trying to make this work with redirected
    output.  Instead, output is written to the file extract.pgn.
6th July 1994:	 Added -7 flag.
9th May 1994:	 Added -p flag for variation permutations.
6th May 1994:	 Added * as a don't-care move in variations files.
26th April 1994: Added the -t flag for files of extraction criteria.
25th April 1994: Added the -T flag for extraction criteria.
22nd April 1994: Added the -f flag for handling lists of PGN files.
13th April 1994:
    Cleaned up the game-length determination by reading/writing files
    in binary-mode.
    Added -a flag for appending to existing .pgn files.
    Added multiple input files.
    Made verbose output the default behaviour.
