#ifndef RXPLIB_INCLUDED
#define RXPLIB_INCLUDED

/*     Regular expression library header file
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       vv. 1.01            M.I.Barlow 3-10-97
 */

/*{{{  ___________________________________________________ SUPPORTED SYNTAX ___ */
/*
   REGULAR EXPRESSIONS

       The library associated with this header file supports the following
       regular expression elements, drawn from the Unix egrep and sed
       utilities. When library features extend these, lines in the
       description are prefixed with a dash; they may be turned off
       by using the function 'rxp_configure' (see below), as may
       any of the features listed here:

       LITERAL CHARACTERS and ESCAPES

	   Any ASCII character except . * + ? [ ( ) | and \ matches itself.
	   Each of the special characters may be prefixed with \, and will
	   then match itself literally. Both forms are "atomic", for closure
	   purposes (see below). In addition, the normal C string escapes
	   \b \t \r \n and \x<hh>, (where <h> is a hex digit) will have
	   their usual meanings provided that they have not been disabled.
	   Note that octal escapes are *not* supported because I don't like
	   them :-I. Escapes in general cannot be disabled, as this would
	   prevent them being used to protect the special characters
	   associated with each of the other features when the user
	   wants to use them literally.

       THE WILD CHARACTER

	   The dot . matches any single ASCII character and is atomic.

       CHARACTER SETS and their COMPLEMENTS

	   Enclosed by brackets, Eg: [ABC], sets match any character in the
	   set, i.e. either A or B or C. Sets may contain ranges, such as
	   0-9, which denote the given endpoints and all characters between
	   them. If a literal ] is required it must be the first character in
	   the set; if - is needed it must be first or last. A complement set
	   has ^ as it's first character and matches any character NOT present
	   in the ensuing list. Sets are also atomic.

       ALTERNATION

	   Two or more regular expressions separated by vertical bars | match
	   any string which matches one of the alternative forms given. The
	   given alternatives are assessed from left to right, so if any
	   alternative is an exact leading subset of one to it's right it
	   will always be preferred for matching. Alternative lists must be
	   grouped (see below) to be considered atomic.

       GROUPING and CAPTURES

	   One or more regular expression elements enclosed in parentheses ()
	   are considered to constitute an atomic sub-expression, and during
	   matching that portion of the subject string which matches the
	   parenthesized part of the pattern is stored (captured) for later
	   reference. Groups are nestable, and are numbered from 1 to 9 as
	   their opening parenthesis ( appears in the pattern, reading from
	   left to right. They may be referred to subsequently using the
	   atomic elements \1 to \9, which match the captured sub-string.
	   Empty captures match exactly nothing.

       CLOSURES

	   Any atomic sub-expression may be followed by a "closure", and will
	   then match a range of repeated instances of itself SUCH THAT the
	   following element (if any) does not fail to match. The available
	   closures are: * which matches zero or more repeats; + which
	   matches one or more; ? which matches zero or one; \{<n>\} which
	   matches exactly <n> copies; \{<n>,\} which matches <n> or more
	   \{,<n>\} which matches upto <n> copies and \{<n>,<m>\} which
	   matches between <n> and <m> copies. When a closure is applied
	   to a group the string captured is the one which fulfils the first
	   match, Eg: "(cat|dog)+\1" matches "dogcatdog", but not "catcatdog".

       ANCHORS

	   An initial caret ^ in a toplevel alternative "anchors" the
	   expression so that it will only be matched by text occuring at
	   the beginning of a line. Similarly, a trailing $ anchors the
	   expression to the end of a line. With both present, a line must
	   match the given regexp exactly. 'Line' in this context refers
	   to a null-terminated string passed to the library's find function,
	   and need not necessarily correspond to a line in a file. If either
	   anchor is intended literally it may be escaped with a \.

       CASE SENSITIVITY

	 - The egrep utility makes no provision for case-insensitive matching
	 - of alphabetic characters other than by using many [sS][eE][tT][sS].
	 - This library extends it's syntax by matching anything enclosed in
	 - escaped angle brackets \< and \> in a case-insensitive manner.
	 - This feature is limited to individual sub-expressions within
	 - groups and alternative lists; i.e. both | and ( posess an
	 - implicit leading \>.

   REPLACEMENT STRING MASKS

       Utilities such as sed support replacement of text matching a regexp
       with a string containing references to the captured contents of groups.
       A function is provided to perform this task. Within the replacement
       mask string passed to it the backslash \ again represents an "escape";
       most (including \\) are simply interpreted as the literal following
       character but \0 refers to the whole of the matched text, and \1-\9
       to the captured text fragments which matched groups within it. The
       actual strings are substituted when these are encountered.

       The normal C string escapes \b \t \r \n and \x<hh> will have their
       usual meanings provided that they are enabled.

     - This library extends the syntax of sed by allowing modification of
     - the case of alphabetic characters within captured strings, for
     - example: \-1 forces lowercase, \+1 forces uppercase, \=1 forces
     - an initial capital and \_1 emits a string of blanks of the same
     - length as the captured string. The characters +, -, = and _ are
     - referred to as "modifiers".

     - String slicing is also supported; the following may appear between
     - the \ (and any modifier) and the numeral: (<n>) picks character No.
     - <n> only, (,<n>) characters upto and including <n>, (n,) characters
     - from <n> onwards and (<n>,<m>) the slice from <n> to <m> inclusive.
     - Character numbering is zero-based so that, for example if the text
     - which matched was "AbCdEfG" and the replacement mask was "\=(3,6)0"
     - the result would be "Def". Note that slices and modifiers are *not*
     - available to captures within regexps, such as "(Hi|Ho)-de-\1!", as
     - I'm not convinced that there's a need for them.

     - Unknown modifiers result in the literal characters themselves being
     - written into the replacement string. (Unavoidable, sorry.) Malformed
     - slices are reported as an RxpBadRange error.

     - If the original regexp had more than one toplevel alternative the
     - replacement may too; the separator is again the vertical bar |
     - character. If there are fewer alternatives in the replacement than
     - the original regexp the last one is used. This feature allows
     - multiple (disjoint) related searches and replacements to be
     - carried out simultaneously. The vertical bar must be escaped with
     - a backslash \| if it is intended literally.

     - The alternatives within replacement masks may contain a leading
     - format specifier in the form of a \L, \R or \C. These have no effect
     - if the resulting replacement string is longer than \0 would have
     - been, but if it is shorter then \L (left justify) causes it to be
     - padded with trailing spaces to yield the same length, and \R (right
     - justify) provides leading spaces. \C tries to centre it but biases
     - to the left in the case of an odd number of spaces.
*/
/*}}}   */
/*{{{  ___________________________________________________________ TYPEDEFS ___ */

/*{{{  RxpPatn */
/* - - - - - - - - - - - an opaque pointer to a regexp pattern's parse tree */

struct RxpTree { int i; }; /* DO NOT USE! this is merely a supplementary
			      dummy type declaration to provide a target
			      for pointers...
			   */

typedef struct RxpTree *RxpPatn;   /* opaque pointer */

/*}}}   */
/*{{{  RxpError */
/* - - - - - - - - - - - - -  enumerated result of this library's functions */

/* N.B: When these values are returned from create_pattern (except for
	RxpOK) all results imply that the pattern tree's memory has
	already been freed for you.
*/

typedef enum
{
    /* results */

    RxpOK       = 0,   /* pattern created / match found / replacement done */
    RxpFail,           /* empty pattern source text / no match seen        */

    /* errors */

    RxpBadSet,         /* malformed set [...] seen                         */
    RxpBadClosure,     /* invalid closure (nothing to apply it to)         */
    RxpBadRange,       /* bad explicit closure range / substitution slice  */
    RxpBigRange,       /* neither n nor m may be be > 255 in \{n,m\}       */
    RxpBadGroup,       /* no terminating ) seen after one or more (        */
    RxpBadEscape,      /* solitary \ seen at end of pattern source text    */
    RxpBadHex,         /* \x<hh> with one or both <h> not a hex digit      */
    RxpBadCapture,     /* \0 is not a valid capture specifier in patterns  */

    /* exceptions */

    RxpNullArg,        /* null pointer(s) were passed into the function    */
    RxpAllocFail,      /* a memory allocation failure occurred             */
    RxpCorrupt         /* broken link in tree (heap may be corrupted)      */
}
RxpError;

/*}}}   */
/*{{{  RxpMatch */
/* - - - - - - - -   struct of results from search for a text-pattern match */

typedef struct RxpMatch
{
    int   which;      /* No. of toplevel alternative that matched, or -1  */
    char *fptr[10];   /* points to first matching character               */
    char *bptr[10];   /* points one place past last matching character    */
}
RxpMatch;

/* The indices of fptr and bptr are used as follows: 0 refers to the whole
   of the text that matched the expression, 1-9 refer to the embedded sub-
   strings which matched groups 1-9 within the pattern (if any). Unused
   locations will be NULL. Note that if the fptr and bptr are equal, you
   wrote an ambiguous regexp that was able to match nothing, such as "x*"
   will when applied to a string that does not start with an "x".
*/

/*}}}   */

/*}}}   */
/*{{{  _________________________________________________________ PROTOTYPES ___ */

/*{{{  rxp_create_pattern() */
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

extern RxpError                                /* status code             */
rxp_create_pattern( char     *source,          /* regexp source text      */
		    RxpPatn  *patn,            /* parse tree root         */
		    char    **error_locus );   /* points (near) to errors */

/*  Scans the null terminated string 'source' and tries to compile a regexp
    pattern into a parse tree. If the result is RxpOK 'patn' will be left
    pointing to a valid tree's root node. If not, it will be (RxpPatn)0,
    '*error_locus' will point to (or just after) the offending part of
    'source' and all heap memory used will have been freed.

    See definition of RxpError above for possible errors. Which of these
    can, or cannot occur will depend on which features are enabled (see
    the 'rxp_configure' function below). Note that '*patn' must be new;
    if you re-use a tree from a previous call you must first call the
    following function on it...
*/
/*}}}   */
/*{{{  rxp_destroy_pattern() */
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

extern RxpError                         /* RxpOK or RxpCorrupt             */
rxp_destroy_pattern( RxpPatn *patn );   /* pointer got from create_pattern */

/*  Frees all memory allocated to a parse tree whose root is pointed to
    by 'patn', then sets the pointer to NULL. If '*patn' points to a
    well-formed parse tree, or is already NULL, returns RxpOK. If not,
    as much memory as possible is freed, and RxpCorrupt is returned.
    No other RxpError is ever returned; in particular RxpNullArg is NOT
    returned if 'tree' is NULL, making multiple calls on the same data
    safe and side-effect free.
*/
/*}}}   */
/*{{{  rxp_find_match() */
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

extern RxpError                            /* status code                 */
rxp_find_match( RxpPatn    patn,           /* pattern to use              */
		char      *text,           /* string to scan              */
		const int  sol,            /* '*text' is start-of-line    */
		RxpMatch  *result   );     /* place to store results      */

/*  Scans the null terminated string 'text' and tries to match the pattern
    contained in the parse tree whose root is pointed to by 'patn'. Returns:

      RxpOK       If a match is found. The fields of 'result' are filled in
		  with data about which characters fulfilled the match. See
		  the definition of RxpMatch above for details; in particular
		  (result->bptr[0]) will point to the place to start scanning
		  for another match within the remainder of 'text', if desired.

      RxpFail     If no match is found.

      RxpNullArg  If 'patn', 'text' or 'result' are NULL.

      RxpCorrupt  If the tree is broken.

    No other RxpError will ever be returned and this function never allocates
    any heap memory. The flag 'sol' is used if a start-of-line anchor appears
    in the regexp. It determines whether 'text' should be considered the
    logical start-of-line or not. If it is zero it prevents matching of the
    anchor even if '*text' would have ordinarily have done so, as you would
    wish when re-scanning for further matches at or beyond (result->bptr[0]).
*/
/*}}}   */
/*{{{  rxp_create_replacement() */
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

extern RxpError                                    /* status code           */
rxp_create_replacement( char            *mask,     /* replacement mask      */
			const RxpMatch   data,     /* match data to use     */
			char           **result ); /* output string or NULL */

/* The replacement source 'mask' is scanned twice for substitution escapes.
   Using the strings pointed to by fields within 'data' on the first pass,
   the length of the result string is calculated. A buffer of this size is
   then allocated and the second pass copies the source, plus any captured
   strings (where appropriate) into it. The value returned depends, to some
   extent, upon which features are enabled (see the 'rxp_configure'
   function below). The full set of possibilities is:

     RxpOK          If successful; '*result' will be left pointing to the
		    output string (it will otherwise be NULL).

     RxpAllocFail   If the buffer memory allocation fails.

     RxpNullArg     If 'mask' or 'result' are NULL.

     RxpBadEscape   If 'mask' has an isolated trailing backslash.

     RxpBadRange    If slices are enabled and an invalid substitution slice
		    operation was seen in 'mask'.

     RxpBadHex      If C escapes are enabled and an invalid hexadecimal
		    escape sequence was seen in 'mask'.

     RxpBadCapture  If \1 - \9 were used but groups were disabled.

   No other RxpError is ever returned, and no memory apart from the output
   string (which may be simply passed to free()) is ever allocated.
*/

/*}}}   */
/*{{{  rxp_configure() */
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

extern int                              /* features enabled after call */
rxp_configure( int enable_mask,         /* features to activate        */
	       int disable_mask );      /* features to turn off        */

/*  Used to change the behaviour of library functions. Both the arguments
    and the return value are bit-masks, with a bit assigned to each
    feature, controlling whether that feature is enabled. The arguments
    are examined and all features whose bit is set in 'disable_mask' are
    turned off. Next, all features whose bit is set in 'enable_mask'
    are activated. Finally, the set of enabled features is returned.

    N.B:  These changes only take effect when a new pattern or
	  replacement is created.

	  By default, all features are enabled at start-up.

	  The only feature that cannot be turned off is escapes;
	  without them none of the others would work, and you might
	  as well use strcmp().

    The constants below assign bits to features. All features are
    independent; The first two affect both searches and replacements,
    the next block only affect searches and the final ones only
    affect replacements:
*/

#define RXP_C_ESCAPES     (0x0001)   /* allow \b, \t, \r, \n and \x<hh>     */
#define RXP_GROUPS        (0x0002)   /* allow (regexp) and hence \1 to \9   */

#define RXP_WILD_CHAR     (0x0004)   /* allow . to stand for any character  */
#define RXP_SETS          (0x0008)   /* allow [...] and [^...]              */
#define RXP_ALTERNATIVES  (0x0010)   /* allow regexp|regexp                 */
#define RXP_CLOSURES      (0x0020)   /* allow *, +, ?, \{n,m\} and variants */
#define RXP_ANCHORS       (0x0040)   /* allow ^regexp and regexp$           */
#define RXP_CASE_INSENS   (0x0080)   /* allow \< and \>                     */

#define RXP_MODIFIERS     (0x0100)   /* allow \[+-=_][0-9]                  */
#define RXP_SLICES        (0x0200)   /* allow \(n,m)[0-9] and variants      */
#define RXP_JUSTIFY       (0x0400)   /* allow \Lrepl., \Rrepl. and \Crepl.  */
#define RXP_REP_ALTS      (0x0800)   /* allow repl.|repl.|...               */

/*  Thus, all non-standard extensions to the syntax of egrep and sed may
    be disabled using:
			(void)rxp_configure(0x7f,0xfff);
*/
/*}}}   */

/*}}}   */

#endif /* ndef RXPLIB_INCLUDED */
