summaryrefslogtreecommitdiff
path: root/Utilities/PCRE/man/html/pcrepattern.3.html
diff options
context:
space:
mode:
Diffstat (limited to 'Utilities/PCRE/man/html/pcrepattern.3.html')
-rw-r--r--Utilities/PCRE/man/html/pcrepattern.3.html1268
1 files changed, 1268 insertions, 0 deletions
diff --git a/Utilities/PCRE/man/html/pcrepattern.3.html b/Utilities/PCRE/man/html/pcrepattern.3.html
new file mode 100644
index 0000000..11bb198
--- /dev/null
+++ b/Utilities/PCRE/man/html/pcrepattern.3.html
@@ -0,0 +1,1268 @@
+<!-- manual page source format generated by PolyglotMan v3.2, -->
+<!-- available at http://polyglotman.sourceforge.net/ -->
+
+<html>
+<head>
+<title>PCRE(3) manual page</title>
+</head>
+<body bgcolor='white'>
+<a href='#toc'>Table of Contents</a><p>
+
+<h2><a name='sect0' href='#toc0'>Name</a></h2>
+PCRE - Perl-compatible regular expressions
+<h2><a name='sect1' href='#toc1'>Pcre Regular Expression Details</a></h2>
+
+<p>
+The syntax and semantics of the regular expressions supported by PCRE are
+described below. Regular expressions are also described in the Perl documentation
+and in a number of books, some of which have copious examples. Jeffrey Friedl&rsquo;s
+"Mastering Regular Expressions", published by O&rsquo;Reilly, covers regular expressions
+in great detail. This description of PCRE&rsquo;s regular expressions is intended
+as reference material. <p>
+The original operation of PCRE was on strings of
+one-byte characters. However, there is now also support for UTF-8 character
+strings. To use this, you must build PCRE to include UTF-8 support, and then
+call <b>pcre_compile()</b> with the PCRE_UTF8 option. How this affects pattern
+matching is mentioned in several places below. There is also a summary of
+UTF-8 features in the section on UTF-8 support in the main <b>pcre</b> page.
+<p>
+A regular expression is a pattern that is matched against a subject string
+from left to right. Most characters stand for themselves in a pattern, and
+match the corresponding characters in the subject. As a trivial example,
+the pattern <p>
+ The quick brown fox<br>
+ <p>
+matches a portion of a subject string that is identical to itself. The
+power of regular expressions comes from the ability to include alternatives
+and repetitions in the pattern. These are encoded in the pattern by the
+use of <i>metacharacters</i>, which do not stand for themselves but instead are
+interpreted in some special way. <p>
+There are two different sets of metacharacters:
+those that are recognized anywhere in the pattern except within square
+brackets, and those that are recognized in square brackets. Outside square
+brackets, the metacharacters are as follows: <p>
+ \ general escape character
+with several uses<br>
+ ^ assert start of string (or line, in multiline mode)<br>
+ $ assert end of string (or line, in multiline mode)<br>
+ . match any character except newline (by default)<br>
+ [ start character class definition<br>
+ | start of alternative branch<br>
+ ( start subpattern<br>
+ ) end subpattern<br>
+ ? extends the meaning of (<br>
+ also 0 or 1 quantifier<br>
+ also quantifier minimizer<br>
+ * 0 or more quantifier<br>
+ + 1 or more quantifier<br>
+ also "possessive quantifier"<br>
+ { start min/max quantifier<br>
+ <p>
+Part of a pattern that is in square brackets is called a "character class".
+In a character class the only metacharacters are: <p>
+ \ general escape
+character<br>
+ ^ negate the class, but only if the first character<br>
+ - indicates character range<br>
+ [ POSIX character class (only if followed by POSIX<br>
+ syntax)<br>
+ ] terminates the character class<br>
+ <p>
+The following sections describe the use of each of the metacharacters.
+
+<h2><a name='sect2' href='#toc2'>Backslash</a></h2>
+ <p>
+The backslash character has several uses. Firstly, if it is followed
+by a non-alphanumeric character, it takes away any special meaning that
+character may have. This use of backslash as an escape character applies
+both inside and outside character classes. <p>
+For example, if you want to match
+a * character, you write \* in the pattern. This escaping action applies
+whether or not the following character would otherwise be interpreted as
+a metacharacter, so it is always safe to precede a non-alphanumeric with
+backslash to specify that it stands for itself. In particular, if you want
+to match a backslash, you write \\. <p>
+If a pattern is compiled with the PCRE_EXTENDED
+option, whitespace in the pattern (other than in a character class) and
+characters between a # outside a character class and the next newline character
+are ignored. An escaping backslash can be used to include a whitespace or
+# character as part of the pattern. <p>
+If you want to remove the special meaning
+from a sequence of characters, you can do so by putting them between \Q
+and \E. This is different from Perl in that $ and @ are handled as literals
+in \Q...\E sequences in PCRE, whereas in Perl, $ and @ cause variable interpolation.
+Note the following examples: <p>
+ Pattern PCRE matches Perl matches<br>
+ <p>
+ \Qabc$xyz\E abc$xyz abc followed by the<br>
+ contents of $xyz<br>
+ \Qabc\$xyz\E abc\$xyz abc\$xyz<br>
+ \Qabc\E\$\Qxyz\E abc$xyz abc$xyz<br>
+ <p>
+The \Q...\E sequence is recognized both inside and outside character classes.
+
+<h3><a name='sect3' href='#toc3'>Non-printing characters</a></h3>
+ <p>
+A second use of backslash provides a way of encoding
+non-printing characters in patterns in a visible manner. There is no restriction
+on the appearance of non-printing characters, apart from the binary zero
+that terminates a pattern, but when a pattern is being prepared by text
+editing, it is usually easier to use one of the following escape sequences
+than the binary character it represents: <p>
+ \a alarm, that is, the
+BEL character (hex 07)<br>
+ \cx "control-x", where x is any character<br>
+ \e escape (hex 1B)<br>
+ \f formfeed (hex 0C)<br>
+ \n newline (hex 0A)<br>
+ \r carriage return (hex 0D)<br>
+ \t tab (hex 09)<br>
+ \ddd character with octal code ddd, or backreference<br>
+ \xhh character with hex code hh<br>
+ \x{hhh..} character with hex code hhh... (UTF-8 mode only)<br>
+ <p>
+The precise effect of \cx is as follows: if x is a lower case letter, it
+is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
+Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
+<p>
+After \x, from zero to two hexadecimal digits are read (letters can be in
+upper or lower case). In UTF-8 mode, any number of hexadecimal digits may
+appear between \x{ and }, but the value of the character code must be less
+than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters
+other than hexadecimal digits appear between \x{ and }, or if there is no
+terminating }, this form of escape is not recognized. Instead, the initial
+\x will be interpreted as a basic hexadecimal escape, with no following
+digits, giving a character whose value is zero. <p>
+Characters whose value is
+less than 256 can be defined by either of the two syntaxes for \x when PCRE
+is in UTF-8 mode. There is no difference in the way they are handled. For
+example, \xdc is exactly the same as \x{dc}. <p>
+After \0 up to two further octal
+digits are read. In both cases, if there are fewer than two digits, just
+those that are present are used. Thus the sequence \0\x\07 specifies two binary
+zeros followed by a BEL character (code value 7). Make sure you supply two
+digits after the initial zero if the pattern character that follows is
+itself an octal digit. <p>
+The handling of a backslash followed by a digit other
+than 0 is complicated. Outside a character class, PCRE reads it and any
+following digits as a decimal number. If the number is less than 10, or
+if there have been at least that many previous capturing left parentheses
+in the expression, the entire sequence is taken as a <i>back reference</i>. A description
+of how this works is given later, following the discussion of parenthesized
+subpatterns. <p>
+Inside a character class, or if the decimal number is greater
+than 9 and there have not been that many capturing subpatterns, PCRE re-reads
+up to three octal digits following the backslash, and generates a single
+byte from the least significant 8 bits of the value. Any subsequent digits
+stand for themselves. For example: <p>
+ \040 is another way of writing a space<br>
+ \40 is the same, provided there are fewer than 40<br>
+ previous capturing subpatterns<br>
+ \7 is always a back reference<br>
+ \11 might be a back reference, or another way of<br>
+ writing a tab<br>
+ \011 is always a tab<br>
+ \0113 is a tab followed by the character "3"<br>
+ \113 might be a back reference, otherwise the<br>
+ character with octal code 113<br>
+ \377 might be a back reference, otherwise<br>
+ the byte consisting entirely of 1 bits<br>
+ \81 is either a back reference, or a binary zero<br>
+ followed by the two characters "8" and "1"<br>
+ <p>
+Note that octal values of 100 or greater must not be introduced by a leading
+zero, because no more than three octal digits are ever read. <p>
+All the sequences
+that define a single byte value or a single UTF-8 character (in UTF-8 mode)
+can be used both inside and outside character classes. In addition, inside
+a character class, the sequence \b is interpreted as the backspace character
+(hex 08), and the sequence \X is interpreted as the character "X". Outside
+a character class, these sequences have different meanings (see below).
+
+<h3><a name='sect4' href='#toc4'>Generic character types</a></h3>
+ <p>
+The third use of backslash is for specifying
+generic character types. The following are always recognized: <p>
+ \d any
+decimal digit<br>
+ \D any character that is not a decimal digit<br>
+ \s any whitespace character<br>
+ \S any character that is not a whitespace character<br>
+ \w any "word" character<br>
+ \W any "non-word" character<br>
+ <p>
+Each pair of escape sequences partitions the complete set of characters
+into two disjoint sets. Any given character matches one, and only one, of
+each pair. <p>
+These character type sequences can appear both inside and outside
+character classes. They each match one character of the appropriate type.
+If the current matching point is at the end of the subject string, all
+of them fail, since there is no character to match. <p>
+For compatibility with
+Perl, \s does not match the VT character (code 11). This makes it different
+from the the POSIX "space" class. The \s characters are <a href='HT.9.html'>HT (9)</a>
+, LF (10),
+FF (12), CR (13), and space (32). <p>
+A "word" character is an underscore or
+any character less than 256 that is a letter or digit. The definition of
+letters and digits is controlled by PCRE&rsquo;s low-valued character tables, and
+may vary if locale-specific matching is taking place (see "Locale support"
+ in the <b>pcreapi</b> page). For example, in the "fr_FR" (French) locale, some
+character codes greater than 128 are used for accented letters, and these
+are matched by \w. <p>
+In UTF-8 mode, characters with values greater than 128
+never match \d, \s, or \w, and always match \D, \S, and \W. This is true even
+when Unicode character property support is available.
+<h3><a name='sect5' href='#toc5'>Unicode character
+properties</a></h3>
+ <p>
+When PCRE is built with Unicode character property support,
+three additional escape sequences to match generic character types are
+available when UTF-8 mode is selected. They are: <p>
+ \p{<i>xx</i>} a character with
+the <i>xx</i> property<br>
+ \P{<i>xx</i>} a character without the <i>xx</i> property<br>
+ \X an extended Unicode sequence<br>
+ <p>
+The property names represented by <i>xx</i> above are limited to the Unicode
+general category properties. Each character has exactly one such property,
+specified by a two-letter abbreviation. For compatibility with Perl, negation
+can be specified by including a circumflex between the opening brace and
+the property name. For example, \p{^Lu} is the same as \P{Lu}. <p>
+If only one letter
+is specified with \p or \P, it includes all the properties that start with
+that letter. In this case, in the absence of negation, the curly brackets
+in the escape sequence are optional; these two examples have the same effect:
+<p>
+ \p{L}<br>
+ \pL<br>
+ <p>
+The following property codes are supported: <p>
+ C Other<br>
+ Cc Control<br>
+ Cf Format<br>
+ Cn Unassigned<br>
+ Co Private use<br>
+ Cs Surrogate<br>
+ <p>
+ L Letter<br>
+ Ll Lower case letter<br>
+ Lm Modifier letter<br>
+ Lo Other letter<br>
+ Lt Title case letter<br>
+ Lu Upper case letter<br>
+ <p>
+ M Mark<br>
+ Mc Spacing mark<br>
+ Me Enclosing mark<br>
+ Mn Non-spacing mark<br>
+ <p>
+ N Number<br>
+ Nd Decimal number<br>
+ Nl Letter number<br>
+ No Other number<br>
+ <p>
+ P Punctuation<br>
+ Pc Connector punctuation<br>
+ Pd Dash punctuation<br>
+ Pe Close punctuation<br>
+ Pf Final punctuation<br>
+ Pi Initial punctuation<br>
+ Po Other punctuation<br>
+ Ps Open punctuation<br>
+ <p>
+ S Symbol<br>
+ Sc Currency symbol<br>
+ Sk Modifier symbol<br>
+ Sm Mathematical symbol<br>
+ So Other symbol<br>
+ <p>
+ Z Separator<br>
+ Zl Line separator<br>
+ Zp Paragraph separator<br>
+ Zs Space separator<br>
+ <p>
+Extended properties such as "Greek" or "InMusicalSymbols" are not supported
+by PCRE. <p>
+Specifying caseless matching does not affect these escape sequences.
+For example, \p{Lu} always matches only upper case letters. <p>
+The \X escape
+matches any number of Unicode characters that form an extended Unicode
+sequence. \X is equivalent to <p>
+ (?&gt;\PM\pM*)<br>
+ <p>
+That is, it matches a character without the "mark" property, followed
+by zero or more characters with the "mark" property, and treats the sequence
+as an atomic group (see below). Characters with the "mark" property are
+typically accents that affect the preceding character. <p>
+Matching characters
+by Unicode property is not fast, because PCRE has to search a structure
+that contains data for over fifteen thousand characters. That is why the
+traditional escape sequences such as \d and \w do not use Unicode properties
+in PCRE.
+<h3><a name='sect6' href='#toc6'>Simple assertions</a></h3>
+ <p>
+The fourth use of backslash is for certain
+simple assertions. An assertion specifies a condition that has to be met
+at a particular point in a match, without consuming any characters from
+the subject string. The use of subpatterns for more complicated assertions
+is described below. The backslashed assertions are: <p>
+ \b matches at
+a word boundary<br>
+ \B matches when not at a word boundary<br>
+ \A matches at start of subject<br>
+ \Z matches at end of subject or before newline at end<br>
+ \z matches at end of subject<br>
+ \G matches at first matching position in subject<br>
+ <p>
+These assertions may not appear in character classes (but note that \b
+has a different meaning, namely the backspace character, inside a character
+class). <p>
+A word boundary is a position in the subject string where the current
+character and the previous character do not both match \w or \W (i.e. one matches
+\w and the other matches \W), or the start or end of the string if the first
+or last character matches \w, respectively. <p>
+The \A, \Z, and \z assertions differ
+from the traditional circumflex and dollar (described in the next section)
+in that they only ever match at the very start and end of the subject string,
+whatever options are set. Thus, they are independent of multiline mode. These
+three assertions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options,
+which affect only the behaviour of the circumflex and dollar metacharacters.
+However, if the <i>startoffset</i> argument of <b>pcre_exec()</b> is non-zero, indicating
+that matching is to start at a point other than the beginning of the subject,
+\A can never match. The difference between \Z and \z is that \Z matches before
+a newline that is the last character of the string as well as at the end
+of the string, whereas \z matches only at the end. <p>
+The \G assertion is true
+only when the current matching position is at the start point of the match,
+as specified by the <i>startoffset</i> argument of <b>pcre_exec()</b>. It differs from
+\A when the value of <i>startoffset</i> is non-zero. By calling <b>pcre_exec()</b> multiple
+times with appropriate arguments, you can mimic Perl&rsquo;s /g option, and it
+is in this kind of implementation where \G can be useful. <p>
+Note, however,
+that PCRE&rsquo;s interpretation of \G, as the start of the current match, is subtly
+different from Perl&rsquo;s, which defines it as the end of the previous match.
+In Perl, these can be different when the previously matched string was
+empty. Because PCRE does just one match at a time, it cannot reproduce this
+behaviour. <p>
+If all the alternatives of a pattern begin with \G, the expression
+is anchored to the starting match position, and the "anchored" flag is
+set in the compiled regular expression.
+<h2><a name='sect7' href='#toc7'>Circumflex and Dollar</a></h2>
+ <p>
+Outside
+a character class, in the default matching mode, the circumflex character
+is an assertion that is true only if the current matching point is at the
+start of the subject string. If the <i>startoffset</i> argument of <b>pcre_exec()</b>
+is non-zero, circumflex can never match if the PCRE_MULTILINE option is
+unset. Inside a character class, circumflex has an entirely different meaning
+ (see below). <p>
+Circumflex need not be the first character of the pattern
+if a number of alternatives are involved, but it should be the first thing
+in each alternative in which it appears if the pattern is ever to match
+that branch. If all possible alternatives start with a circumflex, that
+is, if the pattern is constrained to match only at the start of the subject,
+it is said to be an "anchored" pattern. (There are also other constructs
+that can cause a pattern to be anchored.) <p>
+A dollar character is an assertion
+that is true only if the current matching point is at the end of the subject
+string, or immediately before a newline character that is the last character
+in the string (by default). Dollar need not be the last character of the
+pattern if a number of alternatives are involved, but it should be the
+last item in any branch in which it appears. Dollar has no special meaning
+in a character class. <p>
+The meaning of dollar can be changed so that it matches
+only at the very end of the string, by setting the PCRE_DOLLAR_ENDONLY
+option at compile time. This does not affect the \Z assertion. <p>
+The meanings
+of the circumflex and dollar characters are changed if the PCRE_MULTILINE
+option is set. When this is the case, they match immediately after and immediately
+before an internal newline character, respectively, in addition to matching
+at the start and end of the subject string. For example, the pattern /^abc$/
+matches the subject string "def\nabc" (where \n represents a newline character)
+in multiline mode, but not otherwise. Consequently, patterns that are anchored
+in single line mode because all branches start with ^ are not anchored in
+multiline mode, and a match for circumflex is possible when the <i>startoffset</i>
+argument of <b>pcre_exec()</b> is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored
+if PCRE_MULTILINE is set. <p>
+Note that the sequences \A, \Z, and \z can be used
+to match the start and end of the subject in both modes, and if all branches
+of a pattern start with \A it is always anchored, whether PCRE_MULTILINE
+is set or not.
+<h2><a name='sect8' href='#toc8'>Full Stop (period, Dot)</a></h2>
+ <p>
+Outside a character class, a dot
+in the pattern matches any one character in the subject, including a non-printing
+character, but not (by default) newline. In UTF-8 mode, a dot matches any
+UTF-8 character, which might be more than one byte long, except (by default)
+newline. If the PCRE_DOTALL option is set, dots match newlines as well. The
+handling of dot is entirely independent of the handling of circumflex and
+dollar, the only relationship being that they both involve newline characters.
+Dot has no special meaning in a character class.
+<h2><a name='sect9' href='#toc9'>Matching a Single Byte</a></h2>
+
+<p>
+Outside a character class, the escape sequence \C matches any one byte,
+both in and out of UTF-8 mode. Unlike a dot, it can match a newline. The feature
+is provided in Perl in order to match individual bytes in UTF-8 mode. Because
+it breaks up UTF-8 characters into individual bytes, what remains in the
+string may be a malformed UTF-8 string. For this reason, the \C escape sequence
+is best avoided. <p>
+PCRE does not allow \C to appear in lookbehind assertions
+ (described below), because in UTF-8 mode this would make it impossible
+to calculate the length of the lookbehind.
+<h2><a name='sect10' href='#toc10'>Square Brackets and Character
+Classes</a></h2>
+ <p>
+An opening square bracket introduces a character class, terminated
+by a closing square bracket. A closing square bracket on its own is not
+special. If a closing square bracket is required as a member of the class,
+it should be the first data character in the class (after an initial circumflex,
+if present) or escaped with a backslash. <p>
+A character class matches a single
+character in the subject. In UTF-8 mode, the character may occupy more than
+one byte. A matched character must be in the set of characters defined by
+the class, unless the first character in the class definition is a circumflex,
+in which case the subject character must not be in the set defined by the
+class. If a circumflex is actually required as a member of the class, ensure
+it is not the first character, or escape it with a backslash. <p>
+For example,
+the character class [aeiou] matches any lower case vowel, while [^aeiou]
+matches any character that is not a lower case vowel. Note that a circumflex
+is just a convenient notation for specifying the characters that are in
+the class by enumerating those that are not. A class that starts with a
+circumflex is not an assertion: it still consumes a character from the
+subject string, and therefore it fails if the current pointer is at the
+end of the string. <p>
+In UTF-8 mode, characters with values greater than 255
+can be included in a class as a literal string of bytes, or by using the
+\x{ escaping mechanism. <p>
+When caseless matching is set, any letters in a class
+represent both their upper case and lower case versions, so for example,
+a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does
+not match "A", whereas a caseful version would. When running in UTF-8 mode,
+PCRE supports the concept of case for characters with values greater than
+128 only when it is compiled with Unicode property support. <p>
+The newline
+character is never treated in any special way in character classes, whatever
+the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class such
+as [^a] will always match a newline. <p>
+The minus (hyphen) character can be
+used to specify a range of characters in a character class. For example,
+[d-m] matches any letter between d and m, inclusive. If a minus character
+is required in a class, it must be escaped with a backslash or appear in
+a position where it cannot be interpreted as indicating a range, typically
+as the first or last character in the class. <p>
+It is not possible to have
+the literal character "]" as the end character of a range. A pattern such
+as [W-]46] is interpreted as a class of two characters ("W" and "-") followed
+by a literal string "46]", so it would match "W46]" or "-46]". However, if
+the "]" is escaped with a backslash it is interpreted as the end of range,
+so [W-\]46] is interpreted as a class containing a range followed by two
+other characters. The octal or hexadecimal representation of "]" can also
+be used to end a range. <p>
+Ranges operate in the collating sequence of character
+values. They can also be used for characters specified numerically, for
+example [\000-\037]. In UTF-8 mode, ranges can include characters whose values
+are greater than 255, for example [\x{100}-\x{2ff}]. <p>
+If a range that includes
+letters is used when caseless matching is set, it matches the letters in
+either case. For example, [W-c] is equivalent to [][\\^_&lsquo;wxyzabc], matched caselessly,
+and in non-UTF-8 mode, if character tables for the "fr_FR" locale are in
+use, [\xc8-\xcb] matches accented E characters in both cases. In UTF-8 mode,
+PCRE supports the concept of case for characters with values greater than
+128 only when it is compiled with Unicode property support. <p>
+The character
+types \d, \D, \p, \P, \s, \S, \w, and \W may also appear in a character class,
+and add the characters that they match to the class. For example, [\dABCDEF]
+matches any hexadecimal digit. A circumflex can conveniently be used with
+the upper case character types to specify a more restricted set of characters
+than the matching lower case type. For example, the class [^\W_] matches any
+letter or digit, but not underscore. <p>
+The only metacharacters that are recognized
+in character classes are backslash, hyphen (only where it can be interpreted
+as specifying a range), circumflex (only at the start), opening square
+bracket (only when it can be interpreted as introducing a POSIX class name
+- see the next section), and the terminating closing square bracket. However,
+escaping other non-alphanumeric characters does no harm.
+<h2><a name='sect11' href='#toc11'>Posix Character
+Classes</a></h2>
+ <p>
+Perl supports the POSIX notation for character classes. This uses
+names enclosed by [: and :] within the enclosing square brackets. PCRE also
+supports this notation. For example, <p>
+ [01[:alpha:]%]<br>
+ <p>
+matches "0", "1", any alphabetic character, or "%". The supported class
+names are <p>
+ alnum letters and digits<br>
+ alpha letters<br>
+ ascii character codes 0 - 127<br>
+ blank space or tab only<br>
+ cntrl control characters<br>
+ digit decimal digits (same as \d)<br>
+ graph printing characters, excluding space<br>
+ lower lower case letters<br>
+ print printing characters, including space<br>
+ punct printing characters, excluding letters and digits<br>
+ space white space (not quite the same as \s)<br>
+ upper upper case letters<br>
+ word "word" characters (same as \w)<br>
+ xdigit hexadecimal digits<br>
+ <p>
+The "space" characters are <a href='HT.9.html'>HT (9)</a>
+, LF (10), VT (11), FF (12), CR (13),
+and space (32). Notice that this list includes the VT character (code 11).
+This makes "space" different to \s, which does not include VT (for Perl
+compatibility). <p>
+The name "word" is a Perl extension, and "blank" is a GNU
+extension from Perl 5.8. Another Perl extension is negation, which is indicated
+by a ^ character after the colon. For example, <p>
+ [12[:^digit:]]<br>
+ <p>
+matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
+syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are
+not supported, and an error is given if they are encountered. <p>
+In UTF-8 mode,
+characters with values greater than 128 do not match any of the POSIX character
+classes.
+<h2><a name='sect12' href='#toc12'>Vertical Bar</a></h2>
+ <p>
+Vertical bar characters are used to separate alternative
+patterns. For example, the pattern <p>
+ gilbert|sullivan<br>
+ <p>
+matches either "gilbert" or "sullivan". Any number of alternatives may
+appear, and an empty alternative is permitted (matching the empty string).
+The matching process tries each alternative in turn, from left to right,
+and the first one that succeeds is used. If the alternatives are within
+a subpattern (defined below), "succeeds" means matching the rest of
+the main pattern as well as the alternative in the subpattern.
+<h2><a name='sect13' href='#toc13'>Internal
+Option Setting</a></h2>
+ <p>
+The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
+and PCRE_EXTENDED options can be changed from within the pattern by a sequence
+of Perl option letters enclosed between "(?" and ")". The option letters
+are <p>
+ i for PCRE_CASELESS<br>
+ m for PCRE_MULTILINE<br>
+ s for PCRE_DOTALL<br>
+ x for PCRE_EXTENDED<br>
+ <p>
+For example, (?im) sets caseless, multiline matching. It is also possible
+to unset these options by preceding the letter with a hyphen, and a combined
+setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE
+while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted. If a letter
+appears both before and after the hyphen, the option is unset. <p>
+When an option
+change occurs at top level (that is, not inside subpattern parentheses),
+the change applies to the remainder of the pattern that follows. If the
+change is placed right at the start of a pattern, PCRE extracts it into
+the global options (and it will therefore show up in data extracted by
+the <b>pcre_fullinfo()</b> function). <p>
+An option change within a subpattern affects
+only that part of the current pattern that follows it, so <p>
+ (a(?i)b)c<br>
+ <p>
+matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
+used). By this means, options can be made to have different settings in
+different parts of the pattern. Any changes made in one alternative do carry
+on into subsequent branches within the same subpattern. For example, <p>
+ (a(?i)b|c)<br>
+ <p>
+matches "ab", "aB", "c", and "C", even though when matching "C" the first
+branch is abandoned before the option setting. This is because the effects
+of option settings happen at compile time. There would be some very weird
+behaviour otherwise. <p>
+The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
+can be changed in the same way as the Perl-compatible options by using the
+characters U and X respectively. The (?X) flag setting is special in that
+it must always occur earlier in the pattern than any of the additional
+features it turns on, even when it is at top level. It is best to put it
+at the start.
+<h2><a name='sect14' href='#toc14'>Subpatterns</a></h2>
+ <p>
+Subpatterns are delimited by parentheses (round
+brackets), which can be nested. Turning part of a pattern into a subpattern
+does two things: <p>
+1. It localizes a set of alternatives. For example, the
+pattern <p>
+ cat(aract|erpillar|)<br>
+ <p>
+matches one of the words "cat", "cataract", or "caterpillar". Without the
+parentheses, it would match "cataract", "erpillar" or the empty string.
+<p>
+2. It sets up the subpattern as a capturing subpattern. This means that,
+when the whole pattern matches, that portion of the subject string that
+matched the subpattern is passed back to the caller via the <i>ovector</i> argument
+of <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
+from 1) to obtain numbers for the capturing subpatterns. <p>
+For example, if
+the string "the red king" is matched against the pattern <p>
+ the ((red|white)
+(king|queen))<br>
+ <p>
+the captured substrings are "red king", "red", and "king", and are numbered
+1, 2, and 3, respectively. <p>
+The fact that plain parentheses fulfil two functions
+is not always helpful. There are often times when a grouping subpattern
+is required without a capturing requirement. If an opening parenthesis is
+followed by a question mark and a colon, the subpattern does not do any
+capturing, and is not counted when computing the number of any subsequent
+capturing subpatterns. For example, if the string "the white queen" is matched
+against the pattern <p>
+ the ((?:red|white) (king|queen))<br>
+ <p>
+the captured substrings are "white queen" and "queen", and are numbered
+1 and 2. The maximum number of capturing subpatterns is 65535, and the maximum
+depth of nesting of all subpatterns, both capturing and non-capturing, is
+200. <p>
+As a convenient shorthand, if any option settings are required at the
+start of a non-capturing subpattern, the option letters may appear between
+the "?" and the ":". Thus the two patterns <p>
+ (?i:saturday|sunday)<br>
+ (?:(?i)saturday|sunday)<br>
+ <p>
+match exactly the same set of strings. Because alternative branches are
+tried from left to right, and options are not reset until the end of the
+subpattern is reached, an option setting in one branch does affect subsequent
+branches, so the above patterns match "SUNDAY" as well as "Saturday".
+
+<h2><a name='sect15' href='#toc15'>Named Subpatterns</a></h2>
+ <p>
+Identifying capturing parentheses by number is simple,
+but it can be very hard to keep track of the numbers in complicated regular
+expressions. Furthermore, if an expression is modified, the numbers may
+change. To help with this difficulty, PCRE supports the naming of subpatterns,
+something that Perl does not provide. The Python syntax (?P&lt;name&gt;...) is used.
+Names consist of alphanumeric characters and underscores, and must be unique
+within a pattern. <p>
+Named capturing parentheses are still allocated numbers
+as well as names. The PCRE API provides function calls for extracting the
+name-to-number translation table from a compiled pattern. There is also a
+convenience function for extracting a captured substring by name. For further
+details see the <b>pcreapi</b> documentation.
+<h2><a name='sect16' href='#toc16'>Repetition</a></h2>
+ <p>
+Repetition is specified
+by quantifiers, which can follow any of the following items: <p>
+ a literal
+data character<br>
+ the . metacharacter<br>
+ the \C escape sequence<br>
+ the \X escape sequence (in UTF-8 mode with Unicode properties)<br>
+ an escape such as \d that matches a single character<br>
+ a character class<br>
+ a back reference (see next section)<br>
+ a parenthesized subpattern (unless it is an assertion)<br>
+ <p>
+The general repetition quantifier specifies a minimum and maximum number
+of permitted matches, by giving the two numbers in curly brackets (braces),
+separated by a comma. The numbers must be less than 65536, and the first
+must be less than or equal to the second. For example: <p>
+ z{2,4}<br>
+ <p>
+matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
+character. If the second number is omitted, but the comma is present, there
+is no upper limit; if the second number and the comma are both omitted,
+the quantifier specifies an exact number of required matches. Thus <p>
+ [aeiou]{3,}<br>
+ <p>
+matches at least 3 successive vowels, but may match many more, while <p>
+
+ \d{8}<br>
+ <p>
+matches exactly 8 digits. An opening curly bracket that appears in a position
+where a quantifier is not allowed, or one that does not match the syntax
+of a quantifier, is taken as a literal character. For example, {,6} is not
+a quantifier, but a literal string of four characters. <p>
+In UTF-8 mode, quantifiers
+apply to UTF-8 characters rather than to individual bytes. Thus, for example,
+\x{100}{2} matches two UTF-8 characters, each of which is represented by
+a two-byte sequence. Similarly, when Unicode property support is available,
+\X{3} matches three Unicode extended sequences, each of which may be several
+bytes long (and they may be of different lengths). <p>
+The quantifier {0} is
+permitted, causing the expression to behave as if the previous item and
+the quantifier were not present. <p>
+For convenience (and historical compatibility)
+the three most common quantifiers have single-character abbreviations: <p>
+
+ * is equivalent to {0,}<br>
+ + is equivalent to {1,}<br>
+ ? is equivalent to {0,1}<br>
+ <p>
+It is possible to construct infinite loops by following a subpattern that
+can match no characters with a quantifier that has no upper limit, for
+example: <p>
+ (a?)*<br>
+ <p>
+Earlier versions of Perl and PCRE used to give an error at compile time
+for such patterns. However, because there are cases where this can be useful,
+such patterns are now accepted, but if any repetition of the subpattern
+does in fact match no characters, the loop is forcibly broken. <p>
+By default,
+the quantifiers are "greedy", that is, they match as much as possible (up
+to the maximum number of permitted times), without causing the rest of
+the pattern to fail. The classic example of where this gives problems is
+in trying to match comments in C programs. These appear between /* and */
+and within the comment, individual * and / characters may appear. An attempt
+to match C comments by applying the pattern <p>
+ /\*.*\*/<br>
+ <p>
+to the string <p>
+ /* first comment */ not comment /* second comment */<br>
+ <p>
+fails, because it matches the entire string owing to the greediness of
+the .* item. <p>
+However, if a quantifier is followed by a question mark, it
+ceases to be greedy, and instead matches the minimum number of times possible,
+so the pattern <p>
+ /\*.*?\*/<br>
+ <p>
+does the right thing with the C comments. The meaning of the various quantifiers
+is not otherwise changed, just the preferred number of matches. Do not confuse
+this use of question mark with its use as a quantifier in its own right.
+Because it has two uses, it can sometimes appear doubled, as in <p>
+ \d??\d<br>
+ <p>
+which matches one digit by preference, but can match two if that is the
+only way the rest of the pattern matches. <p>
+If the PCRE_UNGREEDY option is
+set (an option which is not available in Perl), the quantifiers are not
+greedy by default, but individual ones can be made greedy by following
+them with a question mark. In other words, it inverts the default behaviour.
+<p>
+When a parenthesized subpattern is quantified with a minimum repeat count
+that is greater than 1 or with a limited maximum, more memory is required
+for the compiled pattern, in proportion to the size of the minimum or maximum.
+<p>
+If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
+to Perl&rsquo;s /s) is set, thus allowing the . to match newlines, the pattern
+is implicitly anchored, because whatever follows will be tried against
+every character position in the subject string, so there is no point in
+retrying the overall match at any position after the first. PCRE normally
+treats such a pattern as though it were preceded by \A. <p>
+In cases where it
+is known that the subject string contains no newlines, it is worth setting
+PCRE_DOTALL in order to obtain this optimization, or alternatively using
+^ to indicate anchoring explicitly. <p>
+However, there is one situation where
+the optimization cannot be used. When .* is inside capturing parentheses
+that are the subject of a backreference elsewhere in the pattern, a match
+at the start may fail, and a later one succeed. Consider, for example: <p>
+
+ (.*)abc\1<br>
+ <p>
+If the subject is "xyz123abc123" the match point is the fourth character.
+For this reason, such a pattern is not implicitly anchored. <p>
+When a capturing
+subpattern is repeated, the value captured is the substring that matched
+the final iteration. For example, after <p>
+ (tweedle[dume]{3}\s*)+<br>
+ <p>
+has matched "tweedledum tweedledee" the value of the captured substring
+is "tweedledee". However, if there are nested capturing subpatterns, the
+corresponding captured values may have been set in previous iterations.
+For example, after <p>
+ /(a|(b))+/<br>
+ <p>
+matches "aba" the value of the second captured substring is "b".
+<h2><a name='sect17' href='#toc17'>Atomic
+Grouping and Possessive Quantifiers</a></h2>
+ <p>
+With both maximizing and minimizing
+repetition, failure of what follows normally causes the repeated item to
+be re-evaluated to see if a different number of repeats allows the rest
+of the pattern to match. Sometimes it is useful to prevent this, either
+to change the nature of the match, or to cause it fail earlier than it
+otherwise might, when the author of the pattern knows there is no point
+in carrying on. <p>
+Consider, for example, the pattern \d+foo when applied to
+the subject line <p>
+ 123456bar<br>
+ <p>
+After matching all 6 digits and then failing to match "foo", the normal
+action of the matcher is to try again with only 5 digits matching the \d+
+item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
+(a term taken from Jeffrey Friedl&rsquo;s book) provides the means for specifying
+that once a subpattern has matched, it is not to be re-evaluated in this
+way. <p>
+If we use atomic grouping for the previous example, the matcher would
+give up immediately on failing to match "foo" the first time. The notation
+is a kind of special parenthesis, starting with (?&gt; as in this example:
+<p>
+ (?&gt;\d+)foo<br>
+ <p>
+This kind of parenthesis "locks up" the part of the pattern it contains
+once it has matched, and a failure further into the pattern is prevented
+from backtracking into it. Backtracking past it to previous items, however,
+works as normal. <p>
+An alternative description is that a subpattern of this
+type matches the string of characters that an identical standalone pattern
+would match, if anchored at the current point in the subject string. <p>
+Atomic
+grouping subpatterns are not capturing subpatterns. Simple cases such as
+the above example can be thought of as a maximizing repeat that must swallow
+everything it can. So, while both \d+ and \d+? are prepared to adjust the
+number of digits they match in order to make the rest of the pattern match,
+(?&gt;\d+) can only match an entire sequence of digits. <p>
+Atomic groups in general
+can of course contain arbitrarily complicated subpatterns, and can be nested.
+However, when the subpattern for an atomic group is just a single repeated
+item, as in the example above, a simpler notation, called a "possessive
+quantifier" can be used. This consists of an additional + character following
+a quantifier. Using this notation, the previous example can be rewritten
+as <p>
+ \d++foo<br>
+ <p>
+Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
+option is ignored. They are a convenient notation for the simpler forms
+of atomic group. However, there is no difference in the meaning or processing
+of a possessive quantifier and the equivalent atomic group. <p>
+The possessive
+quantifier syntax is an extension to the Perl syntax. It originates in Sun&rsquo;s
+Java package. <p>
+When a pattern contains an unlimited repeat inside a subpattern
+that can itself be repeated an unlimited number of times, the use of an
+atomic group is the only way to avoid some failing matches taking a very
+long time indeed. The pattern <p>
+ (\D+|&lt;\d+&gt;)*[!?]<br>
+ <p>
+matches an unlimited number of substrings that either consist of non-digits,
+or digits enclosed in &lt;&gt;, followed by either ! or ?. When it matches, it runs
+quickly. However, if it is applied to <p>
+ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa<br>
+ <p>
+it takes a long time before reporting failure. This is because the string
+can be divided between the internal \D+ repeat and the external * repeat
+in a large number of ways, and all have to be tried. (The example uses [!?]
+rather than a single character at the end, because both PCRE and Perl have
+an optimization that allows for fast failure when a single character is
+used. They remember the last single character that is required for a match,
+and fail early if it is not present in the string.) If the pattern is changed
+so that it uses an atomic group, like this: <p>
+ ((?&gt;\D+)|&lt;\d+&gt;)*[!?]<br>
+ <p>
+sequences of non-digits cannot be broken, and failure happens quickly.
+
+<h2><a name='sect18' href='#toc18'>Back References</a></h2>
+ <p>
+Outside a character class, a backslash followed by a
+digit greater than 0 (and possibly further digits) is a back reference
+to a capturing subpattern earlier (that is, to its left) in the pattern,
+provided there have been that many previous capturing left parentheses.
+<p>
+However, if the decimal number following the backslash is less than 10,
+it is always taken as a back reference, and causes an error only if there
+are not that many capturing left parentheses in the entire pattern. In other
+words, the parentheses that are referenced need not be to the left of the
+reference for numbers less than 10. See the subsection entitled "Non-printing
+characters" above for further details of the handling of digits following
+a backslash. <p>
+A back reference matches whatever actually matched the capturing
+subpattern in the current subject string, rather than anything matching
+the subpattern itself (see "Subpatterns as subroutines" below for a
+way of doing that). So the pattern <p>
+ (sens|respons)e and \1ibility<br>
+ <p>
+matches "sense and sensibility" and "response and responsibility", but
+not "sense and responsibility". If caseful matching is in force at the time
+of the back reference, the case of letters is relevant. For example, <p>
+ ((?i)rah)\s+\1<br>
+ <p>
+matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
+capturing subpattern is matched caselessly. <p>
+Back references to named subpatterns
+use the Python syntax (?P=name). We could rewrite the above example as follows:
+<p>
+ (?&lt;p1&gt;(?i)rah)\s+(?P=p1)<br>
+ <p>
+There may be more than one back reference to the same subpattern. If a
+subpattern has not actually been used in a particular match, any back references
+to it always fail. For example, the pattern <p>
+ (a|(bc))\2<br>
+ <p>
+always fails if it starts to match "a" rather than "bc". Because there
+may be many capturing parentheses in a pattern, all digits following the
+backslash are taken as part of a potential back reference number. If the
+pattern continues with a digit character, some delimiter must be used to
+terminate the back reference. If the PCRE_EXTENDED option is set, this can
+be whitespace. Otherwise an empty comment (see "Comments" below) can
+be used. <p>
+A back reference that occurs inside the parentheses to which it
+refers fails when the subpattern is first used, so, for example, (a\1) never
+matches. However, such references can be useful inside repeated subpatterns.
+For example, the pattern <p>
+ (a|b\1)+<br>
+ <p>
+matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration
+of the subpattern, the back reference matches the character string corresponding
+to the previous iteration. In order for this to work, the pattern must be
+such that the first iteration does not need to match the back reference.
+This can be done using alternation, as in the example above, or by a quantifier
+with a minimum of zero.
+<h2><a name='sect19' href='#toc19'>Assertions</a></h2>
+ <p>
+An assertion is a test on the characters
+following or preceding the current matching point that does not actually
+consume any characters. The simple assertions coded as \b, \B, \A, \G, \Z, \z,
+^ and $ are described above. <p>
+More complicated assertions are coded as
+subpatterns. There are two kinds: those that look ahead of the current position
+in the subject string, and those that look behind it. An assertion subpattern
+is matched in the normal way, except that it does not cause the current
+matching position to be changed. <p>
+Assertion subpatterns are not capturing
+subpatterns, and may not be repeated, because it makes no sense to assert
+the same thing several times. If any kind of assertion contains capturing
+subpatterns within it, these are counted for the purposes of numbering
+the capturing subpatterns in the whole pattern. However, substring capturing
+is carried out only for positive assertions, because it does not make sense
+for negative assertions.
+<h3><a name='sect20' href='#toc20'>Lookahead assertions</a></h3>
+ <p>
+Lookahead assertions start
+with (?= for positive assertions and (?! for negative assertions. For example,
+<p>
+ \w+(?=;)<br>
+ <p>
+matches a word followed by a semicolon, but does not include the semicolon
+in the match, and <p>
+ foo(?!bar)<br>
+ <p>
+matches any occurrence of "foo" that is not followed by "bar". Note that
+the apparently similar pattern <p>
+ (?!foo)bar<br>
+ <p>
+does not find an occurrence of "bar" that is preceded by something other
+than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
+(?!foo) is always true when the next three characters are "bar". A lookbehind
+assertion is needed to achieve the other effect. <p>
+If you want to force a
+matching failure at some point in a pattern, the most convenient way to
+do it is with (?!) because an empty string always matches, so an assertion
+that requires there not to be an empty string must always fail.
+<h3><a name='sect21' href='#toc21'>Lookbehind
+assertions</a></h3>
+ <p>
+Lookbehind assertions start with (?&lt;= for positive assertions
+and (?&lt;! for negative assertions. For example, <p>
+ (?&lt;!foo)bar<br>
+ <p>
+does find an occurrence of "bar" that is not preceded by "foo". The contents
+of a lookbehind assertion are restricted such that all the strings it matches
+must have a fixed length. However, if there are several alternatives, they
+do not all have to have the same fixed length. Thus <p>
+ (?&lt;=bullock|donkey)<br>
+ <p>
+is permitted, but <p>
+ (?&lt;!dogs?|cats?)<br>
+ <p>
+causes an error at compile time. Branches that match different length strings
+are permitted only at the top level of a lookbehind assertion. This is an
+extension compared with Perl (at least for 5.8), which requires all branches
+to match the same length of string. An assertion such as <p>
+ (?&lt;=ab(c|de))<br>
+ <p>
+is not permitted, because its single top-level branch can match two different
+lengths, but it is acceptable if rewritten to use two top-level branches:
+<p>
+ (?&lt;=abc|abde)<br>
+ <p>
+The implementation of lookbehind assertions is, for each alternative,
+to temporarily move the current position back by the fixed width and then
+try to match. If there are insufficient characters before the current position,
+the match is deemed to fail. <p>
+PCRE does not allow the \C escape (which matches
+a single byte in UTF-8 mode) to appear in lookbehind assertions, because
+it makes it impossible to calculate the length of the lookbehind. The \X
+escape, which can match different numbers of bytes, is also not permitted.
+<p>
+Atomic groups can be used in conjunction with lookbehind assertions to
+specify efficient matching at the end of the subject string. Consider a
+simple pattern such as <p>
+ abcd$<br>
+ <p>
+when applied to a long string that does not match. Because matching proceeds
+from left to right, PCRE will look for each "a" in the subject and then
+see if what follows matches the rest of the pattern. If the pattern is specified
+as <p>
+ ^.*abcd$<br>
+ <p>
+the initial .* matches the entire string at first, but when this fails
+(because there is no following "a"), it backtracks to match all but the
+last character, then all but the last two characters, and so on. Once again
+the search for "a" covers the entire string, from right to left, so we
+are no better off. However, if the pattern is written as <p>
+ ^(?&gt;.*)(?&lt;=abcd)<br>
+ <p>
+or, equivalently, using the possessive quantifier syntax, <p>
+ ^.*+(?&lt;=abcd)<br>
+ <p>
+there can be no backtracking for the .* item; it can match only the entire
+string. The subsequent lookbehind assertion does a single test on the last
+four characters. If it fails, the match fails immediately. For long strings,
+this approach makes a significant difference to the processing time.
+<h3><a name='sect22' href='#toc22'>Using
+multiple assertions</a></h3>
+ <p>
+Several assertions (of any sort) may occur in succession.
+For example, <p>
+ (?&lt;=\d{3})(?&lt;!999)foo<br>
+ <p>
+matches "foo" preceded by three digits that are not "999". Notice that
+each of the assertions is applied independently at the same point in the
+subject string. First there is a check that the previous three characters
+are all digits, and then there is a check that the same three characters
+are not "999". This pattern does <i>not</i> match "foo" preceded by six characters,
+the first of which are digits and the last three of which are not "999".
+For example, it doesn&rsquo;t match "123abcfoo". A pattern to do that is <p>
+ (?&lt;=\d{3}...)(?&lt;!999)foo<br>
+ <p>
+This time the first assertion looks at the preceding six characters, checking
+that the first three are digits, and then the second assertion checks that
+the preceding three characters are not "999". <p>
+Assertions can be nested in
+any combination. For example, <p>
+ (?&lt;=(?&lt;!foo)bar)baz<br>
+ <p>
+matches an occurrence of "baz" that is preceded by "bar" which in turn
+is not preceded by "foo", while <p>
+ (?&lt;=\d{3}(?!999)...)foo<br>
+ <p>
+is another pattern that matches "foo" preceded by three digits and any
+three characters that are not "999".
+<h2><a name='sect23' href='#toc23'>Conditional Subpatterns</a></h2>
+ <p>
+It is possible
+to cause the matching process to obey a subpattern conditionally or to
+choose between two alternative subpatterns, depending on the result of
+an assertion, or whether a previous capturing subpattern matched or not.
+The two possible forms of conditional subpattern are <p>
+ (?(condition)yes-pattern)<br>
+ (?(condition)yes-pattern|no-pattern)<br>
+ <p>
+If the condition is satisfied, the yes-pattern is used; otherwise the no-pattern
+(if present) is used. If there are more than two alternatives in the subpattern,
+a compile-time error occurs. <p>
+There are three kinds of condition. If the text
+between the parentheses consists of a sequence of digits, the condition
+is satisfied if the capturing subpattern of that number has previously
+matched. The number must be greater than zero. Consider the following pattern,
+which contains non-significant white space to make it more readable (assume
+the PCRE_EXTENDED option) and to divide it into three parts for ease of
+discussion: <p>
+ ( \( )? [^()]+ (?(1) \) )<br>
+ <p>
+The first part matches an optional opening parenthesis, and if that character
+is present, sets it as the first captured substring. The second part matches
+one or more characters that are not parentheses. The third part is a conditional
+subpattern that tests whether the first set of parentheses matched or not.
+If they did, that is, if subject started with an opening parenthesis, the
+condition is true, and so the yes-pattern is executed and a closing parenthesis
+is required. Otherwise, since no-pattern is not present, the subpattern matches
+nothing. In other words, this pattern matches a sequence of non-parentheses,
+optionally enclosed in parentheses. <p>
+If the condition is the string (R),
+it is satisfied if a recursive call to the pattern or subpattern has been
+made. At "top level", the condition is false. This is a PCRE extension. Recursive
+patterns are described in the next section. <p>
+If the condition is not a sequence
+of digits or (R), it must be an assertion. This may be a positive or negative
+lookahead or lookbehind assertion. Consider this pattern, again containing
+non-significant white space, and with the two alternatives on the second
+line: <p>
+ (?(?=[^a-z]*[a-z])<br>
+ \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )<br>
+ <p>
+The condition is a positive lookahead assertion that matches an optional
+sequence of non-letters followed by a letter. In other words, it tests for
+the presence of at least one letter in the subject. If a letter is found,
+the subject is matched against the first alternative; otherwise it is matched
+against the second. This pattern matches strings in one of the two forms
+dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
+<h2><a name='sect24' href='#toc24'>Comments</a></h2>
+
+<p>
+The sequence (?# marks the start of a comment that continues up to the
+next closing parenthesis. Nested parentheses are not permitted. The characters
+that make up a comment play no part in the pattern matching at all. <p>
+If the
+PCRE_EXTENDED option is set, an unescaped # character outside a character
+class introduces a comment that continues up to the next newline character
+in the pattern.
+<h2><a name='sect25' href='#toc25'>Recursive Patterns</a></h2>
+ <p>
+Consider the problem of matching a
+string in parentheses, allowing for unlimited nested parentheses. Without
+the use of recursion, the best that can be done is to use a pattern that
+matches up to some fixed depth of nesting. It is not possible to handle
+an arbitrary nesting depth. Perl provides a facility that allows regular
+expressions to recurse (amongst other things). It does this by interpolating
+Perl code in the expression at run time, and the code can refer to the
+expression itself. A Perl pattern to solve the parentheses problem can be
+created like this: <p>
+ $re = qr{\( (?: (?&gt;[^()]+) | (?p{$re}) )* \)}x;<br>
+ <p>
+The (?p{...}) item interpolates Perl code at run time, and in this case refers
+recursively to the pattern in which it appears. Obviously, PCRE cannot support
+the interpolation of Perl code. Instead, it supports some special syntax
+for recursion of the entire pattern, and also for individual subpattern
+recursion. <p>
+The special item that consists of (? followed by a number greater
+than zero and a closing parenthesis is a recursive call of the subpattern
+of the given number, provided that it occurs inside that subpattern. (If
+not, it is a "subroutine" call, which is described in the next section.)
+The special item (?R) is a recursive call of the entire regular expression.
+<p>
+For example, this PCRE pattern solves the nested parentheses problem (assume
+the PCRE_EXTENDED option is set so that white space is ignored): <p>
+ \( (
+(?&gt;[^()]+) | (?R) )* \)<br>
+ <p>
+First it matches an opening parenthesis. Then it matches any number of
+substrings which can either be a sequence of non-parentheses, or a recursive
+match of the pattern itself (that is a correctly parenthesized substring).
+Finally there is a closing parenthesis. <p>
+If this were part of a larger pattern,
+you would not want to recurse the entire pattern, so instead you could
+use this: <p>
+ ( \( ( (?&gt;[^()]+) | (?1) )* \) )<br>
+ <p>
+We have put the pattern into parentheses, and caused the recursion to
+refer to them instead of the whole pattern. In a larger pattern, keeping
+track of parenthesis numbers can be tricky. It may be more convenient to
+use named parentheses instead. For this, PCRE uses (?P&gt;name), which is an
+extension to the Python syntax that PCRE uses for named parentheses (Perl
+does not provide named parentheses). We could rewrite the above example
+as follows: <p>
+ (?P&lt;pn&gt; \( ( (?&gt;[^()]+) | (?P&gt;pn) )* \) )<br>
+ <p>
+This particular example pattern contains nested unlimited repeats, and
+so the use of atomic grouping for matching strings of non-parentheses is
+important when applying the pattern to strings that do not match. For example,
+when this pattern is applied to <p>
+ (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()<br>
+ <p>
+it yields "no match" quickly. However, if atomic grouping is not used,
+the match runs for a very long time indeed because there are so many different
+ways the + and * repeats can carve up the subject, and all have to be tested
+before failure can be reported. <p>
+At the end of a match, the values set for
+any capturing subpatterns are those from the outermost level of the recursion
+at which the subpattern value is set. If you want to obtain intermediate
+values, a callout function can be used (see the next section and the <b>pcrecallout</b>
+ documentation). If the pattern above is matched against <p>
+ (ab(cd)ef)<br>
+ <p>
+the value for the capturing parentheses is "ef", which is the last value
+taken on at the top level. If additional parentheses are added, giving <p>
+
+ \( ( ( (?&gt;[^()]+) | (?R) )* ) \)<br>
+ ^ ^<br>
+ ^ ^<br>
+ <p>
+the string they capture is "ab(cd)ef", the contents of the top level parentheses.
+If there are more than 15 capturing parentheses in a pattern, PCRE has
+to obtain extra memory to store data during a recursion, which it does
+by using <b>pcre_malloc</b>, freeing it via <b>pcre_free</b> afterwards. If no memory
+can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. <p>
+Do
+not confuse the (?R) item with the condition (R), which tests for recursion.
+Consider this pattern, which matches text in angle brackets, allowing for
+arbitrary nesting. Only digits are allowed in nested brackets (that is,
+when recursing), whereas any characters are permitted at the outer level.
+<p>
+ &lt; (?: (?(R) \d++ | [^&lt;&gt;]*+) | (?R)) * &gt;<br>
+ <p>
+In this pattern, (?(R) is the start of a conditional subpattern, with
+two different alternatives for the recursive and non-recursive cases. The
+(?R) item is the actual recursive call.
+<h2><a name='sect26' href='#toc26'>Subpatterns As Subroutines</a></h2>
+ <p>
+If
+the syntax for a recursive subpattern reference (either by number or by
+name) is used outside the parentheses to which it refers, it operates like
+a subroutine in a programming language. An earlier example pointed out that
+the pattern <p>
+ (sens|respons)e and \1ibility<br>
+ <p>
+matches "sense and sensibility" and "response and responsibility", but
+not "sense and responsibility". If instead the pattern <p>
+ (sens|respons)e
+and (?1)ibility<br>
+ <p>
+is used, it does match "sense and responsibility" as well as the other
+two strings. Such references must, however, follow the subpattern to which
+they refer.
+<h2><a name='sect27' href='#toc27'>Callouts</a></h2>
+ <p>
+Perl has a feature whereby using the sequence (?{...})
+causes arbitrary Perl code to be obeyed in the middle of matching a regular
+expression. This makes it possible, amongst other things, to extract different
+substrings that match the same pair of parentheses when there is a repetition.
+<p>
+PCRE provides a similar feature, but of course it cannot obey arbitrary
+Perl code. The feature is called "callout". The caller of PCRE provides an
+external function by putting its entry point in the global variable <i>pcre_callout</i>.
+By default, this variable contains NULL, which disables all calling out.
+<p>
+Within a regular expression, (?C) indicates the points at which the external
+function is to be called. If you want to identify different callout points,
+you can put a number less than 256 after the letter C. The default value
+is zero. For example, this pattern has two callout points: <p>
+ (?C1)dabc(?C2)def<br>
+ <p>
+If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are
+automatically installed before each item in the pattern. They are all numbered
+255. <p>
+During matching, when PCRE reaches a callout point (and <i>pcre_callout</i>
+is set), the external function is called. It is provided with the number
+of the callout, the position in the pattern, and, optionally, one item
+of data originally supplied by the caller of <b>pcre_exec()</b>. The callout function
+may cause matching to proceed, to backtrack, or to fail altogether. A complete
+description of the interface to the callout function is given in the <b>pcrecallout</b>
+ documentation. <p>
+ Last updated: 09 September 2004 <br>
+Copyright (c) 1997-2004 University of Cambridge. <p>
+
+<hr><p>
+<a name='toc'><b>Table of Contents</b></a><p>
+<ul>
+<li><a name='toc0' href='#sect0'>Name</a></li>
+<li><a name='toc1' href='#sect1'>Pcre Regular Expression Details</a></li>
+<li><a name='toc2' href='#sect2'>Backslash</a></li>
+<ul>
+<li><a name='toc3' href='#sect3'>Non-printing characters</a></li>
+<li><a name='toc4' href='#sect4'>Generic character types</a></li>
+<li><a name='toc5' href='#sect5'>Unicode character properties</a></li>
+<li><a name='toc6' href='#sect6'>Simple assertions</a></li>
+</ul>
+<li><a name='toc7' href='#sect7'>Circumflex and Dollar</a></li>
+<li><a name='toc8' href='#sect8'>Full Stop (period, Dot)</a></li>
+<li><a name='toc9' href='#sect9'>Matching a Single Byte</a></li>
+<li><a name='toc10' href='#sect10'>Square Brackets and Character Classes</a></li>
+<li><a name='toc11' href='#sect11'>Posix Character Classes</a></li>
+<li><a name='toc12' href='#sect12'>Vertical Bar</a></li>
+<li><a name='toc13' href='#sect13'>Internal Option Setting</a></li>
+<li><a name='toc14' href='#sect14'>Subpatterns</a></li>
+<li><a name='toc15' href='#sect15'>Named Subpatterns</a></li>
+<li><a name='toc16' href='#sect16'>Repetition</a></li>
+<li><a name='toc17' href='#sect17'>Atomic Grouping and Possessive Quantifiers</a></li>
+<li><a name='toc18' href='#sect18'>Back References</a></li>
+<li><a name='toc19' href='#sect19'>Assertions</a></li>
+<ul>
+<li><a name='toc20' href='#sect20'>Lookahead assertions</a></li>
+<li><a name='toc21' href='#sect21'>Lookbehind assertions</a></li>
+<li><a name='toc22' href='#sect22'>Using multiple assertions</a></li>
+</ul>
+<li><a name='toc23' href='#sect23'>Conditional Subpatterns</a></li>
+<li><a name='toc24' href='#sect24'>Comments</a></li>
+<li><a name='toc25' href='#sect25'>Recursive Patterns</a></li>
+<li><a name='toc26' href='#sect26'>Subpatterns As Subroutines</a></li>
+<li><a name='toc27' href='#sect27'>Callouts</a></li>
+</ul>
+</body>
+</html>