PCRE(3) manual page

From 7fd9fe181150f166a098eaf4e006f878c28cb770 Mon Sep 17 00:00:00 2001 From: Gluzskiy Alexandr Date: Mon, 15 Feb 2010 05:51:01 +0300 Subject: sort --- Utilities/PCRE/man/html/pcreapi.3.html | 1069 ++++++++++++++++++++++++++++++++ 1 file changed, 1069 insertions(+) create mode 100644 Utilities/PCRE/man/html/pcreapi.3.html (limited to 'Utilities/PCRE/man/html/pcreapi.3.html') diff --git a/Utilities/PCRE/man/html/pcreapi.3.html b/Utilities/PCRE/man/html/pcreapi.3.html new file mode 100644 index 0000000..a083204 --- /dev/null +++ b/Utilities/PCRE/man/html/pcreapi.3.html @@ -0,0 +1,1069 @@ + + + + + +PCRE(3) manual page + + +Table of Contents

+ +

Name

+PCRE - Perl-compatible regular expressions +

Pcre Native API

+#include <pcre.h> +

+ +
+pcre *pcre_compile(const char *pattern, int options, const char **errptr, +int *erroffset, const unsigned char *tableptr);

+
+pcre_extra *pcre_study(const pcre *code, int options, const char **errptr); +

+
+int pcre_exec(const pcre *code, "const pcre_extra *extra," const char +*subject, int length, int startoffset, int options, int *ovector, int +ovecsize);

+
+int pcre_copy_named_substring(const pcre *code, const char *subject, int +*ovector, int stringcount, const char *stringname, char *buffer, int +buffersize);

+
+int pcre_copy_substring(const char *subject, int *ovector, int stringcount, +int stringnumber, char *buffer, int buffersize);

+
+int pcre_get_named_substring(const pcre *code, const char *subject, int +*ovector, int stringcount, const char *stringname, const char **stringptr); +

+
+int pcre_get_stringnumber(const pcre *code, const char *name);

+
+int pcre_get_substring(const char *subject, int *ovector, int stringcount, +int stringnumber, const char **stringptr);

+
+int pcre_get_substring_list(const char *subject, int *ovector, int stringcount, +"const char ***listptr);"

+
+void pcre_free_substring(const char *stringptr);

+
+void pcre_free_substring_list(const char **stringptr);

+
+const unsigned char *pcre_maketables(void);

+
+int pcre_fullinfo(const pcre *code, "const pcre_extra *extra," int what, +void *where);

+
+int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

+
+int pcre_config(int what, void *where);

+
+char *pcre_version(void);

+
+void *(*pcre_malloc)(size_t);

+
+void (*pcre_free)(void *);

+
+void *(*pcre_stack_malloc)(size_t);

+
+void (*pcre_stack_free)(void *);

+
+int (*pcre_callout)(pcre_callout_block *); +

Pcre API Overview

+PCRE has +its own native API, which is described in this document. There is also a +set of wrapper functions that correspond to the POSIX regular expression +API. These are described in the pcreposix documentation.

+The native API +function prototypes are defined in the header file pcre.h, and on Unix systems +the library itself is called libpcre. It can normally be accessed by adding +-lpcre to the command for linking an application that uses PCRE. The header +file defines the macros PCRE_MAJOR and PCRE_MINOR to contain the major +and minor release numbers for the library. Applications can use these to +include support for different releases of PCRE.

+The functions pcre_compile(), +pcre_study(), and pcre_exec() are used for compiling and matching regular +expressions. A sample program that demonstrates the simplest way of using +them is provided in the file called pcredemo.c in the source distribution. +The pcresample documentation describes how to run it.

+In addition to the +main compiling and matching functions, there are convenience functions +for extracting captured substrings from a matched subject string. They are: +

+ pcre_copy_substring()
+ pcre_copy_named_substring()
+ pcre_get_substring()
+ pcre_get_named_substring()
+ pcre_get_substring_list()
+ pcre_get_stringnumber()
+

+pcre_free_substring() and pcre_free_substring_list() are also provided, +to free the memory used for extracted strings.

+The function pcre_maketables() +is used to build a set of character tables in the current locale for passing +to pcre_compile() or pcre_exec(). This is an optional facility that is provided +for specialist use. Most commonly, no special tables are passed, in which +case internal tables that are generated when PCRE is built are used.

+The +function pcre_fullinfo() is used to find out information about a compiled +pattern; pcre_info() is an obsolete version that returns only some of the +available information, but is retained for backwards compatibility. The +function pcre_version() returns a pointer to a string containing the version +of PCRE and its date of release.

+The global variables pcre_malloc and pcre_free +initially contain the entry points of the standard malloc() and free() +functions, respectively. PCRE calls the memory management functions via +these variables, so a calling program can replace them if it wishes to +intercept the calls. This should be done before calling any PCRE functions. +

+The global variables pcre_stack_malloc and pcre_stack_free are also indirections +to memory management functions. These special functions are used only when +PCRE is compiled to use the heap for remembering data, instead of recursive +function calls. This is a non-standard way of building PCRE, for use in environments +that have limited stacks. Because of the greater use of memory management, +it runs more slowly. Separate functions are provided so that special-purpose +external code can be used for this case. When used, these functions are +always called in a stack-like manner (last obtained, first freed), and always +for memory blocks of the same size.

+The global variable pcre_callout initially +contains NULL. It can be set by the caller to a "callout" function, which +PCRE will then call at specified points during a matching operation. Details +are given in the pcrecallout documentation. +

Multithreading

+The PCRE +functions can be used in multi-threading applications, with the proviso +that the memory management functions pointed to by pcre_malloc, pcre_free, +pcre_stack_malloc, and pcre_stack_free, and the callout function pointed +to by pcre_callout, are shared by all threads.

+The compiled form of a regular +expression is not altered during matching, so the same compiled pattern +can safely be used by several threads at once. +

Saving Precompiled Patterns +for Later Use

+The compiled form of a regular expression can be saved and +re-used at a later time, possibly by a different program, and even on a +host other than the one on which it was compiled. Details are given in the + pcreprecompile documentation. +

Checking Build-time Options

+int pcre_config(int +what, void *where);

+The function pcre_config() makes it possible for a +PCRE client to discover which optional features have been compiled into +the PCRE library. The pcrebuild documentation has more details about these +optional features.

+The first argument for pcre_config() is an integer, specifying +which information is required; the second argument is a pointer to a variable +into which the information is placed. The following information is available: +

+ PCRE_CONFIG_UTF8
+

+The output is an integer that is set to one if UTF-8 support is available; +otherwise it is set to zero.

+ PCRE_CONFIG_UNICODE_PROPERTIES
+

+The output is an integer that is set to one if support for Unicode character +properties is available; otherwise it is set to zero.

+ PCRE_CONFIG_NEWLINE
+

+The output is an integer that is set to the value of the code that is +used for the newline character. It is either linefeed (10) or carriage return +(13), and should normally be the standard character for your operating +system.

+ PCRE_CONFIG_LINK_SIZE
+

+The output is an integer that contains the number of bytes used for internal +linkage in compiled regular expressions. The value is 2, 3, or 4. Larger +values allow larger regular expressions to be compiled, at the expense +of slower matching. The default value of 2 is sufficient for all but the +most massive patterns, since it allows the compiled pattern to be up to +64K in size.

+ PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
+

+The output is an integer that contains the threshold above which the POSIX +interface uses malloc() for output vectors. Further details are given in +the pcreposix documentation.

+ PCRE_CONFIG_MATCH_LIMIT
+

+The output is an integer that gives the default limit for the number of +internal matching function calls in a pcre_exec() execution. Further details +are given with pcre_exec() below.

+ PCRE_CONFIG_STACKRECURSE
+

+The output is an integer that is set to one if internal recursion is implemented +by recursive function calls that use the stack to remember their state. +This is the usual way that PCRE is compiled. The output is zero if PCRE +was compiled to use blocks of data on the heap instead of recursive function +calls. In this case, pcre_stack_malloc and pcre_stack_free are called to +manage memory blocks on the heap, thus avoiding the use of the stack. + +

Compiling a Pattern

+pcre *pcre_compile(const char *pattern, int options, + const char **errptr, int *erroffset, const unsigned char *tableptr); +

+The function pcre_compile() is called to compile a pattern into an internal +form. The pattern is a C string terminated by a binary zero, and is passed +in the pattern argument. A pointer to a single block of memory that is obtained +via pcre_malloc is returned. This contains the compiled code and related +data. The pcre type is defined for the returned block; this is a typedef +for a structure whose contents are not externally defined. It is up to the +caller to free the memory when it is no longer required.

+Although the compiled +code of a PCRE regex is relocatable, that is, it does not depend on memory +location, the complete pcre data block is not fully relocatable, because +it may contain a copy of the tableptr argument, which is an address (see +below).

+The options argument contains independent bits that affect the compilation. +It should be zero if no options are required. The available options are +described below. Some of them, in particular, those that are compatible +with Perl, can also be set and unset from within the pattern (see the detailed +description in the pcrepattern documentation). For these options, the +contents of the options argument specifies their initial settings at the +start of compilation and execution. The PCRE_ANCHORED option can be set +at the time of matching as well as at compile time.

+If errptr is NULL, pcre_compile() +returns NULL immediately. Otherwise, if compilation of a pattern fails, +pcre_compile() returns NULL, and sets the variable pointed to by errptr +to point to a textual error message. The offset from the start of the pattern +to the character where the error was discovered is placed in the variable +pointed to by erroffset, which must not be NULL. If it is, an immediate +error is given.

+If the final argument, tableptr, is NULL, PCRE uses a default +set of character tables that are built when PCRE is compiled, using the +default C locale. Otherwise, tableptr must be an address that is the result +of a call to pcre_maketables(). This value is stored with the compiled pattern, +and used again by pcre_exec(), unless another table pointer is passed to +it. For more discussion, see the section on locale support below.

+This code +fragment shows a typical straightforward call to pcre_compile():

+ pcre +*re;
+ const char *error;
+ int erroffset;
+ re = pcre_compile(
+ "^A.*Z", /* the pattern */
+ 0, /* default options */
+ &error, /* for error message */
+ &erroffset, /* for error offset */
+ NULL); /* use default character tables */
+

+The following names for option bits are defined in the pcre.h header file: +

+ PCRE_ANCHORED
+

+If this bit is set, the pattern is forced to be "anchored", that is, it +is constrained to match only at the first matching point in the string +that is being searched (the "subject string"). This effect can also be achieved +by appropriate constructs in the pattern itself, which is the only way +to do it in Perl.

+ PCRE_AUTO_CALLOUT
+

+If this bit is set, pcre_compile() automatically inserts callout items, +all with number 255, before each pattern item. For discussion of the callout +facility, see the pcrecallout documentation.

+ PCRE_CASELESS
+

+If this bit is set, letters in the pattern match both upper and lower +case letters. It is equivalent to Perl’s /i option, and it can be changed +within a pattern by a (?i) option setting. When running in UTF-8 mode, case +support for high-valued characters is available only when PCRE is built +with Unicode character property support.

+ PCRE_DOLLAR_ENDONLY
+

+If this bit is set, a dollar metacharacter in the pattern matches only +at the end of the subject string. Without this option, a dollar also matches +immediately before the final character if it is a newline (but not before +any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE +is set. There is no equivalent to this option in Perl, and no way to set +it within a pattern.

+ PCRE_DOTALL
+

+If this bit is set, a dot metacharater in the pattern matches all characters, +including newlines. Without it, newlines are excluded. This option is equivalent +to Perl’s /s option, and it can be changed within a pattern by a (?s) option +setting. A negative class such as [^a] always matches a newline character, +independent of the setting of this option.

+ PCRE_EXTENDED
+

+If this bit is set, whitespace data characters in the pattern are totally +ignored except when escaped or inside a character class. Whitespace does +not include the VT character (code 11). In addition, characters between +an unescaped # outside a character class and the next newline character, +inclusive, are also ignored. This is equivalent to Perl’s /x option, and +it can be changed within a pattern by a (?x) option setting.

+This option +makes it possible to include comments inside complicated patterns. Note, +however, that this applies only to data characters. Whitespace characters +may never appear within special character sequences in a pattern, for example +within the sequence (?( which introduces a conditional subpattern.

+ PCRE_EXTRA
+

+This option was invented in order to turn on additional functionality +of PCRE that is incompatible with Perl, but it is currently of very little +use. When set, any backslash in a pattern that is followed by a letter that +has no special meaning causes an error, thus reserving these combinations +for future expansion. By default, as in Perl, a backslash followed by a +letter with no special meaning is treated as a literal. There are at present +no other features controlled by this option. It can also be set by a (?X) +option setting within a pattern.

+ PCRE_MULTILINE
+

+By default, PCRE treats the subject string as consisting of a single line +of characters (even if it actually contains newlines). The "start of line" +metacharacter (^) matches only at the start of the string, while the "end +of line" metacharacter ($) matches only at the end of the string, or before +a terminating newline (unless PCRE_DOLLAR_ENDONLY is set). This is the same +as Perl.

+When PCRE_MULTILINE it is set, the "start of line" and "end of +line" constructs match immediately following or immediately before any +newline in the subject string, respectively, as well as at the very start +and end. This is equivalent to Perl’s /m option, and it can be changed within +a pattern by a (?m) option setting. If there are no "\n" characters in a +subject string, or no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE +has no effect.

+ PCRE_NO_AUTO_CAPTURE
+

+If this option is set, it disables the use of numbered capturing parentheses +in the pattern. Any opening parenthesis that is not followed by ? behaves +as if it were followed by ?: but named parentheses can still be used for +capturing (and they acquire numbers in the usual way). There is no equivalent +of this option in Perl.

+ PCRE_UNGREEDY
+

+This option inverts the "greediness" of the quantifiers so that they are +not greedy by default, but become greedy if followed by "?". It is not compatible +with Perl. It can also be set by a (?U) option setting within the pattern. +

+ PCRE_UTF8
+

+This option causes PCRE to regard both the pattern and the subject as +strings of UTF-8 characters instead of single-byte character strings. However, +it is available only when PCRE is built to include UTF-8 support. If not, +the use of this option provokes an error. Details of how this option changes +the behaviour of PCRE are given in the section on UTF-8 support in the +main pcre page.

+ PCRE_NO_UTF8_CHECK
+

+When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is +automatically checked. If an invalid UTF-8 sequence of bytes is found, pcre_compile() +returns an error. If you already know that your pattern is valid, and you +want to skip this check for performance reasons, you can set the PCRE_NO_UTF8_CHECK +option. When it is set, the effect of passing an invalid UTF-8 string as +a pattern is undefined. It may cause your program to crash. Note that this +option can also be passed to pcre_exec(), to suppress the UTF-8 validity +checking of subject strings. +

Studying a Pattern

+pcre_extra *pcre_study(const +pcre *code, int options, const char **errptr);

+If a compiled pattern is +going to be used several times, it is worth spending more time analyzing +it in order to speed up the time taken for matching. The function pcre_study() +takes a pointer to a compiled pattern as its first argument. If studying +the pattern produces additional information that will help speed up matching, +pcre_study() returns a pointer to a pcre_extra block, in which the study_data +field points to the results of the study.

+The returned value from pcre_study() +can be passed directly to pcre_exec(). However, a pcre_extra block also +contains other fields that can be set by the caller before the block is +passed; these are described below in the section on matching a pattern. +

+If studying the pattern does not produce any additional information, pcre_study() +returns NULL. In that circumstance, if the calling program wants to pass +any of the other fields to pcre_exec(), it must set up its own pcre_extra +block.

+The second argument of pcre_study() contains option bits. At present, +no options are defined, and this argument should always be zero.

+The third +argument for pcre_study() is a pointer for an error message. If studying +succeeds (even if no data is returned), the variable it points to is set +to NULL. Otherwise it points to a textual error message. You should therefore +test the error pointer for NULL after calling pcre_study(), to be sure +that it has run successfully.

+This is a typical call to pcre_study():

+ +pcre_extra *pe;
+ pe = pcre_study(
+ re, /* result of pcre_compile() */
+ 0, /* no options exist */
+ &error); /* set to NULL or points to a message */
+

+At present, studying a pattern is useful only for non-anchored patterns +that do not have a single fixed starting character. A bitmap of possible +starting bytes is created. +

Locale Support

+PCRE handles caseless matching, +and determines whether characters are letters, digits, or whatever, by +reference to a set of tables, indexed by character value. (When running +in UTF-8 mode, this applies only to characters with codes less than 128. +Higher-valued codes never match escapes such as \w or \d, but can be tested +with \p if PCRE is built with Unicode character property support.)

+An internal +set of tables is created in the default C locale when PCRE is built. This +is used when the final argument of pcre_compile() is NULL, and is sufficient +for many applications. An alternative set of tables can, however, be supplied. +These may be created in a different locale from the default. As more and +more applications change to using Unicode, the need for this locale support +is expected to die away.

+External tables are built by calling the pcre_maketables() +function, which has no arguments, in the relevant locale. The result can +then be passed to pcre_compile() or pcre_exec() as often as necessary. For +example, to build and use tables that are appropriate for the French locale +(where accented characters with values greater than 128 are treated as +letters), the following code could be used:

+ setlocale(LC_CTYPE, "fr_FR");
+ tables = pcre_maketables();
+ re = pcre_compile(..., tables);
+

+When pcre_maketables() runs, the tables are built in memory that is obtained +via pcre_malloc. It is the caller’s responsibility to ensure that the memory +containing the tables remains available for as long as it is needed.

+The +pointer that is passed to pcre_compile() is saved with the compiled pattern, +and the same tables are used via this pointer by pcre_study() and normally +also by pcre_exec(). Thus, by default, for any single pattern, compilation, +studying and matching all happen in the same locale, but different patterns +can be compiled in different locales.

+It is possible to pass a table pointer +or NULL (indicating the use of the internal tables) to pcre_exec(). Although +not intended for this purpose, this facility could be used to match a pattern +in a different locale from the one in which it was compiled. Passing table +pointers at run time is discussed below in the section on matching a pattern. + +

Information About a Pattern

+int pcre_fullinfo(const pcre *code, "const +pcre_extra *extra," int what, void *where);

+The pcre_fullinfo() function +returns information about a compiled pattern. It replaces the obsolete pcre_info() +function, which is nevertheless retained for backwards compability (and +is documented below).

+The first argument for pcre_fullinfo() is a pointer +to the compiled pattern. The second argument is the result of pcre_study(), +or NULL if the pattern was not studied. The third argument specifies which +piece of information is required, and the fourth argument is a pointer +to a variable to receive the data. The yield of the function is zero for +success, or one of the following negative numbers:

+ PCRE_ERROR_NULL + the argument code was NULL
+ the argument where was NULL
+ PCRE_ERROR_BADMAGIC the "magic number" was not found
+ PCRE_ERROR_BADOPTION the value of what was invalid
+

+The "magic number" is placed at the start of each compiled pattern as +an simple check against passing an arbitrary memory pointer. Here is a typical +call of pcre_fullinfo(), to obtain the length of the compiled pattern: +

+ int rc;
+ unsigned long int length;
+ rc = pcre_fullinfo(
+ re, /* result of pcre_compile() */
+ pe, /* result of pcre_study(), or NULL */
+ PCRE_INFO_SIZE, /* what is required */
+ &length); /* where to put the data */
+

+The possible values for the third argument are defined in pcre.h, and are +as follows:

+ PCRE_INFO_BACKREFMAX
+

+Return the number of the highest back reference in the pattern. The fourth +argument should point to an int variable. Zero is returned if there are +no back references.

+ PCRE_INFO_CAPTURECOUNT
+

+Return the number of capturing subpatterns in the pattern. The fourth argument +should point to an int variable.

+ PCRE_INFO_DEFAULTTABLES
+

+Return a pointer to the internal default character tables within PCRE. +The fourth argument should point to an unsigned char * variable. This information +call is provided for internal use by the pcre_study() function. External +callers can cause PCRE to use its internal tables by passing a NULL table +pointer.

+ PCRE_INFO_FIRSTBYTE
+

+Return information about the first byte of any matched string, for a non-anchored +pattern. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name +is still recognized for backwards compatibility.)

+If there is a fixed first +byte, for example, from a pattern such as (cat|cow|coyote), it is returned +in the integer pointed to by where. Otherwise, if either

+(a) the pattern +was compiled with the PCRE_MULTILINE option, and every branch starts with +"^", or

+(b) every branch of the pattern starts with ".*" and PCRE_DOTALL +is not set (if it were set, the pattern would be anchored),

+-1 is returned, +indicating that the pattern matches only at the start of a subject string +or after any newline within the string. Otherwise -2 is returned. For anchored +patterns, -2 is returned.

+ PCRE_INFO_FIRSTTABLE
+

+If the pattern was studied, and this resulted in the construction of a +256-bit table indicating a fixed set of bytes for the first byte in any +matching string, a pointer to the table is returned. Otherwise NULL is returned. +The fourth argument should point to an unsigned char * variable.

+ PCRE_INFO_LASTLITERAL
+

+Return the value of the rightmost literal byte that must exist in any +matched string, other than at its start, if such a byte has been recorded. +The fourth argument should point to an int variable. If there is no such +byte, -1 is returned. For anchored patterns, a last literal byte is recorded +only if it follows something of variable length. For example, for the pattern +/^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value is +-1.

+ PCRE_INFO_NAMECOUNT
+ PCRE_INFO_NAMEENTRYSIZE
+ PCRE_INFO_NAMETABLE
+

+PCRE supports the use of named as well as numbered capturing parentheses. +The names are just an additional way of identifying the parentheses, which +still acquire numbers. A convenience function called pcre_get_named_substring() +is provided for extracting an individual captured substring by name. It +is also possible to extract the data directly, by first converting the +name to a number in order to access the correct pointers in the output +vector (described with pcre_exec() below). To do the conversion, you need +to use the name-to-number map, which is described by these three values.

+The +map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives +the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size of each +entry; both of these return an int value. The entry size depends on the +length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the +first entry of the table (a pointer to char). The first two bytes of each +entry are the number of the capturing parenthesis, most significant byte +first. The rest of the entry is the corresponding name, zero terminated. +The names are in alphabetical order. For example, consider the following +pattern (assume PCRE_EXTENDED is set, so white space - including newlines +- is ignored):

+ (?P<date> (?P<year>(\d\d)?\d\d) -
+ (?P<month>\d\d) - (?P<day>\d\d) )
+

+There are four named subpatterns, so the table has four entries, and each +entry in the table is eight bytes long. The table is as follows, with non-printing +bytes shows in hexadecimal, and undefined bytes shown as ??:

+ 00 01 d + a t e 00 ??
+ 00 05 d a y 00 ?? ??
+ 00 04 m o n t h 00
+ 00 02 y e a r 00 ??
+

+When writing code to extract data from named subpatterns using the name-to-number +map, remember that the length of each entry is likely to be different for +each compiled pattern.

+ PCRE_INFO_OPTIONS
+

+Return a copy of the options with which the pattern was compiled. The fourth +argument should point to an unsigned long int variable. These option bits +are those specified in the call to pcre_compile(), modified by any top-level +option settings within the pattern itself.

+A pattern is automatically anchored +by PCRE if all of its top-level alternatives begin with one of the following: +

+ ^ unless PCRE_MULTILINE is set
+ \A always
+ \G always
+ .* if PCRE_DOTALL is set and there are no back
+ references to the subpattern in which .* appears
+

+For such patterns, the PCRE_ANCHORED bit is set in the options returned +by pcre_fullinfo().

+ PCRE_INFO_SIZE
+

+Return the size of the compiled pattern, that is, the value that was passed +as the argument to pcre_malloc() when PCRE was getting memory in which +to place the compiled data. The fourth argument should point to a size_t +variable.

+ PCRE_INFO_STUDYSIZE
+

+Return the size of the data block pointed to by the study_data field in +a pcre_extra block. That is, it is the value that was passed to pcre_malloc() +when PCRE was getting memory into which to place the data created by pcre_study(). +The fourth argument should point to a size_t variable. +

Obsolete Info Function

+ +

+int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

+The pcre_info() +function is now obsolete because its interface is too restrictive to return +all the available data about a compiled pattern. New programs should use +pcre_fullinfo() instead. The yield of pcre_info() is the number of capturing +subpatterns, or one of the following negative numbers:

+ PCRE_ERROR_NULL + the argument code was NULL
+ PCRE_ERROR_BADMAGIC the "magic number" was not found
+

+If the optptr argument is not NULL, a copy of the options with which the +pattern was compiled is placed in the integer it points to (see PCRE_INFO_OPTIONS +above).

+If the pattern is not anchored and the firstcharptr argument is +not NULL, it is used to pass back information about the first character +of any matched string (see PCRE_INFO_FIRSTBYTE above). +

Matching a Pattern

+ +

+int pcre_exec(const pcre *code, "const pcre_extra *extra," const char +*subject, int length, int startoffset, int options, int *ovector, int +ovecsize);

+The function pcre_exec() is called to match a subject string +against a compiled pattern, which is passed in the code argument. If the +pattern has been studied, the result of the study should be passed in the +extra argument.

+In most applications, the pattern will have been compiled +(and optionally studied) in the same process that calls pcre_exec(). However, +it is possible to save compiled patterns and study data, and then use them +later in different processes, possibly even on different hosts. For a discussion +about this, see the pcreprecompile documentation.

+Here is an example of +a simple call to pcre_exec():

+ int rc;
+ int ovector[30];
+ rc = pcre_exec(
+ re, /* result of pcre_compile() */
+ NULL, /* we didn’t study the pattern */
+ "some string", /* the subject string */
+ 11, /* the length of the subject string */
+ 0, /* start at offset 0 in the subject */
+ 0, /* default options */
+ ovector, /* vector of integers for substring information */
+ 30); /* number of elements in the vector (NOT size in bytes) +*/
+ +

Extra data for pcre_exec()

+If the extra argument is not NULL, it must +point to a pcre_extra data block. The pcre_study() function returns such +a block (when it doesn’t return NULL), but you can also create one for yourself, +and pass additional information in it. The fields in a pcre_extra block +are as follows:

+ unsigned long int flags;
+ void *study_data;
+ unsigned long int match_limit;
+ void *callout_data;
+ const unsigned char *tables;
+

+The flags field is a bitmap that specifies which of the other fields are +set. The flag bits are:

+ PCRE_EXTRA_STUDY_DATA
+ PCRE_EXTRA_MATCH_LIMIT
+ PCRE_EXTRA_CALLOUT_DATA
+ PCRE_EXTRA_TABLES
+

+Other flag bits should be set to zero. The study_data field is set in the +pcre_extra block that is returned by pcre_study(), together with the appropriate +flag bit. You should not set this yourself, but you may add to the block +by setting the other fields and their corresponding flag bits.

+The match_limit +field provides a means of preventing PCRE from using up a vast amount of +resources when running patterns that are not going to match, but which +have a very large number of possibilities in their search trees. The classic +example is the use of nested unlimited repeats.

+Internally, PCRE uses a +function called match() which it calls repeatedly (sometimes recursively). +The limit is imposed on the number of times this function is called during +a match, which has the effect of limiting the amount of recursion and backtracking +that can take place. For patterns that are not anchored, the count starts +from zero for each position in the subject string.

+The default limit for +the library can be set when PCRE is built; the default default is 10 million, +which handles all but the most extreme cases. You can reduce the default +by suppling pcre_exec() with a pcre_extra block in which match_limit is +set to a smaller value, and PCRE_EXTRA_MATCH_LIMIT is set in the flags +field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. +

+The pcre_callout field is used in conjunction with the "callout" feature, +which is described in the pcrecallout documentation.

+The tables field +is used to pass a character tables pointer to pcre_exec(); this overrides +the value that is stored with the compiled pattern. A non-NULL value is stored +with the compiled pattern only if custom tables were supplied to pcre_compile() +via its tableptr argument. If NULL is passed to pcre_exec() using this mechanism, +it forces PCRE’s internal tables to be used. This facility is helpful when +re-using patterns that have been saved after compiling with an external +set of tables, because the external tables might be at a different address +when pcre_exec() is called. See the pcreprecompile documentation for a +discussion of saving compiled patterns for later use. +

Option bits for pcre_exec()

+ +

+The unused bits of the options argument for pcre_exec() must be zero. The +only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, PCRE_NOTEOL, +PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.

+ PCRE_ANCHORED
+

+The PCRE_ANCHORED option limits pcre_exec() to matching at the first matching +position. If a pattern was compiled with PCRE_ANCHORED, or turned out to +be anchored by virtue of its contents, it cannot be made unachored at matching +time.

+ PCRE_NOTBOL
+

+This option specifies that first character of the subject string is not +the beginning of a line, so the circumflex metacharacter should not match +before it. Setting this without PCRE_MULTILINE (at compile time) causes +circumflex never to match. This option affects only the behaviour of the +circumflex metacharacter. It does not affect \A.

+ PCRE_NOTEOL
+

+This option specifies that the end of the subject string is not the end +of a line, so the dollar metacharacter should not match it nor (except +in multiline mode) a newline immediately before it. Setting this without +PCRE_MULTILINE (at compile time) causes dollar never to match. This option +affects only the behaviour of the dollar metacharacter. It does not affect +\Z or \z.

+ PCRE_NOTEMPTY
+

+An empty string is not considered to be a valid match if this option is +set. If there are alternatives in the pattern, they are tried. If all the +alternatives match the empty string, the entire match fails. For example, +if the pattern

+ a?b?
+

+is applied to a string not beginning with "a" or "b", it matches the empty +string at the start of the subject. With PCRE_NOTEMPTY set, this match is +not valid, so PCRE searches further into the string for occurrences of +"a" or "b".

+Perl has no direct equivalent of PCRE_NOTEMPTY, but it does +make a special case of a pattern match of the empty string within its split() +function, and when using the /g modifier. It is possible to emulate Perl’s +behaviour after matching a null string by first trying the match again +at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then if that +fails by advancing the starting offset (see below) and trying an ordinary +match again. There is some code that demonstrates how to do this in the +pcredemo.c sample program.

+ PCRE_NO_UTF8_CHECK
+

+When PCRE_UTF8 is set at compile time, the validity of the subject as +a UTF-8 string is automatically checked when pcre_exec() is subsequently +called. The value of startoffset is also checked to ensure that it points +to the start of a UTF-8 character. If an invalid UTF-8 sequence of bytes is +found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset +contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.

+If you +already know that your subject is valid, and you want to skip these checks +for performance reasons, you can set the PCRE_NO_UTF8_CHECK option when +calling pcre_exec(). You might want to do this for the second and subsequent +calls to pcre_exec() if you are making repeated calls to find all the matches +in a single subject string. However, you should be sure that the value of +startoffset points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK +is set, the effect of passing an invalid UTF-8 string as a subject, or a +value of startoffset that does not point to the start of a UTF-8 character, +is undefined. Your program may crash.

+ PCRE_PARTIAL
+

+This option turns on the partial matching feature. If the subject string +fails to match the pattern, but at some point during the matching process +the end of the subject was reached (that is, the subject partially matches +the pattern and the failure to match occurred only because there were not +enough subject characters), pcre_exec() returns PCRE_ERROR_PARTIAL instead +of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is used, there are restrictions +on what may appear in the pattern. These are discussed in the pcrepartial + documentation. +

The string to be matched by pcre_exec()

+The subject string +is passed to pcre_exec() as a pointer in subject, a length in length, and +a starting byte offset in startoffset. In UTF-8 mode, the byte offset must +point to the start of a UTF-8 character. Unlike the pattern string, the subject +may contain binary zero bytes. When the starting offset is zero, the search +for a match starts at the beginning of the subject, and this is by far +the most common case.

+A non-zero starting offset is useful when searching +for another match in the same subject by calling pcre_exec() again after +a previous success. Setting startoffset differs from just passing over a +shortened string and setting PCRE_NOTBOL in the case of a pattern that +begins with any kind of lookbehind. For example, consider the pattern

+ +\Biss\B
+

+which finds occurrences of "iss" in the middle of words. (\B matches only +if the current position in the subject is not a word boundary.) When applied +to the string "Mississipi" the first call to pcre_exec() finds the first +occurrence. If pcre_exec() is called again with just the remainder of the +subject, namely "issipi", it does not match, because \B is always false +at the start of the subject, which is deemed to be a word boundary. However, +if pcre_exec() is passed the entire string again, but with startoffset +set to 4, it finds the second occurrence of "iss" because it is able to +look behind the starting point to discover that it is preceded by a letter. +

+If a non-zero starting offset is passed when the pattern is anchored, one +attempt to match at the given offset is made. This can only succeed if the +pattern does not require the match to be at the start of the subject. +

How +pcre_exec() returns captured substrings

+In general, a pattern matches a +certain portion of the subject, and in addition, further substrings from +the subject may be picked out by parts of the pattern. Following the usage +in Jeffrey Friedl’s book, this is called "capturing" in what follows, and +the phrase "capturing subpattern" is used for a fragment of a pattern that +picks out a substring. PCRE supports several other kinds of parenthesized +subpattern that do not cause substrings to be captured.

+Captured substrings +are returned to the caller via a vector of integer offsets whose address +is passed in ovector. The number of elements in the vector is passed in +ovecsize, which must be a non-negative number. Note: this argument is NOT +the size of ovector in bytes.

+The first two-thirds of the vector is used +to pass back captured substrings, each substring using a pair of integers. +The remaining third of the vector is used as workspace by pcre_exec() while +matching capturing subpatterns, and is not available for passing back information. +The length passed in ovecsize should always be a multiple of three. If it +is not, it is rounded down.

+When a match is successful, information about +captured substrings is returned in pairs of integers, starting at the beginning +of ovector, and continuing up to two-thirds of its length at the most. The +first element of a pair is set to the offset of the first character in +a substring, and the second is set to the offset of the first character +after the end of a substring. The first pair, ovector[0] and ovector[1], +identify the portion of the subject string matched by the entire pattern. +The next pair is used for the first capturing subpattern, and so on. The +value returned by pcre_exec() is the number of pairs that have been set. +If there are no capturing subpatterns, the return value from a successful +match is 1, indicating that just the first pair of offsets has been set. +

+Some convenience functions are provided for extracting the captured substrings +as separate strings. These are described in the following section.

+It is +possible for an capturing subpattern number n+1 to match some part of the +subject when subpattern n has not been used at all. For example, if the +string "abc" is matched against the pattern (a|(z))(bc) subpatterns 1 and +3 are matched, but 2 is not. When this happens, both offset values corresponding +to the unused subpattern are set to -1.

+If a capturing subpattern is matched +repeatedly, it is the last portion of the string that it matched that is +returned.

+If the vector is too small to hold all the captured substring +offsets, it is used as far as possible (up to two-thirds of its length), +and the function returns a value of zero. In particular, if the substring +offsets are not of interest, pcre_exec() may be called with ovector passed +as NULL and ovecsize as zero. However, if the pattern contains back references +and the ovector is not big enough to remember the related substrings, PCRE +has to get additional memory for use during matching. Thus it is usually +advisable to supply an ovector.

+Note that pcre_info() can be used to find +out how many capturing subpatterns there are in a compiled pattern. The +smallest size for ovector that will allow for n captured substrings, in +addition to the offsets of the substring matched by the whole pattern, +is (n+1)*3. +

Return values from pcre_exec()

+If pcre_exec() fails, it returns +a negative number. The following are defined in the header file:

+ PCRE_ERROR_NOMATCH + (-1)
+

+The subject string did not match the pattern.

+ PCRE_ERROR_NULL + (-2)
+

+Either code or subject was passed as NULL, or ovector was NULL and ovecsize +was not zero.

+ PCRE_ERROR_BADOPTION (-3)
+

+An unrecognized bit was set in the options argument.

+ PCRE_ERROR_BADMAGIC + (-4)
+

+PCRE stores a 4-byte "magic number" at the start of the compiled code, +to catch the case when it is passed a junk pointer and to detect when a +pattern that was compiled in an environment of one endianness is run in +an environment with the other endianness. This is the error that PCRE gives +when the magic number is not present.

+ PCRE_ERROR_UNKNOWN_NODE (-5)
+

+While running the pattern match, an unknown item was encountered in the +compiled pattern. This error could be caused by a bug in PCRE or by overwriting +of the compiled pattern.

+ PCRE_ERROR_NOMEMORY (-6)
+

+If a pattern contains back references, but the ovector that is passed +to pcre_exec() is not big enough to remember the referenced substrings, +PCRE gets a block of memory at the start of matching to use for this purpose. +If the call via pcre_malloc() fails, this error is given. The memory is +automatically freed at the end of matching.

+ PCRE_ERROR_NOSUBSTRING +(-7)
+

+This error is used by the pcre_copy_substring(), pcre_get_substring(), +and pcre_get_substring_list() functions (see below). It is never returned +by pcre_exec().

+ PCRE_ERROR_MATCHLIMIT (-8)
+

+The recursion and backtracking limit, as specified by the match_limit +field in a pcre_extra structure (or defaulted) was reached. See the description +above.

+ PCRE_ERROR_CALLOUT (-9)
+

+This error is never generated by pcre_exec() itself. It is provided for +use by callout functions that want to yield a distinctive error code. See +the pcrecallout documentation for details.

+ PCRE_ERROR_BADUTF8 + (-10)
+

+A string that contains an invalid UTF-8 byte sequence was passed as a subject. +

+ PCRE_ERROR_BADUTF8_OFFSET (-11)
+

+The UTF-8 byte sequence that was passed as a subject was valid, but the +value of startoffset did not point to the beginning of a UTF-8 character. +

+ PCRE_ERROR_PARTIAL (-12)
+

+The subject string did not match, but it did match partially. See the +pcrepartial documentation for details of partial matching.

+ PCRE_ERROR_BAD_PARTIAL +(-13)
+

+The PCRE_PARTIAL option was used with a compiled pattern containing items +that are not supported for partial matching. See the pcrepartial documentation +for details of partial matching.

+ PCRE_ERROR_INTERNAL (-14)
+

+An unexpected internal error has occurred. This error could be caused by +a bug in PCRE or by overwriting of the compiled pattern.

+ PCRE_ERROR_BADCOUNT +(-15)
+

+This error is given if the value of the ovecsize argument is negative. + +

Extracting Captured Substrings by Number

+int pcre_copy_substring(const +char *subject, int *ovector, int stringcount, int stringnumber, char *buffer, + int buffersize);

+
+int pcre_get_substring(const char *subject, int *ovector, int stringcount, +int stringnumber, const char **stringptr);

+
+int pcre_get_substring_list(const char *subject, int *ovector, int stringcount, +"const char ***listptr);"

+Captured substrings can be accessed directly +by using the offsets returned by pcre_exec() in ovector. For convenience, +the functions pcre_copy_substring(), pcre_get_substring(), and pcre_get_substring_list() +are provided for extracting captured substrings as new, separate, zero-terminated +strings. These functions identify substrings by number. The next section +describes functions for extracting named substrings. A substring that contains +a binary zero is correctly extracted and has a further zero added on the +end, but the result is not, of course, a C string.

+The first three arguments +are the same for all three of these functions: subject is the subject string +that has just been successfully matched, ovector is a pointer to the vector +of integer offsets that was passed to pcre_exec(), and stringcount is the +number of substrings that were captured by the match, including the substring +that matched the entire regular expression. This is the value returned by +pcre_exec() if it is greater than zero. If pcre_exec() returned zero, indicating +that it ran out of space in ovector, the value passed as stringcount should +be the number of elements in the vector divided by three.

+The functions +pcre_copy_substring() and pcre_get_substring() extract a single substring, +whose number is given as stringnumber. A value of zero extracts the substring +that matched the entire pattern, whereas higher values extract the captured +substrings. For pcre_copy_substring(), the string is placed in buffer, whose +length is given by buffersize, while for pcre_get_substring() a new block +of memory is obtained via pcre_malloc, and its address is returned via +stringptr. The yield of the function is the length of the string, not including +the terminating zero, or one of

+ PCRE_ERROR_NOMEMORY (-6)
+

+The buffer was too small for pcre_copy_substring(), or the attempt to +get memory failed for pcre_get_substring().

+ PCRE_ERROR_NOSUBSTRING +(-7)
+

+There is no substring whose number is stringnumber.

+The pcre_get_substring_list() +function extracts all available substrings and builds a list of pointers +to them. All this is done in a single block of memory that is obtained via +pcre_malloc. The address of the memory block is returned via listptr, which +is also the start of the list of string pointers. The end of the list is +marked by a NULL pointer. The yield of the function is zero if all went +well, or

+ PCRE_ERROR_NOMEMORY (-6)
+

+if the attempt to get the memory block failed.

+When any of these functions +encounter a substring that is unset, which can happen when capturing subpattern +number n+1 matches some part of the subject, but subpattern n has not been +used at all, they return an empty string. This can be distinguished from +a genuine zero-length substring by inspecting the appropriate offset in +ovector, which is negative for unset substrings.

+The two convenience functions +pcre_free_substring() and pcre_free_substring_list() can be used to free +the memory returned by a previous call of pcre_get_substring() or pcre_get_substring_list(), +respectively. They do nothing more than call the function pointed to by +pcre_free, which of course could be called directly from a C program. However, +PCRE is used in some situations where it is linked via a special interface +to another programming language which cannot use pcre_free directly; it +is for these cases that the functions are provided. +

Extracting Captured +Substrings by Name

+int pcre_get_stringnumber(const pcre *code, const char +*name);

+
+int pcre_copy_named_substring(const pcre *code, const char *subject, int +*ovector, int stringcount, const char *stringname, char *buffer, int +buffersize);

+
+int pcre_get_named_substring(const pcre *code, const char *subject, int +*ovector, int stringcount, const char *stringname, const char **stringptr); +

+To extract a substring by name, you first have to find associated number. +For example, for this pattern

+ (a+)b(?<xxx>\d+)...
+

+the number of the subpattern called "xxx" is 2. You can find the number +from the name by calling pcre_get_stringnumber(). The first argument is +the compiled pattern, and the second is the name. The yield of the function +is the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no +subpattern of that name.

+Given the number, you can extract the substring +directly, or use one of the functions described in the previous section. +For convenience, there are also two functions that do the whole job.

+Most +of the arguments of pcre_copy_named_substring() and pcre_get_named_substring() +are the same as those for the similarly named functions that extract by +number. As these are described in the previous section, they are not re-described +here. There are just two differences:

+First, instead of a substring number, +a substring name is given. Second, there is an extra argument, given at +the start, which is a pointer to the compiled pattern. This is needed in +order to gain access to the name-to-number translation table.

+These functions +call pcre_get_stringnumber(), and if it succeeds, they then call pcre_copy_substring() +or pcre_get_substring(), as appropriate.

+ Last updated: 09 September 2004 +
+Copyright (c) 1997-2004 University of Cambridge.

+ +

+Table of Contents

Name
Pcre Native API
Pcre API Overview
Multithreading
Saving Precompiled Patterns for Later Use
Checking Build-time Options
Compiling a Pattern
Studying a Pattern
Locale Support
Information About a Pattern
Obsolete Info Function
Matching a Pattern

Extra data for pcre_exec()
Option bits for pcre_exec()
The string to be matched by pcre_exec()
How pcre_exec() returns captured substrings
Return values from pcre_exec()

Extracting Captured Substrings by Number
Extracting Captured Substrings by Name

+ + -- cgit v1.2.3