From dc60ced6cf17e76579e28929cea0c1891fdc4acb Mon Sep 17 00:00:00 2001 From: Aidan Lister Date: Tue, 7 Dec 2004 03:29:16 +0000 Subject: [PATCH] whitespace fixes git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@174209 c90b9560-bf6c-de11-be94-00142212c4b1 --- reference/pcre/pattern.syntax.xml | 1120 ++++++++++++----------------- 1 file changed, 474 insertions(+), 646 deletions(-) diff --git a/reference/pcre/pattern.syntax.xml b/reference/pcre/pattern.syntax.xml index d638be7d5c..ebf9677682 100644 --- a/reference/pcre/pattern.syntax.xml +++ b/reference/pcre/pattern.syntax.xml @@ -1,5 +1,5 @@ - + @@ -38,109 +38,105 @@ - PCRE does not allow repeat quantifiers on lookahead - assertions. Perl permits them, but they do not mean what you - might think. For example, (?!a){3} does not assert that the - next three characters are not "a". It just asserts that the - next character is not "a" three times. + PCRE does not allow repeat quantifiers on lookahead + assertions. Perl permits them, but they do not mean what you + might think. For example, (?!a){3} does not assert that the + next three characters are not "a". It just asserts that the + next character is not "a" three times. - Capturing subpatterns that occur inside negative - lookahead assertions are counted, but their entries in the - offsets vector are never set. Perl sets its numerical - variables from any such patterns that are matched before the - assertion fails to match something (thereby succeeding), but - only if the negative lookahead assertion contains just one - branch. + Capturing subpatterns that occur inside negative + lookahead assertions are counted, but their entries in the + offsets vector are never set. Perl sets its numerical + variables from any such patterns that are matched before the + assertion fails to match something (thereby succeeding), but + only if the negative lookahead assertion contains just one + branch. - Though binary zero characters are supported in the subject string, - they are not allowed in a pattern string because it is passed as a - normal C string, terminated by zero. The escape sequence "\\x00" can - be used in the pattern to represent a binary zero. + Though binary zero characters are supported in the subject string, + they are not allowed in a pattern string because it is passed as a + normal C string, terminated by zero. The escape sequence "\\x00" can + be used in the pattern to represent a binary zero. - The following Perl escape sequences are not supported: - \l, \u, \L, \U, \E, \Q. In fact these are implemented by - Perl's general string-handling and are not part of its - pattern matching engine. + The following Perl escape sequences are not supported: + \l, \u, \L, \U, \E, \Q. In fact these are implemented by + Perl's general string-handling and are not part of its + pattern matching engine. - The Perl \G assertion is not supported as it is not - relevant to single pattern matches. + The Perl \G assertion is not supported as it is not + relevant to single pattern matches. - Fairly obviously, PCRE does not support the (?{code}) - construction. + Fairly obviously, PCRE does not support the (?{code}) + construction. - There are at the time of writing some oddities in Perl - 5.005_02 concerned with the settings of captured strings - when part of a pattern is repeated. For example, matching - "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value - "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 - unset. However, if the pattern is changed to - /^(aa(b(b))?)+$/ then $2 (and $3) get set. - In Perl 5.004 $2 is set in both cases, and that is also &true; - of PCRE. If in the future Perl changes to a consistent state - that is different, PCRE may change to follow. + There are at the time of writing some oddities in Perl + 5.005_02 concerned with the settings of captured strings + when part of a pattern is repeated. For example, matching + "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value + "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 + unset. However, if the pattern is changed to + /^(aa(b(b))?)+$/ then $2 (and $3) get set. + In Perl 5.004 $2 is set in both cases, and that is also &true; + of PCRE. If in the future Perl changes to a consistent state + that is different, PCRE may change to follow. - Another as yet unresolved discrepancy is that in Perl - 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string - "a", whereas in PCRE it does not. However, in both Perl and - PCRE /^(a)?a/ matched against "a" leaves $1 unset. + Another as yet unresolved discrepancy is that in Perl + 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string + "a", whereas in PCRE it does not. However, in both Perl and + PCRE /^(a)?a/ matched against "a" leaves $1 unset. - PCRE provides some extensions to the Perl regular - expression facilities: + PCRE provides some extensions to the Perl regular + expression facilities: - Although lookbehind assertions must match fixed length - strings, each alternative branch of a lookbehind assertion - can match a different length of string. Perl 5.005 requires - them all to have the same length. + Although lookbehind assertions must match fixed length + strings, each alternative branch of a lookbehind assertion + can match a different length of string. Perl 5.005 requires + them all to have the same length. - If PCRE_DOLLAR_ENDONLY - is set and PCRE_MULTILINE is + If PCRE_DOLLAR_ENDONLY + is set and PCRE_MULTILINE is not set, the $ meta-character matches only at the very end of the string. - If PCRE_EXTRA is + If PCRE_EXTRA is set, a backslash followed by a letter with no special meaning is faulted. - If PCRE_UNGREEDY is + If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is inverted, that is, by default they are not greedy, but if followed by a question mark they are. @@ -155,307 +151,202 @@ Regular Expression Details - - Introduction - - The syntax and semantics of the regular expressions - supported by PCRE are described below. Regular expressions are - also described in the Perl documentation and in a number of - other books, some of which have copious examples. Jeffrey - Friedl's "Mastering Regular Expressions", published by - O'Reilly (ISBN 1-56592-257-3), covers them in great detail. - The description here is intended as reference documentation. - - - A regular expression is a pattern that is matched against a - subject string from left to right. Most characters stand for - themselves in a pattern, and match the corresponding - characters in the subject. As a trivial example, the pattern - The quick brown fox - matches a portion of a subject string that is identical to - itself. - + + Introduction + + The syntax and semantics of the regular expressions + supported by PCRE are described below. Regular expressions are + also described in the Perl documentation and in a number of + other books, some of which have copious examples. Jeffrey + Friedl's "Mastering Regular Expressions", published by + O'Reilly (ISBN 1-56592-257-3), covers them in great detail. + The description here is intended as reference documentation. + + + A regular expression is a pattern that is matched against a + subject string from left to right. Most characters stand for + themselves in a pattern, and match the corresponding + characters in the subject. As a trivial example, the pattern + The quick brown fox + matches a portion of a subject string that is identical to + itself. + Meta-characters - The power of regular expressions comes from the - ability to include alternatives and repetitions in the - pattern. These are encoded in the pattern by the use of - meta-characters, which do not stand for themselves but instead - are interpreted in some special way. - - - There are two different sets of meta-characters: those that - are recognized anywhere in the pattern except within square - brackets, and those that are recognized in square brackets. - Outside square brackets, the meta-characters are as follows: + The power of regular expressions comes from the + ability to include alternatives and repetitions in the + pattern. These are encoded in the pattern by the use of + meta-characters, which do not stand for themselves but instead + are interpreted in some special way. + + + There are two different sets of meta-characters: those that + are recognized anywhere in the pattern except within square + brackets, and those that are recognized in square brackets. + Outside square brackets, the meta-characters are as follows: \ - - - general escape character with several uses - - + general escape character with several uses ^ - - - assert start of subject (or line, in multiline mode) - - + assert start of subject (or line, in multiline mode) $ - - - assert end of subject (or line, in multiline mode) - - + assert end of subject (or line, in multiline mode) . - - - match any character except newline (by default) - - + match any character except newline (by default) [ - - - start character class definition - - + start character class definition ] - - - end character class definition - - + end character class definition | - - - start of alternative branch - - + start of alternative branch ( - - - start subpattern - - + start subpattern ) - - - end subpattern - - + end subpattern ? - - - extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer - - + extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer * - - - 0 or more quantifier - - + 0 or more quantifier + - - - 1 or more quantifier - - + 1 or more quantifier { - - - start min/max quantifier - - + start min/max quantifier } - - - end min/max quantifier - - + end min/max quantifier - Part of a pattern that is in square brackets is called a - "character class". In a character class the only - meta-characters are: + Part of a pattern that is in square brackets is called a + "character class". In a character class the only + meta-characters are: + \ - - - general escape character - - + general escape character ^ - - - negate the class, but only if the first character - - + negate the class, but only if the first character - - - - indicates character range - - + indicates character range ] - - - terminates the character class - - + terminates the character class - The following sections describe the use of each of the - meta-characters. - + + The following sections describe the use of each of the + meta-characters. + - - backslash + + + backslash + + The backslash character has several uses. Firstly, if it is + followed by a non-alphanumeric character, it takes away any + special meaning that character may have. This use of + backslash as an escape character applies both inside and + outside character classes. + + + For example, if you want to match a "*" character, you write + "\*" in the pattern. This applies whether or not the + following character would otherwise be interpreted as a + meta-character, so it is always safe to precede a non-alphanumeric + with "\" to specify that it stands for itself. In + particular, if you want to match a backslash, you write "\\". + + + If a pattern is compiled with the + PCRE_EXTENDED option, + whitespace in the pattern (other than in a character class) and + characters between a "#" outside a character class and the next newline + character are ignored. An escaping backslash can be used to include a + whitespace or "#" character as part of the pattern. + + + A second use of backslash provides a way of encoding + non-printing characters in patterns in a visible manner. There + is no restriction on the appearance of non-printing characters, + apart from the binary zero that terminates a pattern, + but when a pattern is being prepared by text editing, it is + usually easier to use one of the following escape sequences + than the binary character it represents: + - The backslash character has several uses. Firstly, if it is - followed by a non-alphanumeric character, it takes away any - special meaning that character may have. This use of - backslash as an escape character applies both inside and - outside character classes. - - - For example, if you want to match a "*" character, you write - "\*" in the pattern. This applies whether or not the - following character would otherwise be interpreted as a - meta-character, so it is always safe to precede a non-alphanumeric - with "\" to specify that it stands for itself. In - particular, if you want to match a backslash, you write "\\". - - - If a pattern is compiled with the PCRE_EXTENDED option, - whitespace in the pattern (other than in a character class) and - characters between a "#" outside a character class and the next newline - character are ignored. An escaping backslash can be used to include a - whitespace or "#" character as part of the pattern. - - - A second use of backslash provides a way of encoding - non-printing characters in patterns in a visible manner. There - is no restriction on the appearance of non-printing characters, - apart from the binary zero that terminates a pattern, - but when a pattern is being prepared by text editing, it is - usually easier to use one of the following escape sequences - than the binary character it represents: - - \a - - - alarm, that is, the BEL character (hex 07) - - + alarm, that is, the BEL character (hex 07) \cx - - - "control-x", where x is any character - - + "control-x", where x is any character \e - - - escape (hex 1B) - - + escape (hex 1B) \f - - - formfeed (hex 0C) - - + formfeed (hex 0C) \n - - - newline (hex 0A) - - + newline (hex 0A) \r - - - carriage return (hex 0D) - - + carriage return (hex 0D) \t - - - tab (hex 09) - - + tab (hex 09) \xhh - - - character with hex code hh - - + character with hex code hh \ddd - - - character with octal code ddd, or backreference - - + character with octal code ddd, or backreference - + The precise effect of "\cx" is as follows: if "x" is a lower case letter, it is converted @@ -496,83 +387,63 @@ stand for themselves. For example: - - - \040 - - - is another way of writing a space - - - - - \40 - - - is the same, provided there are fewer than 40 - previous capturing subpatterns - - - - - \7 - - - is always a back reference - - - - - \11 - - - might be a back reference, or another way of - writing a tab - - - - - \011 - - - is always a tab - - - - - \0113 - - - is a tab followed by the character "3" - - - - - \113 - - - is the character with octal code 113 (since there - can be no more than 99 back references) - - - - - \377 - - - is a byte consisting entirely of 1 bits - - - - - \81 - - - is either a back reference, or a binary zero - followed by the two characters "8" and "1" - - - + + + \040 + is another way of writing a space + + + \40 + + + is the same, provided there are fewer than 40 + previous capturing subpatterns + + + + + \7 + is always a back reference + + + \11 + + + might be a back reference, or another way of + writing a tab + + + + + \011 + is always a tab + + + \0113 + is a tab followed by the character "3" + + + \113 + + + is the character with octal code 113 (since there + can be no more than 99 back references) + + + + + \377 + is a byte consisting entirely of 1 bits + + + \81 + + + is either a back reference, or a binary zero + followed by the two characters "8" and "1" + + + @@ -592,56 +463,32 @@ character types: - - - \d - - - any decimal digit - - - - - \D - - - any character that is not a decimal digit - - - - - \s - - - any whitespace character - - - - - \S - - - any character that is not a whitespace character - - - - - \w - - - any "word" character - - - - - \W - - - any "non-word" character - - - - + + + \d + any decimal digit + + + \D + any character that is not a decimal digit + + + \s + any whitespace character + + + \S + any character that is not a whitespace character + + + \w + any "word" character + + + \W + any "non-word" character + + Each pair of escape sequences partitions the complete set of @@ -677,44 +524,28 @@ \b - - - word boundary - - + word boundary \B - - - not a word boundary - - + not a word boundary \A - - - start of subject (independent of multiline mode) - - + start of subject (independent of multiline mode) \Z - + - end of subject or newline at end (independent of - multiline mode) + end of subject or newline at end (independent of + multiline mode) \z - - - end of subject(independent of multiline mode) - - + end of subject(independent of multiline mode) @@ -738,8 +569,7 @@ ever match at the very start and end of the subject string, whatever options are set. They are not affected by the PCRE_MULTILINE or - PCRE_DOLLAR_ENDONLY + PCRE_DOLLAR_ENDONLY options. The difference between \Z and \z is that \Z matches before a newline that is the last character of the string as well as at the end of @@ -750,60 +580,59 @@ Circumflex and dollar - Outside a character class, in the default matching mode, the - circumflex character is an assertion which is true only if - the current matching point is at the start of the subject - string. Inside a character class, circumflex has an entirely - different meaning (see below). - - - Circumflex need not be the first character of the pattern if - a number of alternatives are involved, but it should be the - first thing in each alternative in which it appears if the - pattern is ever to match that branch. If all possible - alternatives start with a circumflex, that is, if the pattern is - constrained to match only at the start of the subject, it is - said to be an "anchored" pattern. (There are also other - constructs that can cause a pattern to be anchored.) - - - A dollar character is an assertion which is &true; only if the - current matching point is at the end of the subject string, - or immediately before a newline character that is the last - character in the string (by default). Dollar need not be the - last character of the pattern if a number of alternatives - are involved, but it should be the last item in any branch - in which it appears. Dollar has no special meaning in a - character class. - - - The meaning of dollar can be changed so that it matches only - at the very end of the string, by setting the - PCRE_DOLLAR_ENDONLY - option at compile or matching time. This - does not affect the \Z assertion. - - - The meanings of the circumflex and dollar characters are - changed if the PCRE_MULTILINE option - is set. When this is the case, they match immediately after and - immediately before an internal "\n" character, respectively, in addition - to matching at the start and end of the subject string. For example, the - pattern /^abc$/ matches the subject string "def\nabc" in multiline mode, - but not otherwise. Consequently, patterns that are anchored in single - line mode because all branches start with "^" are not anchored in - multiline mode. The PCRE_DOLLAR_ENDONLY - option is ignored if PCRE_MULTILINE is - set. - - - Note that the sequences \A, \Z, and \z can be used to match - the start and end of the subject in both modes, and if all - branches of a pattern start with \A is it always anchored, - whether PCRE_MULTILINE is set or not. + Outside a character class, in the default matching mode, the + circumflex character is an assertion which is true only if + the current matching point is at the start of the subject + string. Inside a character class, circumflex has an entirely + different meaning (see below). + + + Circumflex need not be the first character of the pattern if + a number of alternatives are involved, but it should be the + first thing in each alternative in which it appears if the + pattern is ever to match that branch. If all possible + alternatives start with a circumflex, that is, if the pattern is + constrained to match only at the start of the subject, it is + said to be an "anchored" pattern. (There are also other + constructs that can cause a pattern to be anchored.) + + + A dollar character is an assertion which is &true; only if the + current matching point is at the end of the subject string, + or immediately before a newline character that is the last + character in the string (by default). Dollar need not be the + last character of the pattern if a number of alternatives + are involved, but it should be the last item in any branch + in which it appears. Dollar has no special meaning in a + character class. + + + The meaning of dollar can be changed so that it matches only + at the very end of the string, by setting the + PCRE_DOLLAR_ENDONLY + option at compile or matching time. This does not affect the \Z assertion. + + + The meanings of the circumflex and dollar characters are + changed if the + PCRE_MULTILINE option + is set. When this is the case, they match immediately after and + immediately before an internal "\n" character, respectively, in addition + to matching at the start and end of the subject string. For example, the + pattern /^abc$/ matches the subject string "def\nabc" in multiline mode, + but not otherwise. Consequently, patterns that are anchored in single + line mode because all branches start with "^" are not anchored in + multiline mode. The + PCRE_DOLLAR_ENDONLY + option is ignored if + PCRE_MULTILINE is + set. + + + Note that the sequences \A, \Z, and \z can be used to match + the start and end of the subject in both modes, and if all + branches of a pattern start with \A is it always anchored, + whether PCRE_MULTILINE is set or not. @@ -812,8 +641,8 @@ Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing - character, but not (by default) newline. If the PCRE_DOTALL + character, but not (by default) newline. If the + PCRE_DOTALL option is set, then dots match newlines as well. The handling of dot is entirely independent of the handling of circumflex and dollar, the only relationship being that they @@ -825,90 +654,90 @@ Square brackets - An opening square bracket introduces a character class, - terminated by a closing square bracket. A closing square - bracket on its own is not special. If a closing square - bracket is required as a member of the class, it should be - the first data character in the class (after an initial - circumflex, if present) or escaped with a backslash. - - - A character class matches a single character in the subject; - the character must be in the set of characters defined by - the class, unless the first character in the class is a - circumflex, in which case the subject character must not be in - the set defined by the class. If a circumflex is actually - required as a member of the class, ensure it is not the - first character, or escape it with a backslash. - - - For example, the character class [aeiou] matches any lower - case vowel, while [^aeiou] matches any character that is not - a lower case vowel. Note that a circumflex is just a - convenient notation for specifying the characters which are in - the class by enumerating those that are not. It is not an - assertion: it still consumes a character from the subject - string, and fails if the current pointer is at the end of - the string. - - - When caseless matching is set, any letters in a class - represent both their upper case and lower case versions, so - for example, a caseless [aeiou] matches "A" as well as "a", - and a caseless [^aeiou] does not match "A", whereas a - caseful version would. - - - The newline character is never treated in any special way in - character classes, whatever the setting of the PCRE_DOTALL - or PCRE_MULTILINE - options is. A class such as [^a] will always match a newline. - - - The minus (hyphen) character can be used to specify a range - of characters in a character class. For example, [d-m] - matches any letter between d and m, inclusive. If a minus - character is required in a class, it must be escaped with a - backslash or appear in a position where it cannot be - interpreted as indicating a range, typically as the first or last - character in the class. - - - It is not possible to have the literal character "]" as the - end character of a range. A pattern such as [W-]46] is - interpreted as a class of two characters ("W" and "-") - followed by a literal string "46]", so it would match "W46]" or - "-46]". However, if the "]" is escaped with a backslash it - is interpreted as the end of range, so [W-\]46] is - interpreted as a single class containing a range followed by two - separate characters. The octal or hexadecimal representation - of "]" can also be used to end a range. - - - Ranges operate in ASCII collating sequence. They can also be - used for characters specified numerically, for example - [\000-\037]. If a range that includes letters is used when - caseless matching is set, it matches the letters in either - case. For example, [W-c] is equivalent to [][\^_`wxyzabc], - matched caselessly, and if character tables for the "fr" - locale are in use, [\xc8-\xcb] matches accented E characters - in both cases. - - - The character types \d, \D, \s, \S, \w, and \W may also - appear in a character class, and add the characters that - they match to the class. For example, [\dABCDEF] matches any - hexadecimal digit. A circumflex can conveniently be used - with the upper case character types to specify a more - restricted set of characters than the matching lower case type. - For example, the class [^\W_] matches any letter or digit, - but not underscore. - - - All non-alphanumeric characters other than \, -, ^ (at the - start) and the terminating ] are non-special in character - classes, but it does no harm if they are escaped. + An opening square bracket introduces a character class, + terminated by a closing square bracket. A closing square + bracket on its own is not special. If a closing square + bracket is required as a member of the class, it should be + the first data character in the class (after an initial + circumflex, if present) or escaped with a backslash. + + + A character class matches a single character in the subject; + the character must be in the set of characters defined by + the class, unless the first character in the class is a + circumflex, in which case the subject character must not be in + the set defined by the class. If a circumflex is actually + required as a member of the class, ensure it is not the + first character, or escape it with a backslash. + + + For example, the character class [aeiou] matches any lower + case vowel, while [^aeiou] matches any character that is not + a lower case vowel. Note that a circumflex is just a + convenient notation for specifying the characters which are in + the class by enumerating those that are not. It is not an + assertion: it still consumes a character from the subject + string, and fails if the current pointer is at the end of + the string. + + + When caseless matching is set, any letters in a class + represent both their upper case and lower case versions, so + for example, a caseless [aeiou] matches "A" as well as "a", + and a caseless [^aeiou] does not match "A", whereas a + caseful version would. + + + The newline character is never treated in any special way in + character classes, whatever the setting of the PCRE_DOTALL + or PCRE_MULTILINE + options is. A class such as [^a] will always match a newline. + + + The minus (hyphen) character can be used to specify a range + of characters in a character class. For example, [d-m] + matches any letter between d and m, inclusive. If a minus + character is required in a class, it must be escaped with a + backslash or appear in a position where it cannot be + interpreted as indicating a range, typically as the first or last + character in the class. + + + It is not possible to have the literal character "]" as the + end character of a range. A pattern such as [W-]46] is + interpreted as a class of two characters ("W" and "-") + followed by a literal string "46]", so it would match "W46]" or + "-46]". However, if the "]" is escaped with a backslash it + is interpreted as the end of range, so [W-\]46] is + interpreted as a single class containing a range followed by two + separate characters. The octal or hexadecimal representation + of "]" can also be used to end a range. + + + Ranges operate in ASCII collating sequence. They can also be + used for characters specified numerically, for example + [\000-\037]. If a range that includes letters is used when + caseless matching is set, it matches the letters in either + case. For example, [W-c] is equivalent to [][\^_`wxyzabc], + matched caselessly, and if character tables for the "fr" + locale are in use, [\xc8-\xcb] matches accented E characters + in both cases. + + + The character types \d, \D, \s, \S, \w, and \W may also + appear in a character class, and add the characters that + they match to the class. For example, [\dABCDEF] matches any + hexadecimal digit. A circumflex can conveniently be used + with the upper case character types to specify a more + restricted set of characters than the matching lower case type. + For example, the class [^\W_] matches any letter or digit, + but not underscore. + + + All non-alphanumeric characters other than \, -, ^ (at the + start) and the terminating ] are non-special in character + classes, but it does no harm if they are escaped. @@ -917,9 +746,7 @@ Vertical bar characters are used to separate alternative patterns. For example, the pattern - - gilbert|sullivan - + gilbert|sullivan matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries @@ -934,104 +761,105 @@ Internal option setting - The settings of PCRE_CASELESS, - PCRE_MULTILINE, - PCRE_DOTALL, - PCRE_UNGREEDY, - and PCRE_EXTENDED can be changed from within the pattern by - a sequence of Perl option letters enclosed between "(?" and - ")". The option letters are + The settings of PCRE_CASELESS, + PCRE_MULTILINE, + PCRE_DOTALL, + PCRE_UNGREEDY, + and PCRE_EXTENDED + can be changed from within the pattern by + a sequence of Perl option letters enclosed between "(?" and + ")". The option letters are: - - Internal option letters - - - - i - for PCRE_CASELESS - - - m - for PCRE_MULTILINE - - - s - for PCRE_DOTALL - - - x - for PCRE_EXTENDED - - - U - for PCRE_UNGREEDY - - - -
-
- - For example, (?im) sets caseless, multiline matching. It is - also possible to unset these options by preceding the letter - with a hyphen, and a combined setting and unsetting such as - (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while - unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted. - If a letter appears both before and after the hyphen, the - option is unset. - - - The scope of these option changes depends on where in the - pattern the setting occurs. For settings that are outside - any subpattern (defined below), the effect is the same as if - the options were set or unset at the start of matching. The - following patterns all behave in exactly the same way: - + + Internal option letters + + + + i + for PCRE_CASELESS + + + m + for PCRE_MULTILINE + + + s + for PCRE_DOTALL + + + x + for PCRE_EXTENDED + + + U + for PCRE_UNGREEDY + + + +
+
+ + For example, (?im) sets caseless, multiline matching. It is + also possible to unset these options by preceding the letter + with a hyphen, and a combined setting and unsetting such as + (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while + unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted. + If a letter appears both before and after the hyphen, the + option is unset. + + + The scope of these option changes depends on where in the + pattern the setting occurs. For settings that are outside + any subpattern (defined below), the effect is the same as if + the options were set or unset at the start of matching. The + following patterns all behave in exactly the same way: + - - (?i)abc - a(?i)bc - ab(?i)c - abc(?i) - + + (?i)abc + a(?i)bc + ab(?i)c + abc(?i) + - - which in turn is the same as compiling the pattern abc with - PCRE_CASELESS set. - In other words, such "top level" settings apply to the whole - pattern (unless there are other changes inside subpatterns). - If there is more than one setting of the same option at top level, - the rightmost setting is used. - - - If an option change occurs inside a subpattern, the effect - is different. This is a change of behaviour in Perl 5.005. - An option change inside a subpattern affects only that part - of the subpattern that follows it, so + + which in turn is the same as compiling the pattern abc with + PCRE_CASELESS set. + In other words, such "top level" settings apply to the whole + pattern (unless there are other changes inside subpatterns). + If there is more than one setting of the same option at top level, + the rightmost setting is used. + + + If an option change occurs inside a subpattern, the effect + is different. This is a change of behaviour in Perl 5.005. + An option change inside a subpattern affects only that part + of the subpattern that follows it, so - (a(?i)b)c + (a(?i)b)c - matches abc and aBc and no other strings (assuming - PCRE_CASELESS is not used). By this means, options can be - made to have different settings in different parts of the - pattern. Any changes made in one alternative do carry on - into subsequent branches within the same subpattern. For - example, + matches abc and aBc and no other strings (assuming + PCRE_CASELESS is not used). By this means, options can be + made to have different settings in different parts of the + pattern. Any changes made in one alternative do carry on + into subsequent branches within the same subpattern. For + example, - (a(?i)b|c) + (a(?i)b|c) - matches "ab", "aB", "c", and "C", even though when matching - "C" the first branch is abandoned before the option setting. - This is because the effects of option settings happen at - compile time. There would be some very weird behaviour otherwise. - - - The PCRE-specific options PCRE_UNGREEDY and - PCRE_EXTRA can - be changed in the same way as the Perl-compatible options by - using the characters U and X respectively. The (?X) flag - setting is special in that it must always occur earlier in - the pattern than any of the additional features it turns on, - even when it is at top level. It is best put at the start. + matches "ab", "aB", "c", and "C", even though when matching + "C" the first branch is abandoned before the option setting. + This is because the effects of option settings happen at + compile time. There would be some very weird behaviour otherwise. + + + The PCRE-specific options PCRE_UNGREEDY and + PCRE_EXTRA can + be changed in the same way as the Perl-compatible options by + using the characters U and X respectively. The (?X) flag + setting is special in that it must always occur earlier in + the pattern than any of the additional features it turns on, + even when it is at top level. It is best put at the start.