From c0c80687671c628489190b095bd569783cde89ed Mon Sep 17 00:00:00 2001 From: Jakub Vrana Date: Tue, 23 Dec 2003 13:07:58 +0000 Subject: [PATCH] literallayout changed to para git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@147253 c90b9560-bf6c-de11-be94-00142212c4b1 --- .../pcre/functions/pcre.pattern.syntax.xml | 448 +++++++++++------- 1 file changed, 282 insertions(+), 166 deletions(-) diff --git a/reference/pcre/functions/pcre.pattern.syntax.xml b/reference/pcre/functions/pcre.pattern.syntax.xml index 2e3e279e06..e46e1332e7 100644 --- a/reference/pcre/functions/pcre.pattern.syntax.xml +++ b/reference/pcre/functions/pcre.pattern.syntax.xml @@ -1,5 +1,5 @@ - + @@ -159,7 +159,8 @@ Friedl's "Mastering Regular Expressions", published by O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description here is intended as reference documentation. - + + A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding @@ -742,13 +743,14 @@ Circumflex and dollar - + Outside a character class, in the default matching mode, the circumflex character is an assertion which is true only if the current matching point is at the start of the subject string. Inside a character class, circumflex has an entirely different meaning (see below). - + + Circumflex need not be the first character of the pattern if a number of alternatives are involved, but it should be the first thing in each alternative in which it appears if the @@ -757,7 +759,8 @@ constrained to match only at the start of the subject, it is said to be an "anchored" pattern. (There are also other constructs that can cause a pattern to be anchored.) - + + A dollar character is an assertion which is &true; only if the current matching point is at the end of the subject string, or immediately before a newline character that is the last @@ -766,13 +769,15 @@ are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class. - + + The meaning of dollar can be changed so that it matches only at the very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching time. This does not affect the \Z assertion. - + + The meanings of the circumflex and dollar characters are changed if the PCRE_MULTILINE option is set. When this is the case, they match immediately after and immediately @@ -784,17 +789,18 @@ because all branches start with "^" are not anchored in multiline mode. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. - + + Note that the sequences \A, \Z, and \z can be used to match the start and end of the subject in both modes, and if all branches of a pattern start with \A is it always anchored, whether PCRE_MULTILINE is set or not. - + FULL STOP - + Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing character, but not (by default) newline. If the PCRE_DOTALL @@ -803,19 +809,20 @@ circumflex and dollar, the only relationship being that they both involve newline characters. Dot has no special meaning in a character class. - + Square brackets - + An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. - + + A character class matches a single character in the subject; the character must be in the set of characters defined by the class, unless the first character in the class is a @@ -823,7 +830,8 @@ the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash. - + + For example, the character class [aeiou] matches any lower case vowel, while [^aeiou] matches any character that is not a lower case vowel. Note that a circumflex is just a @@ -832,18 +840,21 @@ assertion: it still consumes a character from the subject string, and fails if the current pointer is at the end of the string. - + + When caseless matching is set, any letters in a class represent both their upper case and lower case versions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful version would. - + + The newline character is never treated in any special way in character classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class such as [^a] will always match a newline. - + + The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus @@ -851,7 +862,8 @@ backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class. - + + It is not possible to have the literal character "]" as the end character of a range. A pattern such as [W-]46] is interpreted as a class of two characters ("W" and "-") @@ -861,7 +873,8 @@ interpreted as a single class containing a range followed by two separate characters. The octal or hexadecimal representation of "]" can also be used to end a range. - + + Ranges operate in ASCII collating sequence. They can also be used for characters specified numerically, for example [\000-\037]. If a range that includes letters is used when @@ -870,7 +883,8 @@ matched caselessly, and if character tables for the "fr" locale are in use, [\xc8-\xcb] matches accented E characters in both cases. - + + The character types \d, \D, \s, \S, \w, and \W may also appear in a character class, and add the characters that they match to the class. For example, [\dABCDEF] matches any @@ -879,20 +893,21 @@ restricted set of characters than the matching lower case type. For example, the class [^\W_] matches any letter or digit, but not underscore. - + + All non-alphanumeric characters other than \, -, ^ (at the start) and the terminating ] are non-special in character classes, but it does no harm if they are escaped. - + Vertical bar - + Vertical bar characters are used to separate alternative patterns. For example, the pattern - gilbert|sullivan + gilbert|sullivan matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted @@ -902,56 +917,82 @@ subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern. - + Internal option setting - - The settings of PCRE_CASELESS , - PCRE_MULTILINE , - PCRE_DOTALL , + + The settings of PCRE_CASELESS, + PCRE_MULTILINE, + PCRE_DOTALL, and PCRE_EXTENDED can be changed from within the pattern by a sequence of Perl option letters enclosed between "(?" and ")". The option letters are - i for PCRE_CASELESS - m for PCRE_MULTILINE - s for PCRE_DOTALL - x for PCRE_EXTENDED - + + Internal option letters + + + + i + for PCRE_CASELESS + + + m + for PCRE_MULTILINE + + + s + for PCRE_DOTALL + + + x + for PCRE_EXTENDED + + + +
+
+ For example, (?im) sets caseless, multiline matching. It is also possible to unset these options by preceding the letter with a hyphen, and a combined setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while - unsetting PCRE_DOTALL and PCRE_EXTENDED , is also permitted. + unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted. If a letter appears both before and after the hyphen, the option is unset. - + + The scope of these option changes depends on where in the pattern the setting occurs. For settings that are outside any subpattern (defined below), the effect is the same as if the options were set or unset at the start of matching. The following patterns all behave in exactly the same way: + + (?i)abc a(?i)bc ab(?i)c abc(?i) + + which in turn is the same as compiling the pattern abc with PCRE_CASELESS set. In other words, such "top level" settings apply to the whole pattern (unless there are other changes inside subpatterns). If there is more than one setting of the same option at top level, the rightmost setting is used. - + + If an option change occurs inside a subpattern, the effect is different. This is a change of behaviour in Perl 5.005. An option change inside a subpattern affects only that part of the subpattern that follows it, so - (a(?i)b)c + (a(?i)b)c matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used). By this means, options can be @@ -960,13 +1001,14 @@ into subsequent branches within the same subpattern. For example, - (a(?i)b|c) + (a(?i)b|c) matches "ab", "aB", "c", and "C", even though when matching "C" the first branch is abandoned before the option setting. This is because the effects of option settings happen at compile time. There would be some very weird behaviour otherwise. - + + The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the same way as the Perl-compatible options by @@ -974,25 +1016,27 @@ setting is special in that it must always occur earlier in the pattern than any of the additional features it turns on, even when it is at top level. It is best put at the start. -
+
subpatterns - + Subpatterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a subpattern does two things: - + + 1. It localizes a set of alternatives. For example, the pattern - cat(aract|erpillar|) + cat(aract|erpillar|) matches one of the words "cat", "cataract", or "caterpillar". Without the parentheses, it would match "cataract", "erpillar" or the empty string. - + + 2. It sets up the subpattern as a capturing subpattern (as defined above). When the whole pattern matches, that portion of the subject string that matched the subpattern is @@ -1001,15 +1045,17 @@ pcre_exec. Opening parentheses are counted from left to right (starting from 1) to obtain the numbers of the capturing subpatterns. - + + For example, if the string "the red king" is matched against the pattern - the ((red|white) (king|queen)) + the ((red|white) (king|queen)) the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3. - + + The fact that plain parentheses fulfil two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an @@ -1019,49 +1065,57 @@ if the string "the white queen" is matched against the pattern - the ((?:red|white) (king|queen)) + the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200. - + + As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns + + (?i:saturday|sunday) (?:(?i)saturday|sunday) + + match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday". - + Repetition - + Repetition is specified by quantifiers, which can follow any of the following items: - a single character, possibly escaped - the . metacharacter - a character class - a back reference (see next section) - a parenthesized subpattern (unless it is an assertion - - see below) - + + a single character, possibly escaped + the . metacharacter + a character class + a back reference (see next section) + a parenthesized subpattern (unless it is an assertion - + see below) + + + The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example: - z{2,4} + z{2,4} matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, @@ -1069,42 +1123,63 @@ second number and the comma are both omitted, the quantifier specifies an exact number of required matches. Thus - [aeiou]{3,} + [aeiou]{3,} matches at least 3 successive vowels, but may match many more, while - \d{8} + \d{8} matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters. - + + The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present. - + + For convenience (and historical compatibility) the three most common quantifiers have single-character abbreviations: - * is equivalent to {0,} - + is equivalent to {1,} - ? is equivalent to {0,1} - + + Single-character quantifiers + + + + * + equivalent to {0,} + + + + + equivalent to {1,} + + + ? + equivalent to {0,1} + + + +
+
+ It is possible to construct infinite loops by following a subpattern that can match no characters with a quantifier that has no upper limit, for example: - (a?)* - + (a?)* + + Earlier versions of Perl and PCRE used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such patterns are now accepted, but if any repetition of the subpattern does in fact match no characters, the loop is forcibly broken. - + + By default, the quantifiers are "greedy", that is, they match as much as possible (up to the maximum number of permitted times), without causing the rest of the pattern to @@ -1114,20 +1189,21 @@ * and / characters may appear. An attempt to match C comments by applying the pattern - /\*.*\*/ + /\*.*\*/ to the string - /* first command */ not comment /* second comment */ + /* first command */ not comment /* second comment */ fails, because it matches the entire string due to the greediness of the .* item. - + + However, if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern - /\*.*?\*/ + /\*.*?\*/ does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred @@ -1136,22 +1212,25 @@ Because it has two uses, it can sometimes appear doubled, as in - \d??\d + \d??\d which matches one digit by preference, but can match two if that is the only way the rest of the pattern matches. - + + If the PCRE_UNGREEDY option is set (an option which is not available in Perl) then the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour. - + + When a parenthesized subpattern is quantified with a minimum repeat count that is greater than 1 or with a limited maximum, more store is required for the compiled pattern, in proportion to the size of the minimum or maximum. - + + If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent to Perl's /s) is set, thus allowing the . to match newlines, then the pattern is implicitly anchored, @@ -1163,11 +1242,12 @@ no newlines, it is worth setting PCRE_DOTALL when the pattern begins with .* in order to obtain this optimization, or alternatively using ^ to indicate anchoring explicitly. - + + When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration. For example, after - (tweedle[dume]{3}\s*)+ + (tweedle[dume]{3}\s*)+ has matched "tweedledum tweedledee" the value of the captured substring is "tweedledee". However, if there are @@ -1175,22 +1255,23 @@ values may have been set in previous iterations. For example, after - /(a|(b))+/ + /(a|(b))+/ matches "aba" the value of the second captured substring is "b". -
+
BACK REFERENCES - + Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier (i.e. to its left) in the pattern, provided there have been that many previous capturing left parentheses. - + + However, if the decimal number following the backslash is less than 10, it is always taken as a back reference, and causes an error only if there are not that many capturing @@ -1199,29 +1280,31 @@ the reference for numbers less than 10. See the section entitled "Backslash" above for further details of the handling of digits following a backslash. - + + A back reference matches whatever actually matched the capturing subpattern in the current subject string, rather than anything matching the subpattern itself. So the pattern - (sens|respons)e and \1ibility + (sens|respons)e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If caseful matching is in force at the time of the back reference, then the case of letters is relevant. For example, - ((?i)rah)\s+\1 + ((?i)rah)\s+\1 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original capturing subpattern is matched caselessly. - + + There may be more than one back reference to the same subpattern. If a subpattern has not actually been used in a particular match, then any back references to it always fail. For example, the pattern - (a|(bc))\2 + (a|(bc))\2 always fails if it starts to match "a" rather than "bc". Because there may be up to 99 back references, all digits @@ -1230,13 +1313,14 @@ character, then some delimiter must be used to terminate the back reference. If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty comment can be used. - + + A back reference that occurs inside the parentheses to which it refers fails when the subpattern is first used, so, for example, (a\1) never matches. However, such references can be useful inside repeated subpatterns. For example, the pattern - (a|b\1)+ + (a|b\1)+ matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of the subpattern, the back reference matches @@ -1245,12 +1329,12 @@ that the first iteration does not need to match the back reference. This can be done using alternation, as in the example above, or by a quantifier with a minimum of zero. - + Assertions - + An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple assertions coded as \b, @@ -1258,34 +1342,36 @@ assertions are coded as subpatterns. There are two kinds: those that look ahead of the current position in the subject string, and those that look behind it. - + + An assertion subpattern is matched in the normal way, except that it does not cause the current matching position to be changed. Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example, - \w+(?=;) + \w+(?=;) matches a word followed by a semicolon, but does not include the semicolon in the match, and - foo(?!bar) + foo(?!bar) matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern - (?!foo)bar + (?!foo)bar does not find an occurrence of "bar" that is preceded by something other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always &true; when the next three characters are "bar". A lookbehind assertion is needed to achieve this effect. - + + Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example, - (?<!foo)bar + (?<!foo)bar does find an occurrence of "bar" that is not preceded by "foo". The contents of a lookbehind assertion are restricted @@ -1293,11 +1379,11 @@ length. However, if there are several alternatives, they do not all have to have the same fixed length. Thus - (?<=bullock|donkey) + (?<=bullock|donkey) is permitted, but - (?<!dogs?|cats?) + (?<!dogs?|cats?) causes an error at compile time. Branches that match different length strings are permitted only at the top level of @@ -1305,13 +1391,13 @@ Perl 5.005, which requires all branches to match the same length of string. An assertion such as - (?<=ab(c|de)) + (?<=ab(c|de)) is not permitted, because its single top-level branch can match two different lengths, but it is acceptable if rewritten to use two top-level branches: - (?<=abc|abde) + (?<=abc|abde) The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back @@ -1321,11 +1407,12 @@ once-only subpatterns can be particularly useful for matching at the ends of strings; an example is given at the end of the section on once-only subpatterns. - + + Several assertions (of any sort) may occur in succession. For example, - (?<=\d{3})(?<!999)foo + (?<=\d{3})(?<!999)foo matches "foo" preceded by three digits that are not "999". Notice that each of the assertions is applied independently @@ -1337,25 +1424,28 @@ of which are not "999". For example, it doesn't match "123abcfoo". A pattern to do that is - (?<=\d{3}...)(?<!999)foo - + (?<=\d{3}...)(?<!999)foo + + This time the first assertion looks at the preceding six characters, checking that the first three are digits, and then the second assertion checks that the preceding three characters are not "999". - + + Assertions can be nested in any combination. For example, - (?<=(?<!foo)bar)baz + (?<=(?<!foo)bar)baz matches an occurrence of "baz" that is preceded by "bar" which in turn is not preceded by "foo", while - (?<=\d{3}(?!999)...)foo + (?<=\d{3}(?!999)...)foo is another pattern which matches "foo" preceded by three digits and any three characters that are not "999". - + + Assertion subpatterns are not capturing subpatterns, and may not be repeated, because it makes no sense to assert the same thing several times. If any kind of assertion contains @@ -1364,15 +1454,16 @@ pattern. However, substring capturing is carried out only for positive assertions, because it does not make sense for negative assertions. - + + Assertions count towards the maximum of 200 parenthesized subpatterns. - + Once-only subpatterns - + With both maximizing and minimizing repetition, failure of what follows normally causes the repeated item to be re-evaluated to see if a different number of repeats allows the @@ -1381,12 +1472,14 @@ to cause it fail earlier than it otherwise might, when the author of the pattern knows there is no point in carrying on. - + + Consider, for example, the pattern \d+foo when applied to the subject line - 123456bar - + 123456bar + + After matching all 6 digits and then failing to match "foo", the normal action of the matcher is to try again with only 5 digits matching the \d+ item, and then with 4, and so on, @@ -1397,40 +1490,45 @@ the first time. The notation is another kind of special parenthesis, starting with (?> as in this example: - (?>\d+)bar - + (?>\d+)bar + + This kind of parenthesis "locks up" the part of the pattern it contains once it has matched, and a failure further into the pattern is prevented from backtracking into it. Backtracking past it to previous items, however, works as normal. - + + An alternative description is that a subpattern of this type matches the string of characters that an identical standalone pattern would match, if anchored at the current point in the subject string. - + + Once-only subpatterns are not capturing subpatterns. Simple cases such as the above example can be thought of as a maximizing repeat that must swallow everything it can. So, while both \d+ and \d+? are prepared to adjust the number of digits they match in order to make the rest of the pattern match, (?>\d+) can only match an entire sequence of digits. - + + This construction can of course contain arbitrarily complicated subpatterns, and it can be nested. - + + Once-only subpatterns can be used in conjunction with look-behind assertions to specify efficient matching at the end of the subject string. Consider a simple pattern such as - abcd$ + abcd$ when applied to a long string which does not match. Because matching proceeds from left to right, PCRE will look for each "a" in the subject and then see if what follows matches the rest of the pattern. If the pattern is specified as - ^.*abcd$ + ^.*abcd$ then the initial .* matches the entire string at first, but when this fails (because there is no following "a"), it @@ -1439,28 +1537,29 @@ for "a" covers the entire string, from right to left, so we are no better off. However, if the pattern is written as - ^(?>.*)(?<=abcd) + ^(?>.*)(?<=abcd) then there can be no backtracking for the .* item; it can match only the entire string. The subsequent lookbehind assertion does a single test on the last four characters. If it fails, the match fails immediately. For long strings, this approach makes a significant difference to the processing time. - + + When a pattern contains an unlimited repeat inside a subpattern that can itself be repeated an unlimited number of times, the use of a once-only subpattern is the only way to avoid some failing matches taking a very long time indeed. The pattern - (\D+|<\d+>)*[!?] + (\D+|<\d+>)*[!?] matches an unlimited number of substrings that either consist of non-digits, or digits enclosed in <>, followed by either ! or ?. When it matches, it runs quickly. However, if it is applied to - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa it takes a long time before reporting failure. This is because the string can be divided between the two repeats in @@ -1472,29 +1571,33 @@ match, and fail early if it is not present in the string.) If the pattern is changed to - ((?>\D+)|<\d+>)*[!?] + ((?>\D+)|<\d+>)*[!?] sequences of non-digits cannot be broken, and failure happens quickly. - + Conditional subpatterns - + It is possible to cause the matching process to obey a subpattern conditionally or to choose between two alternative subpatterns, depending on the result of an assertion, or whether a previous capturing subpattern matched or not. The two possible forms of conditional subpattern are + + (?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern) - + + If the condition is satisfied, the yes-pattern is used; otherwise the no-pattern (if present) is used. If there are more than two alternatives in the subpattern, a compile-time error occurs. - + + There are two kinds of condition. If the text between the parentheses consists of a sequence of digits, then the condition is satisfied if the capturing subpattern of that @@ -1503,8 +1606,9 @@ more readable (assume the PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: - ( \( )? [^()]+ (?(1) \) ) - + ( \( )? [^()]+ (?(1) \) ) + + The first part matches an optional opening parenthesis, and if that character is present, sets it as the first captured substring. The second part matches one or more characters @@ -1517,16 +1621,20 @@ subpattern matches nothing. In other words, this pattern matches a sequence of non-parentheses, optionally enclosed in parentheses. - + + If the condition is not a sequence of digits, it must be an assertion. This may be a positive or negative lookahead or lookbehind assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on the second line: + + (?(?=[^a-z]*[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) - + + The condition is a positive lookahead assertion that matches an optional sequence of non-letters followed by a letter. In other words, it tests for the presence of at least one @@ -1535,26 +1643,27 @@ matched against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. - + Comments - + The sequence (?# marks the start of a comment which continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters that make up a comment play no part in the pattern matching at all. - + + If the PCRE_EXTENDED option is set, an unescaped # character outside a character class introduces a comment that continues up to the next newline character in the pattern. - + Recursive patterns - + Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern @@ -1568,41 +1677,43 @@ option is set so that white space is ignored): - \( ( (?>[^()]+) | (?R) )* \) - + \( ( (?>[^()]+) | (?R) )* \) + + First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself (i.e. a correctly parenthesized substring). Finally there is a closing parenthesis. - + + This particular example pattern contains nested unlimited repeats, and so the use of a once-only subpattern for matching strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when it is applied to - (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() + (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() it yields "no match" quickly. However, if a once-only subpattern is not used, the match runs for a very long time indeed because there are so many different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported. - + + The values set for any capturing subpatterns are those from the outermost level of the recursion at which the subpattern value is set. If the pattern above is matched against - (ab(cd)ef) + (ab(cd)ef) the value for the capturing parentheses is "ef", which is the last value taken on at the top level. If additional parentheses are added, giving - \( ( ( (?>[^()]+) | (?R) )* ) \) - ^ ^ - ^ ^ then the string they capture + \( ( ( (?>[^()]+) | (?R) )* ) \) + then the string they capture is "ab(cd)ef", the contents of the top level parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE has to obtain extra memory to store data during a @@ -1611,12 +1722,12 @@ saves data for the first 15 capturing parentheses only, as there is no way to give an out-of-memory error from within a recursion. - + Performances - + Certain items that may appear in patterns are more efficient than others. It is more efficient to use a character class like [aeiou] than a set of alternatives such as (a|e|i|o|u). @@ -1624,7 +1735,8 @@ required behaviour is usually the most efficient. Jeffrey Friedl's book contains a lot of discussion about optimizing regular expressions for efficient performance. - + + When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is implicitly anchored by PCRE, since it can match only at the start of a subject string. However, if @@ -1634,25 +1746,28 @@ match from the character immediately following one of them instead of from the very start. For example, the pattern - (.*) second + (.*) second matches the subject "first\nand second" (where \n stands for a newline character) with the first captured substring being "and". In order to do this, PCRE has to retry the match starting after every newline in the subject. - + + If you are using such a pattern with subject strings that do not contain newlines, the best performance is obtained by - setting PCRE_DOTALL , or starting the pattern with ^.* to + setting PCRE_DOTALL, or starting the pattern with ^.* to indicate explicit anchoring. That saves PCRE from having to scan along the subject looking for a newline to restart at. - + + Beware of patterns that contain nested indefinite repeats. These can take a long time to run when applied to a string that does not match. Consider the pattern fragment - (a+)* - + (a+)* + + This can match "aaaa" in 33 different ways, and this number increases very rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 times, and for each of @@ -1661,11 +1776,12 @@ that the entire match is going to fail, PCRE has in principle to try every possible variation, and this can take an extremely long time. - + + An optimization catches some of the more simple cases such as - (a+)*b + (a+)*b where a literal character follows. Before embarking on the standard matching procedure, PCRE checks that there is a "b" @@ -1674,13 +1790,13 @@ literal this optimization cannot be used. You can see the difference by comparing the behaviour of - (a+)*\d + (a+)*\d with the pattern above. The former gives a failure almost instantly when applied to a whole line of "a" characters, whereas the latter takes an appreciable time with strings longer than about 20 characters. - +