php-doc-en/reference/parle/pattern.matching.xml

<?xml version="1.0" encoding="utf-8"?>
<!-- $Revision$ -->

<chapter xml:id="parle.pattern.matching" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
 <title>Parle pattern matching</title>
 <titleabbrev>Pattern matching</titleabbrev>
 <para>
  Parle supports regex matching similar to flex. Also supported are the following POSIX character sets: <literal>[:alnum:]</literal>, <literal>[:alpha:]</literal>, <literal>[:blank:]</literal>, <literal>[:cntrl:]</literal>, <literal>[:digit:]</literal>, <literal>[:graph:]</literal>, <literal>[:lower:]</literal>, <literal>[:print:]</literal>, <literal>[:punct:]</literal>, <literal>[:space:]</literal>, <literal>[:upper:]</literal> and <literal>[:xdigit:]</literal>.
 </para>
 <para>
  The Unicode character classes are currently not enabled by default, pass --enable-parle-utf32 to make them available. A particular encoding can be mapped with a correctly constructed regex. For example, to match the EURO symbol encoded in UTF-8, the regular expression <literal>[\xe2][\x82][\xac]</literal> can be used. The pattern for an UTF-8 encoded string could be <literal>[ -\x7f]{+}[\x80-\xbf]{+}[\xc2-\xdf]{+}[\xe0-\xef]{+}[\xf0-\xff]+</literal>.
 </para>
 <section xml:id="parle.regex.chars">
  <title>Character representations</title>
  <para>
   <table>
    <title>Character representations</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>\a</entry><entry>Alert (bell).</entry>
      </row>
      <row>
       <entry>\b</entry><entry>Backspace.</entry>
      </row>
      <row>
       <entry>\e</entry><entry>ESC character, \x1b.</entry>
      </row>
      <row>
       <entry>\n</entry><entry>Newline.</entry>
      </row>
      <row>
       <entry>\r</entry><entry>Carriage return.</entry>
      </row>
      <row>
       <entry>\f</entry><entry>Form feed, \x0c.</entry>
      </row>
      <row>
       <entry>\t</entry><entry>Horizontal tab, \x09.</entry>
      </row>
      <row>
       <entry>\v</entry><entry>Vertical tab, \x0b.</entry>
      </row>
      <row>
       <entry>\oct</entry><entry>Character specified by a three-digit octal code.</entry>
      </row>
      <row>
       <entry>\xhex</entry><entry>Character specified by a hex code.</entry>
      </row>
      <row>
       <entry>\cchar</entry><entry>Named control character.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.charclass">
  <title>Character classes</title>
  <para>
   <table>
    <title>Character classes</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>[...]</entry><entry>A single character listed or contained within a listed range. Ranges can be combined with the <literal>{+}</literal> and <literal>{-}</literal> operators. For example <literal>[a-z]{+}[0-9]</literal> is the same as <literal>[0-9a-z]</literal> and <literal>[a-z]{-}[aeiou]</literal> is the same as <literal>[b-df-hj-np-tv-z]</literal>.</entry>
      </row>
      <row>
       <entry>[^...]</entry><entry>A single character not listed and not contained within a listed range.</entry>
      </row>
      <row>
       <entry>.</entry><entry>Any character, default <literal>[^\n].</literal></entry>
      </row>
      <row>
       <entry>\d</entry><entry>Digit character, <literal>[0-9]</literal>.</entry>
      </row>
      <row>
       <entry>\D</entry><entry>Non-digit character, <literal>[^0-9]</literal>.</entry>
      </row>
      <row>
       <entry>\s</entry><entry>White space character, <literal>[ \t\n\r\f\v]</literal>.</entry>
      </row>
      <row>
       <entry>\S</entry><entry>Non-white space character, <literal>[^ \t\n\r\f\v]</literal>.</entry>
      </row>
      <row>
       <entry>\w</entry><entry>Word character, <literal>[a-zA-Z0-9_]</literal>.</entry>
      </row>
      <row>
       <entry>\W</entry><entry>Non-word character, <literal>[^a-zA-Z0-9_]</literal>.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.unicodecharclass">
  <title>Unicode character classes</title>
  <para>
   <table>
    <title>Unicode character classes</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>\p{C}</entry><entry>Other.</entry>
      </row>
      <row>
       <entry>\p{Cc}</entry><entry>Other, control.</entry>
      </row>
      <row>
       <entry>\p{Cf}</entry><entry>Other, format.</entry>
      </row>
      <row>
       <entry>\p{Co}</entry><entry>Other, private use.</entry>
      </row>
      <row>
       <entry>\p{Cs}</entry><entry>Other, surrogate.</entry>
      </row>
      <row>
       <entry>\p{L}</entry><entry>Letter.</entry>
      </row>
      <row>
       <entry>\p{LC}</entry><entry>Letter, cased.</entry>
      </row>
      <row>
       <entry>\p{Ll}</entry><entry>Letter, lowercase.</entry>
      </row>
      <row>
       <entry>\p{Lm}</entry><entry>Letter, modifier.</entry>
      </row>
      <row>
       <entry>\p{Lo}</entry><entry>Letter, other.</entry>
      </row>
      <row>
       <entry>\p{Lt}</entry><entry>Letter, titlecase.</entry>
      </row>
      <row>
       <entry>\p{Lu}</entry><entry>Letter, uppercase.</entry>
      </row>
      <row>
       <entry>\p{M}</entry><entry>Mark.</entry>
      </row>
      <row>
       <entry>\p{Mc}</entry><entry>Mark, space combining.</entry>
      </row>
      <row>
       <entry>\p{Me}</entry><entry>Mark, enclosing.</entry>
      </row>
      <row>
       <entry>\p{Mn}</entry><entry>Mark, nonspacing.</entry>
      </row>
      <row>
       <entry>\p{N}</entry><entry>Number.</entry>
      </row>
      <row>
       <entry>\p{Nd}</entry><entry>Number, decimal digit.</entry>
      </row>
      <row>
       <entry>\p{Nl}</entry><entry>Number, letter.</entry>
      </row>
      <row>
       <entry>\p{No}</entry><entry>Number, other.</entry>
      </row>
      <row>
       <entry>\p{P}</entry><entry>Punctuation.</entry>
      </row>
      <row>
       <entry>\p{Pc}</entry><entry>Punctiation, connector.</entry>
      </row>
      <row>
       <entry>\p{Pd}</entry><entry>Punctuation, dash.</entry>
      </row>
      <row>
       <entry>\p{Pe}</entry><entry>Punctuation, close.</entry>
      </row>
      <row>
       <entry>\p{Pf}</entry><entry>Punctuation, final quote.</entry>
      </row>
      <row>
       <entry>\p{Pi}</entry><entry>Punctuation, initial quote.</entry>
      </row>
      <row>
       <entry>\p{Po}</entry><entry>Punctuation, other.</entry>
      </row>
      <row>
       <entry>\p{Ps}</entry><entry>Punctuation, open.</entry>
      </row>
      <row>
       <entry>\p{S}</entry><entry>Symbol.</entry>
      </row>
      <row>
       <entry>\p{Sc}</entry><entry>Symbol, currency.</entry>
      </row>
      <row>
       <entry>\p{Sk}</entry><entry>Symbol, modifier.</entry>
      </row>
      <row>
       <entry>\p{Sm}</entry><entry>Symbol, math.</entry>
      </row>
      <row>
       <entry>\p{So}</entry><entry>Symbol, other.</entry>
      </row>
      <row>
       <entry>\p{Z}</entry><entry>Separator.</entry>
      </row>
      <row>
       <entry>\p{Zl}</entry><entry>Separator, line.</entry>
      </row>
      <row>
       <entry>\p{Zp}</entry><entry>Separator, paragraph.</entry>
      </row>
      <row>
       <entry>\p{Zs}</entry><entry>Separator, space.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
  <para>
   These character clasess are only available, if the option --enable-parle-utf32 was passed at the compilation time.
  </para>
 </section>
 <section xml:id="parle.regex.alternation">
  <title>Alternation and repetition</title>
  <para>
   <table>
    <title>Alternation and repetition</title>
    <tgroup cols="3">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Greedy</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>...|...</entry><entry>-</entry><entry>Try subpatterns in alternation.</entry>
      </row>
      <row>
       <entry>*</entry><entry>yes</entry><entry>Match 0 or more times.</entry>
      </row>
      <row>
       <entry>+</entry><entry>yes</entry><entry>Match 1 or more times.</entry>
      </row>
      <row>
       <entry>?</entry><entry>yes</entry><entry>Match 0 or 1 times.</entry>
      </row>
      <row>
       <entry>{n}</entry><entry>no</entry><entry>Match exactly n times.</entry>
      </row>
      <row>
       <entry>{n,}</entry><entry>yes</entry><entry>Match at least n times.</entry>
      </row>
      <row>
       <entry>{n,m}</entry><entry>yes</entry><entry>Match at least n times but no more than m times.</entry>
      </row>
      <row>
       <entry>*?</entry><entry>no</entry><entry>Match 0 or more times.</entry>
      </row>
      <row>
       <entry>+?</entry><entry>no</entry><entry>Match 1 or more times.</entry>
      </row>
      <row>
       <entry>??</entry><entry>no</entry><entry>Match 0 or 1 times.</entry>
      </row>
      <row>
       <entry>{n,}?</entry><entry>no</entry><entry>Match at least n times.</entry>
      </row>
      <row>
       <entry>{n,m}?</entry><entry>no</entry><entry>Match at least n times but no more than m times.</entry>
      </row>
      <row>
       <entry>{MACRO}</entry><entry>-</entry><entry>Include the regex MACRO in the current regex.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.anchors">
  <title>Anchors</title>
  <para>
   <table>
    <title>Anchors</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>^</entry><entry>Start of string or after a newline.</entry>
      </row>
      <row>
       <entry>$</entry><entry>End of string or before a newline.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.grouping">
  <title>Grouping</title>
  <para>
   <table>
    <title>Grouping</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>(...)</entry><entry>Group a regular expression to override default operator precedence.</entry>
      </row>
      <row>
       <entry valign="top">(?r-s:pattern)</entry>
       <entry>
        Apply option r and omit option s while interpreting pattern. Options may be zero or more of the characters i, s, or x.
        <table>
         <title>Options</title>
          <tgroup cols="2">
           <thead>
            <row>
             <entry>Option</entry><entry>Description</entry>
            </row>
           </thead>
          <tbody>
            <row>
             <entry>i</entry><entry>Case insensitive.</entry>
            </row>
            <row>
             <entry>-i</entry><entry>Case sensitive.</entry>
            </row>
            <row>
             <entry>s</entry><entry>Alters the meaning of '.' to match any character whatsoever.</entry>
            </row>
            <row>
             <entry>-s</entry><entry>Alters the meaning of '.' to match any character except '\n'.</entry>
            </row>
            <row>
             <entry>x</entry><entry>Ignores comments and whitespace in patterns. Whitespace is ignored unless it is backslash-escaped, contained within ""s, or appears inside a character range.</entry>
            </row>
          </tbody>
         </tgroup>
        </table>
        These options can be applied globally at the rules level by passing a combination of the bit flags to the lexer.
       </entry>
      </row>
      <row>
       <entry>(?# comment )</entry><entry>Omit everything within (). The first ) character encountered ends the pattern. It is not possible for the comment to contain a ) character. The comment may span lines.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
</chapter>

<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-omittag:t
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
indent-tabs-mode:nil
sgml-parent-document:nil
sgml-default-dtd-file:"~/.phpdoc/manual.ced"
sgml-exposed-tags:nil
sgml-local-catalogs:nil
sgml-local-ecat-files:nil
End:
vim600: syn=xml fen fdm=syntax fdl=2 si
vim: et tw=78 syn=sgml
vi: ts=1 sw=1
-->