17db96d56Sopenharmony_ci.. _regex-howto:
27db96d56Sopenharmony_ci
37db96d56Sopenharmony_ci****************************
47db96d56Sopenharmony_ci  Regular Expression HOWTO
57db96d56Sopenharmony_ci****************************
67db96d56Sopenharmony_ci
77db96d56Sopenharmony_ci:Author: A.M. Kuchling <amk@amk.ca>
87db96d56Sopenharmony_ci
97db96d56Sopenharmony_ci.. TODO:
107db96d56Sopenharmony_ci   Document lookbehind assertions
117db96d56Sopenharmony_ci   Better way of displaying a RE, a string, and what it matches
127db96d56Sopenharmony_ci   Mention optional argument to match.groups()
137db96d56Sopenharmony_ci   Unicode (at least a reference)
147db96d56Sopenharmony_ci
157db96d56Sopenharmony_ci
167db96d56Sopenharmony_ci.. topic:: Abstract
177db96d56Sopenharmony_ci
187db96d56Sopenharmony_ci   This document is an introductory tutorial to using regular expressions in Python
197db96d56Sopenharmony_ci   with the :mod:`re` module.  It provides a gentler introduction than the
207db96d56Sopenharmony_ci   corresponding section in the Library Reference.
217db96d56Sopenharmony_ci
227db96d56Sopenharmony_ci
237db96d56Sopenharmony_ciIntroduction
247db96d56Sopenharmony_ci============
257db96d56Sopenharmony_ci
267db96d56Sopenharmony_ciRegular expressions (called REs, or regexes, or regex patterns) are essentially
277db96d56Sopenharmony_cia tiny, highly specialized programming language embedded inside Python and made
287db96d56Sopenharmony_ciavailable through the :mod:`re` module. Using this little language, you specify
297db96d56Sopenharmony_cithe rules for the set of possible strings that you want to match; this set might
307db96d56Sopenharmony_cicontain English sentences, or e-mail addresses, or TeX commands, or anything you
317db96d56Sopenharmony_cilike.  You can then ask questions such as "Does this string match the pattern?",
327db96d56Sopenharmony_cior "Is there a match for the pattern anywhere in this string?".  You can also
337db96d56Sopenharmony_ciuse REs to modify a string or to split it apart in various ways.
347db96d56Sopenharmony_ci
357db96d56Sopenharmony_ciRegular expression patterns are compiled into a series of bytecodes which are
367db96d56Sopenharmony_cithen executed by a matching engine written in C.  For advanced use, it may be
377db96d56Sopenharmony_cinecessary to pay careful attention to how the engine will execute a given RE,
387db96d56Sopenharmony_ciand write the RE in a certain way in order to produce bytecode that runs faster.
397db96d56Sopenharmony_ciOptimization isn't covered in this document, because it requires that you have a
407db96d56Sopenharmony_cigood understanding of the matching engine's internals.
417db96d56Sopenharmony_ci
427db96d56Sopenharmony_ciThe regular expression language is relatively small and restricted, so not all
437db96d56Sopenharmony_cipossible string processing tasks can be done using regular expressions.  There
447db96d56Sopenharmony_ciare also tasks that *can* be done with regular expressions, but the expressions
457db96d56Sopenharmony_citurn out to be very complicated.  In these cases, you may be better off writing
467db96d56Sopenharmony_ciPython code to do the processing; while Python code will be slower than an
477db96d56Sopenharmony_cielaborate regular expression, it will also probably be more understandable.
487db96d56Sopenharmony_ci
497db96d56Sopenharmony_ci
507db96d56Sopenharmony_ciSimple Patterns
517db96d56Sopenharmony_ci===============
527db96d56Sopenharmony_ci
537db96d56Sopenharmony_ciWe'll start by learning about the simplest possible regular expressions.  Since
547db96d56Sopenharmony_ciregular expressions are used to operate on strings, we'll begin with the most
557db96d56Sopenharmony_cicommon task: matching characters.
567db96d56Sopenharmony_ci
577db96d56Sopenharmony_ciFor a detailed explanation of the computer science underlying regular
587db96d56Sopenharmony_ciexpressions (deterministic and non-deterministic finite automata), you can refer
597db96d56Sopenharmony_cito almost any textbook on writing compilers.
607db96d56Sopenharmony_ci
617db96d56Sopenharmony_ci
627db96d56Sopenharmony_ciMatching Characters
637db96d56Sopenharmony_ci-------------------
647db96d56Sopenharmony_ci
657db96d56Sopenharmony_ciMost letters and characters will simply match themselves.  For example, the
667db96d56Sopenharmony_ciregular expression ``test`` will match the string ``test`` exactly.  (You can
677db96d56Sopenharmony_cienable a case-insensitive mode that would let this RE match ``Test`` or ``TEST``
687db96d56Sopenharmony_cias well; more about this later.)
697db96d56Sopenharmony_ci
707db96d56Sopenharmony_ciThere are exceptions to this rule; some characters are special
717db96d56Sopenharmony_ci:dfn:`metacharacters`, and don't match themselves.  Instead, they signal that
727db96d56Sopenharmony_cisome out-of-the-ordinary thing should be matched, or they affect other portions
737db96d56Sopenharmony_ciof the RE by repeating them or changing their meaning.  Much of this document is
747db96d56Sopenharmony_cidevoted to discussing various metacharacters and what they do.
757db96d56Sopenharmony_ci
767db96d56Sopenharmony_ciHere's a complete list of the metacharacters; their meanings will be discussed
777db96d56Sopenharmony_ciin the rest of this HOWTO.
787db96d56Sopenharmony_ci
797db96d56Sopenharmony_ci.. code-block:: none
807db96d56Sopenharmony_ci
817db96d56Sopenharmony_ci   . ^ $ * + ? { } [ ] \ | ( )
827db96d56Sopenharmony_ci
837db96d56Sopenharmony_ciThe first metacharacters we'll look at are ``[`` and ``]``. They're used for
847db96d56Sopenharmony_cispecifying a character class, which is a set of characters that you wish to
857db96d56Sopenharmony_cimatch.  Characters can be listed individually, or a range of characters can be
867db96d56Sopenharmony_ciindicated by giving two characters and separating them by a ``'-'``.  For
877db96d56Sopenharmony_ciexample, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this
887db96d56Sopenharmony_ciis the same as ``[a-c]``, which uses a range to express the same set of
897db96d56Sopenharmony_cicharacters.  If you wanted to match only lowercase letters, your RE would be
907db96d56Sopenharmony_ci``[a-z]``.
917db96d56Sopenharmony_ci
927db96d56Sopenharmony_ciMetacharacters (except ``\``) are not active inside classes.  For example, ``[akm$]`` will
937db96d56Sopenharmony_cimatch any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is
947db96d56Sopenharmony_ciusually a metacharacter, but inside a character class it's stripped of its
957db96d56Sopenharmony_cispecial nature.
967db96d56Sopenharmony_ci
977db96d56Sopenharmony_ciYou can match the characters not listed within the class by :dfn:`complementing`
987db96d56Sopenharmony_cithe set.  This is indicated by including a ``'^'`` as the first character of the
997db96d56Sopenharmony_ciclass. For example, ``[^5]`` will match any character except ``'5'``.  If the
1007db96d56Sopenharmony_cicaret appears elsewhere in a character class, it does not have special meaning.
1017db96d56Sopenharmony_ciFor example: ``[5^]`` will match either a ``'5'`` or a ``'^'``.
1027db96d56Sopenharmony_ci
1037db96d56Sopenharmony_ciPerhaps the most important metacharacter is the backslash, ``\``.   As in Python
1047db96d56Sopenharmony_cistring literals, the backslash can be followed by various characters to signal
1057db96d56Sopenharmony_civarious special sequences.  It's also used to escape all the metacharacters so
1067db96d56Sopenharmony_ciyou can still match them in patterns; for example, if you need to match a ``[``
1077db96d56Sopenharmony_cior  ``\``, you can precede them with a backslash to remove their special
1087db96d56Sopenharmony_cimeaning: ``\[`` or ``\\``.
1097db96d56Sopenharmony_ci
1107db96d56Sopenharmony_ciSome of the special sequences beginning with ``'\'`` represent
1117db96d56Sopenharmony_cipredefined sets of characters that are often useful, such as the set
1127db96d56Sopenharmony_ciof digits, the set of letters, or the set of anything that isn't
1137db96d56Sopenharmony_ciwhitespace.
1147db96d56Sopenharmony_ci
1157db96d56Sopenharmony_ciLet's take an example: ``\w`` matches any alphanumeric character.  If
1167db96d56Sopenharmony_cithe regex pattern is expressed in bytes, this is equivalent to the
1177db96d56Sopenharmony_ciclass ``[a-zA-Z0-9_]``.  If the regex pattern is a string, ``\w`` will
1187db96d56Sopenharmony_cimatch all the characters marked as letters in the Unicode database
1197db96d56Sopenharmony_ciprovided by the :mod:`unicodedata` module.  You can use the more
1207db96d56Sopenharmony_cirestricted definition of ``\w`` in a string pattern by supplying the
1217db96d56Sopenharmony_ci:const:`re.ASCII` flag when compiling the regular expression.
1227db96d56Sopenharmony_ci
1237db96d56Sopenharmony_ciThe following list of special sequences isn't complete. For a complete
1247db96d56Sopenharmony_cilist of sequences and expanded class definitions for Unicode string
1257db96d56Sopenharmony_cipatterns, see the last part of :ref:`Regular Expression Syntax
1267db96d56Sopenharmony_ci<re-syntax>` in the Standard Library reference.  In general, the
1277db96d56Sopenharmony_ciUnicode versions match any character that's in the appropriate
1287db96d56Sopenharmony_cicategory in the Unicode database.
1297db96d56Sopenharmony_ci
1307db96d56Sopenharmony_ci``\d``
1317db96d56Sopenharmony_ci   Matches any decimal digit; this is equivalent to the class ``[0-9]``.
1327db96d56Sopenharmony_ci
1337db96d56Sopenharmony_ci``\D``
1347db96d56Sopenharmony_ci   Matches any non-digit character; this is equivalent to the class ``[^0-9]``.
1357db96d56Sopenharmony_ci
1367db96d56Sopenharmony_ci``\s``
1377db96d56Sopenharmony_ci   Matches any whitespace character; this is equivalent to the class ``[
1387db96d56Sopenharmony_ci   \t\n\r\f\v]``.
1397db96d56Sopenharmony_ci
1407db96d56Sopenharmony_ci``\S``
1417db96d56Sopenharmony_ci   Matches any non-whitespace character; this is equivalent to the class ``[^
1427db96d56Sopenharmony_ci   \t\n\r\f\v]``.
1437db96d56Sopenharmony_ci
1447db96d56Sopenharmony_ci``\w``
1457db96d56Sopenharmony_ci   Matches any alphanumeric character; this is equivalent to the class
1467db96d56Sopenharmony_ci   ``[a-zA-Z0-9_]``.
1477db96d56Sopenharmony_ci
1487db96d56Sopenharmony_ci``\W``
1497db96d56Sopenharmony_ci   Matches any non-alphanumeric character; this is equivalent to the class
1507db96d56Sopenharmony_ci   ``[^a-zA-Z0-9_]``.
1517db96d56Sopenharmony_ci
1527db96d56Sopenharmony_ciThese sequences can be included inside a character class.  For example,
1537db96d56Sopenharmony_ci``[\s,.]`` is a character class that will match any whitespace character, or
1547db96d56Sopenharmony_ci``','`` or ``'.'``.
1557db96d56Sopenharmony_ci
1567db96d56Sopenharmony_ciThe final metacharacter in this section is ``.``.  It matches anything except a
1577db96d56Sopenharmony_cinewline character, and there's an alternate mode (:const:`re.DOTALL`) where it will
1587db96d56Sopenharmony_cimatch even a newline.  ``.`` is often used where you want to match "any
1597db96d56Sopenharmony_cicharacter".
1607db96d56Sopenharmony_ci
1617db96d56Sopenharmony_ci
1627db96d56Sopenharmony_ciRepeating Things
1637db96d56Sopenharmony_ci----------------
1647db96d56Sopenharmony_ci
1657db96d56Sopenharmony_ciBeing able to match varying sets of characters is the first thing regular
1667db96d56Sopenharmony_ciexpressions can do that isn't already possible with the methods available on
1677db96d56Sopenharmony_cistrings.  However, if that was the only additional capability of regexes, they
1687db96d56Sopenharmony_ciwouldn't be much of an advance. Another capability is that you can specify that
1697db96d56Sopenharmony_ciportions of the RE must be repeated a certain number of times.
1707db96d56Sopenharmony_ci
1717db96d56Sopenharmony_ciThe first metacharacter for repeating things that we'll look at is ``*``.  ``*``
1727db96d56Sopenharmony_cidoesn't match the literal character ``'*'``; instead, it specifies that the
1737db96d56Sopenharmony_ciprevious character can be matched zero or more times, instead of exactly once.
1747db96d56Sopenharmony_ci
1757db96d56Sopenharmony_ciFor example, ``ca*t`` will match ``'ct'`` (0 ``'a'`` characters), ``'cat'`` (1 ``'a'``),
1767db96d56Sopenharmony_ci``'caaat'`` (3 ``'a'`` characters), and so forth.
1777db96d56Sopenharmony_ci
1787db96d56Sopenharmony_ciRepetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
1797db96d56Sopenharmony_ciengine will try to repeat it as many times as possible. If later portions of the
1807db96d56Sopenharmony_cipattern don't match, the matching engine will then back up and try again with
1817db96d56Sopenharmony_cifewer repetitions.
1827db96d56Sopenharmony_ci
1837db96d56Sopenharmony_ciA step-by-step example will make this more obvious.  Let's consider the
1847db96d56Sopenharmony_ciexpression ``a[bcd]*b``.  This matches the letter ``'a'``, zero or more letters
1857db96d56Sopenharmony_cifrom the class ``[bcd]``, and finally ends with a ``'b'``.  Now imagine matching
1867db96d56Sopenharmony_cithis RE against the string ``'abcbd'``.
1877db96d56Sopenharmony_ci
1887db96d56Sopenharmony_ci+------+-----------+---------------------------------+
1897db96d56Sopenharmony_ci| Step | Matched   | Explanation                     |
1907db96d56Sopenharmony_ci+======+===========+=================================+
1917db96d56Sopenharmony_ci| 1    | ``a``     | The ``a`` in the RE matches.    |
1927db96d56Sopenharmony_ci+------+-----------+---------------------------------+
1937db96d56Sopenharmony_ci| 2    | ``abcbd`` | The engine matches ``[bcd]*``,  |
1947db96d56Sopenharmony_ci|      |           | going as far as it can, which   |
1957db96d56Sopenharmony_ci|      |           | is to the end of the string.    |
1967db96d56Sopenharmony_ci+------+-----------+---------------------------------+
1977db96d56Sopenharmony_ci| 3    | *Failure* | The engine tries to match       |
1987db96d56Sopenharmony_ci|      |           | ``b``, but the current position |
1997db96d56Sopenharmony_ci|      |           | is at the end of the string, so |
2007db96d56Sopenharmony_ci|      |           | it fails.                       |
2017db96d56Sopenharmony_ci+------+-----------+---------------------------------+
2027db96d56Sopenharmony_ci| 4    | ``abcb``  | Back up, so that  ``[bcd]*``    |
2037db96d56Sopenharmony_ci|      |           | matches one less character.     |
2047db96d56Sopenharmony_ci+------+-----------+---------------------------------+
2057db96d56Sopenharmony_ci| 5    | *Failure* | Try ``b`` again, but the        |
2067db96d56Sopenharmony_ci|      |           | current position is at the last |
2077db96d56Sopenharmony_ci|      |           | character, which is a ``'d'``.  |
2087db96d56Sopenharmony_ci+------+-----------+---------------------------------+
2097db96d56Sopenharmony_ci| 6    | ``abc``   | Back up again, so that          |
2107db96d56Sopenharmony_ci|      |           | ``[bcd]*`` is only matching     |
2117db96d56Sopenharmony_ci|      |           | ``bc``.                         |
2127db96d56Sopenharmony_ci+------+-----------+---------------------------------+
2137db96d56Sopenharmony_ci| 6    | ``abcb``  | Try ``b`` again.  This time     |
2147db96d56Sopenharmony_ci|      |           | the character at the            |
2157db96d56Sopenharmony_ci|      |           | current position is ``'b'``, so |
2167db96d56Sopenharmony_ci|      |           | it succeeds.                    |
2177db96d56Sopenharmony_ci+------+-----------+---------------------------------+
2187db96d56Sopenharmony_ci
2197db96d56Sopenharmony_ciThe end of the RE has now been reached, and it has matched ``'abcb'``.  This
2207db96d56Sopenharmony_cidemonstrates how the matching engine goes as far as it can at first, and if no
2217db96d56Sopenharmony_cimatch is found it will then progressively back up and retry the rest of the RE
2227db96d56Sopenharmony_ciagain and again.  It will back up until it has tried zero matches for
2237db96d56Sopenharmony_ci``[bcd]*``, and if that subsequently fails, the engine will conclude that the
2247db96d56Sopenharmony_cistring doesn't match the RE at all.
2257db96d56Sopenharmony_ci
2267db96d56Sopenharmony_ciAnother repeating metacharacter is ``+``, which matches one or more times.  Pay
2277db96d56Sopenharmony_cicareful attention to the difference between ``*`` and ``+``; ``*`` matches
2287db96d56Sopenharmony_ci*zero* or more times, so whatever's being repeated may not be present at all,
2297db96d56Sopenharmony_ciwhile ``+`` requires at least *one* occurrence.  To use a similar example,
2307db96d56Sopenharmony_ci``ca+t`` will match ``'cat'`` (1 ``'a'``), ``'caaat'`` (3 ``'a'``\ s), but won't
2317db96d56Sopenharmony_cimatch ``'ct'``.
2327db96d56Sopenharmony_ci
2337db96d56Sopenharmony_ciThere are two more repeating operators or quantifiers.  The question mark character, ``?``,
2347db96d56Sopenharmony_cimatches either once or zero times; you can think of it as marking something as
2357db96d56Sopenharmony_cibeing optional.  For example, ``home-?brew`` matches either ``'homebrew'`` or
2367db96d56Sopenharmony_ci``'home-brew'``.
2377db96d56Sopenharmony_ci
2387db96d56Sopenharmony_ciThe most complicated quantifier is ``{m,n}``, where *m* and *n* are
2397db96d56Sopenharmony_cidecimal integers.  This quantifier means there must be at least *m* repetitions,
2407db96d56Sopenharmony_ciand at most *n*.  For example, ``a/{1,3}b`` will match ``'a/b'``, ``'a//b'``, and
2417db96d56Sopenharmony_ci``'a///b'``.  It won't match ``'ab'``, which has no slashes, or ``'a////b'``, which
2427db96d56Sopenharmony_cihas four.
2437db96d56Sopenharmony_ci
2447db96d56Sopenharmony_ciYou can omit either *m* or *n*; in that case, a reasonable value is assumed for
2457db96d56Sopenharmony_cithe missing value.  Omitting *m* is interpreted as a lower limit of 0, while
2467db96d56Sopenharmony_ciomitting *n* results in an upper bound of infinity.
2477db96d56Sopenharmony_ci
2487db96d56Sopenharmony_ciReaders of a reductionist bent may notice that the three other quantifiers can
2497db96d56Sopenharmony_ciall be expressed using this notation.  ``{0,}`` is the same as ``*``, ``{1,}``
2507db96d56Sopenharmony_ciis equivalent to ``+``, and ``{0,1}`` is the same as ``?``.  It's better to use
2517db96d56Sopenharmony_ci``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier
2527db96d56Sopenharmony_cito read.
2537db96d56Sopenharmony_ci
2547db96d56Sopenharmony_ci
2557db96d56Sopenharmony_ciUsing Regular Expressions
2567db96d56Sopenharmony_ci=========================
2577db96d56Sopenharmony_ci
2587db96d56Sopenharmony_ciNow that we've looked at some simple regular expressions, how do we actually use
2597db96d56Sopenharmony_cithem in Python?  The :mod:`re` module provides an interface to the regular
2607db96d56Sopenharmony_ciexpression engine, allowing you to compile REs into objects and then perform
2617db96d56Sopenharmony_cimatches with them.
2627db96d56Sopenharmony_ci
2637db96d56Sopenharmony_ci
2647db96d56Sopenharmony_ciCompiling Regular Expressions
2657db96d56Sopenharmony_ci-----------------------------
2667db96d56Sopenharmony_ci
2677db96d56Sopenharmony_ciRegular expressions are compiled into pattern objects, which have
2687db96d56Sopenharmony_cimethods for various operations such as searching for pattern matches or
2697db96d56Sopenharmony_ciperforming string substitutions. ::
2707db96d56Sopenharmony_ci
2717db96d56Sopenharmony_ci   >>> import re
2727db96d56Sopenharmony_ci   >>> p = re.compile('ab*')
2737db96d56Sopenharmony_ci   >>> p
2747db96d56Sopenharmony_ci   re.compile('ab*')
2757db96d56Sopenharmony_ci
2767db96d56Sopenharmony_ci:func:`re.compile` also accepts an optional *flags* argument, used to enable
2777db96d56Sopenharmony_civarious special features and syntax variations.  We'll go over the available
2787db96d56Sopenharmony_cisettings later, but for now a single example will do::
2797db96d56Sopenharmony_ci
2807db96d56Sopenharmony_ci   >>> p = re.compile('ab*', re.IGNORECASE)
2817db96d56Sopenharmony_ci
2827db96d56Sopenharmony_ciThe RE is passed to :func:`re.compile` as a string.  REs are handled as strings
2837db96d56Sopenharmony_cibecause regular expressions aren't part of the core Python language, and no
2847db96d56Sopenharmony_cispecial syntax was created for expressing them.  (There are applications that
2857db96d56Sopenharmony_cidon't need REs at all, so there's no need to bloat the language specification by
2867db96d56Sopenharmony_ciincluding them.) Instead, the :mod:`re` module is simply a C extension module
2877db96d56Sopenharmony_ciincluded with Python, just like the :mod:`socket` or :mod:`zlib` modules.
2887db96d56Sopenharmony_ci
2897db96d56Sopenharmony_ciPutting REs in strings keeps the Python language simpler, but has one
2907db96d56Sopenharmony_cidisadvantage which is the topic of the next section.
2917db96d56Sopenharmony_ci
2927db96d56Sopenharmony_ci
2937db96d56Sopenharmony_ci.. _the-backslash-plague:
2947db96d56Sopenharmony_ci
2957db96d56Sopenharmony_ciThe Backslash Plague
2967db96d56Sopenharmony_ci--------------------
2977db96d56Sopenharmony_ci
2987db96d56Sopenharmony_ciAs stated earlier, regular expressions use the backslash character (``'\'``) to
2997db96d56Sopenharmony_ciindicate special forms or to allow special characters to be used without
3007db96d56Sopenharmony_ciinvoking their special meaning. This conflicts with Python's usage of the same
3017db96d56Sopenharmony_cicharacter for the same purpose in string literals.
3027db96d56Sopenharmony_ci
3037db96d56Sopenharmony_ciLet's say you want to write a RE that matches the string ``\section``, which
3047db96d56Sopenharmony_cimight be found in a LaTeX file.  To figure out what to write in the program
3057db96d56Sopenharmony_cicode, start with the desired string to be matched.  Next, you must escape any
3067db96d56Sopenharmony_cibackslashes and other metacharacters by preceding them with a backslash,
3077db96d56Sopenharmony_ciresulting in the string ``\\section``.  The resulting string that must be passed
3087db96d56Sopenharmony_cito :func:`re.compile` must be ``\\section``.  However, to express this as a
3097db96d56Sopenharmony_ciPython string literal, both backslashes must be escaped *again*.
3107db96d56Sopenharmony_ci
3117db96d56Sopenharmony_ci+-------------------+------------------------------------------+
3127db96d56Sopenharmony_ci| Characters        | Stage                                    |
3137db96d56Sopenharmony_ci+===================+==========================================+
3147db96d56Sopenharmony_ci| ``\section``      | Text string to be matched                |
3157db96d56Sopenharmony_ci+-------------------+------------------------------------------+
3167db96d56Sopenharmony_ci| ``\\section``     | Escaped backslash for :func:`re.compile` |
3177db96d56Sopenharmony_ci+-------------------+------------------------------------------+
3187db96d56Sopenharmony_ci| ``"\\\\section"`` | Escaped backslashes for a string literal |
3197db96d56Sopenharmony_ci+-------------------+------------------------------------------+
3207db96d56Sopenharmony_ci
3217db96d56Sopenharmony_ciIn short, to match a literal backslash, one has to write ``'\\\\'`` as the RE
3227db96d56Sopenharmony_cistring, because the regular expression must be ``\\``, and each backslash must
3237db96d56Sopenharmony_cibe expressed as ``\\`` inside a regular Python string literal.  In REs that
3247db96d56Sopenharmony_cifeature backslashes repeatedly, this leads to lots of repeated backslashes and
3257db96d56Sopenharmony_cimakes the resulting strings difficult to understand.
3267db96d56Sopenharmony_ci
3277db96d56Sopenharmony_ciThe solution is to use Python's raw string notation for regular expressions;
3287db96d56Sopenharmony_cibackslashes are not handled in any special way in a string literal prefixed with
3297db96d56Sopenharmony_ci``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``,
3307db96d56Sopenharmony_ciwhile ``"\n"`` is a one-character string containing a newline. Regular
3317db96d56Sopenharmony_ciexpressions will often be written in Python code using this raw string notation.
3327db96d56Sopenharmony_ci
3337db96d56Sopenharmony_ciIn addition, special escape sequences that are valid in regular expressions,
3347db96d56Sopenharmony_cibut not valid as Python string literals, now result in a
3357db96d56Sopenharmony_ci:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`,
3367db96d56Sopenharmony_ciwhich means the sequences will be invalid if raw string notation or escaping
3377db96d56Sopenharmony_cithe backslashes isn't used.
3387db96d56Sopenharmony_ci
3397db96d56Sopenharmony_ci
3407db96d56Sopenharmony_ci+-------------------+------------------+
3417db96d56Sopenharmony_ci| Regular String    | Raw string       |
3427db96d56Sopenharmony_ci+===================+==================+
3437db96d56Sopenharmony_ci| ``"ab*"``         | ``r"ab*"``       |
3447db96d56Sopenharmony_ci+-------------------+------------------+
3457db96d56Sopenharmony_ci| ``"\\\\section"`` | ``r"\\section"`` |
3467db96d56Sopenharmony_ci+-------------------+------------------+
3477db96d56Sopenharmony_ci| ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"``  |
3487db96d56Sopenharmony_ci+-------------------+------------------+
3497db96d56Sopenharmony_ci
3507db96d56Sopenharmony_ci
3517db96d56Sopenharmony_ciPerforming Matches
3527db96d56Sopenharmony_ci------------------
3537db96d56Sopenharmony_ci
3547db96d56Sopenharmony_ciOnce you have an object representing a compiled regular expression, what do you
3557db96d56Sopenharmony_cido with it?  Pattern objects have several methods and attributes.
3567db96d56Sopenharmony_ciOnly the most significant ones will be covered here; consult the :mod:`re` docs
3577db96d56Sopenharmony_cifor a complete listing.
3587db96d56Sopenharmony_ci
3597db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
3607db96d56Sopenharmony_ci| Method/Attribute | Purpose                                       |
3617db96d56Sopenharmony_ci+==================+===============================================+
3627db96d56Sopenharmony_ci| ``match()``      | Determine if the RE matches at the beginning  |
3637db96d56Sopenharmony_ci|                  | of the string.                                |
3647db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
3657db96d56Sopenharmony_ci| ``search()``     | Scan through a string, looking for any        |
3667db96d56Sopenharmony_ci|                  | location where this RE matches.               |
3677db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
3687db96d56Sopenharmony_ci| ``findall()``    | Find all substrings where the RE matches, and |
3697db96d56Sopenharmony_ci|                  | returns them as a list.                       |
3707db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
3717db96d56Sopenharmony_ci| ``finditer()``   | Find all substrings where the RE matches, and |
3727db96d56Sopenharmony_ci|                  | returns them as an :term:`iterator`.          |
3737db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
3747db96d56Sopenharmony_ci
3757db96d56Sopenharmony_ci:meth:`~re.Pattern.match` and :meth:`~re.Pattern.search` return ``None`` if no match can be found.  If
3767db96d56Sopenharmony_cithey're successful, a :ref:`match object <match-objects>` instance is returned,
3777db96d56Sopenharmony_cicontaining information about the match: where it starts and ends, the substring
3787db96d56Sopenharmony_ciit matched, and more.
3797db96d56Sopenharmony_ci
3807db96d56Sopenharmony_ciYou can learn about this by interactively experimenting with the :mod:`re`
3817db96d56Sopenharmony_cimodule.  If you have :mod:`tkinter` available, you may also want to look at
3827db96d56Sopenharmony_ci:source:`Tools/demo/redemo.py`, a demonstration program included with the
3837db96d56Sopenharmony_ciPython distribution.  It allows you to enter REs and strings, and displays
3847db96d56Sopenharmony_ciwhether the RE matches or fails. :file:`redemo.py` can be quite useful when
3857db96d56Sopenharmony_citrying to debug a complicated RE.
3867db96d56Sopenharmony_ci
3877db96d56Sopenharmony_ciThis HOWTO uses the standard Python interpreter for its examples. First, run the
3887db96d56Sopenharmony_ciPython interpreter, import the :mod:`re` module, and compile a RE::
3897db96d56Sopenharmony_ci
3907db96d56Sopenharmony_ci   >>> import re
3917db96d56Sopenharmony_ci   >>> p = re.compile('[a-z]+')
3927db96d56Sopenharmony_ci   >>> p
3937db96d56Sopenharmony_ci   re.compile('[a-z]+')
3947db96d56Sopenharmony_ci
3957db96d56Sopenharmony_ciNow, you can try matching various strings against the RE ``[a-z]+``.  An empty
3967db96d56Sopenharmony_cistring shouldn't match at all, since ``+`` means 'one or more repetitions'.
3977db96d56Sopenharmony_ci:meth:`~re.Pattern.match` should return ``None`` in this case, which will cause the
3987db96d56Sopenharmony_ciinterpreter to print no output.  You can explicitly print the result of
3997db96d56Sopenharmony_ci:meth:`!match` to make this clear. ::
4007db96d56Sopenharmony_ci
4017db96d56Sopenharmony_ci   >>> p.match("")
4027db96d56Sopenharmony_ci   >>> print(p.match(""))
4037db96d56Sopenharmony_ci   None
4047db96d56Sopenharmony_ci
4057db96d56Sopenharmony_ciNow, let's try it on a string that it should match, such as ``tempo``.  In this
4067db96d56Sopenharmony_cicase, :meth:`~re.Pattern.match` will return a :ref:`match object <match-objects>`, so you
4077db96d56Sopenharmony_cishould store the result in a variable for later use. ::
4087db96d56Sopenharmony_ci
4097db96d56Sopenharmony_ci   >>> m = p.match('tempo')
4107db96d56Sopenharmony_ci   >>> m
4117db96d56Sopenharmony_ci   <re.Match object; span=(0, 5), match='tempo'>
4127db96d56Sopenharmony_ci
4137db96d56Sopenharmony_ciNow you can query the :ref:`match object <match-objects>` for information
4147db96d56Sopenharmony_ciabout the matching string.  Match object instances
4157db96d56Sopenharmony_cialso have several methods and attributes; the most important ones are:
4167db96d56Sopenharmony_ci
4177db96d56Sopenharmony_ci+------------------+--------------------------------------------+
4187db96d56Sopenharmony_ci| Method/Attribute | Purpose                                    |
4197db96d56Sopenharmony_ci+==================+============================================+
4207db96d56Sopenharmony_ci| ``group()``      | Return the string matched by the RE        |
4217db96d56Sopenharmony_ci+------------------+--------------------------------------------+
4227db96d56Sopenharmony_ci| ``start()``      | Return the starting position of the match  |
4237db96d56Sopenharmony_ci+------------------+--------------------------------------------+
4247db96d56Sopenharmony_ci| ``end()``        | Return the ending position of the match    |
4257db96d56Sopenharmony_ci+------------------+--------------------------------------------+
4267db96d56Sopenharmony_ci| ``span()``       | Return a tuple containing the (start, end) |
4277db96d56Sopenharmony_ci|                  | positions  of the match                    |
4287db96d56Sopenharmony_ci+------------------+--------------------------------------------+
4297db96d56Sopenharmony_ci
4307db96d56Sopenharmony_ciTrying these methods will soon clarify their meaning::
4317db96d56Sopenharmony_ci
4327db96d56Sopenharmony_ci   >>> m.group()
4337db96d56Sopenharmony_ci   'tempo'
4347db96d56Sopenharmony_ci   >>> m.start(), m.end()
4357db96d56Sopenharmony_ci   (0, 5)
4367db96d56Sopenharmony_ci   >>> m.span()
4377db96d56Sopenharmony_ci   (0, 5)
4387db96d56Sopenharmony_ci
4397db96d56Sopenharmony_ci:meth:`~re.Match.group` returns the substring that was matched by the RE.  :meth:`~re.Match.start`
4407db96d56Sopenharmony_ciand :meth:`~re.Match.end` return the starting and ending index of the match. :meth:`~re.Match.span`
4417db96d56Sopenharmony_cireturns both start and end indexes in a single tuple.  Since the :meth:`~re.Pattern.match`
4427db96d56Sopenharmony_cimethod only checks if the RE matches at the start of a string, :meth:`!start`
4437db96d56Sopenharmony_ciwill always be zero.  However, the :meth:`~re.Pattern.search` method of patterns
4447db96d56Sopenharmony_ciscans through the string, so  the match may not start at zero in that
4457db96d56Sopenharmony_cicase. ::
4467db96d56Sopenharmony_ci
4477db96d56Sopenharmony_ci   >>> print(p.match('::: message'))
4487db96d56Sopenharmony_ci   None
4497db96d56Sopenharmony_ci   >>> m = p.search('::: message'); print(m)
4507db96d56Sopenharmony_ci   <re.Match object; span=(4, 11), match='message'>
4517db96d56Sopenharmony_ci   >>> m.group()
4527db96d56Sopenharmony_ci   'message'
4537db96d56Sopenharmony_ci   >>> m.span()
4547db96d56Sopenharmony_ci   (4, 11)
4557db96d56Sopenharmony_ci
4567db96d56Sopenharmony_ciIn actual programs, the most common style is to store the
4577db96d56Sopenharmony_ci:ref:`match object <match-objects>` in a variable, and then check if it was
4587db96d56Sopenharmony_ci``None``.  This usually looks like::
4597db96d56Sopenharmony_ci
4607db96d56Sopenharmony_ci   p = re.compile( ... )
4617db96d56Sopenharmony_ci   m = p.match( 'string goes here' )
4627db96d56Sopenharmony_ci   if m:
4637db96d56Sopenharmony_ci       print('Match found: ', m.group())
4647db96d56Sopenharmony_ci   else:
4657db96d56Sopenharmony_ci       print('No match')
4667db96d56Sopenharmony_ci
4677db96d56Sopenharmony_ciTwo pattern methods return all of the matches for a pattern.
4687db96d56Sopenharmony_ci:meth:`~re.Pattern.findall` returns a list of matching strings::
4697db96d56Sopenharmony_ci
4707db96d56Sopenharmony_ci   >>> p = re.compile(r'\d+')
4717db96d56Sopenharmony_ci   >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
4727db96d56Sopenharmony_ci   ['12', '11', '10']
4737db96d56Sopenharmony_ci
4747db96d56Sopenharmony_ciThe ``r`` prefix, making the literal a raw string literal, is needed in this
4757db96d56Sopenharmony_ciexample because escape sequences in a normal "cooked" string literal that are
4767db96d56Sopenharmony_cinot recognized by Python, as opposed to regular expressions, now result in a
4777db96d56Sopenharmony_ci:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`.  See
4787db96d56Sopenharmony_ci:ref:`the-backslash-plague`.
4797db96d56Sopenharmony_ci
4807db96d56Sopenharmony_ci:meth:`~re.Pattern.findall` has to create the entire list before it can be returned as the
4817db96d56Sopenharmony_ciresult.  The :meth:`~re.Pattern.finditer` method returns a sequence of
4827db96d56Sopenharmony_ci:ref:`match object <match-objects>` instances as an :term:`iterator`::
4837db96d56Sopenharmony_ci
4847db96d56Sopenharmony_ci   >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
4857db96d56Sopenharmony_ci   >>> iterator  #doctest: +ELLIPSIS
4867db96d56Sopenharmony_ci   <callable_iterator object at 0x...>
4877db96d56Sopenharmony_ci   >>> for match in iterator:
4887db96d56Sopenharmony_ci   ...     print(match.span())
4897db96d56Sopenharmony_ci   ...
4907db96d56Sopenharmony_ci   (0, 2)
4917db96d56Sopenharmony_ci   (22, 24)
4927db96d56Sopenharmony_ci   (29, 31)
4937db96d56Sopenharmony_ci
4947db96d56Sopenharmony_ci
4957db96d56Sopenharmony_ciModule-Level Functions
4967db96d56Sopenharmony_ci----------------------
4977db96d56Sopenharmony_ci
4987db96d56Sopenharmony_ciYou don't have to create a pattern object and call its methods; the
4997db96d56Sopenharmony_ci:mod:`re` module also provides top-level functions called :func:`~re.match`,
5007db96d56Sopenharmony_ci:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth.  These functions
5017db96d56Sopenharmony_citake the same arguments as the corresponding pattern method with
5027db96d56Sopenharmony_cithe RE string added as the first argument, and still return either ``None`` or a
5037db96d56Sopenharmony_ci:ref:`match object <match-objects>` instance. ::
5047db96d56Sopenharmony_ci
5057db96d56Sopenharmony_ci   >>> print(re.match(r'From\s+', 'Fromage amk'))
5067db96d56Sopenharmony_ci   None
5077db96d56Sopenharmony_ci   >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')  #doctest: +ELLIPSIS
5087db96d56Sopenharmony_ci   <re.Match object; span=(0, 5), match='From '>
5097db96d56Sopenharmony_ci
5107db96d56Sopenharmony_ciUnder the hood, these functions simply create a pattern object for you
5117db96d56Sopenharmony_ciand call the appropriate method on it.  They also store the compiled
5127db96d56Sopenharmony_ciobject in a cache, so future calls using the same RE won't need to
5137db96d56Sopenharmony_ciparse the pattern again and again.
5147db96d56Sopenharmony_ci
5157db96d56Sopenharmony_ciShould you use these module-level functions, or should you get the
5167db96d56Sopenharmony_cipattern and call its methods yourself?  If you're accessing a regex
5177db96d56Sopenharmony_ciwithin a loop, pre-compiling it will save a few function calls.
5187db96d56Sopenharmony_ciOutside of loops, there's not much difference thanks to the internal
5197db96d56Sopenharmony_cicache.
5207db96d56Sopenharmony_ci
5217db96d56Sopenharmony_ci
5227db96d56Sopenharmony_ciCompilation Flags
5237db96d56Sopenharmony_ci-----------------
5247db96d56Sopenharmony_ci
5257db96d56Sopenharmony_ciCompilation flags let you modify some aspects of how regular expressions work.
5267db96d56Sopenharmony_ciFlags are available in the :mod:`re` module under two names, a long name such as
5277db96d56Sopenharmony_ci:const:`IGNORECASE` and a short, one-letter form such as :const:`I`.  (If you're
5287db96d56Sopenharmony_cifamiliar with Perl's pattern modifiers, the one-letter forms use the same
5297db96d56Sopenharmony_ciletters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.)
5307db96d56Sopenharmony_ciMultiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets
5317db96d56Sopenharmony_ciboth the :const:`I` and :const:`M` flags, for example.
5327db96d56Sopenharmony_ci
5337db96d56Sopenharmony_ciHere's a table of the available flags, followed by a more detailed explanation
5347db96d56Sopenharmony_ciof each one.
5357db96d56Sopenharmony_ci
5367db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+
5377db96d56Sopenharmony_ci| Flag                            | Meaning                                    |
5387db96d56Sopenharmony_ci+=================================+============================================+
5397db96d56Sopenharmony_ci| :const:`ASCII`, :const:`A`      | Makes several escapes like ``\w``, ``\b``, |
5407db96d56Sopenharmony_ci|                                 | ``\s`` and ``\d`` match only on ASCII      |
5417db96d56Sopenharmony_ci|                                 | characters with the respective property.   |
5427db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+
5437db96d56Sopenharmony_ci| :const:`DOTALL`, :const:`S`     | Make ``.`` match any character, including  |
5447db96d56Sopenharmony_ci|                                 | newlines.                                  |
5457db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+
5467db96d56Sopenharmony_ci| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches.               |
5477db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+
5487db96d56Sopenharmony_ci| :const:`LOCALE`, :const:`L`     | Do a locale-aware match.                   |
5497db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+
5507db96d56Sopenharmony_ci| :const:`MULTILINE`, :const:`M`  | Multi-line matching, affecting ``^`` and   |
5517db96d56Sopenharmony_ci|                                 | ``$``.                                     |
5527db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+
5537db96d56Sopenharmony_ci| :const:`VERBOSE`, :const:`X`    | Enable verbose REs, which can be organized |
5547db96d56Sopenharmony_ci| (for 'extended')                | more cleanly and understandably.           |
5557db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+
5567db96d56Sopenharmony_ci
5577db96d56Sopenharmony_ci
5587db96d56Sopenharmony_ci.. data:: I
5597db96d56Sopenharmony_ci          IGNORECASE
5607db96d56Sopenharmony_ci   :noindex:
5617db96d56Sopenharmony_ci
5627db96d56Sopenharmony_ci   Perform case-insensitive matching; character class and literal strings will
5637db96d56Sopenharmony_ci   match letters by ignoring case.  For example, ``[A-Z]`` will match lowercase
5647db96d56Sopenharmony_ci   letters, too. Full Unicode matching also works unless the :const:`ASCII`
5657db96d56Sopenharmony_ci   flag is used to disable non-ASCII matches.  When the Unicode patterns
5667db96d56Sopenharmony_ci   ``[a-z]`` or ``[A-Z]`` are used in combination with the :const:`IGNORECASE`
5677db96d56Sopenharmony_ci   flag, they will match the 52 ASCII letters and 4 additional non-ASCII
5687db96d56Sopenharmony_ci   letters: 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
5697db96d56Sopenharmony_ci   Latin small letter dotless i), 'ſ' (U+017F, Latin small letter long s) and
5707db96d56Sopenharmony_ci   'K' (U+212A, Kelvin sign).  ``Spam`` will match ``'Spam'``, ``'spam'``,
5717db96d56Sopenharmony_ci   ``'spAM'``, or ``'ſpam'`` (the latter is matched only in Unicode mode).
5727db96d56Sopenharmony_ci   This lowercasing doesn't take the current locale into account;
5737db96d56Sopenharmony_ci   it will if you also set the :const:`LOCALE` flag.
5747db96d56Sopenharmony_ci
5757db96d56Sopenharmony_ci
5767db96d56Sopenharmony_ci.. data:: L
5777db96d56Sopenharmony_ci          LOCALE
5787db96d56Sopenharmony_ci   :noindex:
5797db96d56Sopenharmony_ci
5807db96d56Sopenharmony_ci   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching dependent
5817db96d56Sopenharmony_ci   on the current locale instead of the Unicode database.
5827db96d56Sopenharmony_ci
5837db96d56Sopenharmony_ci   Locales are a feature of the C library intended to help in writing programs
5847db96d56Sopenharmony_ci   that take account of language differences.  For example, if you're
5857db96d56Sopenharmony_ci   processing encoded French text, you'd want to be able to write ``\w+`` to
5867db96d56Sopenharmony_ci   match words, but ``\w`` only matches the character class ``[A-Za-z]`` in
5877db96d56Sopenharmony_ci   bytes patterns; it won't match bytes corresponding to ``é`` or ``ç``.
5887db96d56Sopenharmony_ci   If your system is configured properly and a French locale is selected,
5897db96d56Sopenharmony_ci   certain C functions will tell the program that the byte corresponding to
5907db96d56Sopenharmony_ci   ``é`` should also be considered a letter.
5917db96d56Sopenharmony_ci   Setting the :const:`LOCALE` flag when compiling a regular expression will cause
5927db96d56Sopenharmony_ci   the resulting compiled object to use these C functions for ``\w``; this is
5937db96d56Sopenharmony_ci   slower, but also enables ``\w+`` to match French words as you'd expect.
5947db96d56Sopenharmony_ci   The use of this flag is discouraged in Python 3 as the locale mechanism
5957db96d56Sopenharmony_ci   is very unreliable, it only handles one "culture" at a time, and it only
5967db96d56Sopenharmony_ci   works with 8-bit locales.  Unicode matching is already enabled by default
5977db96d56Sopenharmony_ci   in Python 3 for Unicode (str) patterns, and it is able to handle different
5987db96d56Sopenharmony_ci   locales/languages.
5997db96d56Sopenharmony_ci
6007db96d56Sopenharmony_ci
6017db96d56Sopenharmony_ci.. data:: M
6027db96d56Sopenharmony_ci          MULTILINE
6037db96d56Sopenharmony_ci   :noindex:
6047db96d56Sopenharmony_ci
6057db96d56Sopenharmony_ci   (``^`` and ``$`` haven't been explained yet;  they'll be introduced in section
6067db96d56Sopenharmony_ci   :ref:`more-metacharacters`.)
6077db96d56Sopenharmony_ci
6087db96d56Sopenharmony_ci   Usually ``^`` matches only at the beginning of the string, and ``$`` matches
6097db96d56Sopenharmony_ci   only at the end of the string and immediately before the newline (if any) at the
6107db96d56Sopenharmony_ci   end of the string. When this flag is specified, ``^`` matches at the beginning
6117db96d56Sopenharmony_ci   of the string and at the beginning of each line within the string, immediately
6127db96d56Sopenharmony_ci   following each newline.  Similarly, the ``$`` metacharacter matches either at
6137db96d56Sopenharmony_ci   the end of the string and at the end of each line (immediately preceding each
6147db96d56Sopenharmony_ci   newline).
6157db96d56Sopenharmony_ci
6167db96d56Sopenharmony_ci
6177db96d56Sopenharmony_ci.. data:: S
6187db96d56Sopenharmony_ci          DOTALL
6197db96d56Sopenharmony_ci   :noindex:
6207db96d56Sopenharmony_ci
6217db96d56Sopenharmony_ci   Makes the ``'.'`` special character match any character at all, including a
6227db96d56Sopenharmony_ci   newline; without this flag, ``'.'`` will match anything *except* a newline.
6237db96d56Sopenharmony_ci
6247db96d56Sopenharmony_ci
6257db96d56Sopenharmony_ci.. data:: A
6267db96d56Sopenharmony_ci          ASCII
6277db96d56Sopenharmony_ci   :noindex:
6287db96d56Sopenharmony_ci
6297db96d56Sopenharmony_ci   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only
6307db96d56Sopenharmony_ci   matching instead of full Unicode matching. This is only meaningful for
6317db96d56Sopenharmony_ci   Unicode patterns, and is ignored for byte patterns.
6327db96d56Sopenharmony_ci
6337db96d56Sopenharmony_ci
6347db96d56Sopenharmony_ci.. data:: X
6357db96d56Sopenharmony_ci          VERBOSE
6367db96d56Sopenharmony_ci   :noindex:
6377db96d56Sopenharmony_ci
6387db96d56Sopenharmony_ci   This flag allows you to write regular expressions that are more readable by
6397db96d56Sopenharmony_ci   granting you more flexibility in how you can format them.  When this flag has
6407db96d56Sopenharmony_ci   been specified, whitespace within the RE string is ignored, except when the
6417db96d56Sopenharmony_ci   whitespace is in a character class or preceded by an unescaped backslash; this
6427db96d56Sopenharmony_ci   lets you organize and indent the RE more clearly.  This flag also lets you put
6437db96d56Sopenharmony_ci   comments within a RE that will be ignored by the engine; comments are marked by
6447db96d56Sopenharmony_ci   a ``'#'`` that's neither in a character class or preceded by an unescaped
6457db96d56Sopenharmony_ci   backslash.
6467db96d56Sopenharmony_ci
6477db96d56Sopenharmony_ci   For example, here's a RE that uses :const:`re.VERBOSE`; see how much easier it
6487db96d56Sopenharmony_ci   is to read? ::
6497db96d56Sopenharmony_ci
6507db96d56Sopenharmony_ci      charref = re.compile(r"""
6517db96d56Sopenharmony_ci       &[#]                # Start of a numeric entity reference
6527db96d56Sopenharmony_ci       (
6537db96d56Sopenharmony_ci           0[0-7]+         # Octal form
6547db96d56Sopenharmony_ci         | [0-9]+          # Decimal form
6557db96d56Sopenharmony_ci         | x[0-9a-fA-F]+   # Hexadecimal form
6567db96d56Sopenharmony_ci       )
6577db96d56Sopenharmony_ci       ;                   # Trailing semicolon
6587db96d56Sopenharmony_ci      """, re.VERBOSE)
6597db96d56Sopenharmony_ci
6607db96d56Sopenharmony_ci   Without the verbose setting, the RE would look like this::
6617db96d56Sopenharmony_ci
6627db96d56Sopenharmony_ci      charref = re.compile("&#(0[0-7]+"
6637db96d56Sopenharmony_ci                           "|[0-9]+"
6647db96d56Sopenharmony_ci                           "|x[0-9a-fA-F]+);")
6657db96d56Sopenharmony_ci
6667db96d56Sopenharmony_ci   In the above example, Python's automatic concatenation of string literals has
6677db96d56Sopenharmony_ci   been used to break up the RE into smaller pieces, but it's still more difficult
6687db96d56Sopenharmony_ci   to understand than the version using :const:`re.VERBOSE`.
6697db96d56Sopenharmony_ci
6707db96d56Sopenharmony_ci
6717db96d56Sopenharmony_ciMore Pattern Power
6727db96d56Sopenharmony_ci==================
6737db96d56Sopenharmony_ci
6747db96d56Sopenharmony_ciSo far we've only covered a part of the features of regular expressions.  In
6757db96d56Sopenharmony_cithis section, we'll cover some new metacharacters, and how to use groups to
6767db96d56Sopenharmony_ciretrieve portions of the text that was matched.
6777db96d56Sopenharmony_ci
6787db96d56Sopenharmony_ci
6797db96d56Sopenharmony_ci.. _more-metacharacters:
6807db96d56Sopenharmony_ci
6817db96d56Sopenharmony_ciMore Metacharacters
6827db96d56Sopenharmony_ci-------------------
6837db96d56Sopenharmony_ci
6847db96d56Sopenharmony_ciThere are some metacharacters that we haven't covered yet.  Most of them will be
6857db96d56Sopenharmony_cicovered in this section.
6867db96d56Sopenharmony_ci
6877db96d56Sopenharmony_ciSome of the remaining metacharacters to be discussed are :dfn:`zero-width
6887db96d56Sopenharmony_ciassertions`.  They don't cause the engine to advance through the string;
6897db96d56Sopenharmony_ciinstead, they consume no characters at all, and simply succeed or fail.  For
6907db96d56Sopenharmony_ciexample, ``\b`` is an assertion that the current position is located at a word
6917db96d56Sopenharmony_ciboundary; the position isn't changed by the ``\b`` at all.  This means that
6927db96d56Sopenharmony_cizero-width assertions should never be repeated, because if they match once at a
6937db96d56Sopenharmony_cigiven location, they can obviously be matched an infinite number of times.
6947db96d56Sopenharmony_ci
6957db96d56Sopenharmony_ci``|``
6967db96d56Sopenharmony_ci   Alternation, or the "or" operator.   If *A* and *B* are regular expressions,
6977db96d56Sopenharmony_ci   ``A|B`` will match any string that matches either *A* or *B*. ``|`` has very
6987db96d56Sopenharmony_ci   low precedence in order to make it work reasonably when you're alternating
6997db96d56Sopenharmony_ci   multi-character strings. ``Crow|Servo`` will match either ``'Crow'`` or ``'Servo'``,
7007db96d56Sopenharmony_ci   not ``'Cro'``, a ``'w'`` or an ``'S'``, and ``'ervo'``.
7017db96d56Sopenharmony_ci
7027db96d56Sopenharmony_ci   To match a literal ``'|'``, use ``\|``, or enclose it inside a character class,
7037db96d56Sopenharmony_ci   as in ``[|]``.
7047db96d56Sopenharmony_ci
7057db96d56Sopenharmony_ci``^``
7067db96d56Sopenharmony_ci   Matches at the beginning of lines.  Unless the :const:`MULTILINE` flag has been
7077db96d56Sopenharmony_ci   set, this will only match at the beginning of the string.  In :const:`MULTILINE`
7087db96d56Sopenharmony_ci   mode, this also matches immediately after each newline within the string.
7097db96d56Sopenharmony_ci
7107db96d56Sopenharmony_ci   For example, if you wish to match the word ``From`` only at the beginning of a
7117db96d56Sopenharmony_ci   line, the RE to use is ``^From``. ::
7127db96d56Sopenharmony_ci
7137db96d56Sopenharmony_ci      >>> print(re.search('^From', 'From Here to Eternity'))  #doctest: +ELLIPSIS
7147db96d56Sopenharmony_ci      <re.Match object; span=(0, 4), match='From'>
7157db96d56Sopenharmony_ci      >>> print(re.search('^From', 'Reciting From Memory'))
7167db96d56Sopenharmony_ci      None
7177db96d56Sopenharmony_ci
7187db96d56Sopenharmony_ci   To match a literal ``'^'``, use ``\^``.
7197db96d56Sopenharmony_ci
7207db96d56Sopenharmony_ci``$``
7217db96d56Sopenharmony_ci   Matches at the end of a line, which is defined as either the end of the string,
7227db96d56Sopenharmony_ci   or any location followed by a newline character.     ::
7237db96d56Sopenharmony_ci
7247db96d56Sopenharmony_ci      >>> print(re.search('}$', '{block}'))  #doctest: +ELLIPSIS
7257db96d56Sopenharmony_ci      <re.Match object; span=(6, 7), match='}'>
7267db96d56Sopenharmony_ci      >>> print(re.search('}$', '{block} '))
7277db96d56Sopenharmony_ci      None
7287db96d56Sopenharmony_ci      >>> print(re.search('}$', '{block}\n'))  #doctest: +ELLIPSIS
7297db96d56Sopenharmony_ci      <re.Match object; span=(6, 7), match='}'>
7307db96d56Sopenharmony_ci
7317db96d56Sopenharmony_ci   To match a literal ``'$'``, use ``\$`` or enclose it inside a character class,
7327db96d56Sopenharmony_ci   as in  ``[$]``.
7337db96d56Sopenharmony_ci
7347db96d56Sopenharmony_ci``\A``
7357db96d56Sopenharmony_ci   Matches only at the start of the string.  When not in :const:`MULTILINE` mode,
7367db96d56Sopenharmony_ci   ``\A`` and ``^`` are effectively the same.  In :const:`MULTILINE` mode, they're
7377db96d56Sopenharmony_ci   different: ``\A`` still matches only at the beginning of the string, but ``^``
7387db96d56Sopenharmony_ci   may match at any location inside the string that follows a newline character.
7397db96d56Sopenharmony_ci
7407db96d56Sopenharmony_ci``\Z``
7417db96d56Sopenharmony_ci   Matches only at the end of the string.
7427db96d56Sopenharmony_ci
7437db96d56Sopenharmony_ci``\b``
7447db96d56Sopenharmony_ci   Word boundary.  This is a zero-width assertion that matches only at the
7457db96d56Sopenharmony_ci   beginning or end of a word.  A word is defined as a sequence of alphanumeric
7467db96d56Sopenharmony_ci   characters, so the end of a word is indicated by whitespace or a
7477db96d56Sopenharmony_ci   non-alphanumeric character.
7487db96d56Sopenharmony_ci
7497db96d56Sopenharmony_ci   The following example matches ``class`` only when it's a complete word; it won't
7507db96d56Sopenharmony_ci   match when it's contained inside another word. ::
7517db96d56Sopenharmony_ci
7527db96d56Sopenharmony_ci      >>> p = re.compile(r'\bclass\b')
7537db96d56Sopenharmony_ci      >>> print(p.search('no class at all'))
7547db96d56Sopenharmony_ci      <re.Match object; span=(3, 8), match='class'>
7557db96d56Sopenharmony_ci      >>> print(p.search('the declassified algorithm'))
7567db96d56Sopenharmony_ci      None
7577db96d56Sopenharmony_ci      >>> print(p.search('one subclass is'))
7587db96d56Sopenharmony_ci      None
7597db96d56Sopenharmony_ci
7607db96d56Sopenharmony_ci   There are two subtleties you should remember when using this special sequence.
7617db96d56Sopenharmony_ci   First, this is the worst collision between Python's string literals and regular
7627db96d56Sopenharmony_ci   expression sequences.  In Python's string literals, ``\b`` is the backspace
7637db96d56Sopenharmony_ci   character, ASCII value 8.  If you're not using raw strings, then Python will
7647db96d56Sopenharmony_ci   convert the ``\b`` to a backspace, and your RE won't match as you expect it to.
7657db96d56Sopenharmony_ci   The following example looks the same as our previous RE, but omits the ``'r'``
7667db96d56Sopenharmony_ci   in front of the RE string. ::
7677db96d56Sopenharmony_ci
7687db96d56Sopenharmony_ci      >>> p = re.compile('\bclass\b')
7697db96d56Sopenharmony_ci      >>> print(p.search('no class at all'))
7707db96d56Sopenharmony_ci      None
7717db96d56Sopenharmony_ci      >>> print(p.search('\b' + 'class' + '\b'))
7727db96d56Sopenharmony_ci      <re.Match object; span=(0, 7), match='\x08class\x08'>
7737db96d56Sopenharmony_ci
7747db96d56Sopenharmony_ci   Second, inside a character class, where there's no use for this assertion,
7757db96d56Sopenharmony_ci   ``\b`` represents the backspace character, for compatibility with Python's
7767db96d56Sopenharmony_ci   string literals.
7777db96d56Sopenharmony_ci
7787db96d56Sopenharmony_ci``\B``
7797db96d56Sopenharmony_ci   Another zero-width assertion, this is the opposite of ``\b``, only matching when
7807db96d56Sopenharmony_ci   the current position is not at a word boundary.
7817db96d56Sopenharmony_ci
7827db96d56Sopenharmony_ci
7837db96d56Sopenharmony_ciGrouping
7847db96d56Sopenharmony_ci--------
7857db96d56Sopenharmony_ci
7867db96d56Sopenharmony_ciFrequently you need to obtain more information than just whether the RE matched
7877db96d56Sopenharmony_cior not.  Regular expressions are often used to dissect strings by writing a RE
7887db96d56Sopenharmony_cidivided into several subgroups which match different components of interest.
7897db96d56Sopenharmony_ciFor example, an RFC-822 header line is divided into a header name and a value,
7907db96d56Sopenharmony_ciseparated by a ``':'``, like this:
7917db96d56Sopenharmony_ci
7927db96d56Sopenharmony_ci.. code-block:: none
7937db96d56Sopenharmony_ci
7947db96d56Sopenharmony_ci   From: author@example.com
7957db96d56Sopenharmony_ci   User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
7967db96d56Sopenharmony_ci   MIME-Version: 1.0
7977db96d56Sopenharmony_ci   To: editor@example.com
7987db96d56Sopenharmony_ci
7997db96d56Sopenharmony_ciThis can be handled by writing a regular expression which matches an entire
8007db96d56Sopenharmony_ciheader line, and has one group which matches the header name, and another group
8017db96d56Sopenharmony_ciwhich matches the header's value.
8027db96d56Sopenharmony_ci
8037db96d56Sopenharmony_ciGroups are marked by the ``'('``, ``')'`` metacharacters. ``'('`` and ``')'``
8047db96d56Sopenharmony_cihave much the same meaning as they do in mathematical expressions; they group
8057db96d56Sopenharmony_citogether the expressions contained inside them, and you can repeat the contents
8067db96d56Sopenharmony_ciof a group with a quantifier, such as ``*``, ``+``, ``?``, or
8077db96d56Sopenharmony_ci``{m,n}``.  For example, ``(ab)*`` will match zero or more repetitions of
8087db96d56Sopenharmony_ci``ab``. ::
8097db96d56Sopenharmony_ci
8107db96d56Sopenharmony_ci   >>> p = re.compile('(ab)*')
8117db96d56Sopenharmony_ci   >>> print(p.match('ababababab').span())
8127db96d56Sopenharmony_ci   (0, 10)
8137db96d56Sopenharmony_ci
8147db96d56Sopenharmony_ciGroups indicated with ``'('``, ``')'`` also capture the starting and ending
8157db96d56Sopenharmony_ciindex of the text that they match; this can be retrieved by passing an argument
8167db96d56Sopenharmony_cito :meth:`~re.Match.group`, :meth:`~re.Match.start`, :meth:`~re.Match.end`, and
8177db96d56Sopenharmony_ci:meth:`~re.Match.span`.  Groups are
8187db96d56Sopenharmony_cinumbered starting with 0.  Group 0 is always present; it's the whole RE, so
8197db96d56Sopenharmony_ci:ref:`match object <match-objects>` methods all have group 0 as their default
8207db96d56Sopenharmony_ciargument.  Later we'll see how to express groups that don't capture the span
8217db96d56Sopenharmony_ciof text that they match. ::
8227db96d56Sopenharmony_ci
8237db96d56Sopenharmony_ci   >>> p = re.compile('(a)b')
8247db96d56Sopenharmony_ci   >>> m = p.match('ab')
8257db96d56Sopenharmony_ci   >>> m.group()
8267db96d56Sopenharmony_ci   'ab'
8277db96d56Sopenharmony_ci   >>> m.group(0)
8287db96d56Sopenharmony_ci   'ab'
8297db96d56Sopenharmony_ci
8307db96d56Sopenharmony_ciSubgroups are numbered from left to right, from 1 upward.  Groups can be nested;
8317db96d56Sopenharmony_cito determine the number, just count the opening parenthesis characters, going
8327db96d56Sopenharmony_cifrom left to right. ::
8337db96d56Sopenharmony_ci
8347db96d56Sopenharmony_ci   >>> p = re.compile('(a(b)c)d')
8357db96d56Sopenharmony_ci   >>> m = p.match('abcd')
8367db96d56Sopenharmony_ci   >>> m.group(0)
8377db96d56Sopenharmony_ci   'abcd'
8387db96d56Sopenharmony_ci   >>> m.group(1)
8397db96d56Sopenharmony_ci   'abc'
8407db96d56Sopenharmony_ci   >>> m.group(2)
8417db96d56Sopenharmony_ci   'b'
8427db96d56Sopenharmony_ci
8437db96d56Sopenharmony_ci:meth:`~re.Match.group` can be passed multiple group numbers at a time, in which case it
8447db96d56Sopenharmony_ciwill return a tuple containing the corresponding values for those groups. ::
8457db96d56Sopenharmony_ci
8467db96d56Sopenharmony_ci   >>> m.group(2,1,2)
8477db96d56Sopenharmony_ci   ('b', 'abc', 'b')
8487db96d56Sopenharmony_ci
8497db96d56Sopenharmony_ciThe :meth:`~re.Match.groups` method returns a tuple containing the strings for all the
8507db96d56Sopenharmony_cisubgroups, from 1 up to however many there are. ::
8517db96d56Sopenharmony_ci
8527db96d56Sopenharmony_ci   >>> m.groups()
8537db96d56Sopenharmony_ci   ('abc', 'b')
8547db96d56Sopenharmony_ci
8557db96d56Sopenharmony_ciBackreferences in a pattern allow you to specify that the contents of an earlier
8567db96d56Sopenharmony_cicapturing group must also be found at the current location in the string.  For
8577db96d56Sopenharmony_ciexample, ``\1`` will succeed if the exact contents of group 1 can be found at
8587db96d56Sopenharmony_cithe current position, and fails otherwise.  Remember that Python's string
8597db96d56Sopenharmony_ciliterals also use a backslash followed by numbers to allow including arbitrary
8607db96d56Sopenharmony_cicharacters in a string, so be sure to use a raw string when incorporating
8617db96d56Sopenharmony_cibackreferences in a RE.
8627db96d56Sopenharmony_ci
8637db96d56Sopenharmony_ciFor example, the following RE detects doubled words in a string. ::
8647db96d56Sopenharmony_ci
8657db96d56Sopenharmony_ci   >>> p = re.compile(r'\b(\w+)\s+\1\b')
8667db96d56Sopenharmony_ci   >>> p.search('Paris in the the spring').group()
8677db96d56Sopenharmony_ci   'the the'
8687db96d56Sopenharmony_ci
8697db96d56Sopenharmony_ciBackreferences like this aren't often useful for just searching through a string
8707db96d56Sopenharmony_ci--- there are few text formats which repeat data in this way --- but you'll soon
8717db96d56Sopenharmony_cifind out that they're *very* useful when performing string substitutions.
8727db96d56Sopenharmony_ci
8737db96d56Sopenharmony_ci
8747db96d56Sopenharmony_ciNon-capturing and Named Groups
8757db96d56Sopenharmony_ci------------------------------
8767db96d56Sopenharmony_ci
8777db96d56Sopenharmony_ciElaborate REs may use many groups, both to capture substrings of interest, and
8787db96d56Sopenharmony_cito group and structure the RE itself.  In complex REs, it becomes difficult to
8797db96d56Sopenharmony_cikeep track of the group numbers.  There are two features which help with this
8807db96d56Sopenharmony_ciproblem.  Both of them use a common syntax for regular expression extensions, so
8817db96d56Sopenharmony_ciwe'll look at that first.
8827db96d56Sopenharmony_ci
8837db96d56Sopenharmony_ciPerl 5 is well known for its powerful additions to standard regular expressions.
8847db96d56Sopenharmony_ciFor these new features the Perl developers couldn't choose new single-keystroke metacharacters
8857db96d56Sopenharmony_cior new special sequences beginning with ``\`` without making Perl's regular
8867db96d56Sopenharmony_ciexpressions confusingly different from standard REs.  If they chose ``&`` as a
8877db96d56Sopenharmony_cinew metacharacter, for example, old expressions would be assuming that ``&`` was
8887db96d56Sopenharmony_cia regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``.
8897db96d56Sopenharmony_ci
8907db96d56Sopenharmony_ciThe solution chosen by the Perl developers was to use ``(?...)`` as the
8917db96d56Sopenharmony_ciextension syntax.  ``?`` immediately after a parenthesis was a syntax error
8927db96d56Sopenharmony_cibecause the ``?`` would have nothing to repeat, so this didn't introduce any
8937db96d56Sopenharmony_cicompatibility problems.  The characters immediately after the ``?``  indicate
8947db96d56Sopenharmony_ciwhat extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
8957db96d56Sopenharmony_ciassertion) and ``(?:foo)`` is something else (a non-capturing group containing
8967db96d56Sopenharmony_cithe subexpression ``foo``).
8977db96d56Sopenharmony_ci
8987db96d56Sopenharmony_ciPython supports several of Perl's extensions and adds an extension
8997db96d56Sopenharmony_cisyntax to Perl's extension syntax.  If the first character after the
9007db96d56Sopenharmony_ciquestion mark is a ``P``, you know that it's an extension that's
9017db96d56Sopenharmony_cispecific to Python.
9027db96d56Sopenharmony_ci
9037db96d56Sopenharmony_ciNow that we've looked at the general extension syntax, we can return
9047db96d56Sopenharmony_cito the features that simplify working with groups in complex REs.
9057db96d56Sopenharmony_ci
9067db96d56Sopenharmony_ciSometimes you'll want to use a group to denote a part of a regular expression,
9077db96d56Sopenharmony_cibut aren't interested in retrieving the group's contents. You can make this fact
9087db96d56Sopenharmony_ciexplicit by using a non-capturing group: ``(?:...)``, where you can replace the
9097db96d56Sopenharmony_ci``...`` with any other regular expression. ::
9107db96d56Sopenharmony_ci
9117db96d56Sopenharmony_ci   >>> m = re.match("([abc])+", "abc")
9127db96d56Sopenharmony_ci   >>> m.groups()
9137db96d56Sopenharmony_ci   ('c',)
9147db96d56Sopenharmony_ci   >>> m = re.match("(?:[abc])+", "abc")
9157db96d56Sopenharmony_ci   >>> m.groups()
9167db96d56Sopenharmony_ci   ()
9177db96d56Sopenharmony_ci
9187db96d56Sopenharmony_ciExcept for the fact that you can't retrieve the contents of what the group
9197db96d56Sopenharmony_cimatched, a non-capturing group behaves exactly the same as a capturing group;
9207db96d56Sopenharmony_ciyou can put anything inside it, repeat it with a repetition metacharacter such
9217db96d56Sopenharmony_cias ``*``, and nest it within other groups (capturing or non-capturing).
9227db96d56Sopenharmony_ci``(?:...)`` is particularly useful when modifying an existing pattern, since you
9237db96d56Sopenharmony_cican add new groups without changing how all the other groups are numbered.  It
9247db96d56Sopenharmony_cishould be mentioned that there's no performance difference in searching between
9257db96d56Sopenharmony_cicapturing and non-capturing groups; neither form is any faster than the other.
9267db96d56Sopenharmony_ci
9277db96d56Sopenharmony_ciA more significant feature is named groups: instead of referring to them by
9287db96d56Sopenharmony_cinumbers, groups can be referenced by a name.
9297db96d56Sopenharmony_ci
9307db96d56Sopenharmony_ciThe syntax for a named group is one of the Python-specific extensions:
9317db96d56Sopenharmony_ci``(?P<name>...)``.  *name* is, obviously, the name of the group.  Named groups
9327db96d56Sopenharmony_cibehave exactly like capturing groups, and additionally associate a name
9337db96d56Sopenharmony_ciwith a group.  The :ref:`match object <match-objects>` methods that deal with
9347db96d56Sopenharmony_cicapturing groups all accept either integers that refer to the group by number
9357db96d56Sopenharmony_cior strings that contain the desired group's name.  Named groups are still
9367db96d56Sopenharmony_cigiven numbers, so you can retrieve information about a group in two ways::
9377db96d56Sopenharmony_ci
9387db96d56Sopenharmony_ci   >>> p = re.compile(r'(?P<word>\b\w+\b)')
9397db96d56Sopenharmony_ci   >>> m = p.search( '(((( Lots of punctuation )))' )
9407db96d56Sopenharmony_ci   >>> m.group('word')
9417db96d56Sopenharmony_ci   'Lots'
9427db96d56Sopenharmony_ci   >>> m.group(1)
9437db96d56Sopenharmony_ci   'Lots'
9447db96d56Sopenharmony_ci
9457db96d56Sopenharmony_ciAdditionally, you can retrieve named groups as a dictionary with
9467db96d56Sopenharmony_ci:meth:`~re.Match.groupdict`::
9477db96d56Sopenharmony_ci
9487db96d56Sopenharmony_ci   >>> m = re.match(r'(?P<first>\w+) (?P<last>\w+)', 'Jane Doe')
9497db96d56Sopenharmony_ci   >>> m.groupdict()
9507db96d56Sopenharmony_ci   {'first': 'Jane', 'last': 'Doe'}
9517db96d56Sopenharmony_ci
9527db96d56Sopenharmony_ciNamed groups are handy because they let you use easily remembered names, instead
9537db96d56Sopenharmony_ciof having to remember numbers.  Here's an example RE from the :mod:`imaplib`
9547db96d56Sopenharmony_cimodule::
9557db96d56Sopenharmony_ci
9567db96d56Sopenharmony_ci   InternalDate = re.compile(r'INTERNALDATE "'
9577db96d56Sopenharmony_ci           r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
9587db96d56Sopenharmony_ci           r'(?P<year>[0-9][0-9][0-9][0-9])'
9597db96d56Sopenharmony_ci           r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
9607db96d56Sopenharmony_ci           r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
9617db96d56Sopenharmony_ci           r'"')
9627db96d56Sopenharmony_ci
9637db96d56Sopenharmony_ciIt's obviously much easier to retrieve ``m.group('zonem')``, instead of having
9647db96d56Sopenharmony_cito remember to retrieve group 9.
9657db96d56Sopenharmony_ci
9667db96d56Sopenharmony_ciThe syntax for backreferences in an expression such as ``(...)\1`` refers to the
9677db96d56Sopenharmony_cinumber of the group.  There's naturally a variant that uses the group name
9687db96d56Sopenharmony_ciinstead of the number. This is another Python extension: ``(?P=name)`` indicates
9697db96d56Sopenharmony_cithat the contents of the group called *name* should again be matched at the
9707db96d56Sopenharmony_cicurrent point.  The regular expression for finding doubled words,
9717db96d56Sopenharmony_ci``\b(\w+)\s+\1\b`` can also be written as ``\b(?P<word>\w+)\s+(?P=word)\b``::
9727db96d56Sopenharmony_ci
9737db96d56Sopenharmony_ci   >>> p = re.compile(r'\b(?P<word>\w+)\s+(?P=word)\b')
9747db96d56Sopenharmony_ci   >>> p.search('Paris in the the spring').group()
9757db96d56Sopenharmony_ci   'the the'
9767db96d56Sopenharmony_ci
9777db96d56Sopenharmony_ci
9787db96d56Sopenharmony_ciLookahead Assertions
9797db96d56Sopenharmony_ci--------------------
9807db96d56Sopenharmony_ci
9817db96d56Sopenharmony_ciAnother zero-width assertion is the lookahead assertion.  Lookahead assertions
9827db96d56Sopenharmony_ciare available in both positive and negative form, and  look like this:
9837db96d56Sopenharmony_ci
9847db96d56Sopenharmony_ci``(?=...)``
9857db96d56Sopenharmony_ci   Positive lookahead assertion.  This succeeds if the contained regular
9867db96d56Sopenharmony_ci   expression, represented here by ``...``, successfully matches at the current
9877db96d56Sopenharmony_ci   location, and fails otherwise. But, once the contained expression has been
9887db96d56Sopenharmony_ci   tried, the matching engine doesn't advance at all; the rest of the pattern is
9897db96d56Sopenharmony_ci   tried right where the assertion started.
9907db96d56Sopenharmony_ci
9917db96d56Sopenharmony_ci``(?!...)``
9927db96d56Sopenharmony_ci   Negative lookahead assertion.  This is the opposite of the positive assertion;
9937db96d56Sopenharmony_ci   it succeeds if the contained expression *doesn't* match at the current position
9947db96d56Sopenharmony_ci   in the string.
9957db96d56Sopenharmony_ci
9967db96d56Sopenharmony_ciTo make this concrete, let's look at a case where a lookahead is useful.
9977db96d56Sopenharmony_ciConsider a simple pattern to match a filename and split it apart into a base
9987db96d56Sopenharmony_ciname and an extension, separated by a ``.``.  For example, in ``news.rc``,
9997db96d56Sopenharmony_ci``news`` is the base name, and ``rc`` is the filename's extension.
10007db96d56Sopenharmony_ci
10017db96d56Sopenharmony_ciThe pattern to match this is quite simple:
10027db96d56Sopenharmony_ci
10037db96d56Sopenharmony_ci``.*[.].*$``
10047db96d56Sopenharmony_ci
10057db96d56Sopenharmony_ciNotice that the ``.`` needs to be treated specially because it's a
10067db96d56Sopenharmony_cimetacharacter, so it's inside a character class to only match that
10077db96d56Sopenharmony_cispecific character.  Also notice the trailing ``$``; this is added to
10087db96d56Sopenharmony_ciensure that all the rest of the string must be included in the
10097db96d56Sopenharmony_ciextension.  This regular expression matches ``foo.bar`` and
10107db96d56Sopenharmony_ci``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``.
10117db96d56Sopenharmony_ci
10127db96d56Sopenharmony_ciNow, consider complicating the problem a bit; what if you want to match
10137db96d56Sopenharmony_cifilenames where the extension is not ``bat``? Some incorrect attempts:
10147db96d56Sopenharmony_ci
10157db96d56Sopenharmony_ci``.*[.][^b].*$``  The first attempt above tries to exclude ``bat`` by requiring
10167db96d56Sopenharmony_cithat the first character of the extension is not a ``b``.  This is wrong,
10177db96d56Sopenharmony_cibecause the pattern also doesn't match ``foo.bar``.
10187db96d56Sopenharmony_ci
10197db96d56Sopenharmony_ci``.*[.]([^b]..|.[^a].|..[^t])$``
10207db96d56Sopenharmony_ci
10217db96d56Sopenharmony_ciThe expression gets messier when you try to patch up the first solution by
10227db96d56Sopenharmony_cirequiring one of the following cases to match: the first character of the
10237db96d56Sopenharmony_ciextension isn't ``b``; the second character isn't ``a``; or the third character
10247db96d56Sopenharmony_ciisn't ``t``.  This accepts ``foo.bar`` and rejects ``autoexec.bat``, but it
10257db96d56Sopenharmony_cirequires a three-letter extension and won't accept a filename with a two-letter
10267db96d56Sopenharmony_ciextension such as ``sendmail.cf``.  We'll complicate the pattern again in an
10277db96d56Sopenharmony_cieffort to fix it.
10287db96d56Sopenharmony_ci
10297db96d56Sopenharmony_ci``.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$``
10307db96d56Sopenharmony_ci
10317db96d56Sopenharmony_ciIn the third attempt, the second and third letters are all made optional in
10327db96d56Sopenharmony_ciorder to allow matching extensions shorter than three characters, such as
10337db96d56Sopenharmony_ci``sendmail.cf``.
10347db96d56Sopenharmony_ci
10357db96d56Sopenharmony_ciThe pattern's getting really complicated now, which makes it hard to read and
10367db96d56Sopenharmony_ciunderstand.  Worse, if the problem changes and you want to exclude both ``bat``
10377db96d56Sopenharmony_ciand ``exe`` as extensions, the pattern would get even more complicated and
10387db96d56Sopenharmony_ciconfusing.
10397db96d56Sopenharmony_ci
10407db96d56Sopenharmony_ciA negative lookahead cuts through all this confusion:
10417db96d56Sopenharmony_ci
10427db96d56Sopenharmony_ci``.*[.](?!bat$)[^.]*$``  The negative lookahead means: if the expression ``bat``
10437db96d56Sopenharmony_cidoesn't match at this point, try the rest of the pattern; if ``bat$`` does
10447db96d56Sopenharmony_cimatch, the whole pattern will fail.  The trailing ``$`` is required to ensure
10457db96d56Sopenharmony_cithat something like ``sample.batch``, where the extension only starts with
10467db96d56Sopenharmony_ci``bat``, will be allowed.  The ``[^.]*`` makes sure that the pattern works
10477db96d56Sopenharmony_ciwhen there are multiple dots in the filename.
10487db96d56Sopenharmony_ci
10497db96d56Sopenharmony_ciExcluding another filename extension is now easy; simply add it as an
10507db96d56Sopenharmony_cialternative inside the assertion.  The following pattern excludes filenames that
10517db96d56Sopenharmony_ciend in either ``bat`` or ``exe``:
10527db96d56Sopenharmony_ci
10537db96d56Sopenharmony_ci``.*[.](?!bat$|exe$)[^.]*$``
10547db96d56Sopenharmony_ci
10557db96d56Sopenharmony_ci
10567db96d56Sopenharmony_ciModifying Strings
10577db96d56Sopenharmony_ci=================
10587db96d56Sopenharmony_ci
10597db96d56Sopenharmony_ciUp to this point, we've simply performed searches against a static string.
10607db96d56Sopenharmony_ciRegular expressions are also commonly used to modify strings in various ways,
10617db96d56Sopenharmony_ciusing the following pattern methods:
10627db96d56Sopenharmony_ci
10637db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
10647db96d56Sopenharmony_ci| Method/Attribute | Purpose                                       |
10657db96d56Sopenharmony_ci+==================+===============================================+
10667db96d56Sopenharmony_ci| ``split()``      | Split the string into a list, splitting it    |
10677db96d56Sopenharmony_ci|                  | wherever the RE matches                       |
10687db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
10697db96d56Sopenharmony_ci| ``sub()``        | Find all substrings where the RE matches, and |
10707db96d56Sopenharmony_ci|                  | replace them with a different string          |
10717db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
10727db96d56Sopenharmony_ci| ``subn()``       | Does the same thing as :meth:`!sub`,  but     |
10737db96d56Sopenharmony_ci|                  | returns the new string and the number of      |
10747db96d56Sopenharmony_ci|                  | replacements                                  |
10757db96d56Sopenharmony_ci+------------------+-----------------------------------------------+
10767db96d56Sopenharmony_ci
10777db96d56Sopenharmony_ci
10787db96d56Sopenharmony_ciSplitting Strings
10797db96d56Sopenharmony_ci-----------------
10807db96d56Sopenharmony_ci
10817db96d56Sopenharmony_ciThe :meth:`~re.Pattern.split` method of a pattern splits a string apart
10827db96d56Sopenharmony_ciwherever the RE matches, returning a list of the pieces. It's similar to the
10837db96d56Sopenharmony_ci:meth:`~str.split` method of strings but provides much more generality in the
10847db96d56Sopenharmony_cidelimiters that you can split by; string :meth:`!split` only supports splitting by
10857db96d56Sopenharmony_ciwhitespace or by a fixed string.  As you'd expect, there's a module-level
10867db96d56Sopenharmony_ci:func:`re.split` function, too.
10877db96d56Sopenharmony_ci
10887db96d56Sopenharmony_ci
10897db96d56Sopenharmony_ci.. method:: .split(string [, maxsplit=0])
10907db96d56Sopenharmony_ci   :noindex:
10917db96d56Sopenharmony_ci
10927db96d56Sopenharmony_ci   Split *string* by the matches of the regular expression.  If capturing
10937db96d56Sopenharmony_ci   parentheses are used in the RE, then their contents will also be returned as
10947db96d56Sopenharmony_ci   part of the resulting list.  If *maxsplit* is nonzero, at most *maxsplit* splits
10957db96d56Sopenharmony_ci   are performed.
10967db96d56Sopenharmony_ci
10977db96d56Sopenharmony_ciYou can limit the number of splits made, by passing a value for *maxsplit*.
10987db96d56Sopenharmony_ciWhen *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the
10997db96d56Sopenharmony_ciremainder of the string is returned as the final element of the list.  In the
11007db96d56Sopenharmony_cifollowing example, the delimiter is any sequence of non-alphanumeric characters.
11017db96d56Sopenharmony_ci::
11027db96d56Sopenharmony_ci
11037db96d56Sopenharmony_ci   >>> p = re.compile(r'\W+')
11047db96d56Sopenharmony_ci   >>> p.split('This is a test, short and sweet, of split().')
11057db96d56Sopenharmony_ci   ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
11067db96d56Sopenharmony_ci   >>> p.split('This is a test, short and sweet, of split().', 3)
11077db96d56Sopenharmony_ci   ['This', 'is', 'a', 'test, short and sweet, of split().']
11087db96d56Sopenharmony_ci
11097db96d56Sopenharmony_ciSometimes you're not only interested in what the text between delimiters is, but
11107db96d56Sopenharmony_cialso need to know what the delimiter was.  If capturing parentheses are used in
11117db96d56Sopenharmony_cithe RE, then their values are also returned as part of the list.  Compare the
11127db96d56Sopenharmony_cifollowing calls::
11137db96d56Sopenharmony_ci
11147db96d56Sopenharmony_ci   >>> p = re.compile(r'\W+')
11157db96d56Sopenharmony_ci   >>> p2 = re.compile(r'(\W+)')
11167db96d56Sopenharmony_ci   >>> p.split('This... is a test.')
11177db96d56Sopenharmony_ci   ['This', 'is', 'a', 'test', '']
11187db96d56Sopenharmony_ci   >>> p2.split('This... is a test.')
11197db96d56Sopenharmony_ci   ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
11207db96d56Sopenharmony_ci
11217db96d56Sopenharmony_ciThe module-level function :func:`re.split` adds the RE to be used as the first
11227db96d56Sopenharmony_ciargument, but is otherwise the same.   ::
11237db96d56Sopenharmony_ci
11247db96d56Sopenharmony_ci   >>> re.split(r'[\W]+', 'Words, words, words.')
11257db96d56Sopenharmony_ci   ['Words', 'words', 'words', '']
11267db96d56Sopenharmony_ci   >>> re.split(r'([\W]+)', 'Words, words, words.')
11277db96d56Sopenharmony_ci   ['Words', ', ', 'words', ', ', 'words', '.', '']
11287db96d56Sopenharmony_ci   >>> re.split(r'[\W]+', 'Words, words, words.', 1)
11297db96d56Sopenharmony_ci   ['Words', 'words, words.']
11307db96d56Sopenharmony_ci
11317db96d56Sopenharmony_ci
11327db96d56Sopenharmony_ciSearch and Replace
11337db96d56Sopenharmony_ci------------------
11347db96d56Sopenharmony_ci
11357db96d56Sopenharmony_ciAnother common task is to find all the matches for a pattern, and replace them
11367db96d56Sopenharmony_ciwith a different string.  The :meth:`~re.Pattern.sub` method takes a replacement value,
11377db96d56Sopenharmony_ciwhich can be either a string or a function, and the string to be processed.
11387db96d56Sopenharmony_ci
11397db96d56Sopenharmony_ci.. method:: .sub(replacement, string[, count=0])
11407db96d56Sopenharmony_ci   :noindex:
11417db96d56Sopenharmony_ci
11427db96d56Sopenharmony_ci   Returns the string obtained by replacing the leftmost non-overlapping
11437db96d56Sopenharmony_ci   occurrences of the RE in *string* by the replacement *replacement*.  If the
11447db96d56Sopenharmony_ci   pattern isn't found, *string* is returned unchanged.
11457db96d56Sopenharmony_ci
11467db96d56Sopenharmony_ci   The optional argument *count* is the maximum number of pattern occurrences to be
11477db96d56Sopenharmony_ci   replaced; *count* must be a non-negative integer.  The default value of 0 means
11487db96d56Sopenharmony_ci   to replace all occurrences.
11497db96d56Sopenharmony_ci
11507db96d56Sopenharmony_ciHere's a simple example of using the :meth:`~re.Pattern.sub` method.  It replaces colour
11517db96d56Sopenharmony_cinames with the word ``colour``::
11527db96d56Sopenharmony_ci
11537db96d56Sopenharmony_ci   >>> p = re.compile('(blue|white|red)')
11547db96d56Sopenharmony_ci   >>> p.sub('colour', 'blue socks and red shoes')
11557db96d56Sopenharmony_ci   'colour socks and colour shoes'
11567db96d56Sopenharmony_ci   >>> p.sub('colour', 'blue socks and red shoes', count=1)
11577db96d56Sopenharmony_ci   'colour socks and red shoes'
11587db96d56Sopenharmony_ci
11597db96d56Sopenharmony_ciThe :meth:`~re.Pattern.subn` method does the same work, but returns a 2-tuple containing the
11607db96d56Sopenharmony_cinew string value and the number of replacements  that were performed::
11617db96d56Sopenharmony_ci
11627db96d56Sopenharmony_ci   >>> p = re.compile('(blue|white|red)')
11637db96d56Sopenharmony_ci   >>> p.subn('colour', 'blue socks and red shoes')
11647db96d56Sopenharmony_ci   ('colour socks and colour shoes', 2)
11657db96d56Sopenharmony_ci   >>> p.subn('colour', 'no colours at all')
11667db96d56Sopenharmony_ci   ('no colours at all', 0)
11677db96d56Sopenharmony_ci
11687db96d56Sopenharmony_ciEmpty matches are replaced only when they're not adjacent to a previous empty match.
11697db96d56Sopenharmony_ci::
11707db96d56Sopenharmony_ci
11717db96d56Sopenharmony_ci   >>> p = re.compile('x*')
11727db96d56Sopenharmony_ci   >>> p.sub('-', 'abxd')
11737db96d56Sopenharmony_ci   '-a-b--d-'
11747db96d56Sopenharmony_ci
11757db96d56Sopenharmony_ciIf *replacement* is a string, any backslash escapes in it are processed.  That
11767db96d56Sopenharmony_ciis, ``\n`` is converted to a single newline character, ``\r`` is converted to a
11777db96d56Sopenharmony_cicarriage return, and so forth. Unknown escapes such as ``\&`` are left alone.
11787db96d56Sopenharmony_ciBackreferences, such as ``\6``, are replaced with the substring matched by the
11797db96d56Sopenharmony_cicorresponding group in the RE.  This lets you incorporate portions of the
11807db96d56Sopenharmony_cioriginal text in the resulting replacement string.
11817db96d56Sopenharmony_ci
11827db96d56Sopenharmony_ciThis example matches the word ``section`` followed by a string enclosed in
11837db96d56Sopenharmony_ci``{``, ``}``, and changes ``section`` to ``subsection``::
11847db96d56Sopenharmony_ci
11857db96d56Sopenharmony_ci   >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
11867db96d56Sopenharmony_ci   >>> p.sub(r'subsection{\1}','section{First} section{second}')
11877db96d56Sopenharmony_ci   'subsection{First} subsection{second}'
11887db96d56Sopenharmony_ci
11897db96d56Sopenharmony_ciThere's also a syntax for referring to named groups as defined by the
11907db96d56Sopenharmony_ci``(?P<name>...)`` syntax.  ``\g<name>`` will use the substring matched by the
11917db96d56Sopenharmony_cigroup named ``name``, and  ``\g<number>``  uses the corresponding group number.
11927db96d56Sopenharmony_ci``\g<2>`` is therefore equivalent to ``\2``,  but isn't ambiguous in a
11937db96d56Sopenharmony_cireplacement string such as ``\g<2>0``.  (``\20`` would be interpreted as a
11947db96d56Sopenharmony_cireference to group 20, not a reference to group 2 followed by the literal
11957db96d56Sopenharmony_cicharacter ``'0'``.)  The following substitutions are all equivalent, but use all
11967db96d56Sopenharmony_cithree variations of the replacement string. ::
11977db96d56Sopenharmony_ci
11987db96d56Sopenharmony_ci   >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
11997db96d56Sopenharmony_ci   >>> p.sub(r'subsection{\1}','section{First}')
12007db96d56Sopenharmony_ci   'subsection{First}'
12017db96d56Sopenharmony_ci   >>> p.sub(r'subsection{\g<1>}','section{First}')
12027db96d56Sopenharmony_ci   'subsection{First}'
12037db96d56Sopenharmony_ci   >>> p.sub(r'subsection{\g<name>}','section{First}')
12047db96d56Sopenharmony_ci   'subsection{First}'
12057db96d56Sopenharmony_ci
12067db96d56Sopenharmony_ci*replacement* can also be a function, which gives you even more control.  If
12077db96d56Sopenharmony_ci*replacement* is a function, the function is called for every non-overlapping
12087db96d56Sopenharmony_cioccurrence of *pattern*.  On each call, the function is passed a
12097db96d56Sopenharmony_ci:ref:`match object <match-objects>` argument for the match and can use this
12107db96d56Sopenharmony_ciinformation to compute the desired replacement string and return it.
12117db96d56Sopenharmony_ci
12127db96d56Sopenharmony_ciIn the following example, the replacement function translates decimals into
12137db96d56Sopenharmony_cihexadecimal::
12147db96d56Sopenharmony_ci
12157db96d56Sopenharmony_ci   >>> def hexrepl(match):
12167db96d56Sopenharmony_ci   ...     "Return the hex string for a decimal number"
12177db96d56Sopenharmony_ci   ...     value = int(match.group())
12187db96d56Sopenharmony_ci   ...     return hex(value)
12197db96d56Sopenharmony_ci   ...
12207db96d56Sopenharmony_ci   >>> p = re.compile(r'\d+')
12217db96d56Sopenharmony_ci   >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
12227db96d56Sopenharmony_ci   'Call 0xffd2 for printing, 0xc000 for user code.'
12237db96d56Sopenharmony_ci
12247db96d56Sopenharmony_ciWhen using the module-level :func:`re.sub` function, the pattern is passed as
12257db96d56Sopenharmony_cithe first argument.  The pattern may be provided as an object or as a string; if
12267db96d56Sopenharmony_ciyou need to specify regular expression flags, you must either use a
12277db96d56Sopenharmony_cipattern object as the first parameter, or use embedded modifiers in the
12287db96d56Sopenharmony_cipattern string, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
12297db96d56Sopenharmony_ci
12307db96d56Sopenharmony_ci
12317db96d56Sopenharmony_ciCommon Problems
12327db96d56Sopenharmony_ci===============
12337db96d56Sopenharmony_ci
12347db96d56Sopenharmony_ciRegular expressions are a powerful tool for some applications, but in some ways
12357db96d56Sopenharmony_citheir behaviour isn't intuitive and at times they don't behave the way you may
12367db96d56Sopenharmony_ciexpect them to.  This section will point out some of the most common pitfalls.
12377db96d56Sopenharmony_ci
12387db96d56Sopenharmony_ci
12397db96d56Sopenharmony_ciUse String Methods
12407db96d56Sopenharmony_ci------------------
12417db96d56Sopenharmony_ci
12427db96d56Sopenharmony_ciSometimes using the :mod:`re` module is a mistake.  If you're matching a fixed
12437db96d56Sopenharmony_cistring, or a single character class, and you're not using any :mod:`re` features
12447db96d56Sopenharmony_cisuch as the :const:`~re.IGNORECASE` flag, then the full power of regular expressions
12457db96d56Sopenharmony_cimay not be required. Strings have several methods for performing operations with
12467db96d56Sopenharmony_cifixed strings and they're usually much faster, because the implementation is a
12477db96d56Sopenharmony_cisingle small C loop that's been optimized for the purpose, instead of the large,
12487db96d56Sopenharmony_cimore generalized regular expression engine.
12497db96d56Sopenharmony_ci
12507db96d56Sopenharmony_ciOne example might be replacing a single fixed string with another one; for
12517db96d56Sopenharmony_ciexample, you might replace ``word`` with ``deed``.  :func:`re.sub` seems like the
12527db96d56Sopenharmony_cifunction to use for this, but consider the :meth:`~str.replace` method.  Note that
12537db96d56Sopenharmony_ci:meth:`!replace` will also replace ``word`` inside words, turning ``swordfish``
12547db96d56Sopenharmony_ciinto ``sdeedfish``, but the  naive RE ``word`` would have done that, too.  (To
12557db96d56Sopenharmony_ciavoid performing the substitution on parts of words, the pattern would have to
12567db96d56Sopenharmony_cibe ``\bword\b``, in order to require that ``word`` have a word boundary on
12577db96d56Sopenharmony_cieither side.  This takes the job beyond  :meth:`!replace`'s abilities.)
12587db96d56Sopenharmony_ci
12597db96d56Sopenharmony_ciAnother common task is deleting every occurrence of a single character from a
12607db96d56Sopenharmony_cistring or replacing it with another single character.  You might do this with
12617db96d56Sopenharmony_cisomething like ``re.sub('\n', ' ', S)``, but :meth:`~str.translate` is capable of
12627db96d56Sopenharmony_cidoing both tasks and will be faster than any regular expression operation can
12637db96d56Sopenharmony_cibe.
12647db96d56Sopenharmony_ci
12657db96d56Sopenharmony_ciIn short, before turning to the :mod:`re` module, consider whether your problem
12667db96d56Sopenharmony_cican be solved with a faster and simpler string method.
12677db96d56Sopenharmony_ci
12687db96d56Sopenharmony_ci
12697db96d56Sopenharmony_cimatch() versus search()
12707db96d56Sopenharmony_ci-----------------------
12717db96d56Sopenharmony_ci
12727db96d56Sopenharmony_ciThe :func:`~re.match` function only checks if the RE matches at the beginning of the
12737db96d56Sopenharmony_cistring while :func:`~re.search` will scan forward through the string for a match.
12747db96d56Sopenharmony_ciIt's important to keep this distinction in mind.  Remember,  :func:`!match` will
12757db96d56Sopenharmony_cionly report a successful match which will start at 0; if the match wouldn't
12767db96d56Sopenharmony_cistart at zero,  :func:`!match` will *not* report it. ::
12777db96d56Sopenharmony_ci
12787db96d56Sopenharmony_ci   >>> print(re.match('super', 'superstition').span())
12797db96d56Sopenharmony_ci   (0, 5)
12807db96d56Sopenharmony_ci   >>> print(re.match('super', 'insuperable'))
12817db96d56Sopenharmony_ci   None
12827db96d56Sopenharmony_ci
12837db96d56Sopenharmony_ciOn the other hand, :func:`~re.search` will scan forward through the string,
12847db96d56Sopenharmony_cireporting the first match it finds. ::
12857db96d56Sopenharmony_ci
12867db96d56Sopenharmony_ci   >>> print(re.search('super', 'superstition').span())
12877db96d56Sopenharmony_ci   (0, 5)
12887db96d56Sopenharmony_ci   >>> print(re.search('super', 'insuperable').span())
12897db96d56Sopenharmony_ci   (2, 7)
12907db96d56Sopenharmony_ci
12917db96d56Sopenharmony_ciSometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*``
12927db96d56Sopenharmony_cito the front of your RE.  Resist this temptation and use :func:`re.search`
12937db96d56Sopenharmony_ciinstead.  The regular expression compiler does some analysis of REs in order to
12947db96d56Sopenharmony_cispeed up the process of looking for a match.  One such analysis figures out what
12957db96d56Sopenharmony_cithe first character of a match must be; for example, a pattern starting with
12967db96d56Sopenharmony_ci``Crow`` must match starting with a ``'C'``.  The analysis lets the engine
12977db96d56Sopenharmony_ciquickly scan through the string looking for the starting character, only trying
12987db96d56Sopenharmony_cithe full match if a ``'C'`` is found.
12997db96d56Sopenharmony_ci
13007db96d56Sopenharmony_ciAdding ``.*`` defeats this optimization, requiring scanning to the end of the
13017db96d56Sopenharmony_cistring and then backtracking to find a match for the rest of the RE.  Use
13027db96d56Sopenharmony_ci:func:`re.search` instead.
13037db96d56Sopenharmony_ci
13047db96d56Sopenharmony_ci
13057db96d56Sopenharmony_ciGreedy versus Non-Greedy
13067db96d56Sopenharmony_ci------------------------
13077db96d56Sopenharmony_ci
13087db96d56Sopenharmony_ciWhen repeating a regular expression, as in ``a*``, the resulting action is to
13097db96d56Sopenharmony_ciconsume as much of the pattern as possible.  This fact often bites you when
13107db96d56Sopenharmony_ciyou're trying to match a pair of balanced delimiters, such as the angle brackets
13117db96d56Sopenharmony_cisurrounding an HTML tag.  The naive pattern for matching a single HTML tag
13127db96d56Sopenharmony_cidoesn't work because of the greedy nature of ``.*``. ::
13137db96d56Sopenharmony_ci
13147db96d56Sopenharmony_ci   >>> s = '<html><head><title>Title</title>'
13157db96d56Sopenharmony_ci   >>> len(s)
13167db96d56Sopenharmony_ci   32
13177db96d56Sopenharmony_ci   >>> print(re.match('<.*>', s).span())
13187db96d56Sopenharmony_ci   (0, 32)
13197db96d56Sopenharmony_ci   >>> print(re.match('<.*>', s).group())
13207db96d56Sopenharmony_ci   <html><head><title>Title</title>
13217db96d56Sopenharmony_ci
13227db96d56Sopenharmony_ciThe RE matches the ``'<'`` in ``'<html>'``, and the ``.*`` consumes the rest of
13237db96d56Sopenharmony_cithe string.  There's still more left in the RE, though, and the ``>`` can't
13247db96d56Sopenharmony_cimatch at the end of the string, so the regular expression engine has to
13257db96d56Sopenharmony_cibacktrack character by character until it finds a match for the ``>``.   The
13267db96d56Sopenharmony_cifinal match extends from the ``'<'`` in ``'<html>'`` to the ``'>'`` in
13277db96d56Sopenharmony_ci``'</title>'``, which isn't what you want.
13287db96d56Sopenharmony_ci
13297db96d56Sopenharmony_ciIn this case, the solution is to use the non-greedy quantifiers ``*?``, ``+?``,
13307db96d56Sopenharmony_ci``??``, or ``{m,n}?``, which match as *little* text as possible.  In the above
13317db96d56Sopenharmony_ciexample, the ``'>'`` is tried immediately after the first ``'<'`` matches, and
13327db96d56Sopenharmony_ciwhen it fails, the engine advances a character at a time, retrying the ``'>'``
13337db96d56Sopenharmony_ciat every step.  This produces just the right result::
13347db96d56Sopenharmony_ci
13357db96d56Sopenharmony_ci   >>> print(re.match('<.*?>', s).group())
13367db96d56Sopenharmony_ci   <html>
13377db96d56Sopenharmony_ci
13387db96d56Sopenharmony_ci(Note that parsing HTML or XML with regular expressions is painful.
13397db96d56Sopenharmony_ciQuick-and-dirty patterns will handle common cases, but HTML and XML have special
13407db96d56Sopenharmony_cicases that will break the obvious regular expression; by the time you've written
13417db96d56Sopenharmony_cia regular expression that handles all of the possible cases, the patterns will
13427db96d56Sopenharmony_cibe *very* complicated.  Use an HTML or XML parser module for such tasks.)
13437db96d56Sopenharmony_ci
13447db96d56Sopenharmony_ci
13457db96d56Sopenharmony_ciUsing re.VERBOSE
13467db96d56Sopenharmony_ci----------------
13477db96d56Sopenharmony_ci
13487db96d56Sopenharmony_ciBy now you've probably noticed that regular expressions are a very compact
13497db96d56Sopenharmony_cinotation, but they're not terribly readable.  REs of moderate complexity can
13507db96d56Sopenharmony_cibecome lengthy collections of backslashes, parentheses, and metacharacters,
13517db96d56Sopenharmony_cimaking them difficult to read and understand.
13527db96d56Sopenharmony_ci
13537db96d56Sopenharmony_ciFor such REs, specifying the :const:`re.VERBOSE` flag when compiling the regular
13547db96d56Sopenharmony_ciexpression can be helpful, because it allows you to format the regular
13557db96d56Sopenharmony_ciexpression more clearly.
13567db96d56Sopenharmony_ci
13577db96d56Sopenharmony_ciThe ``re.VERBOSE`` flag has several effects.  Whitespace in the regular
13587db96d56Sopenharmony_ciexpression that *isn't* inside a character class is ignored.  This means that an
13597db96d56Sopenharmony_ciexpression such as ``dog | cat`` is equivalent to the less readable ``dog|cat``,
13607db96d56Sopenharmony_cibut ``[a b]`` will still match the characters ``'a'``, ``'b'``, or a space.  In
13617db96d56Sopenharmony_ciaddition, you can also put comments inside a RE; comments extend from a ``#``
13627db96d56Sopenharmony_cicharacter to the next newline.  When used with triple-quoted strings, this
13637db96d56Sopenharmony_cienables REs to be formatted more neatly::
13647db96d56Sopenharmony_ci
13657db96d56Sopenharmony_ci   pat = re.compile(r"""
13667db96d56Sopenharmony_ci    \s*                 # Skip leading whitespace
13677db96d56Sopenharmony_ci    (?P<header>[^:]+)   # Header name
13687db96d56Sopenharmony_ci    \s* :               # Whitespace, and a colon
13697db96d56Sopenharmony_ci    (?P<value>.*?)      # The header's value -- *? used to
13707db96d56Sopenharmony_ci                        # lose the following trailing whitespace
13717db96d56Sopenharmony_ci    \s*$                # Trailing whitespace to end-of-line
13727db96d56Sopenharmony_ci   """, re.VERBOSE)
13737db96d56Sopenharmony_ci
13747db96d56Sopenharmony_ciThis is far more readable than::
13757db96d56Sopenharmony_ci
13767db96d56Sopenharmony_ci   pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
13777db96d56Sopenharmony_ci
13787db96d56Sopenharmony_ci
13797db96d56Sopenharmony_ciFeedback
13807db96d56Sopenharmony_ci========
13817db96d56Sopenharmony_ci
13827db96d56Sopenharmony_ciRegular expressions are a complicated topic.  Did this document help you
13837db96d56Sopenharmony_ciunderstand them?  Were there parts that were unclear, or Problems you
13847db96d56Sopenharmony_ciencountered that weren't covered here?  If so, please send suggestions for
13857db96d56Sopenharmony_ciimprovements to the author.
13867db96d56Sopenharmony_ci
13877db96d56Sopenharmony_ciThe most complete book on regular expressions is almost certainly Jeffrey
13887db96d56Sopenharmony_ciFriedl's Mastering Regular Expressions, published by O'Reilly.  Unfortunately,
13897db96d56Sopenharmony_ciit exclusively concentrates on Perl and Java's flavours of regular expressions,
13907db96d56Sopenharmony_ciand doesn't contain any Python material at all, so it won't be useful as a
13917db96d56Sopenharmony_cireference for programming in Python.  (The first edition covered Python's
13927db96d56Sopenharmony_cinow-removed :mod:`!regex` module, which won't help you much.)  Consider checking
13937db96d56Sopenharmony_ciit out from your library.
1394