xref: /third_party/python/Doc/library/re.rst (revision 7db96d56)
17db96d56Sopenharmony_ci:mod:`re` --- Regular expression operations
27db96d56Sopenharmony_ci===========================================
37db96d56Sopenharmony_ci
47db96d56Sopenharmony_ci.. module:: re
57db96d56Sopenharmony_ci   :synopsis: Regular expression operations.
67db96d56Sopenharmony_ci
77db96d56Sopenharmony_ci.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
87db96d56Sopenharmony_ci.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
97db96d56Sopenharmony_ci
107db96d56Sopenharmony_ci**Source code:** :source:`Lib/re/`
117db96d56Sopenharmony_ci
127db96d56Sopenharmony_ci--------------
137db96d56Sopenharmony_ci
147db96d56Sopenharmony_ciThis module provides regular expression matching operations similar to
157db96d56Sopenharmony_cithose found in Perl.
167db96d56Sopenharmony_ci
177db96d56Sopenharmony_ciBoth patterns and strings to be searched can be Unicode strings (:class:`str`)
187db96d56Sopenharmony_cias well as 8-bit strings (:class:`bytes`).
197db96d56Sopenharmony_ciHowever, Unicode strings and 8-bit strings cannot be mixed:
207db96d56Sopenharmony_cithat is, you cannot match a Unicode string with a byte pattern or
217db96d56Sopenharmony_civice-versa; similarly, when asking for a substitution, the replacement
227db96d56Sopenharmony_cistring must be of the same type as both the pattern and the search string.
237db96d56Sopenharmony_ci
247db96d56Sopenharmony_ciRegular expressions use the backslash character (``'\'``) to indicate
257db96d56Sopenharmony_cispecial forms or to allow special characters to be used without invoking
267db96d56Sopenharmony_citheir special meaning.  This collides with Python's usage of the same
277db96d56Sopenharmony_cicharacter for the same purpose in string literals; for example, to match
287db96d56Sopenharmony_cia literal backslash, one might have to write ``'\\\\'`` as the pattern
297db96d56Sopenharmony_cistring, because the regular expression must be ``\\``, and each
307db96d56Sopenharmony_cibackslash must be expressed as ``\\`` inside a regular Python string
317db96d56Sopenharmony_ciliteral. Also, please note that any invalid escape sequences in Python's
327db96d56Sopenharmony_ciusage of the backslash in string literals now generate a :exc:`DeprecationWarning`
337db96d56Sopenharmony_ciand in the future this will become a :exc:`SyntaxError`. This behaviour
347db96d56Sopenharmony_ciwill happen even if it is a valid escape sequence for a regular expression.
357db96d56Sopenharmony_ci
367db96d56Sopenharmony_ciThe solution is to use Python's raw string notation for regular expression
377db96d56Sopenharmony_cipatterns; backslashes are not handled in any special way in a string literal
387db96d56Sopenharmony_ciprefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
397db96d56Sopenharmony_ci``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
407db96d56Sopenharmony_cinewline.  Usually patterns will be expressed in Python code using this raw
417db96d56Sopenharmony_cistring notation.
427db96d56Sopenharmony_ci
437db96d56Sopenharmony_ciIt is important to note that most regular expression operations are available as
447db96d56Sopenharmony_cimodule-level functions and methods on
457db96d56Sopenharmony_ci:ref:`compiled regular expressions <re-objects>`.  The functions are shortcuts
467db96d56Sopenharmony_cithat don't require you to compile a regex object first, but miss some
477db96d56Sopenharmony_cifine-tuning parameters.
487db96d56Sopenharmony_ci
497db96d56Sopenharmony_ci.. seealso::
507db96d56Sopenharmony_ci
517db96d56Sopenharmony_ci   The third-party `regex <https://pypi.org/project/regex/>`_ module,
527db96d56Sopenharmony_ci   which has an API compatible with the standard library :mod:`re` module,
537db96d56Sopenharmony_ci   but offers additional functionality and a more thorough Unicode support.
547db96d56Sopenharmony_ci
557db96d56Sopenharmony_ci
567db96d56Sopenharmony_ci.. _re-syntax:
577db96d56Sopenharmony_ci
587db96d56Sopenharmony_ciRegular Expression Syntax
597db96d56Sopenharmony_ci-------------------------
607db96d56Sopenharmony_ci
617db96d56Sopenharmony_ciA regular expression (or RE) specifies a set of strings that matches it; the
627db96d56Sopenharmony_cifunctions in this module let you check if a particular string matches a given
637db96d56Sopenharmony_ciregular expression (or if a given regular expression matches a particular
647db96d56Sopenharmony_cistring, which comes down to the same thing).
657db96d56Sopenharmony_ci
667db96d56Sopenharmony_ciRegular expressions can be concatenated to form new regular expressions; if *A*
677db96d56Sopenharmony_ciand *B* are both regular expressions, then *AB* is also a regular expression.
687db96d56Sopenharmony_ciIn general, if a string *p* matches *A* and another string *q* matches *B*, the
697db96d56Sopenharmony_cistring *pq* will match AB.  This holds unless *A* or *B* contain low precedence
707db96d56Sopenharmony_cioperations; boundary conditions between *A* and *B*; or have numbered group
717db96d56Sopenharmony_cireferences.  Thus, complex expressions can easily be constructed from simpler
727db96d56Sopenharmony_ciprimitive expressions like the ones described here.  For details of the theory
737db96d56Sopenharmony_ciand implementation of regular expressions, consult the Friedl book [Frie09]_,
747db96d56Sopenharmony_cior almost any textbook about compiler construction.
757db96d56Sopenharmony_ci
767db96d56Sopenharmony_ciA brief explanation of the format of regular expressions follows.  For further
777db96d56Sopenharmony_ciinformation and a gentler presentation, consult the :ref:`regex-howto`.
787db96d56Sopenharmony_ci
797db96d56Sopenharmony_ciRegular expressions can contain both special and ordinary characters. Most
807db96d56Sopenharmony_ciordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
817db96d56Sopenharmony_ciexpressions; they simply match themselves.  You can concatenate ordinary
827db96d56Sopenharmony_cicharacters, so ``last`` matches the string ``'last'``.  (In the rest of this
837db96d56Sopenharmony_cisection, we'll write RE's in ``this special style``, usually without quotes, and
847db96d56Sopenharmony_cistrings to be matched ``'in single quotes'``.)
857db96d56Sopenharmony_ci
867db96d56Sopenharmony_ciSome characters, like ``'|'`` or ``'('``, are special. Special
877db96d56Sopenharmony_cicharacters either stand for classes of ordinary characters, or affect
887db96d56Sopenharmony_cihow the regular expressions around them are interpreted.
897db96d56Sopenharmony_ci
907db96d56Sopenharmony_ciRepetition operators or quantifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
917db96d56Sopenharmony_cidirectly nested. This avoids ambiguity with the non-greedy modifier suffix
927db96d56Sopenharmony_ci``?``, and with other modifiers in other implementations. To apply a second
937db96d56Sopenharmony_cirepetition to an inner repetition, parentheses may be used. For example,
947db96d56Sopenharmony_cithe expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
957db96d56Sopenharmony_ci
967db96d56Sopenharmony_ci
977db96d56Sopenharmony_ciThe special characters are:
987db96d56Sopenharmony_ci
997db96d56Sopenharmony_ci.. index:: single: . (dot); in regular expressions
1007db96d56Sopenharmony_ci
1017db96d56Sopenharmony_ci``.``
1027db96d56Sopenharmony_ci   (Dot.)  In the default mode, this matches any character except a newline.  If
1037db96d56Sopenharmony_ci   the :const:`DOTALL` flag has been specified, this matches any character
1047db96d56Sopenharmony_ci   including a newline.
1057db96d56Sopenharmony_ci
1067db96d56Sopenharmony_ci.. index:: single: ^ (caret); in regular expressions
1077db96d56Sopenharmony_ci
1087db96d56Sopenharmony_ci``^``
1097db96d56Sopenharmony_ci   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
1107db96d56Sopenharmony_ci   matches immediately after each newline.
1117db96d56Sopenharmony_ci
1127db96d56Sopenharmony_ci.. index:: single: $ (dollar); in regular expressions
1137db96d56Sopenharmony_ci
1147db96d56Sopenharmony_ci``$``
1157db96d56Sopenharmony_ci   Matches the end of the string or just before the newline at the end of the
1167db96d56Sopenharmony_ci   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
1177db96d56Sopenharmony_ci   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
1187db96d56Sopenharmony_ci   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
1197db96d56Sopenharmony_ci   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
1207db96d56Sopenharmony_ci   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
1217db96d56Sopenharmony_ci   the newline, and one at the end of the string.
1227db96d56Sopenharmony_ci
1237db96d56Sopenharmony_ci.. index:: single: * (asterisk); in regular expressions
1247db96d56Sopenharmony_ci
1257db96d56Sopenharmony_ci``*``
1267db96d56Sopenharmony_ci   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
1277db96d56Sopenharmony_ci   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
1287db96d56Sopenharmony_ci   by any number of 'b's.
1297db96d56Sopenharmony_ci
1307db96d56Sopenharmony_ci.. index:: single: + (plus); in regular expressions
1317db96d56Sopenharmony_ci
1327db96d56Sopenharmony_ci``+``
1337db96d56Sopenharmony_ci   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
1347db96d56Sopenharmony_ci   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
1357db96d56Sopenharmony_ci   match just 'a'.
1367db96d56Sopenharmony_ci
1377db96d56Sopenharmony_ci.. index:: single: ? (question mark); in regular expressions
1387db96d56Sopenharmony_ci
1397db96d56Sopenharmony_ci``?``
1407db96d56Sopenharmony_ci   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
1417db96d56Sopenharmony_ci   ``ab?`` will match either 'a' or 'ab'.
1427db96d56Sopenharmony_ci
1437db96d56Sopenharmony_ci.. index::
1447db96d56Sopenharmony_ci   single: *?; in regular expressions
1457db96d56Sopenharmony_ci   single: +?; in regular expressions
1467db96d56Sopenharmony_ci   single: ??; in regular expressions
1477db96d56Sopenharmony_ci
1487db96d56Sopenharmony_ci``*?``, ``+?``, ``??``
1497db96d56Sopenharmony_ci   The ``'*'``, ``'+'``, and ``'?'`` quantifiers are all :dfn:`greedy`; they match
1507db96d56Sopenharmony_ci   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
1517db96d56Sopenharmony_ci   ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
1527db96d56Sopenharmony_ci   string, and not just ``'<a>'``.  Adding ``?`` after the quantifier makes it
1537db96d56Sopenharmony_ci   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
1547db96d56Sopenharmony_ci   characters as possible will be matched.  Using the RE ``<.*?>`` will match
1557db96d56Sopenharmony_ci   only ``'<a>'``.
1567db96d56Sopenharmony_ci
1577db96d56Sopenharmony_ci.. index::
1587db96d56Sopenharmony_ci   single: *+; in regular expressions
1597db96d56Sopenharmony_ci   single: ++; in regular expressions
1607db96d56Sopenharmony_ci   single: ?+; in regular expressions
1617db96d56Sopenharmony_ci
1627db96d56Sopenharmony_ci``*+``, ``++``, ``?+``
1637db96d56Sopenharmony_ci  Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers, those where ``'+'`` is
1647db96d56Sopenharmony_ci  appended also match as many times as possible.
1657db96d56Sopenharmony_ci  However, unlike the true greedy quantifiers, these do not allow
1667db96d56Sopenharmony_ci  back-tracking when the expression following it fails to match.
1677db96d56Sopenharmony_ci  These are known as :dfn:`possessive` quantifiers.
1687db96d56Sopenharmony_ci  For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match
1697db96d56Sopenharmony_ci  all 4 ``'a'``\ s, but, when the final ``'a'`` is encountered, the
1707db96d56Sopenharmony_ci  expression is backtracked so that in the end the ``a*`` ends up matching
1717db96d56Sopenharmony_ci  3 ``'a'``\ s total, and the fourth ``'a'`` is matched by the final ``'a'``.
1727db96d56Sopenharmony_ci  However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will
1737db96d56Sopenharmony_ci  match all 4 ``'a'``, but when the final ``'a'`` fails to find any more
1747db96d56Sopenharmony_ci  characters to match, the expression cannot be backtracked and will thus
1757db96d56Sopenharmony_ci  fail to match.
1767db96d56Sopenharmony_ci  ``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)``
1777db96d56Sopenharmony_ci  and ``(?>x?)`` correspondingly.
1787db96d56Sopenharmony_ci
1797db96d56Sopenharmony_ci   .. versionadded:: 3.11
1807db96d56Sopenharmony_ci
1817db96d56Sopenharmony_ci.. index::
1827db96d56Sopenharmony_ci   single: {} (curly brackets); in regular expressions
1837db96d56Sopenharmony_ci
1847db96d56Sopenharmony_ci``{m}``
1857db96d56Sopenharmony_ci   Specifies that exactly *m* copies of the previous RE should be matched; fewer
1867db96d56Sopenharmony_ci   matches cause the entire RE not to match.  For example, ``a{6}`` will match
1877db96d56Sopenharmony_ci   exactly six ``'a'`` characters, but not five.
1887db96d56Sopenharmony_ci
1897db96d56Sopenharmony_ci``{m,n}``
1907db96d56Sopenharmony_ci   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
1917db96d56Sopenharmony_ci   RE, attempting to match as many repetitions as possible.  For example,
1927db96d56Sopenharmony_ci   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
1937db96d56Sopenharmony_ci   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
1947db96d56Sopenharmony_ci   example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
1957db96d56Sopenharmony_ci   followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
1967db96d56Sopenharmony_ci   modifier would be confused with the previously described form.
1977db96d56Sopenharmony_ci
1987db96d56Sopenharmony_ci``{m,n}?``
1997db96d56Sopenharmony_ci   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
2007db96d56Sopenharmony_ci   RE, attempting to match as *few* repetitions as possible.  This is the
2017db96d56Sopenharmony_ci   non-greedy version of the previous quantifier.  For example, on the
2027db96d56Sopenharmony_ci   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
2037db96d56Sopenharmony_ci   while ``a{3,5}?`` will only match 3 characters.
2047db96d56Sopenharmony_ci
2057db96d56Sopenharmony_ci``{m,n}+``
2067db96d56Sopenharmony_ci   Causes the resulting RE to match from *m* to *n* repetitions of the
2077db96d56Sopenharmony_ci   preceding RE, attempting to match as many repetitions as possible
2087db96d56Sopenharmony_ci   *without* establishing any backtracking points.
2097db96d56Sopenharmony_ci   This is the possessive version of the quantifier above.
2107db96d56Sopenharmony_ci   For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa``
2117db96d56Sopenharmony_ci   attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``\ s,
2127db96d56Sopenharmony_ci   will need more characters than available and thus fail, while
2137db96d56Sopenharmony_ci   ``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``\ s
2147db96d56Sopenharmony_ci   by backtracking and then the final 2 ``'a'``\ s are matched by the final
2157db96d56Sopenharmony_ci   ``aa`` in the pattern.
2167db96d56Sopenharmony_ci   ``x{m,n}+`` is equivalent to ``(?>x{m,n})``.
2177db96d56Sopenharmony_ci
2187db96d56Sopenharmony_ci   .. versionadded:: 3.11
2197db96d56Sopenharmony_ci
2207db96d56Sopenharmony_ci.. index:: single: \ (backslash); in regular expressions
2217db96d56Sopenharmony_ci
2227db96d56Sopenharmony_ci``\``
2237db96d56Sopenharmony_ci   Either escapes special characters (permitting you to match characters like
2247db96d56Sopenharmony_ci   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
2257db96d56Sopenharmony_ci   sequences are discussed below.
2267db96d56Sopenharmony_ci
2277db96d56Sopenharmony_ci   If you're not using a raw string to express the pattern, remember that Python
2287db96d56Sopenharmony_ci   also uses the backslash as an escape sequence in string literals; if the escape
2297db96d56Sopenharmony_ci   sequence isn't recognized by Python's parser, the backslash and subsequent
2307db96d56Sopenharmony_ci   character are included in the resulting string.  However, if Python would
2317db96d56Sopenharmony_ci   recognize the resulting sequence, the backslash should be repeated twice.  This
2327db96d56Sopenharmony_ci   is complicated and hard to understand, so it's highly recommended that you use
2337db96d56Sopenharmony_ci   raw strings for all but the simplest expressions.
2347db96d56Sopenharmony_ci
2357db96d56Sopenharmony_ci.. index::
2367db96d56Sopenharmony_ci   single: [] (square brackets); in regular expressions
2377db96d56Sopenharmony_ci
2387db96d56Sopenharmony_ci``[]``
2397db96d56Sopenharmony_ci   Used to indicate a set of characters.  In a set:
2407db96d56Sopenharmony_ci
2417db96d56Sopenharmony_ci   * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
2427db96d56Sopenharmony_ci     ``'m'``, or ``'k'``.
2437db96d56Sopenharmony_ci
2447db96d56Sopenharmony_ci   .. index:: single: - (minus); in regular expressions
2457db96d56Sopenharmony_ci
2467db96d56Sopenharmony_ci   * Ranges of characters can be indicated by giving two characters and separating
2477db96d56Sopenharmony_ci     them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
2487db96d56Sopenharmony_ci     ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
2497db96d56Sopenharmony_ci     ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
2507db96d56Sopenharmony_ci     ``[a\-z]``) or if it's placed as the first or last character
2517db96d56Sopenharmony_ci     (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
2527db96d56Sopenharmony_ci
2537db96d56Sopenharmony_ci   * Special characters lose their special meaning inside sets.  For example,
2547db96d56Sopenharmony_ci     ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
2557db96d56Sopenharmony_ci     ``'*'``, or ``')'``.
2567db96d56Sopenharmony_ci
2577db96d56Sopenharmony_ci   .. index:: single: \ (backslash); in regular expressions
2587db96d56Sopenharmony_ci
2597db96d56Sopenharmony_ci   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
2607db96d56Sopenharmony_ci     inside a set, although the characters they match depends on whether
2617db96d56Sopenharmony_ci     :const:`ASCII` or :const:`LOCALE` mode is in force.
2627db96d56Sopenharmony_ci
2637db96d56Sopenharmony_ci   .. index:: single: ^ (caret); in regular expressions
2647db96d56Sopenharmony_ci
2657db96d56Sopenharmony_ci   * Characters that are not within a range can be matched by :dfn:`complementing`
2667db96d56Sopenharmony_ci     the set.  If the first character of the set is ``'^'``, all the characters
2677db96d56Sopenharmony_ci     that are *not* in the set will be matched.  For example, ``[^5]`` will match
2687db96d56Sopenharmony_ci     any character except ``'5'``, and ``[^^]`` will match any character except
2697db96d56Sopenharmony_ci     ``'^'``.  ``^`` has no special meaning if it's not the first character in
2707db96d56Sopenharmony_ci     the set.
2717db96d56Sopenharmony_ci
2727db96d56Sopenharmony_ci   * To match a literal ``']'`` inside a set, precede it with a backslash, or
2737db96d56Sopenharmony_ci     place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
2747db96d56Sopenharmony_ci     ``[]()[{}]`` will match a right bracket, as well as left bracket, braces,
2757db96d56Sopenharmony_ci     and parentheses.
2767db96d56Sopenharmony_ci
2777db96d56Sopenharmony_ci   .. .. index:: single: --; in regular expressions
2787db96d56Sopenharmony_ci   .. .. index:: single: &&; in regular expressions
2797db96d56Sopenharmony_ci   .. .. index:: single: ~~; in regular expressions
2807db96d56Sopenharmony_ci   .. .. index:: single: ||; in regular expressions
2817db96d56Sopenharmony_ci
2827db96d56Sopenharmony_ci   * Support of nested sets and set operations as in `Unicode Technical
2837db96d56Sopenharmony_ci     Standard #18`_ might be added in the future.  This would change the
2847db96d56Sopenharmony_ci     syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
2857db96d56Sopenharmony_ci     in ambiguous cases for the time being.
2867db96d56Sopenharmony_ci     That includes sets starting with a literal ``'['`` or containing literal
2877db96d56Sopenharmony_ci     character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``.  To
2887db96d56Sopenharmony_ci     avoid a warning escape them with a backslash.
2897db96d56Sopenharmony_ci
2907db96d56Sopenharmony_ci   .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
2917db96d56Sopenharmony_ci
2927db96d56Sopenharmony_ci   .. versionchanged:: 3.7
2937db96d56Sopenharmony_ci      :exc:`FutureWarning` is raised if a character set contains constructs
2947db96d56Sopenharmony_ci      that will change semantically in the future.
2957db96d56Sopenharmony_ci
2967db96d56Sopenharmony_ci.. index:: single: | (vertical bar); in regular expressions
2977db96d56Sopenharmony_ci
2987db96d56Sopenharmony_ci``|``
2997db96d56Sopenharmony_ci   ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
3007db96d56Sopenharmony_ci   will match either *A* or *B*.  An arbitrary number of REs can be separated by the
3017db96d56Sopenharmony_ci   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
3027db96d56Sopenharmony_ci   the target string is scanned, REs separated by ``'|'`` are tried from left to
3037db96d56Sopenharmony_ci   right. When one pattern completely matches, that branch is accepted. This means
3047db96d56Sopenharmony_ci   that once *A* matches, *B* will not be tested further, even if it would
3057db96d56Sopenharmony_ci   produce a longer overall match.  In other words, the ``'|'`` operator is never
3067db96d56Sopenharmony_ci   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
3077db96d56Sopenharmony_ci   character class, as in ``[|]``.
3087db96d56Sopenharmony_ci
3097db96d56Sopenharmony_ci.. index::
3107db96d56Sopenharmony_ci   single: () (parentheses); in regular expressions
3117db96d56Sopenharmony_ci
3127db96d56Sopenharmony_ci``(...)``
3137db96d56Sopenharmony_ci   Matches whatever regular expression is inside the parentheses, and indicates the
3147db96d56Sopenharmony_ci   start and end of a group; the contents of a group can be retrieved after a match
3157db96d56Sopenharmony_ci   has been performed, and can be matched later in the string with the ``\number``
3167db96d56Sopenharmony_ci   special sequence, described below.  To match the literals ``'('`` or ``')'``,
3177db96d56Sopenharmony_ci   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
3187db96d56Sopenharmony_ci
3197db96d56Sopenharmony_ci.. index:: single: (?; in regular expressions
3207db96d56Sopenharmony_ci
3217db96d56Sopenharmony_ci``(?...)``
3227db96d56Sopenharmony_ci   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
3237db96d56Sopenharmony_ci   otherwise).  The first character after the ``'?'`` determines what the meaning
3247db96d56Sopenharmony_ci   and further syntax of the construct is. Extensions usually do not create a new
3257db96d56Sopenharmony_ci   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
3267db96d56Sopenharmony_ci   currently supported extensions.
3277db96d56Sopenharmony_ci
3287db96d56Sopenharmony_ci``(?aiLmsux)``
3297db96d56Sopenharmony_ci   (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
3307db96d56Sopenharmony_ci   ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
3317db96d56Sopenharmony_ci   letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
3327db96d56Sopenharmony_ci   :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
3337db96d56Sopenharmony_ci   :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
3347db96d56Sopenharmony_ci   :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
3357db96d56Sopenharmony_ci   for the entire regular expression.
3367db96d56Sopenharmony_ci   (The flags are described in :ref:`contents-of-module-re`.)
3377db96d56Sopenharmony_ci   This is useful if you wish to include the flags as part of the
3387db96d56Sopenharmony_ci   regular expression, instead of passing a *flag* argument to the
3397db96d56Sopenharmony_ci   :func:`re.compile` function.  Flags should be used first in the
3407db96d56Sopenharmony_ci   expression string.
3417db96d56Sopenharmony_ci
3427db96d56Sopenharmony_ci   .. versionchanged:: 3.11
3437db96d56Sopenharmony_ci      This construction can only be used at the start of the expression.
3447db96d56Sopenharmony_ci
3457db96d56Sopenharmony_ci.. index:: single: (?:; in regular expressions
3467db96d56Sopenharmony_ci
3477db96d56Sopenharmony_ci``(?:...)``
3487db96d56Sopenharmony_ci   A non-capturing version of regular parentheses.  Matches whatever regular
3497db96d56Sopenharmony_ci   expression is inside the parentheses, but the substring matched by the group
3507db96d56Sopenharmony_ci   *cannot* be retrieved after performing a match or referenced later in the
3517db96d56Sopenharmony_ci   pattern.
3527db96d56Sopenharmony_ci
3537db96d56Sopenharmony_ci``(?aiLmsux-imsx:...)``
3547db96d56Sopenharmony_ci   (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
3557db96d56Sopenharmony_ci   ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
3567db96d56Sopenharmony_ci   one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
3577db96d56Sopenharmony_ci   The letters set or remove the corresponding flags:
3587db96d56Sopenharmony_ci   :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
3597db96d56Sopenharmony_ci   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
3607db96d56Sopenharmony_ci   :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
3617db96d56Sopenharmony_ci   and :const:`re.X` (verbose), for the part of the expression.
3627db96d56Sopenharmony_ci   (The flags are described in :ref:`contents-of-module-re`.)
3637db96d56Sopenharmony_ci
3647db96d56Sopenharmony_ci   The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
3657db96d56Sopenharmony_ci   as inline flags, so they can't be combined or follow ``'-'``.  Instead,
3667db96d56Sopenharmony_ci   when one of them appears in an inline group, it overrides the matching mode
3677db96d56Sopenharmony_ci   in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
3687db96d56Sopenharmony_ci   ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
3697db96d56Sopenharmony_ci   (default).  In byte pattern ``(?L:...)`` switches to locale depending
3707db96d56Sopenharmony_ci   matching, and ``(?a:...)`` switches to ASCII-only matching (default).
3717db96d56Sopenharmony_ci   This override is only in effect for the narrow inline group, and the
3727db96d56Sopenharmony_ci   original matching mode is restored outside of the group.
3737db96d56Sopenharmony_ci
3747db96d56Sopenharmony_ci   .. versionadded:: 3.6
3757db96d56Sopenharmony_ci
3767db96d56Sopenharmony_ci   .. versionchanged:: 3.7
3777db96d56Sopenharmony_ci      The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
3787db96d56Sopenharmony_ci
3797db96d56Sopenharmony_ci``(?>...)``
3807db96d56Sopenharmony_ci   Attempts to match ``...`` as if it was a separate regular expression, and
3817db96d56Sopenharmony_ci   if successful, continues to match the rest of the pattern following it.
3827db96d56Sopenharmony_ci   If the subsequent pattern fails to match, the stack can only be unwound
3837db96d56Sopenharmony_ci   to a point *before* the ``(?>...)`` because once exited, the expression,
3847db96d56Sopenharmony_ci   known as an :dfn:`atomic group`, has thrown away all stack points within
3857db96d56Sopenharmony_ci   itself.
3867db96d56Sopenharmony_ci   Thus, ``(?>.*).`` would never match anything because first the ``.*``
3877db96d56Sopenharmony_ci   would match all characters possible, then, having nothing left to match,
3887db96d56Sopenharmony_ci   the final ``.`` would fail to match.
3897db96d56Sopenharmony_ci   Since there are no stack points saved in the Atomic Group, and there is
3907db96d56Sopenharmony_ci   no stack point before it, the entire expression would thus fail to match.
3917db96d56Sopenharmony_ci
3927db96d56Sopenharmony_ci   .. versionadded:: 3.11
3937db96d56Sopenharmony_ci
3947db96d56Sopenharmony_ci.. index:: single: (?P<; in regular expressions
3957db96d56Sopenharmony_ci
3967db96d56Sopenharmony_ci``(?P<name>...)``
3977db96d56Sopenharmony_ci   Similar to regular parentheses, but the substring matched by the group is
3987db96d56Sopenharmony_ci   accessible via the symbolic group name *name*.  Group names must be valid
3997db96d56Sopenharmony_ci   Python identifiers, and each group name must be defined only once within a
4007db96d56Sopenharmony_ci   regular expression.  A symbolic group is also a numbered group, just as if
4017db96d56Sopenharmony_ci   the group were not named.
4027db96d56Sopenharmony_ci
4037db96d56Sopenharmony_ci   Named groups can be referenced in three contexts.  If the pattern is
4047db96d56Sopenharmony_ci   ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
4057db96d56Sopenharmony_ci   single or double quotes):
4067db96d56Sopenharmony_ci
4077db96d56Sopenharmony_ci   +---------------------------------------+----------------------------------+
4087db96d56Sopenharmony_ci   | Context of reference to group "quote" | Ways to reference it             |
4097db96d56Sopenharmony_ci   +=======================================+==================================+
4107db96d56Sopenharmony_ci   | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
4117db96d56Sopenharmony_ci   |                                       | * ``\1``                         |
4127db96d56Sopenharmony_ci   +---------------------------------------+----------------------------------+
4137db96d56Sopenharmony_ci   | when processing match object *m*      | * ``m.group('quote')``           |
4147db96d56Sopenharmony_ci   |                                       | * ``m.end('quote')`` (etc.)      |
4157db96d56Sopenharmony_ci   +---------------------------------------+----------------------------------+
4167db96d56Sopenharmony_ci   | in a string passed to the *repl*      | * ``\g<quote>``                  |
4177db96d56Sopenharmony_ci   | argument of ``re.sub()``              | * ``\g<1>``                      |
4187db96d56Sopenharmony_ci   |                                       | * ``\1``                         |
4197db96d56Sopenharmony_ci   +---------------------------------------+----------------------------------+
4207db96d56Sopenharmony_ci
4217db96d56Sopenharmony_ci   .. deprecated:: 3.11
4227db96d56Sopenharmony_ci      Group *name* containing characters outside the ASCII range
4237db96d56Sopenharmony_ci      (``b'\x00'``-``b'\x7f'``) in :class:`bytes` patterns.
4247db96d56Sopenharmony_ci
4257db96d56Sopenharmony_ci.. index:: single: (?P=; in regular expressions
4267db96d56Sopenharmony_ci
4277db96d56Sopenharmony_ci``(?P=name)``
4287db96d56Sopenharmony_ci   A backreference to a named group; it matches whatever text was matched by the
4297db96d56Sopenharmony_ci   earlier group named *name*.
4307db96d56Sopenharmony_ci
4317db96d56Sopenharmony_ci.. index:: single: (?#; in regular expressions
4327db96d56Sopenharmony_ci
4337db96d56Sopenharmony_ci``(?#...)``
4347db96d56Sopenharmony_ci   A comment; the contents of the parentheses are simply ignored.
4357db96d56Sopenharmony_ci
4367db96d56Sopenharmony_ci.. index:: single: (?=; in regular expressions
4377db96d56Sopenharmony_ci
4387db96d56Sopenharmony_ci``(?=...)``
4397db96d56Sopenharmony_ci   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
4407db96d56Sopenharmony_ci   called a :dfn:`lookahead assertion`.  For example, ``Isaac (?=Asimov)`` will match
4417db96d56Sopenharmony_ci   ``'Isaac '`` only if it's followed by ``'Asimov'``.
4427db96d56Sopenharmony_ci
4437db96d56Sopenharmony_ci.. index:: single: (?!; in regular expressions
4447db96d56Sopenharmony_ci
4457db96d56Sopenharmony_ci``(?!...)``
4467db96d56Sopenharmony_ci   Matches if ``...`` doesn't match next.  This is a :dfn:`negative lookahead assertion`.
4477db96d56Sopenharmony_ci   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
4487db96d56Sopenharmony_ci   followed by ``'Asimov'``.
4497db96d56Sopenharmony_ci
4507db96d56Sopenharmony_ci.. index:: single: (?<=; in regular expressions
4517db96d56Sopenharmony_ci
4527db96d56Sopenharmony_ci``(?<=...)``
4537db96d56Sopenharmony_ci   Matches if the current position in the string is preceded by a match for ``...``
4547db96d56Sopenharmony_ci   that ends at the current position.  This is called a :dfn:`positive lookbehind
4557db96d56Sopenharmony_ci   assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
4567db96d56Sopenharmony_ci   lookbehind will back up 3 characters and check if the contained pattern matches.
4577db96d56Sopenharmony_ci   The contained pattern must only match strings of some fixed length, meaning that
4587db96d56Sopenharmony_ci   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
4597db96d56Sopenharmony_ci   patterns which start with positive lookbehind assertions will not match at the
4607db96d56Sopenharmony_ci   beginning of the string being searched; you will most likely want to use the
4617db96d56Sopenharmony_ci   :func:`search` function rather than the :func:`match` function:
4627db96d56Sopenharmony_ci
4637db96d56Sopenharmony_ci      >>> import re
4647db96d56Sopenharmony_ci      >>> m = re.search('(?<=abc)def', 'abcdef')
4657db96d56Sopenharmony_ci      >>> m.group(0)
4667db96d56Sopenharmony_ci      'def'
4677db96d56Sopenharmony_ci
4687db96d56Sopenharmony_ci   This example looks for a word following a hyphen:
4697db96d56Sopenharmony_ci
4707db96d56Sopenharmony_ci      >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
4717db96d56Sopenharmony_ci      >>> m.group(0)
4727db96d56Sopenharmony_ci      'egg'
4737db96d56Sopenharmony_ci
4747db96d56Sopenharmony_ci   .. versionchanged:: 3.5
4757db96d56Sopenharmony_ci      Added support for group references of fixed length.
4767db96d56Sopenharmony_ci
4777db96d56Sopenharmony_ci.. index:: single: (?<!; in regular expressions
4787db96d56Sopenharmony_ci
4797db96d56Sopenharmony_ci``(?<!...)``
4807db96d56Sopenharmony_ci   Matches if the current position in the string is not preceded by a match for
4817db96d56Sopenharmony_ci   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
4827db96d56Sopenharmony_ci   positive lookbehind assertions, the contained pattern must only match strings of
4837db96d56Sopenharmony_ci   some fixed length.  Patterns which start with negative lookbehind assertions may
4847db96d56Sopenharmony_ci   match at the beginning of the string being searched.
4857db96d56Sopenharmony_ci
4867db96d56Sopenharmony_ci.. _re-conditional-expression:
4877db96d56Sopenharmony_ci.. index:: single: (?(; in regular expressions
4887db96d56Sopenharmony_ci
4897db96d56Sopenharmony_ci``(?(id/name)yes-pattern|no-pattern)``
4907db96d56Sopenharmony_ci   Will try to match with ``yes-pattern`` if the group with given *id* or
4917db96d56Sopenharmony_ci   *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
4927db96d56Sopenharmony_ci   optional and can be omitted. For example,
4937db96d56Sopenharmony_ci   ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
4947db96d56Sopenharmony_ci   will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
4957db96d56Sopenharmony_ci   not with ``'<user@host.com'`` nor ``'user@host.com>'``.
4967db96d56Sopenharmony_ci
4977db96d56Sopenharmony_ci   .. deprecated:: 3.11
4987db96d56Sopenharmony_ci      Group *id* containing anything except ASCII digits.
4997db96d56Sopenharmony_ci      Group *name* containing characters outside the ASCII range
5007db96d56Sopenharmony_ci      (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings.
5017db96d56Sopenharmony_ci
5027db96d56Sopenharmony_ci
5037db96d56Sopenharmony_ciThe special sequences consist of ``'\'`` and a character from the list below.
5047db96d56Sopenharmony_ciIf the ordinary character is not an ASCII digit or an ASCII letter, then the
5057db96d56Sopenharmony_ciresulting RE will match the second character.  For example, ``\$`` matches the
5067db96d56Sopenharmony_cicharacter ``'$'``.
5077db96d56Sopenharmony_ci
5087db96d56Sopenharmony_ci.. index:: single: \ (backslash); in regular expressions
5097db96d56Sopenharmony_ci
5107db96d56Sopenharmony_ci``\number``
5117db96d56Sopenharmony_ci   Matches the contents of the group of the same number.  Groups are numbered
5127db96d56Sopenharmony_ci   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
5137db96d56Sopenharmony_ci   but not ``'thethe'`` (note the space after the group).  This special sequence
5147db96d56Sopenharmony_ci   can only be used to match one of the first 99 groups.  If the first digit of
5157db96d56Sopenharmony_ci   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
5167db96d56Sopenharmony_ci   a group match, but as the character with octal value *number*. Inside the
5177db96d56Sopenharmony_ci   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
5187db96d56Sopenharmony_ci   characters.
5197db96d56Sopenharmony_ci
5207db96d56Sopenharmony_ci.. index:: single: \A; in regular expressions
5217db96d56Sopenharmony_ci
5227db96d56Sopenharmony_ci``\A``
5237db96d56Sopenharmony_ci   Matches only at the start of the string.
5247db96d56Sopenharmony_ci
5257db96d56Sopenharmony_ci.. index:: single: \b; in regular expressions
5267db96d56Sopenharmony_ci
5277db96d56Sopenharmony_ci``\b``
5287db96d56Sopenharmony_ci   Matches the empty string, but only at the beginning or end of a word.
5297db96d56Sopenharmony_ci   A word is defined as a sequence of word characters.  Note that formally,
5307db96d56Sopenharmony_ci   ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
5317db96d56Sopenharmony_ci   (or vice versa), or between ``\w`` and the beginning/end of the string.
5327db96d56Sopenharmony_ci   This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
5337db96d56Sopenharmony_ci   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
5347db96d56Sopenharmony_ci
5357db96d56Sopenharmony_ci   By default Unicode alphanumerics are the ones used in Unicode patterns, but
5367db96d56Sopenharmony_ci   this can be changed by using the :const:`ASCII` flag.  Word boundaries are
5377db96d56Sopenharmony_ci   determined by the current locale if the :const:`LOCALE` flag is used.
5387db96d56Sopenharmony_ci   Inside a character range, ``\b`` represents the backspace character, for
5397db96d56Sopenharmony_ci   compatibility with Python's string literals.
5407db96d56Sopenharmony_ci
5417db96d56Sopenharmony_ci.. index:: single: \B; in regular expressions
5427db96d56Sopenharmony_ci
5437db96d56Sopenharmony_ci``\B``
5447db96d56Sopenharmony_ci   Matches the empty string, but only when it is *not* at the beginning or end
5457db96d56Sopenharmony_ci   of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
5467db96d56Sopenharmony_ci   ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
5477db96d56Sopenharmony_ci   ``\B`` is just the opposite of ``\b``, so word characters in Unicode
5487db96d56Sopenharmony_ci   patterns are Unicode alphanumerics or the underscore, although this can
5497db96d56Sopenharmony_ci   be changed by using the :const:`ASCII` flag.  Word boundaries are
5507db96d56Sopenharmony_ci   determined by the current locale if the :const:`LOCALE` flag is used.
5517db96d56Sopenharmony_ci
5527db96d56Sopenharmony_ci.. index:: single: \d; in regular expressions
5537db96d56Sopenharmony_ci
5547db96d56Sopenharmony_ci``\d``
5557db96d56Sopenharmony_ci   For Unicode (str) patterns:
5567db96d56Sopenharmony_ci      Matches any Unicode decimal digit (that is, any character in
5577db96d56Sopenharmony_ci      Unicode character category [Nd]).  This includes ``[0-9]``, and
5587db96d56Sopenharmony_ci      also many other digit characters.  If the :const:`ASCII` flag is
5597db96d56Sopenharmony_ci      used only ``[0-9]`` is matched.
5607db96d56Sopenharmony_ci
5617db96d56Sopenharmony_ci   For 8-bit (bytes) patterns:
5627db96d56Sopenharmony_ci      Matches any decimal digit; this is equivalent to ``[0-9]``.
5637db96d56Sopenharmony_ci
5647db96d56Sopenharmony_ci.. index:: single: \D; in regular expressions
5657db96d56Sopenharmony_ci
5667db96d56Sopenharmony_ci``\D``
5677db96d56Sopenharmony_ci   Matches any character which is not a decimal digit. This is
5687db96d56Sopenharmony_ci   the opposite of ``\d``. If the :const:`ASCII` flag is used this
5697db96d56Sopenharmony_ci   becomes the equivalent of ``[^0-9]``.
5707db96d56Sopenharmony_ci
5717db96d56Sopenharmony_ci.. index:: single: \s; in regular expressions
5727db96d56Sopenharmony_ci
5737db96d56Sopenharmony_ci``\s``
5747db96d56Sopenharmony_ci   For Unicode (str) patterns:
5757db96d56Sopenharmony_ci      Matches Unicode whitespace characters (which includes
5767db96d56Sopenharmony_ci      ``[ \t\n\r\f\v]``, and also many other characters, for example the
5777db96d56Sopenharmony_ci      non-breaking spaces mandated by typography rules in many
5787db96d56Sopenharmony_ci      languages). If the :const:`ASCII` flag is used, only
5797db96d56Sopenharmony_ci      ``[ \t\n\r\f\v]`` is matched.
5807db96d56Sopenharmony_ci
5817db96d56Sopenharmony_ci   For 8-bit (bytes) patterns:
5827db96d56Sopenharmony_ci      Matches characters considered whitespace in the ASCII character set;
5837db96d56Sopenharmony_ci      this is equivalent to ``[ \t\n\r\f\v]``.
5847db96d56Sopenharmony_ci
5857db96d56Sopenharmony_ci.. index:: single: \S; in regular expressions
5867db96d56Sopenharmony_ci
5877db96d56Sopenharmony_ci``\S``
5887db96d56Sopenharmony_ci   Matches any character which is not a whitespace character. This is
5897db96d56Sopenharmony_ci   the opposite of ``\s``. If the :const:`ASCII` flag is used this
5907db96d56Sopenharmony_ci   becomes the equivalent of ``[^ \t\n\r\f\v]``.
5917db96d56Sopenharmony_ci
5927db96d56Sopenharmony_ci.. index:: single: \w; in regular expressions
5937db96d56Sopenharmony_ci
5947db96d56Sopenharmony_ci``\w``
5957db96d56Sopenharmony_ci   For Unicode (str) patterns:
5967db96d56Sopenharmony_ci      Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
5977db96d56Sopenharmony_ci      as well as the underscore (``_``).
5987db96d56Sopenharmony_ci      If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
5997db96d56Sopenharmony_ci
6007db96d56Sopenharmony_ci   For 8-bit (bytes) patterns:
6017db96d56Sopenharmony_ci      Matches characters considered alphanumeric in the ASCII character set;
6027db96d56Sopenharmony_ci      this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
6037db96d56Sopenharmony_ci      used, matches characters considered alphanumeric in the current locale
6047db96d56Sopenharmony_ci      and the underscore.
6057db96d56Sopenharmony_ci
6067db96d56Sopenharmony_ci.. index:: single: \W; in regular expressions
6077db96d56Sopenharmony_ci
6087db96d56Sopenharmony_ci``\W``
6097db96d56Sopenharmony_ci   Matches any character which is not a word character. This is
6107db96d56Sopenharmony_ci   the opposite of ``\w``. If the :const:`ASCII` flag is used this
6117db96d56Sopenharmony_ci   becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
6127db96d56Sopenharmony_ci   used, matches characters which are neither alphanumeric in the current locale
6137db96d56Sopenharmony_ci   nor the underscore.
6147db96d56Sopenharmony_ci
6157db96d56Sopenharmony_ci.. index:: single: \Z; in regular expressions
6167db96d56Sopenharmony_ci
6177db96d56Sopenharmony_ci``\Z``
6187db96d56Sopenharmony_ci   Matches only at the end of the string.
6197db96d56Sopenharmony_ci
6207db96d56Sopenharmony_ci.. index::
6217db96d56Sopenharmony_ci   single: \a; in regular expressions
6227db96d56Sopenharmony_ci   single: \b; in regular expressions
6237db96d56Sopenharmony_ci   single: \f; in regular expressions
6247db96d56Sopenharmony_ci   single: \n; in regular expressions
6257db96d56Sopenharmony_ci   single: \N; in regular expressions
6267db96d56Sopenharmony_ci   single: \r; in regular expressions
6277db96d56Sopenharmony_ci   single: \t; in regular expressions
6287db96d56Sopenharmony_ci   single: \u; in regular expressions
6297db96d56Sopenharmony_ci   single: \U; in regular expressions
6307db96d56Sopenharmony_ci   single: \v; in regular expressions
6317db96d56Sopenharmony_ci   single: \x; in regular expressions
6327db96d56Sopenharmony_ci   single: \\; in regular expressions
6337db96d56Sopenharmony_ci
6347db96d56Sopenharmony_ciMost of the standard escapes supported by Python string literals are also
6357db96d56Sopenharmony_ciaccepted by the regular expression parser::
6367db96d56Sopenharmony_ci
6377db96d56Sopenharmony_ci   \a      \b      \f      \n
6387db96d56Sopenharmony_ci   \N      \r      \t      \u
6397db96d56Sopenharmony_ci   \U      \v      \x      \\
6407db96d56Sopenharmony_ci
6417db96d56Sopenharmony_ci(Note that ``\b`` is used to represent word boundaries, and means "backspace"
6427db96d56Sopenharmony_cionly inside character classes.)
6437db96d56Sopenharmony_ci
6447db96d56Sopenharmony_ci``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
6457db96d56Sopenharmony_cipatterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
6467db96d56Sopenharmony_ciletters are reserved for future use and treated as errors.
6477db96d56Sopenharmony_ci
6487db96d56Sopenharmony_ciOctal escapes are included in a limited form.  If the first digit is a 0, or if
6497db96d56Sopenharmony_cithere are three octal digits, it is considered an octal escape. Otherwise, it is
6507db96d56Sopenharmony_cia group reference.  As for string literals, octal escapes are always at most
6517db96d56Sopenharmony_cithree digits in length.
6527db96d56Sopenharmony_ci
6537db96d56Sopenharmony_ci.. versionchanged:: 3.3
6547db96d56Sopenharmony_ci   The ``'\u'`` and ``'\U'`` escape sequences have been added.
6557db96d56Sopenharmony_ci
6567db96d56Sopenharmony_ci.. versionchanged:: 3.6
6577db96d56Sopenharmony_ci   Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
6587db96d56Sopenharmony_ci
6597db96d56Sopenharmony_ci.. versionchanged:: 3.8
6607db96d56Sopenharmony_ci   The ``'\N{name}'`` escape sequence has been added. As in string literals,
6617db96d56Sopenharmony_ci   it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
6627db96d56Sopenharmony_ci
6637db96d56Sopenharmony_ci
6647db96d56Sopenharmony_ci.. _contents-of-module-re:
6657db96d56Sopenharmony_ci
6667db96d56Sopenharmony_ciModule Contents
6677db96d56Sopenharmony_ci---------------
6687db96d56Sopenharmony_ci
6697db96d56Sopenharmony_ciThe module defines several functions, constants, and an exception. Some of the
6707db96d56Sopenharmony_cifunctions are simplified versions of the full featured methods for compiled
6717db96d56Sopenharmony_ciregular expressions.  Most non-trivial applications always use the compiled
6727db96d56Sopenharmony_ciform.
6737db96d56Sopenharmony_ci
6747db96d56Sopenharmony_ci
6757db96d56Sopenharmony_ciFlags
6767db96d56Sopenharmony_ci^^^^^
6777db96d56Sopenharmony_ci
6787db96d56Sopenharmony_ci.. versionchanged:: 3.6
6797db96d56Sopenharmony_ci   Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
6807db96d56Sopenharmony_ci   :class:`enum.IntFlag`.
6817db96d56Sopenharmony_ci
6827db96d56Sopenharmony_ci
6837db96d56Sopenharmony_ci.. class:: RegexFlag
6847db96d56Sopenharmony_ci
6857db96d56Sopenharmony_ci   An :class:`enum.IntFlag` class containing the regex options listed below.
6867db96d56Sopenharmony_ci
6877db96d56Sopenharmony_ci   .. versionadded:: 3.11 - added to ``__all__``
6887db96d56Sopenharmony_ci
6897db96d56Sopenharmony_ci.. data:: A
6907db96d56Sopenharmony_ci          ASCII
6917db96d56Sopenharmony_ci
6927db96d56Sopenharmony_ci   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
6937db96d56Sopenharmony_ci   perform ASCII-only matching instead of full Unicode matching.  This is only
6947db96d56Sopenharmony_ci   meaningful for Unicode patterns, and is ignored for byte patterns.
6957db96d56Sopenharmony_ci   Corresponds to the inline flag ``(?a)``.
6967db96d56Sopenharmony_ci
6977db96d56Sopenharmony_ci   Note that for backward compatibility, the :const:`re.U` flag still
6987db96d56Sopenharmony_ci   exists (as well as its synonym :const:`re.UNICODE` and its embedded
6997db96d56Sopenharmony_ci   counterpart ``(?u)``), but these are redundant in Python 3 since
7007db96d56Sopenharmony_ci   matches are Unicode by default for strings (and Unicode matching
7017db96d56Sopenharmony_ci   isn't allowed for bytes).
7027db96d56Sopenharmony_ci
7037db96d56Sopenharmony_ci
7047db96d56Sopenharmony_ci.. data:: DEBUG
7057db96d56Sopenharmony_ci
7067db96d56Sopenharmony_ci   Display debug information about compiled expression.
7077db96d56Sopenharmony_ci   No corresponding inline flag.
7087db96d56Sopenharmony_ci
7097db96d56Sopenharmony_ci
7107db96d56Sopenharmony_ci.. data:: I
7117db96d56Sopenharmony_ci          IGNORECASE
7127db96d56Sopenharmony_ci
7137db96d56Sopenharmony_ci   Perform case-insensitive matching; expressions like ``[A-Z]`` will also
7147db96d56Sopenharmony_ci   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
7157db96d56Sopenharmony_ci   ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
7167db96d56Sopenharmony_ci   non-ASCII matches.  The current locale does not change the effect of this
7177db96d56Sopenharmony_ci   flag unless the :const:`re.LOCALE` flag is also used.
7187db96d56Sopenharmony_ci   Corresponds to the inline flag ``(?i)``.
7197db96d56Sopenharmony_ci
7207db96d56Sopenharmony_ci   Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
7217db96d56Sopenharmony_ci   combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
7227db96d56Sopenharmony_ci   letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
7237db96d56Sopenharmony_ci   letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
7247db96d56Sopenharmony_ci   'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
7257db96d56Sopenharmony_ci   If the :const:`ASCII` flag is used, only letters 'a' to 'z'
7267db96d56Sopenharmony_ci   and 'A' to 'Z' are matched.
7277db96d56Sopenharmony_ci
7287db96d56Sopenharmony_ci.. data:: L
7297db96d56Sopenharmony_ci          LOCALE
7307db96d56Sopenharmony_ci
7317db96d56Sopenharmony_ci   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
7327db96d56Sopenharmony_ci   dependent on the current locale.  This flag can be used only with bytes
7337db96d56Sopenharmony_ci   patterns.  The use of this flag is discouraged as the locale mechanism
7347db96d56Sopenharmony_ci   is very unreliable, it only handles one "culture" at a time, and it only
7357db96d56Sopenharmony_ci   works with 8-bit locales.  Unicode matching is already enabled by default
7367db96d56Sopenharmony_ci   in Python 3 for Unicode (str) patterns, and it is able to handle different
7377db96d56Sopenharmony_ci   locales/languages.
7387db96d56Sopenharmony_ci   Corresponds to the inline flag ``(?L)``.
7397db96d56Sopenharmony_ci
7407db96d56Sopenharmony_ci   .. versionchanged:: 3.6
7417db96d56Sopenharmony_ci      :const:`re.LOCALE` can be used only with bytes patterns and is
7427db96d56Sopenharmony_ci      not compatible with :const:`re.ASCII`.
7437db96d56Sopenharmony_ci
7447db96d56Sopenharmony_ci   .. versionchanged:: 3.7
7457db96d56Sopenharmony_ci      Compiled regular expression objects with the :const:`re.LOCALE` flag no
7467db96d56Sopenharmony_ci      longer depend on the locale at compile time.  Only the locale at
7477db96d56Sopenharmony_ci      matching time affects the result of matching.
7487db96d56Sopenharmony_ci
7497db96d56Sopenharmony_ci
7507db96d56Sopenharmony_ci.. data:: M
7517db96d56Sopenharmony_ci          MULTILINE
7527db96d56Sopenharmony_ci
7537db96d56Sopenharmony_ci   When specified, the pattern character ``'^'`` matches at the beginning of the
7547db96d56Sopenharmony_ci   string and at the beginning of each line (immediately following each newline);
7557db96d56Sopenharmony_ci   and the pattern character ``'$'`` matches at the end of the string and at the
7567db96d56Sopenharmony_ci   end of each line (immediately preceding each newline).  By default, ``'^'``
7577db96d56Sopenharmony_ci   matches only at the beginning of the string, and ``'$'`` only at the end of the
7587db96d56Sopenharmony_ci   string and immediately before the newline (if any) at the end of the string.
7597db96d56Sopenharmony_ci   Corresponds to the inline flag ``(?m)``.
7607db96d56Sopenharmony_ci
7617db96d56Sopenharmony_ci.. data:: NOFLAG
7627db96d56Sopenharmony_ci
7637db96d56Sopenharmony_ci   Indicates no flag being applied, the value is ``0``.  This flag may be used
7647db96d56Sopenharmony_ci   as a default value for a function keyword argument or as a base value that
7657db96d56Sopenharmony_ci   will be conditionally ORed with other flags.  Example of use as a default
7667db96d56Sopenharmony_ci   value::
7677db96d56Sopenharmony_ci
7687db96d56Sopenharmony_ci      def myfunc(text, flag=re.NOFLAG):
7697db96d56Sopenharmony_ci          return re.match(text, flag)
7707db96d56Sopenharmony_ci
7717db96d56Sopenharmony_ci   .. versionadded:: 3.11
7727db96d56Sopenharmony_ci
7737db96d56Sopenharmony_ci.. data:: S
7747db96d56Sopenharmony_ci          DOTALL
7757db96d56Sopenharmony_ci
7767db96d56Sopenharmony_ci   Make the ``'.'`` special character match any character at all, including a
7777db96d56Sopenharmony_ci   newline; without this flag, ``'.'`` will match anything *except* a newline.
7787db96d56Sopenharmony_ci   Corresponds to the inline flag ``(?s)``.
7797db96d56Sopenharmony_ci
7807db96d56Sopenharmony_ci
7817db96d56Sopenharmony_ci.. data:: X
7827db96d56Sopenharmony_ci          VERBOSE
7837db96d56Sopenharmony_ci
7847db96d56Sopenharmony_ci   .. index:: single: # (hash); in regular expressions
7857db96d56Sopenharmony_ci
7867db96d56Sopenharmony_ci   This flag allows you to write regular expressions that look nicer and are
7877db96d56Sopenharmony_ci   more readable by allowing you to visually separate logical sections of the
7887db96d56Sopenharmony_ci   pattern and add comments. Whitespace within the pattern is ignored, except
7897db96d56Sopenharmony_ci   when in a character class, or when preceded by an unescaped backslash,
7907db96d56Sopenharmony_ci   or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. For example, ``(? :``
7917db96d56Sopenharmony_ci   and ``* ?`` are not allowed.
7927db96d56Sopenharmony_ci   When a line contains a ``#`` that is not in a character class and is not
7937db96d56Sopenharmony_ci   preceded by an unescaped backslash, all characters from the leftmost such
7947db96d56Sopenharmony_ci   ``#`` through the end of the line are ignored.
7957db96d56Sopenharmony_ci
7967db96d56Sopenharmony_ci   This means that the two following regular expression objects that match a
7977db96d56Sopenharmony_ci   decimal number are functionally equal::
7987db96d56Sopenharmony_ci
7997db96d56Sopenharmony_ci      a = re.compile(r"""\d +  # the integral part
8007db96d56Sopenharmony_ci                         \.    # the decimal point
8017db96d56Sopenharmony_ci                         \d *  # some fractional digits""", re.X)
8027db96d56Sopenharmony_ci      b = re.compile(r"\d+\.\d*")
8037db96d56Sopenharmony_ci
8047db96d56Sopenharmony_ci   Corresponds to the inline flag ``(?x)``.
8057db96d56Sopenharmony_ci
8067db96d56Sopenharmony_ci
8077db96d56Sopenharmony_ciFunctions
8087db96d56Sopenharmony_ci^^^^^^^^^
8097db96d56Sopenharmony_ci
8107db96d56Sopenharmony_ci.. function:: compile(pattern, flags=0)
8117db96d56Sopenharmony_ci
8127db96d56Sopenharmony_ci   Compile a regular expression pattern into a :ref:`regular expression object
8137db96d56Sopenharmony_ci   <re-objects>`, which can be used for matching using its
8147db96d56Sopenharmony_ci   :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
8157db96d56Sopenharmony_ci   below.
8167db96d56Sopenharmony_ci
8177db96d56Sopenharmony_ci   The expression's behaviour can be modified by specifying a *flags* value.
8187db96d56Sopenharmony_ci   Values can be any of the following variables, combined using bitwise OR (the
8197db96d56Sopenharmony_ci   ``|`` operator).
8207db96d56Sopenharmony_ci
8217db96d56Sopenharmony_ci   The sequence ::
8227db96d56Sopenharmony_ci
8237db96d56Sopenharmony_ci      prog = re.compile(pattern)
8247db96d56Sopenharmony_ci      result = prog.match(string)
8257db96d56Sopenharmony_ci
8267db96d56Sopenharmony_ci   is equivalent to ::
8277db96d56Sopenharmony_ci
8287db96d56Sopenharmony_ci      result = re.match(pattern, string)
8297db96d56Sopenharmony_ci
8307db96d56Sopenharmony_ci   but using :func:`re.compile` and saving the resulting regular expression
8317db96d56Sopenharmony_ci   object for reuse is more efficient when the expression will be used several
8327db96d56Sopenharmony_ci   times in a single program.
8337db96d56Sopenharmony_ci
8347db96d56Sopenharmony_ci   .. note::
8357db96d56Sopenharmony_ci
8367db96d56Sopenharmony_ci      The compiled versions of the most recent patterns passed to
8377db96d56Sopenharmony_ci      :func:`re.compile` and the module-level matching functions are cached, so
8387db96d56Sopenharmony_ci      programs that use only a few regular expressions at a time needn't worry
8397db96d56Sopenharmony_ci      about compiling regular expressions.
8407db96d56Sopenharmony_ci
8417db96d56Sopenharmony_ci
8427db96d56Sopenharmony_ci.. function:: search(pattern, string, flags=0)
8437db96d56Sopenharmony_ci
8447db96d56Sopenharmony_ci   Scan through *string* looking for the first location where the regular expression
8457db96d56Sopenharmony_ci   *pattern* produces a match, and return a corresponding :ref:`match object
8467db96d56Sopenharmony_ci   <match-objects>`.  Return ``None`` if no position in the string matches the
8477db96d56Sopenharmony_ci   pattern; note that this is different from finding a zero-length match at some
8487db96d56Sopenharmony_ci   point in the string.
8497db96d56Sopenharmony_ci
8507db96d56Sopenharmony_ci
8517db96d56Sopenharmony_ci.. function:: match(pattern, string, flags=0)
8527db96d56Sopenharmony_ci
8537db96d56Sopenharmony_ci   If zero or more characters at the beginning of *string* match the regular
8547db96d56Sopenharmony_ci   expression *pattern*, return a corresponding :ref:`match object
8557db96d56Sopenharmony_ci   <match-objects>`.  Return ``None`` if the string does not match the pattern;
8567db96d56Sopenharmony_ci   note that this is different from a zero-length match.
8577db96d56Sopenharmony_ci
8587db96d56Sopenharmony_ci   Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
8597db96d56Sopenharmony_ci   at the beginning of the string and not at the beginning of each line.
8607db96d56Sopenharmony_ci
8617db96d56Sopenharmony_ci   If you want to locate a match anywhere in *string*, use :func:`search`
8627db96d56Sopenharmony_ci   instead (see also :ref:`search-vs-match`).
8637db96d56Sopenharmony_ci
8647db96d56Sopenharmony_ci
8657db96d56Sopenharmony_ci.. function:: fullmatch(pattern, string, flags=0)
8667db96d56Sopenharmony_ci
8677db96d56Sopenharmony_ci   If the whole *string* matches the regular expression *pattern*, return a
8687db96d56Sopenharmony_ci   corresponding :ref:`match object <match-objects>`.  Return ``None`` if the
8697db96d56Sopenharmony_ci   string does not match the pattern; note that this is different from a
8707db96d56Sopenharmony_ci   zero-length match.
8717db96d56Sopenharmony_ci
8727db96d56Sopenharmony_ci   .. versionadded:: 3.4
8737db96d56Sopenharmony_ci
8747db96d56Sopenharmony_ci
8757db96d56Sopenharmony_ci.. function:: split(pattern, string, maxsplit=0, flags=0)
8767db96d56Sopenharmony_ci
8777db96d56Sopenharmony_ci   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
8787db96d56Sopenharmony_ci   used in *pattern*, then the text of all groups in the pattern are also returned
8797db96d56Sopenharmony_ci   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
8807db96d56Sopenharmony_ci   splits occur, and the remainder of the string is returned as the final element
8817db96d56Sopenharmony_ci   of the list. ::
8827db96d56Sopenharmony_ci
8837db96d56Sopenharmony_ci      >>> re.split(r'\W+', 'Words, words, words.')
8847db96d56Sopenharmony_ci      ['Words', 'words', 'words', '']
8857db96d56Sopenharmony_ci      >>> re.split(r'(\W+)', 'Words, words, words.')
8867db96d56Sopenharmony_ci      ['Words', ', ', 'words', ', ', 'words', '.', '']
8877db96d56Sopenharmony_ci      >>> re.split(r'\W+', 'Words, words, words.', 1)
8887db96d56Sopenharmony_ci      ['Words', 'words, words.']
8897db96d56Sopenharmony_ci      >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
8907db96d56Sopenharmony_ci      ['0', '3', '9']
8917db96d56Sopenharmony_ci
8927db96d56Sopenharmony_ci   If there are capturing groups in the separator and it matches at the start of
8937db96d56Sopenharmony_ci   the string, the result will start with an empty string.  The same holds for
8947db96d56Sopenharmony_ci   the end of the string::
8957db96d56Sopenharmony_ci
8967db96d56Sopenharmony_ci      >>> re.split(r'(\W+)', '...words, words...')
8977db96d56Sopenharmony_ci      ['', '...', 'words', ', ', 'words', '...', '']
8987db96d56Sopenharmony_ci
8997db96d56Sopenharmony_ci   That way, separator components are always found at the same relative
9007db96d56Sopenharmony_ci   indices within the result list.
9017db96d56Sopenharmony_ci
9027db96d56Sopenharmony_ci   Empty matches for the pattern split the string only when not adjacent
9037db96d56Sopenharmony_ci   to a previous empty match.
9047db96d56Sopenharmony_ci
9057db96d56Sopenharmony_ci      >>> re.split(r'\b', 'Words, words, words.')
9067db96d56Sopenharmony_ci      ['', 'Words', ', ', 'words', ', ', 'words', '.']
9077db96d56Sopenharmony_ci      >>> re.split(r'\W*', '...words...')
9087db96d56Sopenharmony_ci      ['', '', 'w', 'o', 'r', 'd', 's', '', '']
9097db96d56Sopenharmony_ci      >>> re.split(r'(\W*)', '...words...')
9107db96d56Sopenharmony_ci      ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
9117db96d56Sopenharmony_ci
9127db96d56Sopenharmony_ci   .. versionchanged:: 3.1
9137db96d56Sopenharmony_ci      Added the optional flags argument.
9147db96d56Sopenharmony_ci
9157db96d56Sopenharmony_ci   .. versionchanged:: 3.7
9167db96d56Sopenharmony_ci      Added support of splitting on a pattern that could match an empty string.
9177db96d56Sopenharmony_ci
9187db96d56Sopenharmony_ci
9197db96d56Sopenharmony_ci.. function:: findall(pattern, string, flags=0)
9207db96d56Sopenharmony_ci
9217db96d56Sopenharmony_ci   Return all non-overlapping matches of *pattern* in *string*, as a list of
9227db96d56Sopenharmony_ci   strings or tuples.  The *string* is scanned left-to-right, and matches
9237db96d56Sopenharmony_ci   are returned in the order found.  Empty matches are included in the result.
9247db96d56Sopenharmony_ci
9257db96d56Sopenharmony_ci   The result depends on the number of capturing groups in the pattern.
9267db96d56Sopenharmony_ci   If there are no groups, return a list of strings matching the whole
9277db96d56Sopenharmony_ci   pattern.  If there is exactly one group, return a list of strings
9287db96d56Sopenharmony_ci   matching that group.  If multiple groups are present, return a list
9297db96d56Sopenharmony_ci   of tuples of strings matching the groups.  Non-capturing groups do not
9307db96d56Sopenharmony_ci   affect the form of the result.
9317db96d56Sopenharmony_ci
9327db96d56Sopenharmony_ci      >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
9337db96d56Sopenharmony_ci      ['foot', 'fell', 'fastest']
9347db96d56Sopenharmony_ci      >>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
9357db96d56Sopenharmony_ci      [('width', '20'), ('height', '10')]
9367db96d56Sopenharmony_ci
9377db96d56Sopenharmony_ci   .. versionchanged:: 3.7
9387db96d56Sopenharmony_ci      Non-empty matches can now start just after a previous empty match.
9397db96d56Sopenharmony_ci
9407db96d56Sopenharmony_ci
9417db96d56Sopenharmony_ci.. function:: finditer(pattern, string, flags=0)
9427db96d56Sopenharmony_ci
9437db96d56Sopenharmony_ci   Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
9447db96d56Sopenharmony_ci   all non-overlapping matches for the RE *pattern* in *string*.  The *string*
9457db96d56Sopenharmony_ci   is scanned left-to-right, and matches are returned in the order found.  Empty
9467db96d56Sopenharmony_ci   matches are included in the result.
9477db96d56Sopenharmony_ci
9487db96d56Sopenharmony_ci   .. versionchanged:: 3.7
9497db96d56Sopenharmony_ci      Non-empty matches can now start just after a previous empty match.
9507db96d56Sopenharmony_ci
9517db96d56Sopenharmony_ci
9527db96d56Sopenharmony_ci.. function:: sub(pattern, repl, string, count=0, flags=0)
9537db96d56Sopenharmony_ci
9547db96d56Sopenharmony_ci   Return the string obtained by replacing the leftmost non-overlapping occurrences
9557db96d56Sopenharmony_ci   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
9567db96d56Sopenharmony_ci   *string* is returned unchanged.  *repl* can be a string or a function; if it is
9577db96d56Sopenharmony_ci   a string, any backslash escapes in it are processed.  That is, ``\n`` is
9587db96d56Sopenharmony_ci   converted to a single newline character, ``\r`` is converted to a carriage return, and
9597db96d56Sopenharmony_ci   so forth.  Unknown escapes of ASCII letters are reserved for future use and
9607db96d56Sopenharmony_ci   treated as errors.  Other unknown escapes such as ``\&`` are left alone.
9617db96d56Sopenharmony_ci   Backreferences, such
9627db96d56Sopenharmony_ci   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
9637db96d56Sopenharmony_ci   For example::
9647db96d56Sopenharmony_ci
9657db96d56Sopenharmony_ci      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
9667db96d56Sopenharmony_ci      ...        r'static PyObject*\npy_\1(void)\n{',
9677db96d56Sopenharmony_ci      ...        'def myfunc():')
9687db96d56Sopenharmony_ci      'static PyObject*\npy_myfunc(void)\n{'
9697db96d56Sopenharmony_ci
9707db96d56Sopenharmony_ci   If *repl* is a function, it is called for every non-overlapping occurrence of
9717db96d56Sopenharmony_ci   *pattern*.  The function takes a single :ref:`match object <match-objects>`
9727db96d56Sopenharmony_ci   argument, and returns the replacement string.  For example::
9737db96d56Sopenharmony_ci
9747db96d56Sopenharmony_ci      >>> def dashrepl(matchobj):
9757db96d56Sopenharmony_ci      ...     if matchobj.group(0) == '-': return ' '
9767db96d56Sopenharmony_ci      ...     else: return '-'
9777db96d56Sopenharmony_ci      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
9787db96d56Sopenharmony_ci      'pro--gram files'
9797db96d56Sopenharmony_ci      >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
9807db96d56Sopenharmony_ci      'Baked Beans & Spam'
9817db96d56Sopenharmony_ci
9827db96d56Sopenharmony_ci   The pattern may be a string or a :ref:`pattern object <re-objects>`.
9837db96d56Sopenharmony_ci
9847db96d56Sopenharmony_ci   The optional argument *count* is the maximum number of pattern occurrences to be
9857db96d56Sopenharmony_ci   replaced; *count* must be a non-negative integer.  If omitted or zero, all
9867db96d56Sopenharmony_ci   occurrences will be replaced. Empty matches for the pattern are replaced only
9877db96d56Sopenharmony_ci   when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
9887db96d56Sopenharmony_ci   ``'-a-b--d-'``.
9897db96d56Sopenharmony_ci
9907db96d56Sopenharmony_ci   .. index:: single: \g; in regular expressions
9917db96d56Sopenharmony_ci
9927db96d56Sopenharmony_ci   In string-type *repl* arguments, in addition to the character escapes and
9937db96d56Sopenharmony_ci   backreferences described above,
9947db96d56Sopenharmony_ci   ``\g<name>`` will use the substring matched by the group named ``name``, as
9957db96d56Sopenharmony_ci   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
9967db96d56Sopenharmony_ci   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
9977db96d56Sopenharmony_ci   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
9987db96d56Sopenharmony_ci   reference to group 20, not a reference to group 2 followed by the literal
9997db96d56Sopenharmony_ci   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
10007db96d56Sopenharmony_ci   substring matched by the RE.
10017db96d56Sopenharmony_ci
10027db96d56Sopenharmony_ci   .. versionchanged:: 3.1
10037db96d56Sopenharmony_ci      Added the optional flags argument.
10047db96d56Sopenharmony_ci
10057db96d56Sopenharmony_ci   .. versionchanged:: 3.5
10067db96d56Sopenharmony_ci      Unmatched groups are replaced with an empty string.
10077db96d56Sopenharmony_ci
10087db96d56Sopenharmony_ci   .. versionchanged:: 3.6
10097db96d56Sopenharmony_ci      Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
10107db96d56Sopenharmony_ci      now are errors.
10117db96d56Sopenharmony_ci
10127db96d56Sopenharmony_ci   .. versionchanged:: 3.7
10137db96d56Sopenharmony_ci      Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
10147db96d56Sopenharmony_ci      now are errors.
10157db96d56Sopenharmony_ci
10167db96d56Sopenharmony_ci   .. versionchanged:: 3.7
10177db96d56Sopenharmony_ci      Empty matches for the pattern are replaced when adjacent to a previous
10187db96d56Sopenharmony_ci      non-empty match.
10197db96d56Sopenharmony_ci
10207db96d56Sopenharmony_ci   .. deprecated:: 3.11
10217db96d56Sopenharmony_ci      Group *id* containing anything except ASCII digits.
10227db96d56Sopenharmony_ci      Group *name* containing characters outside the ASCII range
10237db96d56Sopenharmony_ci      (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings.
10247db96d56Sopenharmony_ci
10257db96d56Sopenharmony_ci
10267db96d56Sopenharmony_ci.. function:: subn(pattern, repl, string, count=0, flags=0)
10277db96d56Sopenharmony_ci
10287db96d56Sopenharmony_ci   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
10297db96d56Sopenharmony_ci   number_of_subs_made)``.
10307db96d56Sopenharmony_ci
10317db96d56Sopenharmony_ci   .. versionchanged:: 3.1
10327db96d56Sopenharmony_ci      Added the optional flags argument.
10337db96d56Sopenharmony_ci
10347db96d56Sopenharmony_ci   .. versionchanged:: 3.5
10357db96d56Sopenharmony_ci      Unmatched groups are replaced with an empty string.
10367db96d56Sopenharmony_ci
10377db96d56Sopenharmony_ci
10387db96d56Sopenharmony_ci.. function:: escape(pattern)
10397db96d56Sopenharmony_ci
10407db96d56Sopenharmony_ci   Escape special characters in *pattern*.
10417db96d56Sopenharmony_ci   This is useful if you want to match an arbitrary literal string that may
10427db96d56Sopenharmony_ci   have regular expression metacharacters in it.  For example::
10437db96d56Sopenharmony_ci
10447db96d56Sopenharmony_ci      >>> print(re.escape('https://www.python.org'))
10457db96d56Sopenharmony_ci      https://www\.python\.org
10467db96d56Sopenharmony_ci
10477db96d56Sopenharmony_ci      >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
10487db96d56Sopenharmony_ci      >>> print('[%s]+' % re.escape(legal_chars))
10497db96d56Sopenharmony_ci      [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
10507db96d56Sopenharmony_ci
10517db96d56Sopenharmony_ci      >>> operators = ['+', '-', '*', '/', '**']
10527db96d56Sopenharmony_ci      >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
10537db96d56Sopenharmony_ci      /|\-|\+|\*\*|\*
10547db96d56Sopenharmony_ci
10557db96d56Sopenharmony_ci   This function must not be used for the replacement string in :func:`sub`
10567db96d56Sopenharmony_ci   and :func:`subn`, only backslashes should be escaped.  For example::
10577db96d56Sopenharmony_ci
10587db96d56Sopenharmony_ci      >>> digits_re = r'\d+'
10597db96d56Sopenharmony_ci      >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
10607db96d56Sopenharmony_ci      >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
10617db96d56Sopenharmony_ci      /usr/sbin/sendmail - \d+ errors, \d+ warnings
10627db96d56Sopenharmony_ci
10637db96d56Sopenharmony_ci   .. versionchanged:: 3.3
10647db96d56Sopenharmony_ci      The ``'_'`` character is no longer escaped.
10657db96d56Sopenharmony_ci
10667db96d56Sopenharmony_ci   .. versionchanged:: 3.7
10677db96d56Sopenharmony_ci      Only characters that can have special meaning in a regular expression
10687db96d56Sopenharmony_ci      are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``,
10697db96d56Sopenharmony_ci      ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and
10707db96d56Sopenharmony_ci      ``"`"`` are no longer escaped.
10717db96d56Sopenharmony_ci
10727db96d56Sopenharmony_ci
10737db96d56Sopenharmony_ci.. function:: purge()
10747db96d56Sopenharmony_ci
10757db96d56Sopenharmony_ci   Clear the regular expression cache.
10767db96d56Sopenharmony_ci
10777db96d56Sopenharmony_ci
10787db96d56Sopenharmony_ciExceptions
10797db96d56Sopenharmony_ci^^^^^^^^^^
10807db96d56Sopenharmony_ci
10817db96d56Sopenharmony_ci.. exception:: error(msg, pattern=None, pos=None)
10827db96d56Sopenharmony_ci
10837db96d56Sopenharmony_ci   Exception raised when a string passed to one of the functions here is not a
10847db96d56Sopenharmony_ci   valid regular expression (for example, it might contain unmatched parentheses)
10857db96d56Sopenharmony_ci   or when some other error occurs during compilation or matching.  It is never an
10867db96d56Sopenharmony_ci   error if a string contains no match for a pattern.  The error instance has
10877db96d56Sopenharmony_ci   the following additional attributes:
10887db96d56Sopenharmony_ci
10897db96d56Sopenharmony_ci   .. attribute:: msg
10907db96d56Sopenharmony_ci
10917db96d56Sopenharmony_ci      The unformatted error message.
10927db96d56Sopenharmony_ci
10937db96d56Sopenharmony_ci   .. attribute:: pattern
10947db96d56Sopenharmony_ci
10957db96d56Sopenharmony_ci      The regular expression pattern.
10967db96d56Sopenharmony_ci
10977db96d56Sopenharmony_ci   .. attribute:: pos
10987db96d56Sopenharmony_ci
10997db96d56Sopenharmony_ci      The index in *pattern* where compilation failed (may be ``None``).
11007db96d56Sopenharmony_ci
11017db96d56Sopenharmony_ci   .. attribute:: lineno
11027db96d56Sopenharmony_ci
11037db96d56Sopenharmony_ci      The line corresponding to *pos* (may be ``None``).
11047db96d56Sopenharmony_ci
11057db96d56Sopenharmony_ci   .. attribute:: colno
11067db96d56Sopenharmony_ci
11077db96d56Sopenharmony_ci      The column corresponding to *pos* (may be ``None``).
11087db96d56Sopenharmony_ci
11097db96d56Sopenharmony_ci   .. versionchanged:: 3.5
11107db96d56Sopenharmony_ci      Added additional attributes.
11117db96d56Sopenharmony_ci
11127db96d56Sopenharmony_ci.. _re-objects:
11137db96d56Sopenharmony_ci
11147db96d56Sopenharmony_ciRegular Expression Objects
11157db96d56Sopenharmony_ci--------------------------
11167db96d56Sopenharmony_ci
11177db96d56Sopenharmony_ciCompiled regular expression objects support the following methods and
11187db96d56Sopenharmony_ciattributes:
11197db96d56Sopenharmony_ci
11207db96d56Sopenharmony_ci.. method:: Pattern.search(string[, pos[, endpos]])
11217db96d56Sopenharmony_ci
11227db96d56Sopenharmony_ci   Scan through *string* looking for the first location where this regular
11237db96d56Sopenharmony_ci   expression produces a match, and return a corresponding :ref:`match object
11247db96d56Sopenharmony_ci   <match-objects>`.  Return ``None`` if no position in the string matches the
11257db96d56Sopenharmony_ci   pattern; note that this is different from finding a zero-length match at some
11267db96d56Sopenharmony_ci   point in the string.
11277db96d56Sopenharmony_ci
11287db96d56Sopenharmony_ci   The optional second parameter *pos* gives an index in the string where the
11297db96d56Sopenharmony_ci   search is to start; it defaults to ``0``.  This is not completely equivalent to
11307db96d56Sopenharmony_ci   slicing the string; the ``'^'`` pattern character matches at the real beginning
11317db96d56Sopenharmony_ci   of the string and at positions just after a newline, but not necessarily at the
11327db96d56Sopenharmony_ci   index where the search is to start.
11337db96d56Sopenharmony_ci
11347db96d56Sopenharmony_ci   The optional parameter *endpos* limits how far the string will be searched; it
11357db96d56Sopenharmony_ci   will be as if the string is *endpos* characters long, so only the characters
11367db96d56Sopenharmony_ci   from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
11377db96d56Sopenharmony_ci   than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
11387db96d56Sopenharmony_ci   expression object, ``rx.search(string, 0, 50)`` is equivalent to
11397db96d56Sopenharmony_ci   ``rx.search(string[:50], 0)``. ::
11407db96d56Sopenharmony_ci
11417db96d56Sopenharmony_ci      >>> pattern = re.compile("d")
11427db96d56Sopenharmony_ci      >>> pattern.search("dog")     # Match at index 0
11437db96d56Sopenharmony_ci      <re.Match object; span=(0, 1), match='d'>
11447db96d56Sopenharmony_ci      >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
11457db96d56Sopenharmony_ci
11467db96d56Sopenharmony_ci
11477db96d56Sopenharmony_ci.. method:: Pattern.match(string[, pos[, endpos]])
11487db96d56Sopenharmony_ci
11497db96d56Sopenharmony_ci   If zero or more characters at the *beginning* of *string* match this regular
11507db96d56Sopenharmony_ci   expression, return a corresponding :ref:`match object <match-objects>`.
11517db96d56Sopenharmony_ci   Return ``None`` if the string does not match the pattern; note that this is
11527db96d56Sopenharmony_ci   different from a zero-length match.
11537db96d56Sopenharmony_ci
11547db96d56Sopenharmony_ci   The optional *pos* and *endpos* parameters have the same meaning as for the
11557db96d56Sopenharmony_ci   :meth:`~Pattern.search` method. ::
11567db96d56Sopenharmony_ci
11577db96d56Sopenharmony_ci      >>> pattern = re.compile("o")
11587db96d56Sopenharmony_ci      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
11597db96d56Sopenharmony_ci      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
11607db96d56Sopenharmony_ci      <re.Match object; span=(1, 2), match='o'>
11617db96d56Sopenharmony_ci
11627db96d56Sopenharmony_ci   If you want to locate a match anywhere in *string*, use
11637db96d56Sopenharmony_ci   :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
11647db96d56Sopenharmony_ci
11657db96d56Sopenharmony_ci
11667db96d56Sopenharmony_ci.. method:: Pattern.fullmatch(string[, pos[, endpos]])
11677db96d56Sopenharmony_ci
11687db96d56Sopenharmony_ci   If the whole *string* matches this regular expression, return a corresponding
11697db96d56Sopenharmony_ci   :ref:`match object <match-objects>`.  Return ``None`` if the string does not
11707db96d56Sopenharmony_ci   match the pattern; note that this is different from a zero-length match.
11717db96d56Sopenharmony_ci
11727db96d56Sopenharmony_ci   The optional *pos* and *endpos* parameters have the same meaning as for the
11737db96d56Sopenharmony_ci   :meth:`~Pattern.search` method. ::
11747db96d56Sopenharmony_ci
11757db96d56Sopenharmony_ci      >>> pattern = re.compile("o[gh]")
11767db96d56Sopenharmony_ci      >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
11777db96d56Sopenharmony_ci      >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
11787db96d56Sopenharmony_ci      >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
11797db96d56Sopenharmony_ci      <re.Match object; span=(1, 3), match='og'>
11807db96d56Sopenharmony_ci
11817db96d56Sopenharmony_ci   .. versionadded:: 3.4
11827db96d56Sopenharmony_ci
11837db96d56Sopenharmony_ci
11847db96d56Sopenharmony_ci.. method:: Pattern.split(string, maxsplit=0)
11857db96d56Sopenharmony_ci
11867db96d56Sopenharmony_ci   Identical to the :func:`split` function, using the compiled pattern.
11877db96d56Sopenharmony_ci
11887db96d56Sopenharmony_ci
11897db96d56Sopenharmony_ci.. method:: Pattern.findall(string[, pos[, endpos]])
11907db96d56Sopenharmony_ci
11917db96d56Sopenharmony_ci   Similar to the :func:`findall` function, using the compiled pattern, but
11927db96d56Sopenharmony_ci   also accepts optional *pos* and *endpos* parameters that limit the search
11937db96d56Sopenharmony_ci   region like for :meth:`search`.
11947db96d56Sopenharmony_ci
11957db96d56Sopenharmony_ci
11967db96d56Sopenharmony_ci.. method:: Pattern.finditer(string[, pos[, endpos]])
11977db96d56Sopenharmony_ci
11987db96d56Sopenharmony_ci   Similar to the :func:`finditer` function, using the compiled pattern, but
11997db96d56Sopenharmony_ci   also accepts optional *pos* and *endpos* parameters that limit the search
12007db96d56Sopenharmony_ci   region like for :meth:`search`.
12017db96d56Sopenharmony_ci
12027db96d56Sopenharmony_ci
12037db96d56Sopenharmony_ci.. method:: Pattern.sub(repl, string, count=0)
12047db96d56Sopenharmony_ci
12057db96d56Sopenharmony_ci   Identical to the :func:`sub` function, using the compiled pattern.
12067db96d56Sopenharmony_ci
12077db96d56Sopenharmony_ci
12087db96d56Sopenharmony_ci.. method:: Pattern.subn(repl, string, count=0)
12097db96d56Sopenharmony_ci
12107db96d56Sopenharmony_ci   Identical to the :func:`subn` function, using the compiled pattern.
12117db96d56Sopenharmony_ci
12127db96d56Sopenharmony_ci
12137db96d56Sopenharmony_ci.. attribute:: Pattern.flags
12147db96d56Sopenharmony_ci
12157db96d56Sopenharmony_ci   The regex matching flags.  This is a combination of the flags given to
12167db96d56Sopenharmony_ci   :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
12177db96d56Sopenharmony_ci   flags such as :data:`UNICODE` if the pattern is a Unicode string.
12187db96d56Sopenharmony_ci
12197db96d56Sopenharmony_ci
12207db96d56Sopenharmony_ci.. attribute:: Pattern.groups
12217db96d56Sopenharmony_ci
12227db96d56Sopenharmony_ci   The number of capturing groups in the pattern.
12237db96d56Sopenharmony_ci
12247db96d56Sopenharmony_ci
12257db96d56Sopenharmony_ci.. attribute:: Pattern.groupindex
12267db96d56Sopenharmony_ci
12277db96d56Sopenharmony_ci   A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
12287db96d56Sopenharmony_ci   numbers.  The dictionary is empty if no symbolic groups were used in the
12297db96d56Sopenharmony_ci   pattern.
12307db96d56Sopenharmony_ci
12317db96d56Sopenharmony_ci
12327db96d56Sopenharmony_ci.. attribute:: Pattern.pattern
12337db96d56Sopenharmony_ci
12347db96d56Sopenharmony_ci   The pattern string from which the pattern object was compiled.
12357db96d56Sopenharmony_ci
12367db96d56Sopenharmony_ci
12377db96d56Sopenharmony_ci.. versionchanged:: 3.7
12387db96d56Sopenharmony_ci   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Compiled
12397db96d56Sopenharmony_ci   regular expression objects are considered atomic.
12407db96d56Sopenharmony_ci
12417db96d56Sopenharmony_ci
12427db96d56Sopenharmony_ci.. _match-objects:
12437db96d56Sopenharmony_ci
12447db96d56Sopenharmony_ciMatch Objects
12457db96d56Sopenharmony_ci-------------
12467db96d56Sopenharmony_ci
12477db96d56Sopenharmony_ciMatch objects always have a boolean value of ``True``.
12487db96d56Sopenharmony_ciSince :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
12497db96d56Sopenharmony_ciwhen there is no match, you can test whether there was a match with a simple
12507db96d56Sopenharmony_ci``if`` statement::
12517db96d56Sopenharmony_ci
12527db96d56Sopenharmony_ci   match = re.search(pattern, string)
12537db96d56Sopenharmony_ci   if match:
12547db96d56Sopenharmony_ci       process(match)
12557db96d56Sopenharmony_ci
12567db96d56Sopenharmony_ciMatch objects support the following methods and attributes:
12577db96d56Sopenharmony_ci
12587db96d56Sopenharmony_ci
12597db96d56Sopenharmony_ci.. method:: Match.expand(template)
12607db96d56Sopenharmony_ci
12617db96d56Sopenharmony_ci   Return the string obtained by doing backslash substitution on the template
12627db96d56Sopenharmony_ci   string *template*, as done by the :meth:`~Pattern.sub` method.
12637db96d56Sopenharmony_ci   Escapes such as ``\n`` are converted to the appropriate characters,
12647db96d56Sopenharmony_ci   and numeric backreferences (``\1``, ``\2``) and named backreferences
12657db96d56Sopenharmony_ci   (``\g<1>``, ``\g<name>``) are replaced by the contents of the
12667db96d56Sopenharmony_ci   corresponding group.
12677db96d56Sopenharmony_ci
12687db96d56Sopenharmony_ci   .. versionchanged:: 3.5
12697db96d56Sopenharmony_ci      Unmatched groups are replaced with an empty string.
12707db96d56Sopenharmony_ci
12717db96d56Sopenharmony_ci.. method:: Match.group([group1, ...])
12727db96d56Sopenharmony_ci
12737db96d56Sopenharmony_ci   Returns one or more subgroups of the match.  If there is a single argument, the
12747db96d56Sopenharmony_ci   result is a single string; if there are multiple arguments, the result is a
12757db96d56Sopenharmony_ci   tuple with one item per argument. Without arguments, *group1* defaults to zero
12767db96d56Sopenharmony_ci   (the whole match is returned). If a *groupN* argument is zero, the corresponding
12777db96d56Sopenharmony_ci   return value is the entire matching string; if it is in the inclusive range
12787db96d56Sopenharmony_ci   [1..99], it is the string matching the corresponding parenthesized group.  If a
12797db96d56Sopenharmony_ci   group number is negative or larger than the number of groups defined in the
12807db96d56Sopenharmony_ci   pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
12817db96d56Sopenharmony_ci   part of the pattern that did not match, the corresponding result is ``None``.
12827db96d56Sopenharmony_ci   If a group is contained in a part of the pattern that matched multiple times,
12837db96d56Sopenharmony_ci   the last match is returned. ::
12847db96d56Sopenharmony_ci
12857db96d56Sopenharmony_ci      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
12867db96d56Sopenharmony_ci      >>> m.group(0)       # The entire match
12877db96d56Sopenharmony_ci      'Isaac Newton'
12887db96d56Sopenharmony_ci      >>> m.group(1)       # The first parenthesized subgroup.
12897db96d56Sopenharmony_ci      'Isaac'
12907db96d56Sopenharmony_ci      >>> m.group(2)       # The second parenthesized subgroup.
12917db96d56Sopenharmony_ci      'Newton'
12927db96d56Sopenharmony_ci      >>> m.group(1, 2)    # Multiple arguments give us a tuple.
12937db96d56Sopenharmony_ci      ('Isaac', 'Newton')
12947db96d56Sopenharmony_ci
12957db96d56Sopenharmony_ci   If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
12967db96d56Sopenharmony_ci   arguments may also be strings identifying groups by their group name.  If a
12977db96d56Sopenharmony_ci   string argument is not used as a group name in the pattern, an :exc:`IndexError`
12987db96d56Sopenharmony_ci   exception is raised.
12997db96d56Sopenharmony_ci
13007db96d56Sopenharmony_ci   A moderately complicated example::
13017db96d56Sopenharmony_ci
13027db96d56Sopenharmony_ci      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
13037db96d56Sopenharmony_ci      >>> m.group('first_name')
13047db96d56Sopenharmony_ci      'Malcolm'
13057db96d56Sopenharmony_ci      >>> m.group('last_name')
13067db96d56Sopenharmony_ci      'Reynolds'
13077db96d56Sopenharmony_ci
13087db96d56Sopenharmony_ci   Named groups can also be referred to by their index::
13097db96d56Sopenharmony_ci
13107db96d56Sopenharmony_ci      >>> m.group(1)
13117db96d56Sopenharmony_ci      'Malcolm'
13127db96d56Sopenharmony_ci      >>> m.group(2)
13137db96d56Sopenharmony_ci      'Reynolds'
13147db96d56Sopenharmony_ci
13157db96d56Sopenharmony_ci   If a group matches multiple times, only the last match is accessible::
13167db96d56Sopenharmony_ci
13177db96d56Sopenharmony_ci      >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
13187db96d56Sopenharmony_ci      >>> m.group(1)                        # Returns only the last match.
13197db96d56Sopenharmony_ci      'c3'
13207db96d56Sopenharmony_ci
13217db96d56Sopenharmony_ci
13227db96d56Sopenharmony_ci.. method:: Match.__getitem__(g)
13237db96d56Sopenharmony_ci
13247db96d56Sopenharmony_ci   This is identical to ``m.group(g)``.  This allows easier access to
13257db96d56Sopenharmony_ci   an individual group from a match::
13267db96d56Sopenharmony_ci
13277db96d56Sopenharmony_ci      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
13287db96d56Sopenharmony_ci      >>> m[0]       # The entire match
13297db96d56Sopenharmony_ci      'Isaac Newton'
13307db96d56Sopenharmony_ci      >>> m[1]       # The first parenthesized subgroup.
13317db96d56Sopenharmony_ci      'Isaac'
13327db96d56Sopenharmony_ci      >>> m[2]       # The second parenthesized subgroup.
13337db96d56Sopenharmony_ci      'Newton'
13347db96d56Sopenharmony_ci
13357db96d56Sopenharmony_ci   Named groups are supported as well::
13367db96d56Sopenharmony_ci
13377db96d56Sopenharmony_ci      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton")
13387db96d56Sopenharmony_ci      >>> m['first_name']
13397db96d56Sopenharmony_ci      'Isaac'
13407db96d56Sopenharmony_ci      >>> m['last_name']
13417db96d56Sopenharmony_ci      'Newton'
13427db96d56Sopenharmony_ci
13437db96d56Sopenharmony_ci   .. versionadded:: 3.6
13447db96d56Sopenharmony_ci
13457db96d56Sopenharmony_ci
13467db96d56Sopenharmony_ci.. method:: Match.groups(default=None)
13477db96d56Sopenharmony_ci
13487db96d56Sopenharmony_ci   Return a tuple containing all the subgroups of the match, from 1 up to however
13497db96d56Sopenharmony_ci   many groups are in the pattern.  The *default* argument is used for groups that
13507db96d56Sopenharmony_ci   did not participate in the match; it defaults to ``None``.
13517db96d56Sopenharmony_ci
13527db96d56Sopenharmony_ci   For example::
13537db96d56Sopenharmony_ci
13547db96d56Sopenharmony_ci      >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
13557db96d56Sopenharmony_ci      >>> m.groups()
13567db96d56Sopenharmony_ci      ('24', '1632')
13577db96d56Sopenharmony_ci
13587db96d56Sopenharmony_ci   If we make the decimal place and everything after it optional, not all groups
13597db96d56Sopenharmony_ci   might participate in the match.  These groups will default to ``None`` unless
13607db96d56Sopenharmony_ci   the *default* argument is given::
13617db96d56Sopenharmony_ci
13627db96d56Sopenharmony_ci      >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
13637db96d56Sopenharmony_ci      >>> m.groups()      # Second group defaults to None.
13647db96d56Sopenharmony_ci      ('24', None)
13657db96d56Sopenharmony_ci      >>> m.groups('0')   # Now, the second group defaults to '0'.
13667db96d56Sopenharmony_ci      ('24', '0')
13677db96d56Sopenharmony_ci
13687db96d56Sopenharmony_ci
13697db96d56Sopenharmony_ci.. method:: Match.groupdict(default=None)
13707db96d56Sopenharmony_ci
13717db96d56Sopenharmony_ci   Return a dictionary containing all the *named* subgroups of the match, keyed by
13727db96d56Sopenharmony_ci   the subgroup name.  The *default* argument is used for groups that did not
13737db96d56Sopenharmony_ci   participate in the match; it defaults to ``None``.  For example::
13747db96d56Sopenharmony_ci
13757db96d56Sopenharmony_ci      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
13767db96d56Sopenharmony_ci      >>> m.groupdict()
13777db96d56Sopenharmony_ci      {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
13787db96d56Sopenharmony_ci
13797db96d56Sopenharmony_ci
13807db96d56Sopenharmony_ci.. method:: Match.start([group])
13817db96d56Sopenharmony_ci            Match.end([group])
13827db96d56Sopenharmony_ci
13837db96d56Sopenharmony_ci   Return the indices of the start and end of the substring matched by *group*;
13847db96d56Sopenharmony_ci   *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
13857db96d56Sopenharmony_ci   *group* exists but did not contribute to the match.  For a match object *m*, and
13867db96d56Sopenharmony_ci   a group *g* that did contribute to the match, the substring matched by group *g*
13877db96d56Sopenharmony_ci   (equivalent to ``m.group(g)``) is ::
13887db96d56Sopenharmony_ci
13897db96d56Sopenharmony_ci      m.string[m.start(g):m.end(g)]
13907db96d56Sopenharmony_ci
13917db96d56Sopenharmony_ci   Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
13927db96d56Sopenharmony_ci   null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
13937db96d56Sopenharmony_ci   ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
13947db96d56Sopenharmony_ci   2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
13957db96d56Sopenharmony_ci
13967db96d56Sopenharmony_ci   An example that will remove *remove_this* from email addresses::
13977db96d56Sopenharmony_ci
13987db96d56Sopenharmony_ci      >>> email = "tony@tiremove_thisger.net"
13997db96d56Sopenharmony_ci      >>> m = re.search("remove_this", email)
14007db96d56Sopenharmony_ci      >>> email[:m.start()] + email[m.end():]
14017db96d56Sopenharmony_ci      'tony@tiger.net'
14027db96d56Sopenharmony_ci
14037db96d56Sopenharmony_ci
14047db96d56Sopenharmony_ci.. method:: Match.span([group])
14057db96d56Sopenharmony_ci
14067db96d56Sopenharmony_ci   For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
14077db96d56Sopenharmony_ci   that if *group* did not contribute to the match, this is ``(-1, -1)``.
14087db96d56Sopenharmony_ci   *group* defaults to zero, the entire match.
14097db96d56Sopenharmony_ci
14107db96d56Sopenharmony_ci
14117db96d56Sopenharmony_ci.. attribute:: Match.pos
14127db96d56Sopenharmony_ci
14137db96d56Sopenharmony_ci   The value of *pos* which was passed to the :meth:`~Pattern.search` or
14147db96d56Sopenharmony_ci   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
14157db96d56Sopenharmony_ci   the index into the string at which the RE engine started looking for a match.
14167db96d56Sopenharmony_ci
14177db96d56Sopenharmony_ci
14187db96d56Sopenharmony_ci.. attribute:: Match.endpos
14197db96d56Sopenharmony_ci
14207db96d56Sopenharmony_ci   The value of *endpos* which was passed to the :meth:`~Pattern.search` or
14217db96d56Sopenharmony_ci   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
14227db96d56Sopenharmony_ci   the index into the string beyond which the RE engine will not go.
14237db96d56Sopenharmony_ci
14247db96d56Sopenharmony_ci
14257db96d56Sopenharmony_ci.. attribute:: Match.lastindex
14267db96d56Sopenharmony_ci
14277db96d56Sopenharmony_ci   The integer index of the last matched capturing group, or ``None`` if no group
14287db96d56Sopenharmony_ci   was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
14297db96d56Sopenharmony_ci   ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
14307db96d56Sopenharmony_ci   the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
14317db96d56Sopenharmony_ci   string.
14327db96d56Sopenharmony_ci
14337db96d56Sopenharmony_ci
14347db96d56Sopenharmony_ci.. attribute:: Match.lastgroup
14357db96d56Sopenharmony_ci
14367db96d56Sopenharmony_ci   The name of the last matched capturing group, or ``None`` if the group didn't
14377db96d56Sopenharmony_ci   have a name, or if no group was matched at all.
14387db96d56Sopenharmony_ci
14397db96d56Sopenharmony_ci
14407db96d56Sopenharmony_ci.. attribute:: Match.re
14417db96d56Sopenharmony_ci
14427db96d56Sopenharmony_ci   The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
14437db96d56Sopenharmony_ci   :meth:`~Pattern.search` method produced this match instance.
14447db96d56Sopenharmony_ci
14457db96d56Sopenharmony_ci
14467db96d56Sopenharmony_ci.. attribute:: Match.string
14477db96d56Sopenharmony_ci
14487db96d56Sopenharmony_ci   The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
14497db96d56Sopenharmony_ci
14507db96d56Sopenharmony_ci
14517db96d56Sopenharmony_ci.. versionchanged:: 3.7
14527db96d56Sopenharmony_ci   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Match objects
14537db96d56Sopenharmony_ci   are considered atomic.
14547db96d56Sopenharmony_ci
14557db96d56Sopenharmony_ci
14567db96d56Sopenharmony_ci.. _re-examples:
14577db96d56Sopenharmony_ci
14587db96d56Sopenharmony_ciRegular Expression Examples
14597db96d56Sopenharmony_ci---------------------------
14607db96d56Sopenharmony_ci
14617db96d56Sopenharmony_ci
14627db96d56Sopenharmony_ciChecking for a Pair
14637db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^
14647db96d56Sopenharmony_ci
14657db96d56Sopenharmony_ciIn this example, we'll use the following helper function to display match
14667db96d56Sopenharmony_ciobjects a little more gracefully::
14677db96d56Sopenharmony_ci
14687db96d56Sopenharmony_ci   def displaymatch(match):
14697db96d56Sopenharmony_ci       if match is None:
14707db96d56Sopenharmony_ci           return None
14717db96d56Sopenharmony_ci       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
14727db96d56Sopenharmony_ci
14737db96d56Sopenharmony_ciSuppose you are writing a poker program where a player's hand is represented as
14747db96d56Sopenharmony_cia 5-character string with each character representing a card, "a" for ace, "k"
14757db96d56Sopenharmony_cifor king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
14767db96d56Sopenharmony_cirepresenting the card with that value.
14777db96d56Sopenharmony_ci
14787db96d56Sopenharmony_ciTo see if a given string is a valid hand, one could do the following::
14797db96d56Sopenharmony_ci
14807db96d56Sopenharmony_ci   >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
14817db96d56Sopenharmony_ci   >>> displaymatch(valid.match("akt5q"))  # Valid.
14827db96d56Sopenharmony_ci   "<Match: 'akt5q', groups=()>"
14837db96d56Sopenharmony_ci   >>> displaymatch(valid.match("akt5e"))  # Invalid.
14847db96d56Sopenharmony_ci   >>> displaymatch(valid.match("akt"))    # Invalid.
14857db96d56Sopenharmony_ci   >>> displaymatch(valid.match("727ak"))  # Valid.
14867db96d56Sopenharmony_ci   "<Match: '727ak', groups=()>"
14877db96d56Sopenharmony_ci
14887db96d56Sopenharmony_ciThat last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
14897db96d56Sopenharmony_ciTo match this with a regular expression, one could use backreferences as such::
14907db96d56Sopenharmony_ci
14917db96d56Sopenharmony_ci   >>> pair = re.compile(r".*(.).*\1")
14927db96d56Sopenharmony_ci   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
14937db96d56Sopenharmony_ci   "<Match: '717', groups=('7',)>"
14947db96d56Sopenharmony_ci   >>> displaymatch(pair.match("718ak"))     # No pairs.
14957db96d56Sopenharmony_ci   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
14967db96d56Sopenharmony_ci   "<Match: '354aa', groups=('a',)>"
14977db96d56Sopenharmony_ci
14987db96d56Sopenharmony_ciTo find out what card the pair consists of, one could use the
14997db96d56Sopenharmony_ci:meth:`~Match.group` method of the match object in the following manner::
15007db96d56Sopenharmony_ci
15017db96d56Sopenharmony_ci   >>> pair = re.compile(r".*(.).*\1")
15027db96d56Sopenharmony_ci   >>> pair.match("717ak").group(1)
15037db96d56Sopenharmony_ci   '7'
15047db96d56Sopenharmony_ci
15057db96d56Sopenharmony_ci   # Error because re.match() returns None, which doesn't have a group() method:
15067db96d56Sopenharmony_ci   >>> pair.match("718ak").group(1)
15077db96d56Sopenharmony_ci   Traceback (most recent call last):
15087db96d56Sopenharmony_ci     File "<pyshell#23>", line 1, in <module>
15097db96d56Sopenharmony_ci       re.match(r".*(.).*\1", "718ak").group(1)
15107db96d56Sopenharmony_ci   AttributeError: 'NoneType' object has no attribute 'group'
15117db96d56Sopenharmony_ci
15127db96d56Sopenharmony_ci   >>> pair.match("354aa").group(1)
15137db96d56Sopenharmony_ci   'a'
15147db96d56Sopenharmony_ci
15157db96d56Sopenharmony_ci
15167db96d56Sopenharmony_ciSimulating scanf()
15177db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^
15187db96d56Sopenharmony_ci
15197db96d56Sopenharmony_ci.. index:: single: scanf()
15207db96d56Sopenharmony_ci
15217db96d56Sopenharmony_ciPython does not currently have an equivalent to :c:func:`scanf`.  Regular
15227db96d56Sopenharmony_ciexpressions are generally more powerful, though also more verbose, than
15237db96d56Sopenharmony_ci:c:func:`scanf` format strings.  The table below offers some more-or-less
15247db96d56Sopenharmony_ciequivalent mappings between :c:func:`scanf` format tokens and regular
15257db96d56Sopenharmony_ciexpressions.
15267db96d56Sopenharmony_ci
15277db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15287db96d56Sopenharmony_ci| :c:func:`scanf` Token          | Regular Expression                          |
15297db96d56Sopenharmony_ci+================================+=============================================+
15307db96d56Sopenharmony_ci| ``%c``                         | ``.``                                       |
15317db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15327db96d56Sopenharmony_ci| ``%5c``                        | ``.{5}``                                    |
15337db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15347db96d56Sopenharmony_ci| ``%d``                         | ``[-+]?\d+``                                |
15357db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15367db96d56Sopenharmony_ci| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
15377db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15387db96d56Sopenharmony_ci| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
15397db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15407db96d56Sopenharmony_ci| ``%o``                         | ``[-+]?[0-7]+``                             |
15417db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15427db96d56Sopenharmony_ci| ``%s``                         | ``\S+``                                     |
15437db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15447db96d56Sopenharmony_ci| ``%u``                         | ``\d+``                                     |
15457db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15467db96d56Sopenharmony_ci| ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
15477db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+
15487db96d56Sopenharmony_ci
15497db96d56Sopenharmony_ciTo extract the filename and numbers from a string like ::
15507db96d56Sopenharmony_ci
15517db96d56Sopenharmony_ci   /usr/sbin/sendmail - 0 errors, 4 warnings
15527db96d56Sopenharmony_ci
15537db96d56Sopenharmony_ciyou would use a :c:func:`scanf` format like ::
15547db96d56Sopenharmony_ci
15557db96d56Sopenharmony_ci   %s - %d errors, %d warnings
15567db96d56Sopenharmony_ci
15577db96d56Sopenharmony_ciThe equivalent regular expression would be ::
15587db96d56Sopenharmony_ci
15597db96d56Sopenharmony_ci   (\S+) - (\d+) errors, (\d+) warnings
15607db96d56Sopenharmony_ci
15617db96d56Sopenharmony_ci
15627db96d56Sopenharmony_ci.. _search-vs-match:
15637db96d56Sopenharmony_ci
15647db96d56Sopenharmony_cisearch() vs. match()
15657db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^
15667db96d56Sopenharmony_ci
15677db96d56Sopenharmony_ci.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
15687db96d56Sopenharmony_ci
15697db96d56Sopenharmony_ciPython offers different primitive operations based on regular expressions:
15707db96d56Sopenharmony_ci
15717db96d56Sopenharmony_ci+ :func:`re.match` checks for a match only at the beginning of the string
15727db96d56Sopenharmony_ci+ :func:`re.search` checks for a match anywhere in the string
15737db96d56Sopenharmony_ci  (this is what Perl does by default)
15747db96d56Sopenharmony_ci+ :func:`re.fullmatch` checks for entire string to be a match
15757db96d56Sopenharmony_ci
15767db96d56Sopenharmony_ci
15777db96d56Sopenharmony_ciFor example::
15787db96d56Sopenharmony_ci
15797db96d56Sopenharmony_ci   >>> re.match("c", "abcdef")    # No match
15807db96d56Sopenharmony_ci   >>> re.search("c", "abcdef")   # Match
15817db96d56Sopenharmony_ci   <re.Match object; span=(2, 3), match='c'>
15827db96d56Sopenharmony_ci   >>> re.fullmatch("p.*n", "python") # Match
15837db96d56Sopenharmony_ci   <re.Match object; span=(0, 6), match='python'>
15847db96d56Sopenharmony_ci   >>> re.fullmatch("r.*n", "python") # No match
15857db96d56Sopenharmony_ci
15867db96d56Sopenharmony_ciRegular expressions beginning with ``'^'`` can be used with :func:`search` to
15877db96d56Sopenharmony_cirestrict the match at the beginning of the string::
15887db96d56Sopenharmony_ci
15897db96d56Sopenharmony_ci   >>> re.match("c", "abcdef")    # No match
15907db96d56Sopenharmony_ci   >>> re.search("^c", "abcdef")  # No match
15917db96d56Sopenharmony_ci   >>> re.search("^a", "abcdef")  # Match
15927db96d56Sopenharmony_ci   <re.Match object; span=(0, 1), match='a'>
15937db96d56Sopenharmony_ci
15947db96d56Sopenharmony_ciNote however that in :const:`MULTILINE` mode :func:`match` only matches at the
15957db96d56Sopenharmony_cibeginning of the string, whereas using :func:`search` with a regular expression
15967db96d56Sopenharmony_cibeginning with ``'^'`` will match at the beginning of each line. ::
15977db96d56Sopenharmony_ci
15987db96d56Sopenharmony_ci   >>> re.match("X", "A\nB\nX", re.MULTILINE)  # No match
15997db96d56Sopenharmony_ci   >>> re.search("^X", "A\nB\nX", re.MULTILINE)  # Match
16007db96d56Sopenharmony_ci   <re.Match object; span=(4, 5), match='X'>
16017db96d56Sopenharmony_ci
16027db96d56Sopenharmony_ci
16037db96d56Sopenharmony_ciMaking a Phonebook
16047db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^
16057db96d56Sopenharmony_ci
16067db96d56Sopenharmony_ci:func:`split` splits a string into a list delimited by the passed pattern.  The
16077db96d56Sopenharmony_cimethod is invaluable for converting textual data into data structures that can be
16087db96d56Sopenharmony_cieasily read and modified by Python as demonstrated in the following example that
16097db96d56Sopenharmony_cicreates a phonebook.
16107db96d56Sopenharmony_ci
16117db96d56Sopenharmony_ciFirst, here is the input.  Normally it may come from a file, here we are using
16127db96d56Sopenharmony_citriple-quoted string syntax
16137db96d56Sopenharmony_ci
16147db96d56Sopenharmony_ci.. doctest::
16157db96d56Sopenharmony_ci
16167db96d56Sopenharmony_ci   >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
16177db96d56Sopenharmony_ci   ...
16187db96d56Sopenharmony_ci   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
16197db96d56Sopenharmony_ci   ... Frank Burger: 925.541.7625 662 South Dogwood Way
16207db96d56Sopenharmony_ci   ...
16217db96d56Sopenharmony_ci   ...
16227db96d56Sopenharmony_ci   ... Heather Albrecht: 548.326.4584 919 Park Place"""
16237db96d56Sopenharmony_ci
16247db96d56Sopenharmony_ciThe entries are separated by one or more newlines. Now we convert the string
16257db96d56Sopenharmony_ciinto a list with each nonempty line having its own entry:
16267db96d56Sopenharmony_ci
16277db96d56Sopenharmony_ci.. doctest::
16287db96d56Sopenharmony_ci   :options: +NORMALIZE_WHITESPACE
16297db96d56Sopenharmony_ci
16307db96d56Sopenharmony_ci   >>> entries = re.split("\n+", text)
16317db96d56Sopenharmony_ci   >>> entries
16327db96d56Sopenharmony_ci   ['Ross McFluff: 834.345.1254 155 Elm Street',
16337db96d56Sopenharmony_ci   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
16347db96d56Sopenharmony_ci   'Frank Burger: 925.541.7625 662 South Dogwood Way',
16357db96d56Sopenharmony_ci   'Heather Albrecht: 548.326.4584 919 Park Place']
16367db96d56Sopenharmony_ci
16377db96d56Sopenharmony_ciFinally, split each entry into a list with first name, last name, telephone
16387db96d56Sopenharmony_cinumber, and address.  We use the ``maxsplit`` parameter of :func:`split`
16397db96d56Sopenharmony_cibecause the address has spaces, our splitting pattern, in it:
16407db96d56Sopenharmony_ci
16417db96d56Sopenharmony_ci.. doctest::
16427db96d56Sopenharmony_ci   :options: +NORMALIZE_WHITESPACE
16437db96d56Sopenharmony_ci
16447db96d56Sopenharmony_ci   >>> [re.split(":? ", entry, 3) for entry in entries]
16457db96d56Sopenharmony_ci   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
16467db96d56Sopenharmony_ci   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
16477db96d56Sopenharmony_ci   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
16487db96d56Sopenharmony_ci   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
16497db96d56Sopenharmony_ci
16507db96d56Sopenharmony_ciThe ``:?`` pattern matches the colon after the last name, so that it does not
16517db96d56Sopenharmony_cioccur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
16527db96d56Sopenharmony_cihouse number from the street name:
16537db96d56Sopenharmony_ci
16547db96d56Sopenharmony_ci.. doctest::
16557db96d56Sopenharmony_ci   :options: +NORMALIZE_WHITESPACE
16567db96d56Sopenharmony_ci
16577db96d56Sopenharmony_ci   >>> [re.split(":? ", entry, 4) for entry in entries]
16587db96d56Sopenharmony_ci   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
16597db96d56Sopenharmony_ci   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
16607db96d56Sopenharmony_ci   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
16617db96d56Sopenharmony_ci   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
16627db96d56Sopenharmony_ci
16637db96d56Sopenharmony_ci
16647db96d56Sopenharmony_ciText Munging
16657db96d56Sopenharmony_ci^^^^^^^^^^^^
16667db96d56Sopenharmony_ci
16677db96d56Sopenharmony_ci:func:`sub` replaces every occurrence of a pattern with a string or the
16687db96d56Sopenharmony_ciresult of a function.  This example demonstrates using :func:`sub` with
16697db96d56Sopenharmony_cia function to "munge" text, or randomize the order of all the characters
16707db96d56Sopenharmony_ciin each word of a sentence except for the first and last characters::
16717db96d56Sopenharmony_ci
16727db96d56Sopenharmony_ci   >>> def repl(m):
16737db96d56Sopenharmony_ci   ...     inner_word = list(m.group(2))
16747db96d56Sopenharmony_ci   ...     random.shuffle(inner_word)
16757db96d56Sopenharmony_ci   ...     return m.group(1) + "".join(inner_word) + m.group(3)
16767db96d56Sopenharmony_ci   >>> text = "Professor Abdolmalek, please report your absences promptly."
16777db96d56Sopenharmony_ci   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
16787db96d56Sopenharmony_ci   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
16797db96d56Sopenharmony_ci   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
16807db96d56Sopenharmony_ci   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
16817db96d56Sopenharmony_ci
16827db96d56Sopenharmony_ci
16837db96d56Sopenharmony_ciFinding all Adverbs
16847db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^
16857db96d56Sopenharmony_ci
16867db96d56Sopenharmony_ci:func:`findall` matches *all* occurrences of a pattern, not just the first
16877db96d56Sopenharmony_cione as :func:`search` does.  For example, if a writer wanted to
16887db96d56Sopenharmony_cifind all of the adverbs in some text, they might use :func:`findall` in
16897db96d56Sopenharmony_cithe following manner::
16907db96d56Sopenharmony_ci
16917db96d56Sopenharmony_ci   >>> text = "He was carefully disguised but captured quickly by police."
16927db96d56Sopenharmony_ci   >>> re.findall(r"\w+ly\b", text)
16937db96d56Sopenharmony_ci   ['carefully', 'quickly']
16947db96d56Sopenharmony_ci
16957db96d56Sopenharmony_ci
16967db96d56Sopenharmony_ciFinding all Adverbs and their Positions
16977db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16987db96d56Sopenharmony_ci
16997db96d56Sopenharmony_ciIf one wants more information about all matches of a pattern than the matched
17007db96d56Sopenharmony_citext, :func:`finditer` is useful as it provides :ref:`match objects
17017db96d56Sopenharmony_ci<match-objects>` instead of strings.  Continuing with the previous example, if
17027db96d56Sopenharmony_cia writer wanted to find all of the adverbs *and their positions* in
17037db96d56Sopenharmony_cisome text, they would use :func:`finditer` in the following manner::
17047db96d56Sopenharmony_ci
17057db96d56Sopenharmony_ci   >>> text = "He was carefully disguised but captured quickly by police."
17067db96d56Sopenharmony_ci   >>> for m in re.finditer(r"\w+ly\b", text):
17077db96d56Sopenharmony_ci   ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
17087db96d56Sopenharmony_ci   07-16: carefully
17097db96d56Sopenharmony_ci   40-47: quickly
17107db96d56Sopenharmony_ci
17117db96d56Sopenharmony_ci
17127db96d56Sopenharmony_ciRaw String Notation
17137db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^
17147db96d56Sopenharmony_ci
17157db96d56Sopenharmony_ciRaw string notation (``r"text"``) keeps regular expressions sane.  Without it,
17167db96d56Sopenharmony_cievery backslash (``'\'``) in a regular expression would have to be prefixed with
17177db96d56Sopenharmony_cianother one to escape it.  For example, the two following lines of code are
17187db96d56Sopenharmony_cifunctionally identical::
17197db96d56Sopenharmony_ci
17207db96d56Sopenharmony_ci   >>> re.match(r"\W(.)\1\W", " ff ")
17217db96d56Sopenharmony_ci   <re.Match object; span=(0, 4), match=' ff '>
17227db96d56Sopenharmony_ci   >>> re.match("\\W(.)\\1\\W", " ff ")
17237db96d56Sopenharmony_ci   <re.Match object; span=(0, 4), match=' ff '>
17247db96d56Sopenharmony_ci
17257db96d56Sopenharmony_ciWhen one wants to match a literal backslash, it must be escaped in the regular
17267db96d56Sopenharmony_ciexpression.  With raw string notation, this means ``r"\\"``.  Without raw string
17277db96d56Sopenharmony_cinotation, one must use ``"\\\\"``, making the following lines of code
17287db96d56Sopenharmony_cifunctionally identical::
17297db96d56Sopenharmony_ci
17307db96d56Sopenharmony_ci   >>> re.match(r"\\", r"\\")
17317db96d56Sopenharmony_ci   <re.Match object; span=(0, 1), match='\\'>
17327db96d56Sopenharmony_ci   >>> re.match("\\\\", r"\\")
17337db96d56Sopenharmony_ci   <re.Match object; span=(0, 1), match='\\'>
17347db96d56Sopenharmony_ci
17357db96d56Sopenharmony_ci
17367db96d56Sopenharmony_ciWriting a Tokenizer
17377db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^
17387db96d56Sopenharmony_ci
17397db96d56Sopenharmony_ciA `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
17407db96d56Sopenharmony_cianalyzes a string to categorize groups of characters.  This is a useful first
17417db96d56Sopenharmony_cistep in writing a compiler or interpreter.
17427db96d56Sopenharmony_ci
17437db96d56Sopenharmony_ciThe text categories are specified with regular expressions.  The technique is
17447db96d56Sopenharmony_cito combine those into a single master regular expression and to loop over
17457db96d56Sopenharmony_cisuccessive matches::
17467db96d56Sopenharmony_ci
17477db96d56Sopenharmony_ci    from typing import NamedTuple
17487db96d56Sopenharmony_ci    import re
17497db96d56Sopenharmony_ci
17507db96d56Sopenharmony_ci    class Token(NamedTuple):
17517db96d56Sopenharmony_ci        type: str
17527db96d56Sopenharmony_ci        value: str
17537db96d56Sopenharmony_ci        line: int
17547db96d56Sopenharmony_ci        column: int
17557db96d56Sopenharmony_ci
17567db96d56Sopenharmony_ci    def tokenize(code):
17577db96d56Sopenharmony_ci        keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
17587db96d56Sopenharmony_ci        token_specification = [
17597db96d56Sopenharmony_ci            ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
17607db96d56Sopenharmony_ci            ('ASSIGN',   r':='),           # Assignment operator
17617db96d56Sopenharmony_ci            ('END',      r';'),            # Statement terminator
17627db96d56Sopenharmony_ci            ('ID',       r'[A-Za-z]+'),    # Identifiers
17637db96d56Sopenharmony_ci            ('OP',       r'[+\-*/]'),      # Arithmetic operators
17647db96d56Sopenharmony_ci            ('NEWLINE',  r'\n'),           # Line endings
17657db96d56Sopenharmony_ci            ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
17667db96d56Sopenharmony_ci            ('MISMATCH', r'.'),            # Any other character
17677db96d56Sopenharmony_ci        ]
17687db96d56Sopenharmony_ci        tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
17697db96d56Sopenharmony_ci        line_num = 1
17707db96d56Sopenharmony_ci        line_start = 0
17717db96d56Sopenharmony_ci        for mo in re.finditer(tok_regex, code):
17727db96d56Sopenharmony_ci            kind = mo.lastgroup
17737db96d56Sopenharmony_ci            value = mo.group()
17747db96d56Sopenharmony_ci            column = mo.start() - line_start
17757db96d56Sopenharmony_ci            if kind == 'NUMBER':
17767db96d56Sopenharmony_ci                value = float(value) if '.' in value else int(value)
17777db96d56Sopenharmony_ci            elif kind == 'ID' and value in keywords:
17787db96d56Sopenharmony_ci                kind = value
17797db96d56Sopenharmony_ci            elif kind == 'NEWLINE':
17807db96d56Sopenharmony_ci                line_start = mo.end()
17817db96d56Sopenharmony_ci                line_num += 1
17827db96d56Sopenharmony_ci                continue
17837db96d56Sopenharmony_ci            elif kind == 'SKIP':
17847db96d56Sopenharmony_ci                continue
17857db96d56Sopenharmony_ci            elif kind == 'MISMATCH':
17867db96d56Sopenharmony_ci                raise RuntimeError(f'{value!r} unexpected on line {line_num}')
17877db96d56Sopenharmony_ci            yield Token(kind, value, line_num, column)
17887db96d56Sopenharmony_ci
17897db96d56Sopenharmony_ci    statements = '''
17907db96d56Sopenharmony_ci        IF quantity THEN
17917db96d56Sopenharmony_ci            total := total + price * quantity;
17927db96d56Sopenharmony_ci            tax := price * 0.05;
17937db96d56Sopenharmony_ci        ENDIF;
17947db96d56Sopenharmony_ci    '''
17957db96d56Sopenharmony_ci
17967db96d56Sopenharmony_ci    for token in tokenize(statements):
17977db96d56Sopenharmony_ci        print(token)
17987db96d56Sopenharmony_ci
17997db96d56Sopenharmony_ciThe tokenizer produces the following output::
18007db96d56Sopenharmony_ci
18017db96d56Sopenharmony_ci    Token(type='IF', value='IF', line=2, column=4)
18027db96d56Sopenharmony_ci    Token(type='ID', value='quantity', line=2, column=7)
18037db96d56Sopenharmony_ci    Token(type='THEN', value='THEN', line=2, column=16)
18047db96d56Sopenharmony_ci    Token(type='ID', value='total', line=3, column=8)
18057db96d56Sopenharmony_ci    Token(type='ASSIGN', value=':=', line=3, column=14)
18067db96d56Sopenharmony_ci    Token(type='ID', value='total', line=3, column=17)
18077db96d56Sopenharmony_ci    Token(type='OP', value='+', line=3, column=23)
18087db96d56Sopenharmony_ci    Token(type='ID', value='price', line=3, column=25)
18097db96d56Sopenharmony_ci    Token(type='OP', value='*', line=3, column=31)
18107db96d56Sopenharmony_ci    Token(type='ID', value='quantity', line=3, column=33)
18117db96d56Sopenharmony_ci    Token(type='END', value=';', line=3, column=41)
18127db96d56Sopenharmony_ci    Token(type='ID', value='tax', line=4, column=8)
18137db96d56Sopenharmony_ci    Token(type='ASSIGN', value=':=', line=4, column=12)
18147db96d56Sopenharmony_ci    Token(type='ID', value='price', line=4, column=15)
18157db96d56Sopenharmony_ci    Token(type='OP', value='*', line=4, column=21)
18167db96d56Sopenharmony_ci    Token(type='NUMBER', value=0.05, line=4, column=23)
18177db96d56Sopenharmony_ci    Token(type='END', value=';', line=4, column=27)
18187db96d56Sopenharmony_ci    Token(type='ENDIF', value='ENDIF', line=5, column=4)
18197db96d56Sopenharmony_ci    Token(type='END', value=';', line=5, column=9)
18207db96d56Sopenharmony_ci
18217db96d56Sopenharmony_ci
18227db96d56Sopenharmony_ci.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
18237db96d56Sopenharmony_ci   Media, 2009. The third edition of the book no longer covers Python at all,
18247db96d56Sopenharmony_ci   but the first edition covered writing good regular expression patterns in
18257db96d56Sopenharmony_ci   great detail.
1826