17db96d56Sopenharmony_ci:mod:`re` --- Regular expression operations 27db96d56Sopenharmony_ci=========================================== 37db96d56Sopenharmony_ci 47db96d56Sopenharmony_ci.. module:: re 57db96d56Sopenharmony_ci :synopsis: Regular expression operations. 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com> 87db96d56Sopenharmony_ci.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca> 97db96d56Sopenharmony_ci 107db96d56Sopenharmony_ci**Source code:** :source:`Lib/re/` 117db96d56Sopenharmony_ci 127db96d56Sopenharmony_ci-------------- 137db96d56Sopenharmony_ci 147db96d56Sopenharmony_ciThis module provides regular expression matching operations similar to 157db96d56Sopenharmony_cithose found in Perl. 167db96d56Sopenharmony_ci 177db96d56Sopenharmony_ciBoth patterns and strings to be searched can be Unicode strings (:class:`str`) 187db96d56Sopenharmony_cias well as 8-bit strings (:class:`bytes`). 197db96d56Sopenharmony_ciHowever, Unicode strings and 8-bit strings cannot be mixed: 207db96d56Sopenharmony_cithat is, you cannot match a Unicode string with a byte pattern or 217db96d56Sopenharmony_civice-versa; similarly, when asking for a substitution, the replacement 227db96d56Sopenharmony_cistring must be of the same type as both the pattern and the search string. 237db96d56Sopenharmony_ci 247db96d56Sopenharmony_ciRegular expressions use the backslash character (``'\'``) to indicate 257db96d56Sopenharmony_cispecial forms or to allow special characters to be used without invoking 267db96d56Sopenharmony_citheir special meaning. This collides with Python's usage of the same 277db96d56Sopenharmony_cicharacter for the same purpose in string literals; for example, to match 287db96d56Sopenharmony_cia literal backslash, one might have to write ``'\\\\'`` as the pattern 297db96d56Sopenharmony_cistring, because the regular expression must be ``\\``, and each 307db96d56Sopenharmony_cibackslash must be expressed as ``\\`` inside a regular Python string 317db96d56Sopenharmony_ciliteral. Also, please note that any invalid escape sequences in Python's 327db96d56Sopenharmony_ciusage of the backslash in string literals now generate a :exc:`DeprecationWarning` 337db96d56Sopenharmony_ciand in the future this will become a :exc:`SyntaxError`. This behaviour 347db96d56Sopenharmony_ciwill happen even if it is a valid escape sequence for a regular expression. 357db96d56Sopenharmony_ci 367db96d56Sopenharmony_ciThe solution is to use Python's raw string notation for regular expression 377db96d56Sopenharmony_cipatterns; backslashes are not handled in any special way in a string literal 387db96d56Sopenharmony_ciprefixed with ``'r'``. So ``r"\n"`` is a two-character string containing 397db96d56Sopenharmony_ci``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a 407db96d56Sopenharmony_cinewline. Usually patterns will be expressed in Python code using this raw 417db96d56Sopenharmony_cistring notation. 427db96d56Sopenharmony_ci 437db96d56Sopenharmony_ciIt is important to note that most regular expression operations are available as 447db96d56Sopenharmony_cimodule-level functions and methods on 457db96d56Sopenharmony_ci:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts 467db96d56Sopenharmony_cithat don't require you to compile a regex object first, but miss some 477db96d56Sopenharmony_cifine-tuning parameters. 487db96d56Sopenharmony_ci 497db96d56Sopenharmony_ci.. seealso:: 507db96d56Sopenharmony_ci 517db96d56Sopenharmony_ci The third-party `regex <https://pypi.org/project/regex/>`_ module, 527db96d56Sopenharmony_ci which has an API compatible with the standard library :mod:`re` module, 537db96d56Sopenharmony_ci but offers additional functionality and a more thorough Unicode support. 547db96d56Sopenharmony_ci 557db96d56Sopenharmony_ci 567db96d56Sopenharmony_ci.. _re-syntax: 577db96d56Sopenharmony_ci 587db96d56Sopenharmony_ciRegular Expression Syntax 597db96d56Sopenharmony_ci------------------------- 607db96d56Sopenharmony_ci 617db96d56Sopenharmony_ciA regular expression (or RE) specifies a set of strings that matches it; the 627db96d56Sopenharmony_cifunctions in this module let you check if a particular string matches a given 637db96d56Sopenharmony_ciregular expression (or if a given regular expression matches a particular 647db96d56Sopenharmony_cistring, which comes down to the same thing). 657db96d56Sopenharmony_ci 667db96d56Sopenharmony_ciRegular expressions can be concatenated to form new regular expressions; if *A* 677db96d56Sopenharmony_ciand *B* are both regular expressions, then *AB* is also a regular expression. 687db96d56Sopenharmony_ciIn general, if a string *p* matches *A* and another string *q* matches *B*, the 697db96d56Sopenharmony_cistring *pq* will match AB. This holds unless *A* or *B* contain low precedence 707db96d56Sopenharmony_cioperations; boundary conditions between *A* and *B*; or have numbered group 717db96d56Sopenharmony_cireferences. Thus, complex expressions can easily be constructed from simpler 727db96d56Sopenharmony_ciprimitive expressions like the ones described here. For details of the theory 737db96d56Sopenharmony_ciand implementation of regular expressions, consult the Friedl book [Frie09]_, 747db96d56Sopenharmony_cior almost any textbook about compiler construction. 757db96d56Sopenharmony_ci 767db96d56Sopenharmony_ciA brief explanation of the format of regular expressions follows. For further 777db96d56Sopenharmony_ciinformation and a gentler presentation, consult the :ref:`regex-howto`. 787db96d56Sopenharmony_ci 797db96d56Sopenharmony_ciRegular expressions can contain both special and ordinary characters. Most 807db96d56Sopenharmony_ciordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular 817db96d56Sopenharmony_ciexpressions; they simply match themselves. You can concatenate ordinary 827db96d56Sopenharmony_cicharacters, so ``last`` matches the string ``'last'``. (In the rest of this 837db96d56Sopenharmony_cisection, we'll write RE's in ``this special style``, usually without quotes, and 847db96d56Sopenharmony_cistrings to be matched ``'in single quotes'``.) 857db96d56Sopenharmony_ci 867db96d56Sopenharmony_ciSome characters, like ``'|'`` or ``'('``, are special. Special 877db96d56Sopenharmony_cicharacters either stand for classes of ordinary characters, or affect 887db96d56Sopenharmony_cihow the regular expressions around them are interpreted. 897db96d56Sopenharmony_ci 907db96d56Sopenharmony_ciRepetition operators or quantifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be 917db96d56Sopenharmony_cidirectly nested. This avoids ambiguity with the non-greedy modifier suffix 927db96d56Sopenharmony_ci``?``, and with other modifiers in other implementations. To apply a second 937db96d56Sopenharmony_cirepetition to an inner repetition, parentheses may be used. For example, 947db96d56Sopenharmony_cithe expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. 957db96d56Sopenharmony_ci 967db96d56Sopenharmony_ci 977db96d56Sopenharmony_ciThe special characters are: 987db96d56Sopenharmony_ci 997db96d56Sopenharmony_ci.. index:: single: . (dot); in regular expressions 1007db96d56Sopenharmony_ci 1017db96d56Sopenharmony_ci``.`` 1027db96d56Sopenharmony_ci (Dot.) In the default mode, this matches any character except a newline. If 1037db96d56Sopenharmony_ci the :const:`DOTALL` flag has been specified, this matches any character 1047db96d56Sopenharmony_ci including a newline. 1057db96d56Sopenharmony_ci 1067db96d56Sopenharmony_ci.. index:: single: ^ (caret); in regular expressions 1077db96d56Sopenharmony_ci 1087db96d56Sopenharmony_ci``^`` 1097db96d56Sopenharmony_ci (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also 1107db96d56Sopenharmony_ci matches immediately after each newline. 1117db96d56Sopenharmony_ci 1127db96d56Sopenharmony_ci.. index:: single: $ (dollar); in regular expressions 1137db96d56Sopenharmony_ci 1147db96d56Sopenharmony_ci``$`` 1157db96d56Sopenharmony_ci Matches the end of the string or just before the newline at the end of the 1167db96d56Sopenharmony_ci string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` 1177db96d56Sopenharmony_ci matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches 1187db96d56Sopenharmony_ci only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` 1197db96d56Sopenharmony_ci matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for 1207db96d56Sopenharmony_ci a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before 1217db96d56Sopenharmony_ci the newline, and one at the end of the string. 1227db96d56Sopenharmony_ci 1237db96d56Sopenharmony_ci.. index:: single: * (asterisk); in regular expressions 1247db96d56Sopenharmony_ci 1257db96d56Sopenharmony_ci``*`` 1267db96d56Sopenharmony_ci Causes the resulting RE to match 0 or more repetitions of the preceding RE, as 1277db96d56Sopenharmony_ci many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed 1287db96d56Sopenharmony_ci by any number of 'b's. 1297db96d56Sopenharmony_ci 1307db96d56Sopenharmony_ci.. index:: single: + (plus); in regular expressions 1317db96d56Sopenharmony_ci 1327db96d56Sopenharmony_ci``+`` 1337db96d56Sopenharmony_ci Causes the resulting RE to match 1 or more repetitions of the preceding RE. 1347db96d56Sopenharmony_ci ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not 1357db96d56Sopenharmony_ci match just 'a'. 1367db96d56Sopenharmony_ci 1377db96d56Sopenharmony_ci.. index:: single: ? (question mark); in regular expressions 1387db96d56Sopenharmony_ci 1397db96d56Sopenharmony_ci``?`` 1407db96d56Sopenharmony_ci Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 1417db96d56Sopenharmony_ci ``ab?`` will match either 'a' or 'ab'. 1427db96d56Sopenharmony_ci 1437db96d56Sopenharmony_ci.. index:: 1447db96d56Sopenharmony_ci single: *?; in regular expressions 1457db96d56Sopenharmony_ci single: +?; in regular expressions 1467db96d56Sopenharmony_ci single: ??; in regular expressions 1477db96d56Sopenharmony_ci 1487db96d56Sopenharmony_ci``*?``, ``+?``, ``??`` 1497db96d56Sopenharmony_ci The ``'*'``, ``'+'``, and ``'?'`` quantifiers are all :dfn:`greedy`; they match 1507db96d56Sopenharmony_ci as much text as possible. Sometimes this behaviour isn't desired; if the RE 1517db96d56Sopenharmony_ci ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire 1527db96d56Sopenharmony_ci string, and not just ``'<a>'``. Adding ``?`` after the quantifier makes it 1537db96d56Sopenharmony_ci perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few* 1547db96d56Sopenharmony_ci characters as possible will be matched. Using the RE ``<.*?>`` will match 1557db96d56Sopenharmony_ci only ``'<a>'``. 1567db96d56Sopenharmony_ci 1577db96d56Sopenharmony_ci.. index:: 1587db96d56Sopenharmony_ci single: *+; in regular expressions 1597db96d56Sopenharmony_ci single: ++; in regular expressions 1607db96d56Sopenharmony_ci single: ?+; in regular expressions 1617db96d56Sopenharmony_ci 1627db96d56Sopenharmony_ci``*+``, ``++``, ``?+`` 1637db96d56Sopenharmony_ci Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers, those where ``'+'`` is 1647db96d56Sopenharmony_ci appended also match as many times as possible. 1657db96d56Sopenharmony_ci However, unlike the true greedy quantifiers, these do not allow 1667db96d56Sopenharmony_ci back-tracking when the expression following it fails to match. 1677db96d56Sopenharmony_ci These are known as :dfn:`possessive` quantifiers. 1687db96d56Sopenharmony_ci For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match 1697db96d56Sopenharmony_ci all 4 ``'a'``\ s, but, when the final ``'a'`` is encountered, the 1707db96d56Sopenharmony_ci expression is backtracked so that in the end the ``a*`` ends up matching 1717db96d56Sopenharmony_ci 3 ``'a'``\ s total, and the fourth ``'a'`` is matched by the final ``'a'``. 1727db96d56Sopenharmony_ci However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will 1737db96d56Sopenharmony_ci match all 4 ``'a'``, but when the final ``'a'`` fails to find any more 1747db96d56Sopenharmony_ci characters to match, the expression cannot be backtracked and will thus 1757db96d56Sopenharmony_ci fail to match. 1767db96d56Sopenharmony_ci ``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)`` 1777db96d56Sopenharmony_ci and ``(?>x?)`` correspondingly. 1787db96d56Sopenharmony_ci 1797db96d56Sopenharmony_ci .. versionadded:: 3.11 1807db96d56Sopenharmony_ci 1817db96d56Sopenharmony_ci.. index:: 1827db96d56Sopenharmony_ci single: {} (curly brackets); in regular expressions 1837db96d56Sopenharmony_ci 1847db96d56Sopenharmony_ci``{m}`` 1857db96d56Sopenharmony_ci Specifies that exactly *m* copies of the previous RE should be matched; fewer 1867db96d56Sopenharmony_ci matches cause the entire RE not to match. For example, ``a{6}`` will match 1877db96d56Sopenharmony_ci exactly six ``'a'`` characters, but not five. 1887db96d56Sopenharmony_ci 1897db96d56Sopenharmony_ci``{m,n}`` 1907db96d56Sopenharmony_ci Causes the resulting RE to match from *m* to *n* repetitions of the preceding 1917db96d56Sopenharmony_ci RE, attempting to match as many repetitions as possible. For example, 1927db96d56Sopenharmony_ci ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a 1937db96d56Sopenharmony_ci lower bound of zero, and omitting *n* specifies an infinite upper bound. As an 1947db96d56Sopenharmony_ci example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters 1957db96d56Sopenharmony_ci followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the 1967db96d56Sopenharmony_ci modifier would be confused with the previously described form. 1977db96d56Sopenharmony_ci 1987db96d56Sopenharmony_ci``{m,n}?`` 1997db96d56Sopenharmony_ci Causes the resulting RE to match from *m* to *n* repetitions of the preceding 2007db96d56Sopenharmony_ci RE, attempting to match as *few* repetitions as possible. This is the 2017db96d56Sopenharmony_ci non-greedy version of the previous quantifier. For example, on the 2027db96d56Sopenharmony_ci 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters, 2037db96d56Sopenharmony_ci while ``a{3,5}?`` will only match 3 characters. 2047db96d56Sopenharmony_ci 2057db96d56Sopenharmony_ci``{m,n}+`` 2067db96d56Sopenharmony_ci Causes the resulting RE to match from *m* to *n* repetitions of the 2077db96d56Sopenharmony_ci preceding RE, attempting to match as many repetitions as possible 2087db96d56Sopenharmony_ci *without* establishing any backtracking points. 2097db96d56Sopenharmony_ci This is the possessive version of the quantifier above. 2107db96d56Sopenharmony_ci For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa`` 2117db96d56Sopenharmony_ci attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``\ s, 2127db96d56Sopenharmony_ci will need more characters than available and thus fail, while 2137db96d56Sopenharmony_ci ``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``\ s 2147db96d56Sopenharmony_ci by backtracking and then the final 2 ``'a'``\ s are matched by the final 2157db96d56Sopenharmony_ci ``aa`` in the pattern. 2167db96d56Sopenharmony_ci ``x{m,n}+`` is equivalent to ``(?>x{m,n})``. 2177db96d56Sopenharmony_ci 2187db96d56Sopenharmony_ci .. versionadded:: 3.11 2197db96d56Sopenharmony_ci 2207db96d56Sopenharmony_ci.. index:: single: \ (backslash); in regular expressions 2217db96d56Sopenharmony_ci 2227db96d56Sopenharmony_ci``\`` 2237db96d56Sopenharmony_ci Either escapes special characters (permitting you to match characters like 2247db96d56Sopenharmony_ci ``'*'``, ``'?'``, and so forth), or signals a special sequence; special 2257db96d56Sopenharmony_ci sequences are discussed below. 2267db96d56Sopenharmony_ci 2277db96d56Sopenharmony_ci If you're not using a raw string to express the pattern, remember that Python 2287db96d56Sopenharmony_ci also uses the backslash as an escape sequence in string literals; if the escape 2297db96d56Sopenharmony_ci sequence isn't recognized by Python's parser, the backslash and subsequent 2307db96d56Sopenharmony_ci character are included in the resulting string. However, if Python would 2317db96d56Sopenharmony_ci recognize the resulting sequence, the backslash should be repeated twice. This 2327db96d56Sopenharmony_ci is complicated and hard to understand, so it's highly recommended that you use 2337db96d56Sopenharmony_ci raw strings for all but the simplest expressions. 2347db96d56Sopenharmony_ci 2357db96d56Sopenharmony_ci.. index:: 2367db96d56Sopenharmony_ci single: [] (square brackets); in regular expressions 2377db96d56Sopenharmony_ci 2387db96d56Sopenharmony_ci``[]`` 2397db96d56Sopenharmony_ci Used to indicate a set of characters. In a set: 2407db96d56Sopenharmony_ci 2417db96d56Sopenharmony_ci * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``, 2427db96d56Sopenharmony_ci ``'m'``, or ``'k'``. 2437db96d56Sopenharmony_ci 2447db96d56Sopenharmony_ci .. index:: single: - (minus); in regular expressions 2457db96d56Sopenharmony_ci 2467db96d56Sopenharmony_ci * Ranges of characters can be indicated by giving two characters and separating 2477db96d56Sopenharmony_ci them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter, 2487db96d56Sopenharmony_ci ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and 2497db96d56Sopenharmony_ci ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g. 2507db96d56Sopenharmony_ci ``[a\-z]``) or if it's placed as the first or last character 2517db96d56Sopenharmony_ci (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``. 2527db96d56Sopenharmony_ci 2537db96d56Sopenharmony_ci * Special characters lose their special meaning inside sets. For example, 2547db96d56Sopenharmony_ci ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``, 2557db96d56Sopenharmony_ci ``'*'``, or ``')'``. 2567db96d56Sopenharmony_ci 2577db96d56Sopenharmony_ci .. index:: single: \ (backslash); in regular expressions 2587db96d56Sopenharmony_ci 2597db96d56Sopenharmony_ci * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted 2607db96d56Sopenharmony_ci inside a set, although the characters they match depends on whether 2617db96d56Sopenharmony_ci :const:`ASCII` or :const:`LOCALE` mode is in force. 2627db96d56Sopenharmony_ci 2637db96d56Sopenharmony_ci .. index:: single: ^ (caret); in regular expressions 2647db96d56Sopenharmony_ci 2657db96d56Sopenharmony_ci * Characters that are not within a range can be matched by :dfn:`complementing` 2667db96d56Sopenharmony_ci the set. If the first character of the set is ``'^'``, all the characters 2677db96d56Sopenharmony_ci that are *not* in the set will be matched. For example, ``[^5]`` will match 2687db96d56Sopenharmony_ci any character except ``'5'``, and ``[^^]`` will match any character except 2697db96d56Sopenharmony_ci ``'^'``. ``^`` has no special meaning if it's not the first character in 2707db96d56Sopenharmony_ci the set. 2717db96d56Sopenharmony_ci 2727db96d56Sopenharmony_ci * To match a literal ``']'`` inside a set, precede it with a backslash, or 2737db96d56Sopenharmony_ci place it at the beginning of the set. For example, both ``[()[\]{}]`` and 2747db96d56Sopenharmony_ci ``[]()[{}]`` will match a right bracket, as well as left bracket, braces, 2757db96d56Sopenharmony_ci and parentheses. 2767db96d56Sopenharmony_ci 2777db96d56Sopenharmony_ci .. .. index:: single: --; in regular expressions 2787db96d56Sopenharmony_ci .. .. index:: single: &&; in regular expressions 2797db96d56Sopenharmony_ci .. .. index:: single: ~~; in regular expressions 2807db96d56Sopenharmony_ci .. .. index:: single: ||; in regular expressions 2817db96d56Sopenharmony_ci 2827db96d56Sopenharmony_ci * Support of nested sets and set operations as in `Unicode Technical 2837db96d56Sopenharmony_ci Standard #18`_ might be added in the future. This would change the 2847db96d56Sopenharmony_ci syntax, so to facilitate this change a :exc:`FutureWarning` will be raised 2857db96d56Sopenharmony_ci in ambiguous cases for the time being. 2867db96d56Sopenharmony_ci That includes sets starting with a literal ``'['`` or containing literal 2877db96d56Sopenharmony_ci character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To 2887db96d56Sopenharmony_ci avoid a warning escape them with a backslash. 2897db96d56Sopenharmony_ci 2907db96d56Sopenharmony_ci .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/ 2917db96d56Sopenharmony_ci 2927db96d56Sopenharmony_ci .. versionchanged:: 3.7 2937db96d56Sopenharmony_ci :exc:`FutureWarning` is raised if a character set contains constructs 2947db96d56Sopenharmony_ci that will change semantically in the future. 2957db96d56Sopenharmony_ci 2967db96d56Sopenharmony_ci.. index:: single: | (vertical bar); in regular expressions 2977db96d56Sopenharmony_ci 2987db96d56Sopenharmony_ci``|`` 2997db96d56Sopenharmony_ci ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that 3007db96d56Sopenharmony_ci will match either *A* or *B*. An arbitrary number of REs can be separated by the 3017db96d56Sopenharmony_ci ``'|'`` in this way. This can be used inside groups (see below) as well. As 3027db96d56Sopenharmony_ci the target string is scanned, REs separated by ``'|'`` are tried from left to 3037db96d56Sopenharmony_ci right. When one pattern completely matches, that branch is accepted. This means 3047db96d56Sopenharmony_ci that once *A* matches, *B* will not be tested further, even if it would 3057db96d56Sopenharmony_ci produce a longer overall match. In other words, the ``'|'`` operator is never 3067db96d56Sopenharmony_ci greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a 3077db96d56Sopenharmony_ci character class, as in ``[|]``. 3087db96d56Sopenharmony_ci 3097db96d56Sopenharmony_ci.. index:: 3107db96d56Sopenharmony_ci single: () (parentheses); in regular expressions 3117db96d56Sopenharmony_ci 3127db96d56Sopenharmony_ci``(...)`` 3137db96d56Sopenharmony_ci Matches whatever regular expression is inside the parentheses, and indicates the 3147db96d56Sopenharmony_ci start and end of a group; the contents of a group can be retrieved after a match 3157db96d56Sopenharmony_ci has been performed, and can be matched later in the string with the ``\number`` 3167db96d56Sopenharmony_ci special sequence, described below. To match the literals ``'('`` or ``')'``, 3177db96d56Sopenharmony_ci use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``. 3187db96d56Sopenharmony_ci 3197db96d56Sopenharmony_ci.. index:: single: (?; in regular expressions 3207db96d56Sopenharmony_ci 3217db96d56Sopenharmony_ci``(?...)`` 3227db96d56Sopenharmony_ci This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful 3237db96d56Sopenharmony_ci otherwise). The first character after the ``'?'`` determines what the meaning 3247db96d56Sopenharmony_ci and further syntax of the construct is. Extensions usually do not create a new 3257db96d56Sopenharmony_ci group; ``(?P<name>...)`` is the only exception to this rule. Following are the 3267db96d56Sopenharmony_ci currently supported extensions. 3277db96d56Sopenharmony_ci 3287db96d56Sopenharmony_ci``(?aiLmsux)`` 3297db96d56Sopenharmony_ci (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, 3307db96d56Sopenharmony_ci ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the 3317db96d56Sopenharmony_ci letters set the corresponding flags: :const:`re.A` (ASCII-only matching), 3327db96d56Sopenharmony_ci :const:`re.I` (ignore case), :const:`re.L` (locale dependent), 3337db96d56Sopenharmony_ci :const:`re.M` (multi-line), :const:`re.S` (dot matches all), 3347db96d56Sopenharmony_ci :const:`re.U` (Unicode matching), and :const:`re.X` (verbose), 3357db96d56Sopenharmony_ci for the entire regular expression. 3367db96d56Sopenharmony_ci (The flags are described in :ref:`contents-of-module-re`.) 3377db96d56Sopenharmony_ci This is useful if you wish to include the flags as part of the 3387db96d56Sopenharmony_ci regular expression, instead of passing a *flag* argument to the 3397db96d56Sopenharmony_ci :func:`re.compile` function. Flags should be used first in the 3407db96d56Sopenharmony_ci expression string. 3417db96d56Sopenharmony_ci 3427db96d56Sopenharmony_ci .. versionchanged:: 3.11 3437db96d56Sopenharmony_ci This construction can only be used at the start of the expression. 3447db96d56Sopenharmony_ci 3457db96d56Sopenharmony_ci.. index:: single: (?:; in regular expressions 3467db96d56Sopenharmony_ci 3477db96d56Sopenharmony_ci``(?:...)`` 3487db96d56Sopenharmony_ci A non-capturing version of regular parentheses. Matches whatever regular 3497db96d56Sopenharmony_ci expression is inside the parentheses, but the substring matched by the group 3507db96d56Sopenharmony_ci *cannot* be retrieved after performing a match or referenced later in the 3517db96d56Sopenharmony_ci pattern. 3527db96d56Sopenharmony_ci 3537db96d56Sopenharmony_ci``(?aiLmsux-imsx:...)`` 3547db96d56Sopenharmony_ci (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, 3557db96d56Sopenharmony_ci ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by 3567db96d56Sopenharmony_ci one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.) 3577db96d56Sopenharmony_ci The letters set or remove the corresponding flags: 3587db96d56Sopenharmony_ci :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case), 3597db96d56Sopenharmony_ci :const:`re.L` (locale dependent), :const:`re.M` (multi-line), 3607db96d56Sopenharmony_ci :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), 3617db96d56Sopenharmony_ci and :const:`re.X` (verbose), for the part of the expression. 3627db96d56Sopenharmony_ci (The flags are described in :ref:`contents-of-module-re`.) 3637db96d56Sopenharmony_ci 3647db96d56Sopenharmony_ci The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used 3657db96d56Sopenharmony_ci as inline flags, so they can't be combined or follow ``'-'``. Instead, 3667db96d56Sopenharmony_ci when one of them appears in an inline group, it overrides the matching mode 3677db96d56Sopenharmony_ci in the enclosing group. In Unicode patterns ``(?a:...)`` switches to 3687db96d56Sopenharmony_ci ASCII-only matching, and ``(?u:...)`` switches to Unicode matching 3697db96d56Sopenharmony_ci (default). In byte pattern ``(?L:...)`` switches to locale depending 3707db96d56Sopenharmony_ci matching, and ``(?a:...)`` switches to ASCII-only matching (default). 3717db96d56Sopenharmony_ci This override is only in effect for the narrow inline group, and the 3727db96d56Sopenharmony_ci original matching mode is restored outside of the group. 3737db96d56Sopenharmony_ci 3747db96d56Sopenharmony_ci .. versionadded:: 3.6 3757db96d56Sopenharmony_ci 3767db96d56Sopenharmony_ci .. versionchanged:: 3.7 3777db96d56Sopenharmony_ci The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group. 3787db96d56Sopenharmony_ci 3797db96d56Sopenharmony_ci``(?>...)`` 3807db96d56Sopenharmony_ci Attempts to match ``...`` as if it was a separate regular expression, and 3817db96d56Sopenharmony_ci if successful, continues to match the rest of the pattern following it. 3827db96d56Sopenharmony_ci If the subsequent pattern fails to match, the stack can only be unwound 3837db96d56Sopenharmony_ci to a point *before* the ``(?>...)`` because once exited, the expression, 3847db96d56Sopenharmony_ci known as an :dfn:`atomic group`, has thrown away all stack points within 3857db96d56Sopenharmony_ci itself. 3867db96d56Sopenharmony_ci Thus, ``(?>.*).`` would never match anything because first the ``.*`` 3877db96d56Sopenharmony_ci would match all characters possible, then, having nothing left to match, 3887db96d56Sopenharmony_ci the final ``.`` would fail to match. 3897db96d56Sopenharmony_ci Since there are no stack points saved in the Atomic Group, and there is 3907db96d56Sopenharmony_ci no stack point before it, the entire expression would thus fail to match. 3917db96d56Sopenharmony_ci 3927db96d56Sopenharmony_ci .. versionadded:: 3.11 3937db96d56Sopenharmony_ci 3947db96d56Sopenharmony_ci.. index:: single: (?P<; in regular expressions 3957db96d56Sopenharmony_ci 3967db96d56Sopenharmony_ci``(?P<name>...)`` 3977db96d56Sopenharmony_ci Similar to regular parentheses, but the substring matched by the group is 3987db96d56Sopenharmony_ci accessible via the symbolic group name *name*. Group names must be valid 3997db96d56Sopenharmony_ci Python identifiers, and each group name must be defined only once within a 4007db96d56Sopenharmony_ci regular expression. A symbolic group is also a numbered group, just as if 4017db96d56Sopenharmony_ci the group were not named. 4027db96d56Sopenharmony_ci 4037db96d56Sopenharmony_ci Named groups can be referenced in three contexts. If the pattern is 4047db96d56Sopenharmony_ci ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either 4057db96d56Sopenharmony_ci single or double quotes): 4067db96d56Sopenharmony_ci 4077db96d56Sopenharmony_ci +---------------------------------------+----------------------------------+ 4087db96d56Sopenharmony_ci | Context of reference to group "quote" | Ways to reference it | 4097db96d56Sopenharmony_ci +=======================================+==================================+ 4107db96d56Sopenharmony_ci | in the same pattern itself | * ``(?P=quote)`` (as shown) | 4117db96d56Sopenharmony_ci | | * ``\1`` | 4127db96d56Sopenharmony_ci +---------------------------------------+----------------------------------+ 4137db96d56Sopenharmony_ci | when processing match object *m* | * ``m.group('quote')`` | 4147db96d56Sopenharmony_ci | | * ``m.end('quote')`` (etc.) | 4157db96d56Sopenharmony_ci +---------------------------------------+----------------------------------+ 4167db96d56Sopenharmony_ci | in a string passed to the *repl* | * ``\g<quote>`` | 4177db96d56Sopenharmony_ci | argument of ``re.sub()`` | * ``\g<1>`` | 4187db96d56Sopenharmony_ci | | * ``\1`` | 4197db96d56Sopenharmony_ci +---------------------------------------+----------------------------------+ 4207db96d56Sopenharmony_ci 4217db96d56Sopenharmony_ci .. deprecated:: 3.11 4227db96d56Sopenharmony_ci Group *name* containing characters outside the ASCII range 4237db96d56Sopenharmony_ci (``b'\x00'``-``b'\x7f'``) in :class:`bytes` patterns. 4247db96d56Sopenharmony_ci 4257db96d56Sopenharmony_ci.. index:: single: (?P=; in regular expressions 4267db96d56Sopenharmony_ci 4277db96d56Sopenharmony_ci``(?P=name)`` 4287db96d56Sopenharmony_ci A backreference to a named group; it matches whatever text was matched by the 4297db96d56Sopenharmony_ci earlier group named *name*. 4307db96d56Sopenharmony_ci 4317db96d56Sopenharmony_ci.. index:: single: (?#; in regular expressions 4327db96d56Sopenharmony_ci 4337db96d56Sopenharmony_ci``(?#...)`` 4347db96d56Sopenharmony_ci A comment; the contents of the parentheses are simply ignored. 4357db96d56Sopenharmony_ci 4367db96d56Sopenharmony_ci.. index:: single: (?=; in regular expressions 4377db96d56Sopenharmony_ci 4387db96d56Sopenharmony_ci``(?=...)`` 4397db96d56Sopenharmony_ci Matches if ``...`` matches next, but doesn't consume any of the string. This is 4407db96d56Sopenharmony_ci called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match 4417db96d56Sopenharmony_ci ``'Isaac '`` only if it's followed by ``'Asimov'``. 4427db96d56Sopenharmony_ci 4437db96d56Sopenharmony_ci.. index:: single: (?!; in regular expressions 4447db96d56Sopenharmony_ci 4457db96d56Sopenharmony_ci``(?!...)`` 4467db96d56Sopenharmony_ci Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`. 4477db96d56Sopenharmony_ci For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not* 4487db96d56Sopenharmony_ci followed by ``'Asimov'``. 4497db96d56Sopenharmony_ci 4507db96d56Sopenharmony_ci.. index:: single: (?<=; in regular expressions 4517db96d56Sopenharmony_ci 4527db96d56Sopenharmony_ci``(?<=...)`` 4537db96d56Sopenharmony_ci Matches if the current position in the string is preceded by a match for ``...`` 4547db96d56Sopenharmony_ci that ends at the current position. This is called a :dfn:`positive lookbehind 4557db96d56Sopenharmony_ci assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the 4567db96d56Sopenharmony_ci lookbehind will back up 3 characters and check if the contained pattern matches. 4577db96d56Sopenharmony_ci The contained pattern must only match strings of some fixed length, meaning that 4587db96d56Sopenharmony_ci ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that 4597db96d56Sopenharmony_ci patterns which start with positive lookbehind assertions will not match at the 4607db96d56Sopenharmony_ci beginning of the string being searched; you will most likely want to use the 4617db96d56Sopenharmony_ci :func:`search` function rather than the :func:`match` function: 4627db96d56Sopenharmony_ci 4637db96d56Sopenharmony_ci >>> import re 4647db96d56Sopenharmony_ci >>> m = re.search('(?<=abc)def', 'abcdef') 4657db96d56Sopenharmony_ci >>> m.group(0) 4667db96d56Sopenharmony_ci 'def' 4677db96d56Sopenharmony_ci 4687db96d56Sopenharmony_ci This example looks for a word following a hyphen: 4697db96d56Sopenharmony_ci 4707db96d56Sopenharmony_ci >>> m = re.search(r'(?<=-)\w+', 'spam-egg') 4717db96d56Sopenharmony_ci >>> m.group(0) 4727db96d56Sopenharmony_ci 'egg' 4737db96d56Sopenharmony_ci 4747db96d56Sopenharmony_ci .. versionchanged:: 3.5 4757db96d56Sopenharmony_ci Added support for group references of fixed length. 4767db96d56Sopenharmony_ci 4777db96d56Sopenharmony_ci.. index:: single: (?<!; in regular expressions 4787db96d56Sopenharmony_ci 4797db96d56Sopenharmony_ci``(?<!...)`` 4807db96d56Sopenharmony_ci Matches if the current position in the string is not preceded by a match for 4817db96d56Sopenharmony_ci ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to 4827db96d56Sopenharmony_ci positive lookbehind assertions, the contained pattern must only match strings of 4837db96d56Sopenharmony_ci some fixed length. Patterns which start with negative lookbehind assertions may 4847db96d56Sopenharmony_ci match at the beginning of the string being searched. 4857db96d56Sopenharmony_ci 4867db96d56Sopenharmony_ci.. _re-conditional-expression: 4877db96d56Sopenharmony_ci.. index:: single: (?(; in regular expressions 4887db96d56Sopenharmony_ci 4897db96d56Sopenharmony_ci``(?(id/name)yes-pattern|no-pattern)`` 4907db96d56Sopenharmony_ci Will try to match with ``yes-pattern`` if the group with given *id* or 4917db96d56Sopenharmony_ci *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is 4927db96d56Sopenharmony_ci optional and can be omitted. For example, 4937db96d56Sopenharmony_ci ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which 4947db96d56Sopenharmony_ci will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but 4957db96d56Sopenharmony_ci not with ``'<user@host.com'`` nor ``'user@host.com>'``. 4967db96d56Sopenharmony_ci 4977db96d56Sopenharmony_ci .. deprecated:: 3.11 4987db96d56Sopenharmony_ci Group *id* containing anything except ASCII digits. 4997db96d56Sopenharmony_ci Group *name* containing characters outside the ASCII range 5007db96d56Sopenharmony_ci (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings. 5017db96d56Sopenharmony_ci 5027db96d56Sopenharmony_ci 5037db96d56Sopenharmony_ciThe special sequences consist of ``'\'`` and a character from the list below. 5047db96d56Sopenharmony_ciIf the ordinary character is not an ASCII digit or an ASCII letter, then the 5057db96d56Sopenharmony_ciresulting RE will match the second character. For example, ``\$`` matches the 5067db96d56Sopenharmony_cicharacter ``'$'``. 5077db96d56Sopenharmony_ci 5087db96d56Sopenharmony_ci.. index:: single: \ (backslash); in regular expressions 5097db96d56Sopenharmony_ci 5107db96d56Sopenharmony_ci``\number`` 5117db96d56Sopenharmony_ci Matches the contents of the group of the same number. Groups are numbered 5127db96d56Sopenharmony_ci starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, 5137db96d56Sopenharmony_ci but not ``'thethe'`` (note the space after the group). This special sequence 5147db96d56Sopenharmony_ci can only be used to match one of the first 99 groups. If the first digit of 5157db96d56Sopenharmony_ci *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as 5167db96d56Sopenharmony_ci a group match, but as the character with octal value *number*. Inside the 5177db96d56Sopenharmony_ci ``'['`` and ``']'`` of a character class, all numeric escapes are treated as 5187db96d56Sopenharmony_ci characters. 5197db96d56Sopenharmony_ci 5207db96d56Sopenharmony_ci.. index:: single: \A; in regular expressions 5217db96d56Sopenharmony_ci 5227db96d56Sopenharmony_ci``\A`` 5237db96d56Sopenharmony_ci Matches only at the start of the string. 5247db96d56Sopenharmony_ci 5257db96d56Sopenharmony_ci.. index:: single: \b; in regular expressions 5267db96d56Sopenharmony_ci 5277db96d56Sopenharmony_ci``\b`` 5287db96d56Sopenharmony_ci Matches the empty string, but only at the beginning or end of a word. 5297db96d56Sopenharmony_ci A word is defined as a sequence of word characters. Note that formally, 5307db96d56Sopenharmony_ci ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character 5317db96d56Sopenharmony_ci (or vice versa), or between ``\w`` and the beginning/end of the string. 5327db96d56Sopenharmony_ci This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, 5337db96d56Sopenharmony_ci ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. 5347db96d56Sopenharmony_ci 5357db96d56Sopenharmony_ci By default Unicode alphanumerics are the ones used in Unicode patterns, but 5367db96d56Sopenharmony_ci this can be changed by using the :const:`ASCII` flag. Word boundaries are 5377db96d56Sopenharmony_ci determined by the current locale if the :const:`LOCALE` flag is used. 5387db96d56Sopenharmony_ci Inside a character range, ``\b`` represents the backspace character, for 5397db96d56Sopenharmony_ci compatibility with Python's string literals. 5407db96d56Sopenharmony_ci 5417db96d56Sopenharmony_ci.. index:: single: \B; in regular expressions 5427db96d56Sopenharmony_ci 5437db96d56Sopenharmony_ci``\B`` 5447db96d56Sopenharmony_ci Matches the empty string, but only when it is *not* at the beginning or end 5457db96d56Sopenharmony_ci of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, 5467db96d56Sopenharmony_ci ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``. 5477db96d56Sopenharmony_ci ``\B`` is just the opposite of ``\b``, so word characters in Unicode 5487db96d56Sopenharmony_ci patterns are Unicode alphanumerics or the underscore, although this can 5497db96d56Sopenharmony_ci be changed by using the :const:`ASCII` flag. Word boundaries are 5507db96d56Sopenharmony_ci determined by the current locale if the :const:`LOCALE` flag is used. 5517db96d56Sopenharmony_ci 5527db96d56Sopenharmony_ci.. index:: single: \d; in regular expressions 5537db96d56Sopenharmony_ci 5547db96d56Sopenharmony_ci``\d`` 5557db96d56Sopenharmony_ci For Unicode (str) patterns: 5567db96d56Sopenharmony_ci Matches any Unicode decimal digit (that is, any character in 5577db96d56Sopenharmony_ci Unicode character category [Nd]). This includes ``[0-9]``, and 5587db96d56Sopenharmony_ci also many other digit characters. If the :const:`ASCII` flag is 5597db96d56Sopenharmony_ci used only ``[0-9]`` is matched. 5607db96d56Sopenharmony_ci 5617db96d56Sopenharmony_ci For 8-bit (bytes) patterns: 5627db96d56Sopenharmony_ci Matches any decimal digit; this is equivalent to ``[0-9]``. 5637db96d56Sopenharmony_ci 5647db96d56Sopenharmony_ci.. index:: single: \D; in regular expressions 5657db96d56Sopenharmony_ci 5667db96d56Sopenharmony_ci``\D`` 5677db96d56Sopenharmony_ci Matches any character which is not a decimal digit. This is 5687db96d56Sopenharmony_ci the opposite of ``\d``. If the :const:`ASCII` flag is used this 5697db96d56Sopenharmony_ci becomes the equivalent of ``[^0-9]``. 5707db96d56Sopenharmony_ci 5717db96d56Sopenharmony_ci.. index:: single: \s; in regular expressions 5727db96d56Sopenharmony_ci 5737db96d56Sopenharmony_ci``\s`` 5747db96d56Sopenharmony_ci For Unicode (str) patterns: 5757db96d56Sopenharmony_ci Matches Unicode whitespace characters (which includes 5767db96d56Sopenharmony_ci ``[ \t\n\r\f\v]``, and also many other characters, for example the 5777db96d56Sopenharmony_ci non-breaking spaces mandated by typography rules in many 5787db96d56Sopenharmony_ci languages). If the :const:`ASCII` flag is used, only 5797db96d56Sopenharmony_ci ``[ \t\n\r\f\v]`` is matched. 5807db96d56Sopenharmony_ci 5817db96d56Sopenharmony_ci For 8-bit (bytes) patterns: 5827db96d56Sopenharmony_ci Matches characters considered whitespace in the ASCII character set; 5837db96d56Sopenharmony_ci this is equivalent to ``[ \t\n\r\f\v]``. 5847db96d56Sopenharmony_ci 5857db96d56Sopenharmony_ci.. index:: single: \S; in regular expressions 5867db96d56Sopenharmony_ci 5877db96d56Sopenharmony_ci``\S`` 5887db96d56Sopenharmony_ci Matches any character which is not a whitespace character. This is 5897db96d56Sopenharmony_ci the opposite of ``\s``. If the :const:`ASCII` flag is used this 5907db96d56Sopenharmony_ci becomes the equivalent of ``[^ \t\n\r\f\v]``. 5917db96d56Sopenharmony_ci 5927db96d56Sopenharmony_ci.. index:: single: \w; in regular expressions 5937db96d56Sopenharmony_ci 5947db96d56Sopenharmony_ci``\w`` 5957db96d56Sopenharmony_ci For Unicode (str) patterns: 5967db96d56Sopenharmony_ci Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`) 5977db96d56Sopenharmony_ci as well as the underscore (``_``). 5987db96d56Sopenharmony_ci If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched. 5997db96d56Sopenharmony_ci 6007db96d56Sopenharmony_ci For 8-bit (bytes) patterns: 6017db96d56Sopenharmony_ci Matches characters considered alphanumeric in the ASCII character set; 6027db96d56Sopenharmony_ci this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is 6037db96d56Sopenharmony_ci used, matches characters considered alphanumeric in the current locale 6047db96d56Sopenharmony_ci and the underscore. 6057db96d56Sopenharmony_ci 6067db96d56Sopenharmony_ci.. index:: single: \W; in regular expressions 6077db96d56Sopenharmony_ci 6087db96d56Sopenharmony_ci``\W`` 6097db96d56Sopenharmony_ci Matches any character which is not a word character. This is 6107db96d56Sopenharmony_ci the opposite of ``\w``. If the :const:`ASCII` flag is used this 6117db96d56Sopenharmony_ci becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is 6127db96d56Sopenharmony_ci used, matches characters which are neither alphanumeric in the current locale 6137db96d56Sopenharmony_ci nor the underscore. 6147db96d56Sopenharmony_ci 6157db96d56Sopenharmony_ci.. index:: single: \Z; in regular expressions 6167db96d56Sopenharmony_ci 6177db96d56Sopenharmony_ci``\Z`` 6187db96d56Sopenharmony_ci Matches only at the end of the string. 6197db96d56Sopenharmony_ci 6207db96d56Sopenharmony_ci.. index:: 6217db96d56Sopenharmony_ci single: \a; in regular expressions 6227db96d56Sopenharmony_ci single: \b; in regular expressions 6237db96d56Sopenharmony_ci single: \f; in regular expressions 6247db96d56Sopenharmony_ci single: \n; in regular expressions 6257db96d56Sopenharmony_ci single: \N; in regular expressions 6267db96d56Sopenharmony_ci single: \r; in regular expressions 6277db96d56Sopenharmony_ci single: \t; in regular expressions 6287db96d56Sopenharmony_ci single: \u; in regular expressions 6297db96d56Sopenharmony_ci single: \U; in regular expressions 6307db96d56Sopenharmony_ci single: \v; in regular expressions 6317db96d56Sopenharmony_ci single: \x; in regular expressions 6327db96d56Sopenharmony_ci single: \\; in regular expressions 6337db96d56Sopenharmony_ci 6347db96d56Sopenharmony_ciMost of the standard escapes supported by Python string literals are also 6357db96d56Sopenharmony_ciaccepted by the regular expression parser:: 6367db96d56Sopenharmony_ci 6377db96d56Sopenharmony_ci \a \b \f \n 6387db96d56Sopenharmony_ci \N \r \t \u 6397db96d56Sopenharmony_ci \U \v \x \\ 6407db96d56Sopenharmony_ci 6417db96d56Sopenharmony_ci(Note that ``\b`` is used to represent word boundaries, and means "backspace" 6427db96d56Sopenharmony_cionly inside character classes.) 6437db96d56Sopenharmony_ci 6447db96d56Sopenharmony_ci``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode 6457db96d56Sopenharmony_cipatterns. In bytes patterns they are errors. Unknown escapes of ASCII 6467db96d56Sopenharmony_ciletters are reserved for future use and treated as errors. 6477db96d56Sopenharmony_ci 6487db96d56Sopenharmony_ciOctal escapes are included in a limited form. If the first digit is a 0, or if 6497db96d56Sopenharmony_cithere are three octal digits, it is considered an octal escape. Otherwise, it is 6507db96d56Sopenharmony_cia group reference. As for string literals, octal escapes are always at most 6517db96d56Sopenharmony_cithree digits in length. 6527db96d56Sopenharmony_ci 6537db96d56Sopenharmony_ci.. versionchanged:: 3.3 6547db96d56Sopenharmony_ci The ``'\u'`` and ``'\U'`` escape sequences have been added. 6557db96d56Sopenharmony_ci 6567db96d56Sopenharmony_ci.. versionchanged:: 3.6 6577db96d56Sopenharmony_ci Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors. 6587db96d56Sopenharmony_ci 6597db96d56Sopenharmony_ci.. versionchanged:: 3.8 6607db96d56Sopenharmony_ci The ``'\N{name}'`` escape sequence has been added. As in string literals, 6617db96d56Sopenharmony_ci it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``). 6627db96d56Sopenharmony_ci 6637db96d56Sopenharmony_ci 6647db96d56Sopenharmony_ci.. _contents-of-module-re: 6657db96d56Sopenharmony_ci 6667db96d56Sopenharmony_ciModule Contents 6677db96d56Sopenharmony_ci--------------- 6687db96d56Sopenharmony_ci 6697db96d56Sopenharmony_ciThe module defines several functions, constants, and an exception. Some of the 6707db96d56Sopenharmony_cifunctions are simplified versions of the full featured methods for compiled 6717db96d56Sopenharmony_ciregular expressions. Most non-trivial applications always use the compiled 6727db96d56Sopenharmony_ciform. 6737db96d56Sopenharmony_ci 6747db96d56Sopenharmony_ci 6757db96d56Sopenharmony_ciFlags 6767db96d56Sopenharmony_ci^^^^^ 6777db96d56Sopenharmony_ci 6787db96d56Sopenharmony_ci.. versionchanged:: 3.6 6797db96d56Sopenharmony_ci Flag constants are now instances of :class:`RegexFlag`, which is a subclass of 6807db96d56Sopenharmony_ci :class:`enum.IntFlag`. 6817db96d56Sopenharmony_ci 6827db96d56Sopenharmony_ci 6837db96d56Sopenharmony_ci.. class:: RegexFlag 6847db96d56Sopenharmony_ci 6857db96d56Sopenharmony_ci An :class:`enum.IntFlag` class containing the regex options listed below. 6867db96d56Sopenharmony_ci 6877db96d56Sopenharmony_ci .. versionadded:: 3.11 - added to ``__all__`` 6887db96d56Sopenharmony_ci 6897db96d56Sopenharmony_ci.. data:: A 6907db96d56Sopenharmony_ci ASCII 6917db96d56Sopenharmony_ci 6927db96d56Sopenharmony_ci Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` 6937db96d56Sopenharmony_ci perform ASCII-only matching instead of full Unicode matching. This is only 6947db96d56Sopenharmony_ci meaningful for Unicode patterns, and is ignored for byte patterns. 6957db96d56Sopenharmony_ci Corresponds to the inline flag ``(?a)``. 6967db96d56Sopenharmony_ci 6977db96d56Sopenharmony_ci Note that for backward compatibility, the :const:`re.U` flag still 6987db96d56Sopenharmony_ci exists (as well as its synonym :const:`re.UNICODE` and its embedded 6997db96d56Sopenharmony_ci counterpart ``(?u)``), but these are redundant in Python 3 since 7007db96d56Sopenharmony_ci matches are Unicode by default for strings (and Unicode matching 7017db96d56Sopenharmony_ci isn't allowed for bytes). 7027db96d56Sopenharmony_ci 7037db96d56Sopenharmony_ci 7047db96d56Sopenharmony_ci.. data:: DEBUG 7057db96d56Sopenharmony_ci 7067db96d56Sopenharmony_ci Display debug information about compiled expression. 7077db96d56Sopenharmony_ci No corresponding inline flag. 7087db96d56Sopenharmony_ci 7097db96d56Sopenharmony_ci 7107db96d56Sopenharmony_ci.. data:: I 7117db96d56Sopenharmony_ci IGNORECASE 7127db96d56Sopenharmony_ci 7137db96d56Sopenharmony_ci Perform case-insensitive matching; expressions like ``[A-Z]`` will also 7147db96d56Sopenharmony_ci match lowercase letters. Full Unicode matching (such as ``Ü`` matching 7157db96d56Sopenharmony_ci ``ü``) also works unless the :const:`re.ASCII` flag is used to disable 7167db96d56Sopenharmony_ci non-ASCII matches. The current locale does not change the effect of this 7177db96d56Sopenharmony_ci flag unless the :const:`re.LOCALE` flag is also used. 7187db96d56Sopenharmony_ci Corresponds to the inline flag ``(?i)``. 7197db96d56Sopenharmony_ci 7207db96d56Sopenharmony_ci Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in 7217db96d56Sopenharmony_ci combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII 7227db96d56Sopenharmony_ci letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital 7237db96d56Sopenharmony_ci letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 7247db96d56Sopenharmony_ci 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign). 7257db96d56Sopenharmony_ci If the :const:`ASCII` flag is used, only letters 'a' to 'z' 7267db96d56Sopenharmony_ci and 'A' to 'Z' are matched. 7277db96d56Sopenharmony_ci 7287db96d56Sopenharmony_ci.. data:: L 7297db96d56Sopenharmony_ci LOCALE 7307db96d56Sopenharmony_ci 7317db96d56Sopenharmony_ci Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching 7327db96d56Sopenharmony_ci dependent on the current locale. This flag can be used only with bytes 7337db96d56Sopenharmony_ci patterns. The use of this flag is discouraged as the locale mechanism 7347db96d56Sopenharmony_ci is very unreliable, it only handles one "culture" at a time, and it only 7357db96d56Sopenharmony_ci works with 8-bit locales. Unicode matching is already enabled by default 7367db96d56Sopenharmony_ci in Python 3 for Unicode (str) patterns, and it is able to handle different 7377db96d56Sopenharmony_ci locales/languages. 7387db96d56Sopenharmony_ci Corresponds to the inline flag ``(?L)``. 7397db96d56Sopenharmony_ci 7407db96d56Sopenharmony_ci .. versionchanged:: 3.6 7417db96d56Sopenharmony_ci :const:`re.LOCALE` can be used only with bytes patterns and is 7427db96d56Sopenharmony_ci not compatible with :const:`re.ASCII`. 7437db96d56Sopenharmony_ci 7447db96d56Sopenharmony_ci .. versionchanged:: 3.7 7457db96d56Sopenharmony_ci Compiled regular expression objects with the :const:`re.LOCALE` flag no 7467db96d56Sopenharmony_ci longer depend on the locale at compile time. Only the locale at 7477db96d56Sopenharmony_ci matching time affects the result of matching. 7487db96d56Sopenharmony_ci 7497db96d56Sopenharmony_ci 7507db96d56Sopenharmony_ci.. data:: M 7517db96d56Sopenharmony_ci MULTILINE 7527db96d56Sopenharmony_ci 7537db96d56Sopenharmony_ci When specified, the pattern character ``'^'`` matches at the beginning of the 7547db96d56Sopenharmony_ci string and at the beginning of each line (immediately following each newline); 7557db96d56Sopenharmony_ci and the pattern character ``'$'`` matches at the end of the string and at the 7567db96d56Sopenharmony_ci end of each line (immediately preceding each newline). By default, ``'^'`` 7577db96d56Sopenharmony_ci matches only at the beginning of the string, and ``'$'`` only at the end of the 7587db96d56Sopenharmony_ci string and immediately before the newline (if any) at the end of the string. 7597db96d56Sopenharmony_ci Corresponds to the inline flag ``(?m)``. 7607db96d56Sopenharmony_ci 7617db96d56Sopenharmony_ci.. data:: NOFLAG 7627db96d56Sopenharmony_ci 7637db96d56Sopenharmony_ci Indicates no flag being applied, the value is ``0``. This flag may be used 7647db96d56Sopenharmony_ci as a default value for a function keyword argument or as a base value that 7657db96d56Sopenharmony_ci will be conditionally ORed with other flags. Example of use as a default 7667db96d56Sopenharmony_ci value:: 7677db96d56Sopenharmony_ci 7687db96d56Sopenharmony_ci def myfunc(text, flag=re.NOFLAG): 7697db96d56Sopenharmony_ci return re.match(text, flag) 7707db96d56Sopenharmony_ci 7717db96d56Sopenharmony_ci .. versionadded:: 3.11 7727db96d56Sopenharmony_ci 7737db96d56Sopenharmony_ci.. data:: S 7747db96d56Sopenharmony_ci DOTALL 7757db96d56Sopenharmony_ci 7767db96d56Sopenharmony_ci Make the ``'.'`` special character match any character at all, including a 7777db96d56Sopenharmony_ci newline; without this flag, ``'.'`` will match anything *except* a newline. 7787db96d56Sopenharmony_ci Corresponds to the inline flag ``(?s)``. 7797db96d56Sopenharmony_ci 7807db96d56Sopenharmony_ci 7817db96d56Sopenharmony_ci.. data:: X 7827db96d56Sopenharmony_ci VERBOSE 7837db96d56Sopenharmony_ci 7847db96d56Sopenharmony_ci .. index:: single: # (hash); in regular expressions 7857db96d56Sopenharmony_ci 7867db96d56Sopenharmony_ci This flag allows you to write regular expressions that look nicer and are 7877db96d56Sopenharmony_ci more readable by allowing you to visually separate logical sections of the 7887db96d56Sopenharmony_ci pattern and add comments. Whitespace within the pattern is ignored, except 7897db96d56Sopenharmony_ci when in a character class, or when preceded by an unescaped backslash, 7907db96d56Sopenharmony_ci or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. For example, ``(? :`` 7917db96d56Sopenharmony_ci and ``* ?`` are not allowed. 7927db96d56Sopenharmony_ci When a line contains a ``#`` that is not in a character class and is not 7937db96d56Sopenharmony_ci preceded by an unescaped backslash, all characters from the leftmost such 7947db96d56Sopenharmony_ci ``#`` through the end of the line are ignored. 7957db96d56Sopenharmony_ci 7967db96d56Sopenharmony_ci This means that the two following regular expression objects that match a 7977db96d56Sopenharmony_ci decimal number are functionally equal:: 7987db96d56Sopenharmony_ci 7997db96d56Sopenharmony_ci a = re.compile(r"""\d + # the integral part 8007db96d56Sopenharmony_ci \. # the decimal point 8017db96d56Sopenharmony_ci \d * # some fractional digits""", re.X) 8027db96d56Sopenharmony_ci b = re.compile(r"\d+\.\d*") 8037db96d56Sopenharmony_ci 8047db96d56Sopenharmony_ci Corresponds to the inline flag ``(?x)``. 8057db96d56Sopenharmony_ci 8067db96d56Sopenharmony_ci 8077db96d56Sopenharmony_ciFunctions 8087db96d56Sopenharmony_ci^^^^^^^^^ 8097db96d56Sopenharmony_ci 8107db96d56Sopenharmony_ci.. function:: compile(pattern, flags=0) 8117db96d56Sopenharmony_ci 8127db96d56Sopenharmony_ci Compile a regular expression pattern into a :ref:`regular expression object 8137db96d56Sopenharmony_ci <re-objects>`, which can be used for matching using its 8147db96d56Sopenharmony_ci :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described 8157db96d56Sopenharmony_ci below. 8167db96d56Sopenharmony_ci 8177db96d56Sopenharmony_ci The expression's behaviour can be modified by specifying a *flags* value. 8187db96d56Sopenharmony_ci Values can be any of the following variables, combined using bitwise OR (the 8197db96d56Sopenharmony_ci ``|`` operator). 8207db96d56Sopenharmony_ci 8217db96d56Sopenharmony_ci The sequence :: 8227db96d56Sopenharmony_ci 8237db96d56Sopenharmony_ci prog = re.compile(pattern) 8247db96d56Sopenharmony_ci result = prog.match(string) 8257db96d56Sopenharmony_ci 8267db96d56Sopenharmony_ci is equivalent to :: 8277db96d56Sopenharmony_ci 8287db96d56Sopenharmony_ci result = re.match(pattern, string) 8297db96d56Sopenharmony_ci 8307db96d56Sopenharmony_ci but using :func:`re.compile` and saving the resulting regular expression 8317db96d56Sopenharmony_ci object for reuse is more efficient when the expression will be used several 8327db96d56Sopenharmony_ci times in a single program. 8337db96d56Sopenharmony_ci 8347db96d56Sopenharmony_ci .. note:: 8357db96d56Sopenharmony_ci 8367db96d56Sopenharmony_ci The compiled versions of the most recent patterns passed to 8377db96d56Sopenharmony_ci :func:`re.compile` and the module-level matching functions are cached, so 8387db96d56Sopenharmony_ci programs that use only a few regular expressions at a time needn't worry 8397db96d56Sopenharmony_ci about compiling regular expressions. 8407db96d56Sopenharmony_ci 8417db96d56Sopenharmony_ci 8427db96d56Sopenharmony_ci.. function:: search(pattern, string, flags=0) 8437db96d56Sopenharmony_ci 8447db96d56Sopenharmony_ci Scan through *string* looking for the first location where the regular expression 8457db96d56Sopenharmony_ci *pattern* produces a match, and return a corresponding :ref:`match object 8467db96d56Sopenharmony_ci <match-objects>`. Return ``None`` if no position in the string matches the 8477db96d56Sopenharmony_ci pattern; note that this is different from finding a zero-length match at some 8487db96d56Sopenharmony_ci point in the string. 8497db96d56Sopenharmony_ci 8507db96d56Sopenharmony_ci 8517db96d56Sopenharmony_ci.. function:: match(pattern, string, flags=0) 8527db96d56Sopenharmony_ci 8537db96d56Sopenharmony_ci If zero or more characters at the beginning of *string* match the regular 8547db96d56Sopenharmony_ci expression *pattern*, return a corresponding :ref:`match object 8557db96d56Sopenharmony_ci <match-objects>`. Return ``None`` if the string does not match the pattern; 8567db96d56Sopenharmony_ci note that this is different from a zero-length match. 8577db96d56Sopenharmony_ci 8587db96d56Sopenharmony_ci Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match 8597db96d56Sopenharmony_ci at the beginning of the string and not at the beginning of each line. 8607db96d56Sopenharmony_ci 8617db96d56Sopenharmony_ci If you want to locate a match anywhere in *string*, use :func:`search` 8627db96d56Sopenharmony_ci instead (see also :ref:`search-vs-match`). 8637db96d56Sopenharmony_ci 8647db96d56Sopenharmony_ci 8657db96d56Sopenharmony_ci.. function:: fullmatch(pattern, string, flags=0) 8667db96d56Sopenharmony_ci 8677db96d56Sopenharmony_ci If the whole *string* matches the regular expression *pattern*, return a 8687db96d56Sopenharmony_ci corresponding :ref:`match object <match-objects>`. Return ``None`` if the 8697db96d56Sopenharmony_ci string does not match the pattern; note that this is different from a 8707db96d56Sopenharmony_ci zero-length match. 8717db96d56Sopenharmony_ci 8727db96d56Sopenharmony_ci .. versionadded:: 3.4 8737db96d56Sopenharmony_ci 8747db96d56Sopenharmony_ci 8757db96d56Sopenharmony_ci.. function:: split(pattern, string, maxsplit=0, flags=0) 8767db96d56Sopenharmony_ci 8777db96d56Sopenharmony_ci Split *string* by the occurrences of *pattern*. If capturing parentheses are 8787db96d56Sopenharmony_ci used in *pattern*, then the text of all groups in the pattern are also returned 8797db96d56Sopenharmony_ci as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* 8807db96d56Sopenharmony_ci splits occur, and the remainder of the string is returned as the final element 8817db96d56Sopenharmony_ci of the list. :: 8827db96d56Sopenharmony_ci 8837db96d56Sopenharmony_ci >>> re.split(r'\W+', 'Words, words, words.') 8847db96d56Sopenharmony_ci ['Words', 'words', 'words', ''] 8857db96d56Sopenharmony_ci >>> re.split(r'(\W+)', 'Words, words, words.') 8867db96d56Sopenharmony_ci ['Words', ', ', 'words', ', ', 'words', '.', ''] 8877db96d56Sopenharmony_ci >>> re.split(r'\W+', 'Words, words, words.', 1) 8887db96d56Sopenharmony_ci ['Words', 'words, words.'] 8897db96d56Sopenharmony_ci >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) 8907db96d56Sopenharmony_ci ['0', '3', '9'] 8917db96d56Sopenharmony_ci 8927db96d56Sopenharmony_ci If there are capturing groups in the separator and it matches at the start of 8937db96d56Sopenharmony_ci the string, the result will start with an empty string. The same holds for 8947db96d56Sopenharmony_ci the end of the string:: 8957db96d56Sopenharmony_ci 8967db96d56Sopenharmony_ci >>> re.split(r'(\W+)', '...words, words...') 8977db96d56Sopenharmony_ci ['', '...', 'words', ', ', 'words', '...', ''] 8987db96d56Sopenharmony_ci 8997db96d56Sopenharmony_ci That way, separator components are always found at the same relative 9007db96d56Sopenharmony_ci indices within the result list. 9017db96d56Sopenharmony_ci 9027db96d56Sopenharmony_ci Empty matches for the pattern split the string only when not adjacent 9037db96d56Sopenharmony_ci to a previous empty match. 9047db96d56Sopenharmony_ci 9057db96d56Sopenharmony_ci >>> re.split(r'\b', 'Words, words, words.') 9067db96d56Sopenharmony_ci ['', 'Words', ', ', 'words', ', ', 'words', '.'] 9077db96d56Sopenharmony_ci >>> re.split(r'\W*', '...words...') 9087db96d56Sopenharmony_ci ['', '', 'w', 'o', 'r', 'd', 's', '', ''] 9097db96d56Sopenharmony_ci >>> re.split(r'(\W*)', '...words...') 9107db96d56Sopenharmony_ci ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', ''] 9117db96d56Sopenharmony_ci 9127db96d56Sopenharmony_ci .. versionchanged:: 3.1 9137db96d56Sopenharmony_ci Added the optional flags argument. 9147db96d56Sopenharmony_ci 9157db96d56Sopenharmony_ci .. versionchanged:: 3.7 9167db96d56Sopenharmony_ci Added support of splitting on a pattern that could match an empty string. 9177db96d56Sopenharmony_ci 9187db96d56Sopenharmony_ci 9197db96d56Sopenharmony_ci.. function:: findall(pattern, string, flags=0) 9207db96d56Sopenharmony_ci 9217db96d56Sopenharmony_ci Return all non-overlapping matches of *pattern* in *string*, as a list of 9227db96d56Sopenharmony_ci strings or tuples. The *string* is scanned left-to-right, and matches 9237db96d56Sopenharmony_ci are returned in the order found. Empty matches are included in the result. 9247db96d56Sopenharmony_ci 9257db96d56Sopenharmony_ci The result depends on the number of capturing groups in the pattern. 9267db96d56Sopenharmony_ci If there are no groups, return a list of strings matching the whole 9277db96d56Sopenharmony_ci pattern. If there is exactly one group, return a list of strings 9287db96d56Sopenharmony_ci matching that group. If multiple groups are present, return a list 9297db96d56Sopenharmony_ci of tuples of strings matching the groups. Non-capturing groups do not 9307db96d56Sopenharmony_ci affect the form of the result. 9317db96d56Sopenharmony_ci 9327db96d56Sopenharmony_ci >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest') 9337db96d56Sopenharmony_ci ['foot', 'fell', 'fastest'] 9347db96d56Sopenharmony_ci >>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10') 9357db96d56Sopenharmony_ci [('width', '20'), ('height', '10')] 9367db96d56Sopenharmony_ci 9377db96d56Sopenharmony_ci .. versionchanged:: 3.7 9387db96d56Sopenharmony_ci Non-empty matches can now start just after a previous empty match. 9397db96d56Sopenharmony_ci 9407db96d56Sopenharmony_ci 9417db96d56Sopenharmony_ci.. function:: finditer(pattern, string, flags=0) 9427db96d56Sopenharmony_ci 9437db96d56Sopenharmony_ci Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over 9447db96d56Sopenharmony_ci all non-overlapping matches for the RE *pattern* in *string*. The *string* 9457db96d56Sopenharmony_ci is scanned left-to-right, and matches are returned in the order found. Empty 9467db96d56Sopenharmony_ci matches are included in the result. 9477db96d56Sopenharmony_ci 9487db96d56Sopenharmony_ci .. versionchanged:: 3.7 9497db96d56Sopenharmony_ci Non-empty matches can now start just after a previous empty match. 9507db96d56Sopenharmony_ci 9517db96d56Sopenharmony_ci 9527db96d56Sopenharmony_ci.. function:: sub(pattern, repl, string, count=0, flags=0) 9537db96d56Sopenharmony_ci 9547db96d56Sopenharmony_ci Return the string obtained by replacing the leftmost non-overlapping occurrences 9557db96d56Sopenharmony_ci of *pattern* in *string* by the replacement *repl*. If the pattern isn't found, 9567db96d56Sopenharmony_ci *string* is returned unchanged. *repl* can be a string or a function; if it is 9577db96d56Sopenharmony_ci a string, any backslash escapes in it are processed. That is, ``\n`` is 9587db96d56Sopenharmony_ci converted to a single newline character, ``\r`` is converted to a carriage return, and 9597db96d56Sopenharmony_ci so forth. Unknown escapes of ASCII letters are reserved for future use and 9607db96d56Sopenharmony_ci treated as errors. Other unknown escapes such as ``\&`` are left alone. 9617db96d56Sopenharmony_ci Backreferences, such 9627db96d56Sopenharmony_ci as ``\6``, are replaced with the substring matched by group 6 in the pattern. 9637db96d56Sopenharmony_ci For example:: 9647db96d56Sopenharmony_ci 9657db96d56Sopenharmony_ci >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 9667db96d56Sopenharmony_ci ... r'static PyObject*\npy_\1(void)\n{', 9677db96d56Sopenharmony_ci ... 'def myfunc():') 9687db96d56Sopenharmony_ci 'static PyObject*\npy_myfunc(void)\n{' 9697db96d56Sopenharmony_ci 9707db96d56Sopenharmony_ci If *repl* is a function, it is called for every non-overlapping occurrence of 9717db96d56Sopenharmony_ci *pattern*. The function takes a single :ref:`match object <match-objects>` 9727db96d56Sopenharmony_ci argument, and returns the replacement string. For example:: 9737db96d56Sopenharmony_ci 9747db96d56Sopenharmony_ci >>> def dashrepl(matchobj): 9757db96d56Sopenharmony_ci ... if matchobj.group(0) == '-': return ' ' 9767db96d56Sopenharmony_ci ... else: return '-' 9777db96d56Sopenharmony_ci >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') 9787db96d56Sopenharmony_ci 'pro--gram files' 9797db96d56Sopenharmony_ci >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 9807db96d56Sopenharmony_ci 'Baked Beans & Spam' 9817db96d56Sopenharmony_ci 9827db96d56Sopenharmony_ci The pattern may be a string or a :ref:`pattern object <re-objects>`. 9837db96d56Sopenharmony_ci 9847db96d56Sopenharmony_ci The optional argument *count* is the maximum number of pattern occurrences to be 9857db96d56Sopenharmony_ci replaced; *count* must be a non-negative integer. If omitted or zero, all 9867db96d56Sopenharmony_ci occurrences will be replaced. Empty matches for the pattern are replaced only 9877db96d56Sopenharmony_ci when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns 9887db96d56Sopenharmony_ci ``'-a-b--d-'``. 9897db96d56Sopenharmony_ci 9907db96d56Sopenharmony_ci .. index:: single: \g; in regular expressions 9917db96d56Sopenharmony_ci 9927db96d56Sopenharmony_ci In string-type *repl* arguments, in addition to the character escapes and 9937db96d56Sopenharmony_ci backreferences described above, 9947db96d56Sopenharmony_ci ``\g<name>`` will use the substring matched by the group named ``name``, as 9957db96d56Sopenharmony_ci defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding 9967db96d56Sopenharmony_ci group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous 9977db96d56Sopenharmony_ci in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a 9987db96d56Sopenharmony_ci reference to group 20, not a reference to group 2 followed by the literal 9997db96d56Sopenharmony_ci character ``'0'``. The backreference ``\g<0>`` substitutes in the entire 10007db96d56Sopenharmony_ci substring matched by the RE. 10017db96d56Sopenharmony_ci 10027db96d56Sopenharmony_ci .. versionchanged:: 3.1 10037db96d56Sopenharmony_ci Added the optional flags argument. 10047db96d56Sopenharmony_ci 10057db96d56Sopenharmony_ci .. versionchanged:: 3.5 10067db96d56Sopenharmony_ci Unmatched groups are replaced with an empty string. 10077db96d56Sopenharmony_ci 10087db96d56Sopenharmony_ci .. versionchanged:: 3.6 10097db96d56Sopenharmony_ci Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter 10107db96d56Sopenharmony_ci now are errors. 10117db96d56Sopenharmony_ci 10127db96d56Sopenharmony_ci .. versionchanged:: 3.7 10137db96d56Sopenharmony_ci Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter 10147db96d56Sopenharmony_ci now are errors. 10157db96d56Sopenharmony_ci 10167db96d56Sopenharmony_ci .. versionchanged:: 3.7 10177db96d56Sopenharmony_ci Empty matches for the pattern are replaced when adjacent to a previous 10187db96d56Sopenharmony_ci non-empty match. 10197db96d56Sopenharmony_ci 10207db96d56Sopenharmony_ci .. deprecated:: 3.11 10217db96d56Sopenharmony_ci Group *id* containing anything except ASCII digits. 10227db96d56Sopenharmony_ci Group *name* containing characters outside the ASCII range 10237db96d56Sopenharmony_ci (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings. 10247db96d56Sopenharmony_ci 10257db96d56Sopenharmony_ci 10267db96d56Sopenharmony_ci.. function:: subn(pattern, repl, string, count=0, flags=0) 10277db96d56Sopenharmony_ci 10287db96d56Sopenharmony_ci Perform the same operation as :func:`sub`, but return a tuple ``(new_string, 10297db96d56Sopenharmony_ci number_of_subs_made)``. 10307db96d56Sopenharmony_ci 10317db96d56Sopenharmony_ci .. versionchanged:: 3.1 10327db96d56Sopenharmony_ci Added the optional flags argument. 10337db96d56Sopenharmony_ci 10347db96d56Sopenharmony_ci .. versionchanged:: 3.5 10357db96d56Sopenharmony_ci Unmatched groups are replaced with an empty string. 10367db96d56Sopenharmony_ci 10377db96d56Sopenharmony_ci 10387db96d56Sopenharmony_ci.. function:: escape(pattern) 10397db96d56Sopenharmony_ci 10407db96d56Sopenharmony_ci Escape special characters in *pattern*. 10417db96d56Sopenharmony_ci This is useful if you want to match an arbitrary literal string that may 10427db96d56Sopenharmony_ci have regular expression metacharacters in it. For example:: 10437db96d56Sopenharmony_ci 10447db96d56Sopenharmony_ci >>> print(re.escape('https://www.python.org')) 10457db96d56Sopenharmony_ci https://www\.python\.org 10467db96d56Sopenharmony_ci 10477db96d56Sopenharmony_ci >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:" 10487db96d56Sopenharmony_ci >>> print('[%s]+' % re.escape(legal_chars)) 10497db96d56Sopenharmony_ci [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+ 10507db96d56Sopenharmony_ci 10517db96d56Sopenharmony_ci >>> operators = ['+', '-', '*', '/', '**'] 10527db96d56Sopenharmony_ci >>> print('|'.join(map(re.escape, sorted(operators, reverse=True)))) 10537db96d56Sopenharmony_ci /|\-|\+|\*\*|\* 10547db96d56Sopenharmony_ci 10557db96d56Sopenharmony_ci This function must not be used for the replacement string in :func:`sub` 10567db96d56Sopenharmony_ci and :func:`subn`, only backslashes should be escaped. For example:: 10577db96d56Sopenharmony_ci 10587db96d56Sopenharmony_ci >>> digits_re = r'\d+' 10597db96d56Sopenharmony_ci >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings' 10607db96d56Sopenharmony_ci >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample)) 10617db96d56Sopenharmony_ci /usr/sbin/sendmail - \d+ errors, \d+ warnings 10627db96d56Sopenharmony_ci 10637db96d56Sopenharmony_ci .. versionchanged:: 3.3 10647db96d56Sopenharmony_ci The ``'_'`` character is no longer escaped. 10657db96d56Sopenharmony_ci 10667db96d56Sopenharmony_ci .. versionchanged:: 3.7 10677db96d56Sopenharmony_ci Only characters that can have special meaning in a regular expression 10687db96d56Sopenharmony_ci are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``, 10697db96d56Sopenharmony_ci ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and 10707db96d56Sopenharmony_ci ``"`"`` are no longer escaped. 10717db96d56Sopenharmony_ci 10727db96d56Sopenharmony_ci 10737db96d56Sopenharmony_ci.. function:: purge() 10747db96d56Sopenharmony_ci 10757db96d56Sopenharmony_ci Clear the regular expression cache. 10767db96d56Sopenharmony_ci 10777db96d56Sopenharmony_ci 10787db96d56Sopenharmony_ciExceptions 10797db96d56Sopenharmony_ci^^^^^^^^^^ 10807db96d56Sopenharmony_ci 10817db96d56Sopenharmony_ci.. exception:: error(msg, pattern=None, pos=None) 10827db96d56Sopenharmony_ci 10837db96d56Sopenharmony_ci Exception raised when a string passed to one of the functions here is not a 10847db96d56Sopenharmony_ci valid regular expression (for example, it might contain unmatched parentheses) 10857db96d56Sopenharmony_ci or when some other error occurs during compilation or matching. It is never an 10867db96d56Sopenharmony_ci error if a string contains no match for a pattern. The error instance has 10877db96d56Sopenharmony_ci the following additional attributes: 10887db96d56Sopenharmony_ci 10897db96d56Sopenharmony_ci .. attribute:: msg 10907db96d56Sopenharmony_ci 10917db96d56Sopenharmony_ci The unformatted error message. 10927db96d56Sopenharmony_ci 10937db96d56Sopenharmony_ci .. attribute:: pattern 10947db96d56Sopenharmony_ci 10957db96d56Sopenharmony_ci The regular expression pattern. 10967db96d56Sopenharmony_ci 10977db96d56Sopenharmony_ci .. attribute:: pos 10987db96d56Sopenharmony_ci 10997db96d56Sopenharmony_ci The index in *pattern* where compilation failed (may be ``None``). 11007db96d56Sopenharmony_ci 11017db96d56Sopenharmony_ci .. attribute:: lineno 11027db96d56Sopenharmony_ci 11037db96d56Sopenharmony_ci The line corresponding to *pos* (may be ``None``). 11047db96d56Sopenharmony_ci 11057db96d56Sopenharmony_ci .. attribute:: colno 11067db96d56Sopenharmony_ci 11077db96d56Sopenharmony_ci The column corresponding to *pos* (may be ``None``). 11087db96d56Sopenharmony_ci 11097db96d56Sopenharmony_ci .. versionchanged:: 3.5 11107db96d56Sopenharmony_ci Added additional attributes. 11117db96d56Sopenharmony_ci 11127db96d56Sopenharmony_ci.. _re-objects: 11137db96d56Sopenharmony_ci 11147db96d56Sopenharmony_ciRegular Expression Objects 11157db96d56Sopenharmony_ci-------------------------- 11167db96d56Sopenharmony_ci 11177db96d56Sopenharmony_ciCompiled regular expression objects support the following methods and 11187db96d56Sopenharmony_ciattributes: 11197db96d56Sopenharmony_ci 11207db96d56Sopenharmony_ci.. method:: Pattern.search(string[, pos[, endpos]]) 11217db96d56Sopenharmony_ci 11227db96d56Sopenharmony_ci Scan through *string* looking for the first location where this regular 11237db96d56Sopenharmony_ci expression produces a match, and return a corresponding :ref:`match object 11247db96d56Sopenharmony_ci <match-objects>`. Return ``None`` if no position in the string matches the 11257db96d56Sopenharmony_ci pattern; note that this is different from finding a zero-length match at some 11267db96d56Sopenharmony_ci point in the string. 11277db96d56Sopenharmony_ci 11287db96d56Sopenharmony_ci The optional second parameter *pos* gives an index in the string where the 11297db96d56Sopenharmony_ci search is to start; it defaults to ``0``. This is not completely equivalent to 11307db96d56Sopenharmony_ci slicing the string; the ``'^'`` pattern character matches at the real beginning 11317db96d56Sopenharmony_ci of the string and at positions just after a newline, but not necessarily at the 11327db96d56Sopenharmony_ci index where the search is to start. 11337db96d56Sopenharmony_ci 11347db96d56Sopenharmony_ci The optional parameter *endpos* limits how far the string will be searched; it 11357db96d56Sopenharmony_ci will be as if the string is *endpos* characters long, so only the characters 11367db96d56Sopenharmony_ci from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less 11377db96d56Sopenharmony_ci than *pos*, no match will be found; otherwise, if *rx* is a compiled regular 11387db96d56Sopenharmony_ci expression object, ``rx.search(string, 0, 50)`` is equivalent to 11397db96d56Sopenharmony_ci ``rx.search(string[:50], 0)``. :: 11407db96d56Sopenharmony_ci 11417db96d56Sopenharmony_ci >>> pattern = re.compile("d") 11427db96d56Sopenharmony_ci >>> pattern.search("dog") # Match at index 0 11437db96d56Sopenharmony_ci <re.Match object; span=(0, 1), match='d'> 11447db96d56Sopenharmony_ci >>> pattern.search("dog", 1) # No match; search doesn't include the "d" 11457db96d56Sopenharmony_ci 11467db96d56Sopenharmony_ci 11477db96d56Sopenharmony_ci.. method:: Pattern.match(string[, pos[, endpos]]) 11487db96d56Sopenharmony_ci 11497db96d56Sopenharmony_ci If zero or more characters at the *beginning* of *string* match this regular 11507db96d56Sopenharmony_ci expression, return a corresponding :ref:`match object <match-objects>`. 11517db96d56Sopenharmony_ci Return ``None`` if the string does not match the pattern; note that this is 11527db96d56Sopenharmony_ci different from a zero-length match. 11537db96d56Sopenharmony_ci 11547db96d56Sopenharmony_ci The optional *pos* and *endpos* parameters have the same meaning as for the 11557db96d56Sopenharmony_ci :meth:`~Pattern.search` method. :: 11567db96d56Sopenharmony_ci 11577db96d56Sopenharmony_ci >>> pattern = re.compile("o") 11587db96d56Sopenharmony_ci >>> pattern.match("dog") # No match as "o" is not at the start of "dog". 11597db96d56Sopenharmony_ci >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". 11607db96d56Sopenharmony_ci <re.Match object; span=(1, 2), match='o'> 11617db96d56Sopenharmony_ci 11627db96d56Sopenharmony_ci If you want to locate a match anywhere in *string*, use 11637db96d56Sopenharmony_ci :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`). 11647db96d56Sopenharmony_ci 11657db96d56Sopenharmony_ci 11667db96d56Sopenharmony_ci.. method:: Pattern.fullmatch(string[, pos[, endpos]]) 11677db96d56Sopenharmony_ci 11687db96d56Sopenharmony_ci If the whole *string* matches this regular expression, return a corresponding 11697db96d56Sopenharmony_ci :ref:`match object <match-objects>`. Return ``None`` if the string does not 11707db96d56Sopenharmony_ci match the pattern; note that this is different from a zero-length match. 11717db96d56Sopenharmony_ci 11727db96d56Sopenharmony_ci The optional *pos* and *endpos* parameters have the same meaning as for the 11737db96d56Sopenharmony_ci :meth:`~Pattern.search` method. :: 11747db96d56Sopenharmony_ci 11757db96d56Sopenharmony_ci >>> pattern = re.compile("o[gh]") 11767db96d56Sopenharmony_ci >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog". 11777db96d56Sopenharmony_ci >>> pattern.fullmatch("ogre") # No match as not the full string matches. 11787db96d56Sopenharmony_ci >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits. 11797db96d56Sopenharmony_ci <re.Match object; span=(1, 3), match='og'> 11807db96d56Sopenharmony_ci 11817db96d56Sopenharmony_ci .. versionadded:: 3.4 11827db96d56Sopenharmony_ci 11837db96d56Sopenharmony_ci 11847db96d56Sopenharmony_ci.. method:: Pattern.split(string, maxsplit=0) 11857db96d56Sopenharmony_ci 11867db96d56Sopenharmony_ci Identical to the :func:`split` function, using the compiled pattern. 11877db96d56Sopenharmony_ci 11887db96d56Sopenharmony_ci 11897db96d56Sopenharmony_ci.. method:: Pattern.findall(string[, pos[, endpos]]) 11907db96d56Sopenharmony_ci 11917db96d56Sopenharmony_ci Similar to the :func:`findall` function, using the compiled pattern, but 11927db96d56Sopenharmony_ci also accepts optional *pos* and *endpos* parameters that limit the search 11937db96d56Sopenharmony_ci region like for :meth:`search`. 11947db96d56Sopenharmony_ci 11957db96d56Sopenharmony_ci 11967db96d56Sopenharmony_ci.. method:: Pattern.finditer(string[, pos[, endpos]]) 11977db96d56Sopenharmony_ci 11987db96d56Sopenharmony_ci Similar to the :func:`finditer` function, using the compiled pattern, but 11997db96d56Sopenharmony_ci also accepts optional *pos* and *endpos* parameters that limit the search 12007db96d56Sopenharmony_ci region like for :meth:`search`. 12017db96d56Sopenharmony_ci 12027db96d56Sopenharmony_ci 12037db96d56Sopenharmony_ci.. method:: Pattern.sub(repl, string, count=0) 12047db96d56Sopenharmony_ci 12057db96d56Sopenharmony_ci Identical to the :func:`sub` function, using the compiled pattern. 12067db96d56Sopenharmony_ci 12077db96d56Sopenharmony_ci 12087db96d56Sopenharmony_ci.. method:: Pattern.subn(repl, string, count=0) 12097db96d56Sopenharmony_ci 12107db96d56Sopenharmony_ci Identical to the :func:`subn` function, using the compiled pattern. 12117db96d56Sopenharmony_ci 12127db96d56Sopenharmony_ci 12137db96d56Sopenharmony_ci.. attribute:: Pattern.flags 12147db96d56Sopenharmony_ci 12157db96d56Sopenharmony_ci The regex matching flags. This is a combination of the flags given to 12167db96d56Sopenharmony_ci :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit 12177db96d56Sopenharmony_ci flags such as :data:`UNICODE` if the pattern is a Unicode string. 12187db96d56Sopenharmony_ci 12197db96d56Sopenharmony_ci 12207db96d56Sopenharmony_ci.. attribute:: Pattern.groups 12217db96d56Sopenharmony_ci 12227db96d56Sopenharmony_ci The number of capturing groups in the pattern. 12237db96d56Sopenharmony_ci 12247db96d56Sopenharmony_ci 12257db96d56Sopenharmony_ci.. attribute:: Pattern.groupindex 12267db96d56Sopenharmony_ci 12277db96d56Sopenharmony_ci A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group 12287db96d56Sopenharmony_ci numbers. The dictionary is empty if no symbolic groups were used in the 12297db96d56Sopenharmony_ci pattern. 12307db96d56Sopenharmony_ci 12317db96d56Sopenharmony_ci 12327db96d56Sopenharmony_ci.. attribute:: Pattern.pattern 12337db96d56Sopenharmony_ci 12347db96d56Sopenharmony_ci The pattern string from which the pattern object was compiled. 12357db96d56Sopenharmony_ci 12367db96d56Sopenharmony_ci 12377db96d56Sopenharmony_ci.. versionchanged:: 3.7 12387db96d56Sopenharmony_ci Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled 12397db96d56Sopenharmony_ci regular expression objects are considered atomic. 12407db96d56Sopenharmony_ci 12417db96d56Sopenharmony_ci 12427db96d56Sopenharmony_ci.. _match-objects: 12437db96d56Sopenharmony_ci 12447db96d56Sopenharmony_ciMatch Objects 12457db96d56Sopenharmony_ci------------- 12467db96d56Sopenharmony_ci 12477db96d56Sopenharmony_ciMatch objects always have a boolean value of ``True``. 12487db96d56Sopenharmony_ciSince :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None`` 12497db96d56Sopenharmony_ciwhen there is no match, you can test whether there was a match with a simple 12507db96d56Sopenharmony_ci``if`` statement:: 12517db96d56Sopenharmony_ci 12527db96d56Sopenharmony_ci match = re.search(pattern, string) 12537db96d56Sopenharmony_ci if match: 12547db96d56Sopenharmony_ci process(match) 12557db96d56Sopenharmony_ci 12567db96d56Sopenharmony_ciMatch objects support the following methods and attributes: 12577db96d56Sopenharmony_ci 12587db96d56Sopenharmony_ci 12597db96d56Sopenharmony_ci.. method:: Match.expand(template) 12607db96d56Sopenharmony_ci 12617db96d56Sopenharmony_ci Return the string obtained by doing backslash substitution on the template 12627db96d56Sopenharmony_ci string *template*, as done by the :meth:`~Pattern.sub` method. 12637db96d56Sopenharmony_ci Escapes such as ``\n`` are converted to the appropriate characters, 12647db96d56Sopenharmony_ci and numeric backreferences (``\1``, ``\2``) and named backreferences 12657db96d56Sopenharmony_ci (``\g<1>``, ``\g<name>``) are replaced by the contents of the 12667db96d56Sopenharmony_ci corresponding group. 12677db96d56Sopenharmony_ci 12687db96d56Sopenharmony_ci .. versionchanged:: 3.5 12697db96d56Sopenharmony_ci Unmatched groups are replaced with an empty string. 12707db96d56Sopenharmony_ci 12717db96d56Sopenharmony_ci.. method:: Match.group([group1, ...]) 12727db96d56Sopenharmony_ci 12737db96d56Sopenharmony_ci Returns one or more subgroups of the match. If there is a single argument, the 12747db96d56Sopenharmony_ci result is a single string; if there are multiple arguments, the result is a 12757db96d56Sopenharmony_ci tuple with one item per argument. Without arguments, *group1* defaults to zero 12767db96d56Sopenharmony_ci (the whole match is returned). If a *groupN* argument is zero, the corresponding 12777db96d56Sopenharmony_ci return value is the entire matching string; if it is in the inclusive range 12787db96d56Sopenharmony_ci [1..99], it is the string matching the corresponding parenthesized group. If a 12797db96d56Sopenharmony_ci group number is negative or larger than the number of groups defined in the 12807db96d56Sopenharmony_ci pattern, an :exc:`IndexError` exception is raised. If a group is contained in a 12817db96d56Sopenharmony_ci part of the pattern that did not match, the corresponding result is ``None``. 12827db96d56Sopenharmony_ci If a group is contained in a part of the pattern that matched multiple times, 12837db96d56Sopenharmony_ci the last match is returned. :: 12847db96d56Sopenharmony_ci 12857db96d56Sopenharmony_ci >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 12867db96d56Sopenharmony_ci >>> m.group(0) # The entire match 12877db96d56Sopenharmony_ci 'Isaac Newton' 12887db96d56Sopenharmony_ci >>> m.group(1) # The first parenthesized subgroup. 12897db96d56Sopenharmony_ci 'Isaac' 12907db96d56Sopenharmony_ci >>> m.group(2) # The second parenthesized subgroup. 12917db96d56Sopenharmony_ci 'Newton' 12927db96d56Sopenharmony_ci >>> m.group(1, 2) # Multiple arguments give us a tuple. 12937db96d56Sopenharmony_ci ('Isaac', 'Newton') 12947db96d56Sopenharmony_ci 12957db96d56Sopenharmony_ci If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN* 12967db96d56Sopenharmony_ci arguments may also be strings identifying groups by their group name. If a 12977db96d56Sopenharmony_ci string argument is not used as a group name in the pattern, an :exc:`IndexError` 12987db96d56Sopenharmony_ci exception is raised. 12997db96d56Sopenharmony_ci 13007db96d56Sopenharmony_ci A moderately complicated example:: 13017db96d56Sopenharmony_ci 13027db96d56Sopenharmony_ci >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 13037db96d56Sopenharmony_ci >>> m.group('first_name') 13047db96d56Sopenharmony_ci 'Malcolm' 13057db96d56Sopenharmony_ci >>> m.group('last_name') 13067db96d56Sopenharmony_ci 'Reynolds' 13077db96d56Sopenharmony_ci 13087db96d56Sopenharmony_ci Named groups can also be referred to by their index:: 13097db96d56Sopenharmony_ci 13107db96d56Sopenharmony_ci >>> m.group(1) 13117db96d56Sopenharmony_ci 'Malcolm' 13127db96d56Sopenharmony_ci >>> m.group(2) 13137db96d56Sopenharmony_ci 'Reynolds' 13147db96d56Sopenharmony_ci 13157db96d56Sopenharmony_ci If a group matches multiple times, only the last match is accessible:: 13167db96d56Sopenharmony_ci 13177db96d56Sopenharmony_ci >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. 13187db96d56Sopenharmony_ci >>> m.group(1) # Returns only the last match. 13197db96d56Sopenharmony_ci 'c3' 13207db96d56Sopenharmony_ci 13217db96d56Sopenharmony_ci 13227db96d56Sopenharmony_ci.. method:: Match.__getitem__(g) 13237db96d56Sopenharmony_ci 13247db96d56Sopenharmony_ci This is identical to ``m.group(g)``. This allows easier access to 13257db96d56Sopenharmony_ci an individual group from a match:: 13267db96d56Sopenharmony_ci 13277db96d56Sopenharmony_ci >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 13287db96d56Sopenharmony_ci >>> m[0] # The entire match 13297db96d56Sopenharmony_ci 'Isaac Newton' 13307db96d56Sopenharmony_ci >>> m[1] # The first parenthesized subgroup. 13317db96d56Sopenharmony_ci 'Isaac' 13327db96d56Sopenharmony_ci >>> m[2] # The second parenthesized subgroup. 13337db96d56Sopenharmony_ci 'Newton' 13347db96d56Sopenharmony_ci 13357db96d56Sopenharmony_ci Named groups are supported as well:: 13367db96d56Sopenharmony_ci 13377db96d56Sopenharmony_ci >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton") 13387db96d56Sopenharmony_ci >>> m['first_name'] 13397db96d56Sopenharmony_ci 'Isaac' 13407db96d56Sopenharmony_ci >>> m['last_name'] 13417db96d56Sopenharmony_ci 'Newton' 13427db96d56Sopenharmony_ci 13437db96d56Sopenharmony_ci .. versionadded:: 3.6 13447db96d56Sopenharmony_ci 13457db96d56Sopenharmony_ci 13467db96d56Sopenharmony_ci.. method:: Match.groups(default=None) 13477db96d56Sopenharmony_ci 13487db96d56Sopenharmony_ci Return a tuple containing all the subgroups of the match, from 1 up to however 13497db96d56Sopenharmony_ci many groups are in the pattern. The *default* argument is used for groups that 13507db96d56Sopenharmony_ci did not participate in the match; it defaults to ``None``. 13517db96d56Sopenharmony_ci 13527db96d56Sopenharmony_ci For example:: 13537db96d56Sopenharmony_ci 13547db96d56Sopenharmony_ci >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") 13557db96d56Sopenharmony_ci >>> m.groups() 13567db96d56Sopenharmony_ci ('24', '1632') 13577db96d56Sopenharmony_ci 13587db96d56Sopenharmony_ci If we make the decimal place and everything after it optional, not all groups 13597db96d56Sopenharmony_ci might participate in the match. These groups will default to ``None`` unless 13607db96d56Sopenharmony_ci the *default* argument is given:: 13617db96d56Sopenharmony_ci 13627db96d56Sopenharmony_ci >>> m = re.match(r"(\d+)\.?(\d+)?", "24") 13637db96d56Sopenharmony_ci >>> m.groups() # Second group defaults to None. 13647db96d56Sopenharmony_ci ('24', None) 13657db96d56Sopenharmony_ci >>> m.groups('0') # Now, the second group defaults to '0'. 13667db96d56Sopenharmony_ci ('24', '0') 13677db96d56Sopenharmony_ci 13687db96d56Sopenharmony_ci 13697db96d56Sopenharmony_ci.. method:: Match.groupdict(default=None) 13707db96d56Sopenharmony_ci 13717db96d56Sopenharmony_ci Return a dictionary containing all the *named* subgroups of the match, keyed by 13727db96d56Sopenharmony_ci the subgroup name. The *default* argument is used for groups that did not 13737db96d56Sopenharmony_ci participate in the match; it defaults to ``None``. For example:: 13747db96d56Sopenharmony_ci 13757db96d56Sopenharmony_ci >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 13767db96d56Sopenharmony_ci >>> m.groupdict() 13777db96d56Sopenharmony_ci {'first_name': 'Malcolm', 'last_name': 'Reynolds'} 13787db96d56Sopenharmony_ci 13797db96d56Sopenharmony_ci 13807db96d56Sopenharmony_ci.. method:: Match.start([group]) 13817db96d56Sopenharmony_ci Match.end([group]) 13827db96d56Sopenharmony_ci 13837db96d56Sopenharmony_ci Return the indices of the start and end of the substring matched by *group*; 13847db96d56Sopenharmony_ci *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if 13857db96d56Sopenharmony_ci *group* exists but did not contribute to the match. For a match object *m*, and 13867db96d56Sopenharmony_ci a group *g* that did contribute to the match, the substring matched by group *g* 13877db96d56Sopenharmony_ci (equivalent to ``m.group(g)``) is :: 13887db96d56Sopenharmony_ci 13897db96d56Sopenharmony_ci m.string[m.start(g):m.end(g)] 13907db96d56Sopenharmony_ci 13917db96d56Sopenharmony_ci Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a 13927db96d56Sopenharmony_ci null string. For example, after ``m = re.search('b(c?)', 'cba')``, 13937db96d56Sopenharmony_ci ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 13947db96d56Sopenharmony_ci 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. 13957db96d56Sopenharmony_ci 13967db96d56Sopenharmony_ci An example that will remove *remove_this* from email addresses:: 13977db96d56Sopenharmony_ci 13987db96d56Sopenharmony_ci >>> email = "tony@tiremove_thisger.net" 13997db96d56Sopenharmony_ci >>> m = re.search("remove_this", email) 14007db96d56Sopenharmony_ci >>> email[:m.start()] + email[m.end():] 14017db96d56Sopenharmony_ci 'tony@tiger.net' 14027db96d56Sopenharmony_ci 14037db96d56Sopenharmony_ci 14047db96d56Sopenharmony_ci.. method:: Match.span([group]) 14057db96d56Sopenharmony_ci 14067db96d56Sopenharmony_ci For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note 14077db96d56Sopenharmony_ci that if *group* did not contribute to the match, this is ``(-1, -1)``. 14087db96d56Sopenharmony_ci *group* defaults to zero, the entire match. 14097db96d56Sopenharmony_ci 14107db96d56Sopenharmony_ci 14117db96d56Sopenharmony_ci.. attribute:: Match.pos 14127db96d56Sopenharmony_ci 14137db96d56Sopenharmony_ci The value of *pos* which was passed to the :meth:`~Pattern.search` or 14147db96d56Sopenharmony_ci :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 14157db96d56Sopenharmony_ci the index into the string at which the RE engine started looking for a match. 14167db96d56Sopenharmony_ci 14177db96d56Sopenharmony_ci 14187db96d56Sopenharmony_ci.. attribute:: Match.endpos 14197db96d56Sopenharmony_ci 14207db96d56Sopenharmony_ci The value of *endpos* which was passed to the :meth:`~Pattern.search` or 14217db96d56Sopenharmony_ci :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 14227db96d56Sopenharmony_ci the index into the string beyond which the RE engine will not go. 14237db96d56Sopenharmony_ci 14247db96d56Sopenharmony_ci 14257db96d56Sopenharmony_ci.. attribute:: Match.lastindex 14267db96d56Sopenharmony_ci 14277db96d56Sopenharmony_ci The integer index of the last matched capturing group, or ``None`` if no group 14287db96d56Sopenharmony_ci was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and 14297db96d56Sopenharmony_ci ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while 14307db96d56Sopenharmony_ci the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same 14317db96d56Sopenharmony_ci string. 14327db96d56Sopenharmony_ci 14337db96d56Sopenharmony_ci 14347db96d56Sopenharmony_ci.. attribute:: Match.lastgroup 14357db96d56Sopenharmony_ci 14367db96d56Sopenharmony_ci The name of the last matched capturing group, or ``None`` if the group didn't 14377db96d56Sopenharmony_ci have a name, or if no group was matched at all. 14387db96d56Sopenharmony_ci 14397db96d56Sopenharmony_ci 14407db96d56Sopenharmony_ci.. attribute:: Match.re 14417db96d56Sopenharmony_ci 14427db96d56Sopenharmony_ci The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or 14437db96d56Sopenharmony_ci :meth:`~Pattern.search` method produced this match instance. 14447db96d56Sopenharmony_ci 14457db96d56Sopenharmony_ci 14467db96d56Sopenharmony_ci.. attribute:: Match.string 14477db96d56Sopenharmony_ci 14487db96d56Sopenharmony_ci The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`. 14497db96d56Sopenharmony_ci 14507db96d56Sopenharmony_ci 14517db96d56Sopenharmony_ci.. versionchanged:: 3.7 14527db96d56Sopenharmony_ci Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects 14537db96d56Sopenharmony_ci are considered atomic. 14547db96d56Sopenharmony_ci 14557db96d56Sopenharmony_ci 14567db96d56Sopenharmony_ci.. _re-examples: 14577db96d56Sopenharmony_ci 14587db96d56Sopenharmony_ciRegular Expression Examples 14597db96d56Sopenharmony_ci--------------------------- 14607db96d56Sopenharmony_ci 14617db96d56Sopenharmony_ci 14627db96d56Sopenharmony_ciChecking for a Pair 14637db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^ 14647db96d56Sopenharmony_ci 14657db96d56Sopenharmony_ciIn this example, we'll use the following helper function to display match 14667db96d56Sopenharmony_ciobjects a little more gracefully:: 14677db96d56Sopenharmony_ci 14687db96d56Sopenharmony_ci def displaymatch(match): 14697db96d56Sopenharmony_ci if match is None: 14707db96d56Sopenharmony_ci return None 14717db96d56Sopenharmony_ci return '<Match: %r, groups=%r>' % (match.group(), match.groups()) 14727db96d56Sopenharmony_ci 14737db96d56Sopenharmony_ciSuppose you are writing a poker program where a player's hand is represented as 14747db96d56Sopenharmony_cia 5-character string with each character representing a card, "a" for ace, "k" 14757db96d56Sopenharmony_cifor king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9" 14767db96d56Sopenharmony_cirepresenting the card with that value. 14777db96d56Sopenharmony_ci 14787db96d56Sopenharmony_ciTo see if a given string is a valid hand, one could do the following:: 14797db96d56Sopenharmony_ci 14807db96d56Sopenharmony_ci >>> valid = re.compile(r"^[a2-9tjqk]{5}$") 14817db96d56Sopenharmony_ci >>> displaymatch(valid.match("akt5q")) # Valid. 14827db96d56Sopenharmony_ci "<Match: 'akt5q', groups=()>" 14837db96d56Sopenharmony_ci >>> displaymatch(valid.match("akt5e")) # Invalid. 14847db96d56Sopenharmony_ci >>> displaymatch(valid.match("akt")) # Invalid. 14857db96d56Sopenharmony_ci >>> displaymatch(valid.match("727ak")) # Valid. 14867db96d56Sopenharmony_ci "<Match: '727ak', groups=()>" 14877db96d56Sopenharmony_ci 14887db96d56Sopenharmony_ciThat last hand, ``"727ak"``, contained a pair, or two of the same valued cards. 14897db96d56Sopenharmony_ciTo match this with a regular expression, one could use backreferences as such:: 14907db96d56Sopenharmony_ci 14917db96d56Sopenharmony_ci >>> pair = re.compile(r".*(.).*\1") 14927db96d56Sopenharmony_ci >>> displaymatch(pair.match("717ak")) # Pair of 7s. 14937db96d56Sopenharmony_ci "<Match: '717', groups=('7',)>" 14947db96d56Sopenharmony_ci >>> displaymatch(pair.match("718ak")) # No pairs. 14957db96d56Sopenharmony_ci >>> displaymatch(pair.match("354aa")) # Pair of aces. 14967db96d56Sopenharmony_ci "<Match: '354aa', groups=('a',)>" 14977db96d56Sopenharmony_ci 14987db96d56Sopenharmony_ciTo find out what card the pair consists of, one could use the 14997db96d56Sopenharmony_ci:meth:`~Match.group` method of the match object in the following manner:: 15007db96d56Sopenharmony_ci 15017db96d56Sopenharmony_ci >>> pair = re.compile(r".*(.).*\1") 15027db96d56Sopenharmony_ci >>> pair.match("717ak").group(1) 15037db96d56Sopenharmony_ci '7' 15047db96d56Sopenharmony_ci 15057db96d56Sopenharmony_ci # Error because re.match() returns None, which doesn't have a group() method: 15067db96d56Sopenharmony_ci >>> pair.match("718ak").group(1) 15077db96d56Sopenharmony_ci Traceback (most recent call last): 15087db96d56Sopenharmony_ci File "<pyshell#23>", line 1, in <module> 15097db96d56Sopenharmony_ci re.match(r".*(.).*\1", "718ak").group(1) 15107db96d56Sopenharmony_ci AttributeError: 'NoneType' object has no attribute 'group' 15117db96d56Sopenharmony_ci 15127db96d56Sopenharmony_ci >>> pair.match("354aa").group(1) 15137db96d56Sopenharmony_ci 'a' 15147db96d56Sopenharmony_ci 15157db96d56Sopenharmony_ci 15167db96d56Sopenharmony_ciSimulating scanf() 15177db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^ 15187db96d56Sopenharmony_ci 15197db96d56Sopenharmony_ci.. index:: single: scanf() 15207db96d56Sopenharmony_ci 15217db96d56Sopenharmony_ciPython does not currently have an equivalent to :c:func:`scanf`. Regular 15227db96d56Sopenharmony_ciexpressions are generally more powerful, though also more verbose, than 15237db96d56Sopenharmony_ci:c:func:`scanf` format strings. The table below offers some more-or-less 15247db96d56Sopenharmony_ciequivalent mappings between :c:func:`scanf` format tokens and regular 15257db96d56Sopenharmony_ciexpressions. 15267db96d56Sopenharmony_ci 15277db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15287db96d56Sopenharmony_ci| :c:func:`scanf` Token | Regular Expression | 15297db96d56Sopenharmony_ci+================================+=============================================+ 15307db96d56Sopenharmony_ci| ``%c`` | ``.`` | 15317db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15327db96d56Sopenharmony_ci| ``%5c`` | ``.{5}`` | 15337db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15347db96d56Sopenharmony_ci| ``%d`` | ``[-+]?\d+`` | 15357db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15367db96d56Sopenharmony_ci| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` | 15377db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15387db96d56Sopenharmony_ci| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` | 15397db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15407db96d56Sopenharmony_ci| ``%o`` | ``[-+]?[0-7]+`` | 15417db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15427db96d56Sopenharmony_ci| ``%s`` | ``\S+`` | 15437db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15447db96d56Sopenharmony_ci| ``%u`` | ``\d+`` | 15457db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15467db96d56Sopenharmony_ci| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` | 15477db96d56Sopenharmony_ci+--------------------------------+---------------------------------------------+ 15487db96d56Sopenharmony_ci 15497db96d56Sopenharmony_ciTo extract the filename and numbers from a string like :: 15507db96d56Sopenharmony_ci 15517db96d56Sopenharmony_ci /usr/sbin/sendmail - 0 errors, 4 warnings 15527db96d56Sopenharmony_ci 15537db96d56Sopenharmony_ciyou would use a :c:func:`scanf` format like :: 15547db96d56Sopenharmony_ci 15557db96d56Sopenharmony_ci %s - %d errors, %d warnings 15567db96d56Sopenharmony_ci 15577db96d56Sopenharmony_ciThe equivalent regular expression would be :: 15587db96d56Sopenharmony_ci 15597db96d56Sopenharmony_ci (\S+) - (\d+) errors, (\d+) warnings 15607db96d56Sopenharmony_ci 15617db96d56Sopenharmony_ci 15627db96d56Sopenharmony_ci.. _search-vs-match: 15637db96d56Sopenharmony_ci 15647db96d56Sopenharmony_cisearch() vs. match() 15657db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^ 15667db96d56Sopenharmony_ci 15677db96d56Sopenharmony_ci.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 15687db96d56Sopenharmony_ci 15697db96d56Sopenharmony_ciPython offers different primitive operations based on regular expressions: 15707db96d56Sopenharmony_ci 15717db96d56Sopenharmony_ci+ :func:`re.match` checks for a match only at the beginning of the string 15727db96d56Sopenharmony_ci+ :func:`re.search` checks for a match anywhere in the string 15737db96d56Sopenharmony_ci (this is what Perl does by default) 15747db96d56Sopenharmony_ci+ :func:`re.fullmatch` checks for entire string to be a match 15757db96d56Sopenharmony_ci 15767db96d56Sopenharmony_ci 15777db96d56Sopenharmony_ciFor example:: 15787db96d56Sopenharmony_ci 15797db96d56Sopenharmony_ci >>> re.match("c", "abcdef") # No match 15807db96d56Sopenharmony_ci >>> re.search("c", "abcdef") # Match 15817db96d56Sopenharmony_ci <re.Match object; span=(2, 3), match='c'> 15827db96d56Sopenharmony_ci >>> re.fullmatch("p.*n", "python") # Match 15837db96d56Sopenharmony_ci <re.Match object; span=(0, 6), match='python'> 15847db96d56Sopenharmony_ci >>> re.fullmatch("r.*n", "python") # No match 15857db96d56Sopenharmony_ci 15867db96d56Sopenharmony_ciRegular expressions beginning with ``'^'`` can be used with :func:`search` to 15877db96d56Sopenharmony_cirestrict the match at the beginning of the string:: 15887db96d56Sopenharmony_ci 15897db96d56Sopenharmony_ci >>> re.match("c", "abcdef") # No match 15907db96d56Sopenharmony_ci >>> re.search("^c", "abcdef") # No match 15917db96d56Sopenharmony_ci >>> re.search("^a", "abcdef") # Match 15927db96d56Sopenharmony_ci <re.Match object; span=(0, 1), match='a'> 15937db96d56Sopenharmony_ci 15947db96d56Sopenharmony_ciNote however that in :const:`MULTILINE` mode :func:`match` only matches at the 15957db96d56Sopenharmony_cibeginning of the string, whereas using :func:`search` with a regular expression 15967db96d56Sopenharmony_cibeginning with ``'^'`` will match at the beginning of each line. :: 15977db96d56Sopenharmony_ci 15987db96d56Sopenharmony_ci >>> re.match("X", "A\nB\nX", re.MULTILINE) # No match 15997db96d56Sopenharmony_ci >>> re.search("^X", "A\nB\nX", re.MULTILINE) # Match 16007db96d56Sopenharmony_ci <re.Match object; span=(4, 5), match='X'> 16017db96d56Sopenharmony_ci 16027db96d56Sopenharmony_ci 16037db96d56Sopenharmony_ciMaking a Phonebook 16047db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^ 16057db96d56Sopenharmony_ci 16067db96d56Sopenharmony_ci:func:`split` splits a string into a list delimited by the passed pattern. The 16077db96d56Sopenharmony_cimethod is invaluable for converting textual data into data structures that can be 16087db96d56Sopenharmony_cieasily read and modified by Python as demonstrated in the following example that 16097db96d56Sopenharmony_cicreates a phonebook. 16107db96d56Sopenharmony_ci 16117db96d56Sopenharmony_ciFirst, here is the input. Normally it may come from a file, here we are using 16127db96d56Sopenharmony_citriple-quoted string syntax 16137db96d56Sopenharmony_ci 16147db96d56Sopenharmony_ci.. doctest:: 16157db96d56Sopenharmony_ci 16167db96d56Sopenharmony_ci >>> text = """Ross McFluff: 834.345.1254 155 Elm Street 16177db96d56Sopenharmony_ci ... 16187db96d56Sopenharmony_ci ... Ronald Heathmore: 892.345.3428 436 Finley Avenue 16197db96d56Sopenharmony_ci ... Frank Burger: 925.541.7625 662 South Dogwood Way 16207db96d56Sopenharmony_ci ... 16217db96d56Sopenharmony_ci ... 16227db96d56Sopenharmony_ci ... Heather Albrecht: 548.326.4584 919 Park Place""" 16237db96d56Sopenharmony_ci 16247db96d56Sopenharmony_ciThe entries are separated by one or more newlines. Now we convert the string 16257db96d56Sopenharmony_ciinto a list with each nonempty line having its own entry: 16267db96d56Sopenharmony_ci 16277db96d56Sopenharmony_ci.. doctest:: 16287db96d56Sopenharmony_ci :options: +NORMALIZE_WHITESPACE 16297db96d56Sopenharmony_ci 16307db96d56Sopenharmony_ci >>> entries = re.split("\n+", text) 16317db96d56Sopenharmony_ci >>> entries 16327db96d56Sopenharmony_ci ['Ross McFluff: 834.345.1254 155 Elm Street', 16337db96d56Sopenharmony_ci 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 16347db96d56Sopenharmony_ci 'Frank Burger: 925.541.7625 662 South Dogwood Way', 16357db96d56Sopenharmony_ci 'Heather Albrecht: 548.326.4584 919 Park Place'] 16367db96d56Sopenharmony_ci 16377db96d56Sopenharmony_ciFinally, split each entry into a list with first name, last name, telephone 16387db96d56Sopenharmony_cinumber, and address. We use the ``maxsplit`` parameter of :func:`split` 16397db96d56Sopenharmony_cibecause the address has spaces, our splitting pattern, in it: 16407db96d56Sopenharmony_ci 16417db96d56Sopenharmony_ci.. doctest:: 16427db96d56Sopenharmony_ci :options: +NORMALIZE_WHITESPACE 16437db96d56Sopenharmony_ci 16447db96d56Sopenharmony_ci >>> [re.split(":? ", entry, 3) for entry in entries] 16457db96d56Sopenharmony_ci [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], 16467db96d56Sopenharmony_ci ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], 16477db96d56Sopenharmony_ci ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], 16487db96d56Sopenharmony_ci ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] 16497db96d56Sopenharmony_ci 16507db96d56Sopenharmony_ciThe ``:?`` pattern matches the colon after the last name, so that it does not 16517db96d56Sopenharmony_cioccur in the result list. With a ``maxsplit`` of ``4``, we could separate the 16527db96d56Sopenharmony_cihouse number from the street name: 16537db96d56Sopenharmony_ci 16547db96d56Sopenharmony_ci.. doctest:: 16557db96d56Sopenharmony_ci :options: +NORMALIZE_WHITESPACE 16567db96d56Sopenharmony_ci 16577db96d56Sopenharmony_ci >>> [re.split(":? ", entry, 4) for entry in entries] 16587db96d56Sopenharmony_ci [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], 16597db96d56Sopenharmony_ci ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], 16607db96d56Sopenharmony_ci ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], 16617db96d56Sopenharmony_ci ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 16627db96d56Sopenharmony_ci 16637db96d56Sopenharmony_ci 16647db96d56Sopenharmony_ciText Munging 16657db96d56Sopenharmony_ci^^^^^^^^^^^^ 16667db96d56Sopenharmony_ci 16677db96d56Sopenharmony_ci:func:`sub` replaces every occurrence of a pattern with a string or the 16687db96d56Sopenharmony_ciresult of a function. This example demonstrates using :func:`sub` with 16697db96d56Sopenharmony_cia function to "munge" text, or randomize the order of all the characters 16707db96d56Sopenharmony_ciin each word of a sentence except for the first and last characters:: 16717db96d56Sopenharmony_ci 16727db96d56Sopenharmony_ci >>> def repl(m): 16737db96d56Sopenharmony_ci ... inner_word = list(m.group(2)) 16747db96d56Sopenharmony_ci ... random.shuffle(inner_word) 16757db96d56Sopenharmony_ci ... return m.group(1) + "".join(inner_word) + m.group(3) 16767db96d56Sopenharmony_ci >>> text = "Professor Abdolmalek, please report your absences promptly." 16777db96d56Sopenharmony_ci >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 16787db96d56Sopenharmony_ci 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 16797db96d56Sopenharmony_ci >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 16807db96d56Sopenharmony_ci 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 16817db96d56Sopenharmony_ci 16827db96d56Sopenharmony_ci 16837db96d56Sopenharmony_ciFinding all Adverbs 16847db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^ 16857db96d56Sopenharmony_ci 16867db96d56Sopenharmony_ci:func:`findall` matches *all* occurrences of a pattern, not just the first 16877db96d56Sopenharmony_cione as :func:`search` does. For example, if a writer wanted to 16887db96d56Sopenharmony_cifind all of the adverbs in some text, they might use :func:`findall` in 16897db96d56Sopenharmony_cithe following manner:: 16907db96d56Sopenharmony_ci 16917db96d56Sopenharmony_ci >>> text = "He was carefully disguised but captured quickly by police." 16927db96d56Sopenharmony_ci >>> re.findall(r"\w+ly\b", text) 16937db96d56Sopenharmony_ci ['carefully', 'quickly'] 16947db96d56Sopenharmony_ci 16957db96d56Sopenharmony_ci 16967db96d56Sopenharmony_ciFinding all Adverbs and their Positions 16977db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 16987db96d56Sopenharmony_ci 16997db96d56Sopenharmony_ciIf one wants more information about all matches of a pattern than the matched 17007db96d56Sopenharmony_citext, :func:`finditer` is useful as it provides :ref:`match objects 17017db96d56Sopenharmony_ci<match-objects>` instead of strings. Continuing with the previous example, if 17027db96d56Sopenharmony_cia writer wanted to find all of the adverbs *and their positions* in 17037db96d56Sopenharmony_cisome text, they would use :func:`finditer` in the following manner:: 17047db96d56Sopenharmony_ci 17057db96d56Sopenharmony_ci >>> text = "He was carefully disguised but captured quickly by police." 17067db96d56Sopenharmony_ci >>> for m in re.finditer(r"\w+ly\b", text): 17077db96d56Sopenharmony_ci ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 17087db96d56Sopenharmony_ci 07-16: carefully 17097db96d56Sopenharmony_ci 40-47: quickly 17107db96d56Sopenharmony_ci 17117db96d56Sopenharmony_ci 17127db96d56Sopenharmony_ciRaw String Notation 17137db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^ 17147db96d56Sopenharmony_ci 17157db96d56Sopenharmony_ciRaw string notation (``r"text"``) keeps regular expressions sane. Without it, 17167db96d56Sopenharmony_cievery backslash (``'\'``) in a regular expression would have to be prefixed with 17177db96d56Sopenharmony_cianother one to escape it. For example, the two following lines of code are 17187db96d56Sopenharmony_cifunctionally identical:: 17197db96d56Sopenharmony_ci 17207db96d56Sopenharmony_ci >>> re.match(r"\W(.)\1\W", " ff ") 17217db96d56Sopenharmony_ci <re.Match object; span=(0, 4), match=' ff '> 17227db96d56Sopenharmony_ci >>> re.match("\\W(.)\\1\\W", " ff ") 17237db96d56Sopenharmony_ci <re.Match object; span=(0, 4), match=' ff '> 17247db96d56Sopenharmony_ci 17257db96d56Sopenharmony_ciWhen one wants to match a literal backslash, it must be escaped in the regular 17267db96d56Sopenharmony_ciexpression. With raw string notation, this means ``r"\\"``. Without raw string 17277db96d56Sopenharmony_cinotation, one must use ``"\\\\"``, making the following lines of code 17287db96d56Sopenharmony_cifunctionally identical:: 17297db96d56Sopenharmony_ci 17307db96d56Sopenharmony_ci >>> re.match(r"\\", r"\\") 17317db96d56Sopenharmony_ci <re.Match object; span=(0, 1), match='\\'> 17327db96d56Sopenharmony_ci >>> re.match("\\\\", r"\\") 17337db96d56Sopenharmony_ci <re.Match object; span=(0, 1), match='\\'> 17347db96d56Sopenharmony_ci 17357db96d56Sopenharmony_ci 17367db96d56Sopenharmony_ciWriting a Tokenizer 17377db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^ 17387db96d56Sopenharmony_ci 17397db96d56Sopenharmony_ciA `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_ 17407db96d56Sopenharmony_cianalyzes a string to categorize groups of characters. This is a useful first 17417db96d56Sopenharmony_cistep in writing a compiler or interpreter. 17427db96d56Sopenharmony_ci 17437db96d56Sopenharmony_ciThe text categories are specified with regular expressions. The technique is 17447db96d56Sopenharmony_cito combine those into a single master regular expression and to loop over 17457db96d56Sopenharmony_cisuccessive matches:: 17467db96d56Sopenharmony_ci 17477db96d56Sopenharmony_ci from typing import NamedTuple 17487db96d56Sopenharmony_ci import re 17497db96d56Sopenharmony_ci 17507db96d56Sopenharmony_ci class Token(NamedTuple): 17517db96d56Sopenharmony_ci type: str 17527db96d56Sopenharmony_ci value: str 17537db96d56Sopenharmony_ci line: int 17547db96d56Sopenharmony_ci column: int 17557db96d56Sopenharmony_ci 17567db96d56Sopenharmony_ci def tokenize(code): 17577db96d56Sopenharmony_ci keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'} 17587db96d56Sopenharmony_ci token_specification = [ 17597db96d56Sopenharmony_ci ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number 17607db96d56Sopenharmony_ci ('ASSIGN', r':='), # Assignment operator 17617db96d56Sopenharmony_ci ('END', r';'), # Statement terminator 17627db96d56Sopenharmony_ci ('ID', r'[A-Za-z]+'), # Identifiers 17637db96d56Sopenharmony_ci ('OP', r'[+\-*/]'), # Arithmetic operators 17647db96d56Sopenharmony_ci ('NEWLINE', r'\n'), # Line endings 17657db96d56Sopenharmony_ci ('SKIP', r'[ \t]+'), # Skip over spaces and tabs 17667db96d56Sopenharmony_ci ('MISMATCH', r'.'), # Any other character 17677db96d56Sopenharmony_ci ] 17687db96d56Sopenharmony_ci tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) 17697db96d56Sopenharmony_ci line_num = 1 17707db96d56Sopenharmony_ci line_start = 0 17717db96d56Sopenharmony_ci for mo in re.finditer(tok_regex, code): 17727db96d56Sopenharmony_ci kind = mo.lastgroup 17737db96d56Sopenharmony_ci value = mo.group() 17747db96d56Sopenharmony_ci column = mo.start() - line_start 17757db96d56Sopenharmony_ci if kind == 'NUMBER': 17767db96d56Sopenharmony_ci value = float(value) if '.' in value else int(value) 17777db96d56Sopenharmony_ci elif kind == 'ID' and value in keywords: 17787db96d56Sopenharmony_ci kind = value 17797db96d56Sopenharmony_ci elif kind == 'NEWLINE': 17807db96d56Sopenharmony_ci line_start = mo.end() 17817db96d56Sopenharmony_ci line_num += 1 17827db96d56Sopenharmony_ci continue 17837db96d56Sopenharmony_ci elif kind == 'SKIP': 17847db96d56Sopenharmony_ci continue 17857db96d56Sopenharmony_ci elif kind == 'MISMATCH': 17867db96d56Sopenharmony_ci raise RuntimeError(f'{value!r} unexpected on line {line_num}') 17877db96d56Sopenharmony_ci yield Token(kind, value, line_num, column) 17887db96d56Sopenharmony_ci 17897db96d56Sopenharmony_ci statements = ''' 17907db96d56Sopenharmony_ci IF quantity THEN 17917db96d56Sopenharmony_ci total := total + price * quantity; 17927db96d56Sopenharmony_ci tax := price * 0.05; 17937db96d56Sopenharmony_ci ENDIF; 17947db96d56Sopenharmony_ci ''' 17957db96d56Sopenharmony_ci 17967db96d56Sopenharmony_ci for token in tokenize(statements): 17977db96d56Sopenharmony_ci print(token) 17987db96d56Sopenharmony_ci 17997db96d56Sopenharmony_ciThe tokenizer produces the following output:: 18007db96d56Sopenharmony_ci 18017db96d56Sopenharmony_ci Token(type='IF', value='IF', line=2, column=4) 18027db96d56Sopenharmony_ci Token(type='ID', value='quantity', line=2, column=7) 18037db96d56Sopenharmony_ci Token(type='THEN', value='THEN', line=2, column=16) 18047db96d56Sopenharmony_ci Token(type='ID', value='total', line=3, column=8) 18057db96d56Sopenharmony_ci Token(type='ASSIGN', value=':=', line=3, column=14) 18067db96d56Sopenharmony_ci Token(type='ID', value='total', line=3, column=17) 18077db96d56Sopenharmony_ci Token(type='OP', value='+', line=3, column=23) 18087db96d56Sopenharmony_ci Token(type='ID', value='price', line=3, column=25) 18097db96d56Sopenharmony_ci Token(type='OP', value='*', line=3, column=31) 18107db96d56Sopenharmony_ci Token(type='ID', value='quantity', line=3, column=33) 18117db96d56Sopenharmony_ci Token(type='END', value=';', line=3, column=41) 18127db96d56Sopenharmony_ci Token(type='ID', value='tax', line=4, column=8) 18137db96d56Sopenharmony_ci Token(type='ASSIGN', value=':=', line=4, column=12) 18147db96d56Sopenharmony_ci Token(type='ID', value='price', line=4, column=15) 18157db96d56Sopenharmony_ci Token(type='OP', value='*', line=4, column=21) 18167db96d56Sopenharmony_ci Token(type='NUMBER', value=0.05, line=4, column=23) 18177db96d56Sopenharmony_ci Token(type='END', value=';', line=4, column=27) 18187db96d56Sopenharmony_ci Token(type='ENDIF', value='ENDIF', line=5, column=4) 18197db96d56Sopenharmony_ci Token(type='END', value=';', line=5, column=9) 18207db96d56Sopenharmony_ci 18217db96d56Sopenharmony_ci 18227db96d56Sopenharmony_ci.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly 18237db96d56Sopenharmony_ci Media, 2009. The third edition of the book no longer covers Python at all, 18247db96d56Sopenharmony_ci but the first edition covered writing good regular expression patterns in 18257db96d56Sopenharmony_ci great detail. 1826