17db96d56Sopenharmony_ci.. _regex-howto: 27db96d56Sopenharmony_ci 37db96d56Sopenharmony_ci**************************** 47db96d56Sopenharmony_ci Regular Expression HOWTO 57db96d56Sopenharmony_ci**************************** 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci:Author: A.M. Kuchling <amk@amk.ca> 87db96d56Sopenharmony_ci 97db96d56Sopenharmony_ci.. TODO: 107db96d56Sopenharmony_ci Document lookbehind assertions 117db96d56Sopenharmony_ci Better way of displaying a RE, a string, and what it matches 127db96d56Sopenharmony_ci Mention optional argument to match.groups() 137db96d56Sopenharmony_ci Unicode (at least a reference) 147db96d56Sopenharmony_ci 157db96d56Sopenharmony_ci 167db96d56Sopenharmony_ci.. topic:: Abstract 177db96d56Sopenharmony_ci 187db96d56Sopenharmony_ci This document is an introductory tutorial to using regular expressions in Python 197db96d56Sopenharmony_ci with the :mod:`re` module. It provides a gentler introduction than the 207db96d56Sopenharmony_ci corresponding section in the Library Reference. 217db96d56Sopenharmony_ci 227db96d56Sopenharmony_ci 237db96d56Sopenharmony_ciIntroduction 247db96d56Sopenharmony_ci============ 257db96d56Sopenharmony_ci 267db96d56Sopenharmony_ciRegular expressions (called REs, or regexes, or regex patterns) are essentially 277db96d56Sopenharmony_cia tiny, highly specialized programming language embedded inside Python and made 287db96d56Sopenharmony_ciavailable through the :mod:`re` module. Using this little language, you specify 297db96d56Sopenharmony_cithe rules for the set of possible strings that you want to match; this set might 307db96d56Sopenharmony_cicontain English sentences, or e-mail addresses, or TeX commands, or anything you 317db96d56Sopenharmony_cilike. You can then ask questions such as "Does this string match the pattern?", 327db96d56Sopenharmony_cior "Is there a match for the pattern anywhere in this string?". You can also 337db96d56Sopenharmony_ciuse REs to modify a string or to split it apart in various ways. 347db96d56Sopenharmony_ci 357db96d56Sopenharmony_ciRegular expression patterns are compiled into a series of bytecodes which are 367db96d56Sopenharmony_cithen executed by a matching engine written in C. For advanced use, it may be 377db96d56Sopenharmony_cinecessary to pay careful attention to how the engine will execute a given RE, 387db96d56Sopenharmony_ciand write the RE in a certain way in order to produce bytecode that runs faster. 397db96d56Sopenharmony_ciOptimization isn't covered in this document, because it requires that you have a 407db96d56Sopenharmony_cigood understanding of the matching engine's internals. 417db96d56Sopenharmony_ci 427db96d56Sopenharmony_ciThe regular expression language is relatively small and restricted, so not all 437db96d56Sopenharmony_cipossible string processing tasks can be done using regular expressions. There 447db96d56Sopenharmony_ciare also tasks that *can* be done with regular expressions, but the expressions 457db96d56Sopenharmony_citurn out to be very complicated. In these cases, you may be better off writing 467db96d56Sopenharmony_ciPython code to do the processing; while Python code will be slower than an 477db96d56Sopenharmony_cielaborate regular expression, it will also probably be more understandable. 487db96d56Sopenharmony_ci 497db96d56Sopenharmony_ci 507db96d56Sopenharmony_ciSimple Patterns 517db96d56Sopenharmony_ci=============== 527db96d56Sopenharmony_ci 537db96d56Sopenharmony_ciWe'll start by learning about the simplest possible regular expressions. Since 547db96d56Sopenharmony_ciregular expressions are used to operate on strings, we'll begin with the most 557db96d56Sopenharmony_cicommon task: matching characters. 567db96d56Sopenharmony_ci 577db96d56Sopenharmony_ciFor a detailed explanation of the computer science underlying regular 587db96d56Sopenharmony_ciexpressions (deterministic and non-deterministic finite automata), you can refer 597db96d56Sopenharmony_cito almost any textbook on writing compilers. 607db96d56Sopenharmony_ci 617db96d56Sopenharmony_ci 627db96d56Sopenharmony_ciMatching Characters 637db96d56Sopenharmony_ci------------------- 647db96d56Sopenharmony_ci 657db96d56Sopenharmony_ciMost letters and characters will simply match themselves. For example, the 667db96d56Sopenharmony_ciregular expression ``test`` will match the string ``test`` exactly. (You can 677db96d56Sopenharmony_cienable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` 687db96d56Sopenharmony_cias well; more about this later.) 697db96d56Sopenharmony_ci 707db96d56Sopenharmony_ciThere are exceptions to this rule; some characters are special 717db96d56Sopenharmony_ci:dfn:`metacharacters`, and don't match themselves. Instead, they signal that 727db96d56Sopenharmony_cisome out-of-the-ordinary thing should be matched, or they affect other portions 737db96d56Sopenharmony_ciof the RE by repeating them or changing their meaning. Much of this document is 747db96d56Sopenharmony_cidevoted to discussing various metacharacters and what they do. 757db96d56Sopenharmony_ci 767db96d56Sopenharmony_ciHere's a complete list of the metacharacters; their meanings will be discussed 777db96d56Sopenharmony_ciin the rest of this HOWTO. 787db96d56Sopenharmony_ci 797db96d56Sopenharmony_ci.. code-block:: none 807db96d56Sopenharmony_ci 817db96d56Sopenharmony_ci . ^ $ * + ? { } [ ] \ | ( ) 827db96d56Sopenharmony_ci 837db96d56Sopenharmony_ciThe first metacharacters we'll look at are ``[`` and ``]``. They're used for 847db96d56Sopenharmony_cispecifying a character class, which is a set of characters that you wish to 857db96d56Sopenharmony_cimatch. Characters can be listed individually, or a range of characters can be 867db96d56Sopenharmony_ciindicated by giving two characters and separating them by a ``'-'``. For 877db96d56Sopenharmony_ciexample, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this 887db96d56Sopenharmony_ciis the same as ``[a-c]``, which uses a range to express the same set of 897db96d56Sopenharmony_cicharacters. If you wanted to match only lowercase letters, your RE would be 907db96d56Sopenharmony_ci``[a-z]``. 917db96d56Sopenharmony_ci 927db96d56Sopenharmony_ciMetacharacters (except ``\``) are not active inside classes. For example, ``[akm$]`` will 937db96d56Sopenharmony_cimatch any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is 947db96d56Sopenharmony_ciusually a metacharacter, but inside a character class it's stripped of its 957db96d56Sopenharmony_cispecial nature. 967db96d56Sopenharmony_ci 977db96d56Sopenharmony_ciYou can match the characters not listed within the class by :dfn:`complementing` 987db96d56Sopenharmony_cithe set. This is indicated by including a ``'^'`` as the first character of the 997db96d56Sopenharmony_ciclass. For example, ``[^5]`` will match any character except ``'5'``. If the 1007db96d56Sopenharmony_cicaret appears elsewhere in a character class, it does not have special meaning. 1017db96d56Sopenharmony_ciFor example: ``[5^]`` will match either a ``'5'`` or a ``'^'``. 1027db96d56Sopenharmony_ci 1037db96d56Sopenharmony_ciPerhaps the most important metacharacter is the backslash, ``\``. As in Python 1047db96d56Sopenharmony_cistring literals, the backslash can be followed by various characters to signal 1057db96d56Sopenharmony_civarious special sequences. It's also used to escape all the metacharacters so 1067db96d56Sopenharmony_ciyou can still match them in patterns; for example, if you need to match a ``[`` 1077db96d56Sopenharmony_cior ``\``, you can precede them with a backslash to remove their special 1087db96d56Sopenharmony_cimeaning: ``\[`` or ``\\``. 1097db96d56Sopenharmony_ci 1107db96d56Sopenharmony_ciSome of the special sequences beginning with ``'\'`` represent 1117db96d56Sopenharmony_cipredefined sets of characters that are often useful, such as the set 1127db96d56Sopenharmony_ciof digits, the set of letters, or the set of anything that isn't 1137db96d56Sopenharmony_ciwhitespace. 1147db96d56Sopenharmony_ci 1157db96d56Sopenharmony_ciLet's take an example: ``\w`` matches any alphanumeric character. If 1167db96d56Sopenharmony_cithe regex pattern is expressed in bytes, this is equivalent to the 1177db96d56Sopenharmony_ciclass ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will 1187db96d56Sopenharmony_cimatch all the characters marked as letters in the Unicode database 1197db96d56Sopenharmony_ciprovided by the :mod:`unicodedata` module. You can use the more 1207db96d56Sopenharmony_cirestricted definition of ``\w`` in a string pattern by supplying the 1217db96d56Sopenharmony_ci:const:`re.ASCII` flag when compiling the regular expression. 1227db96d56Sopenharmony_ci 1237db96d56Sopenharmony_ciThe following list of special sequences isn't complete. For a complete 1247db96d56Sopenharmony_cilist of sequences and expanded class definitions for Unicode string 1257db96d56Sopenharmony_cipatterns, see the last part of :ref:`Regular Expression Syntax 1267db96d56Sopenharmony_ci<re-syntax>` in the Standard Library reference. In general, the 1277db96d56Sopenharmony_ciUnicode versions match any character that's in the appropriate 1287db96d56Sopenharmony_cicategory in the Unicode database. 1297db96d56Sopenharmony_ci 1307db96d56Sopenharmony_ci``\d`` 1317db96d56Sopenharmony_ci Matches any decimal digit; this is equivalent to the class ``[0-9]``. 1327db96d56Sopenharmony_ci 1337db96d56Sopenharmony_ci``\D`` 1347db96d56Sopenharmony_ci Matches any non-digit character; this is equivalent to the class ``[^0-9]``. 1357db96d56Sopenharmony_ci 1367db96d56Sopenharmony_ci``\s`` 1377db96d56Sopenharmony_ci Matches any whitespace character; this is equivalent to the class ``[ 1387db96d56Sopenharmony_ci \t\n\r\f\v]``. 1397db96d56Sopenharmony_ci 1407db96d56Sopenharmony_ci``\S`` 1417db96d56Sopenharmony_ci Matches any non-whitespace character; this is equivalent to the class ``[^ 1427db96d56Sopenharmony_ci \t\n\r\f\v]``. 1437db96d56Sopenharmony_ci 1447db96d56Sopenharmony_ci``\w`` 1457db96d56Sopenharmony_ci Matches any alphanumeric character; this is equivalent to the class 1467db96d56Sopenharmony_ci ``[a-zA-Z0-9_]``. 1477db96d56Sopenharmony_ci 1487db96d56Sopenharmony_ci``\W`` 1497db96d56Sopenharmony_ci Matches any non-alphanumeric character; this is equivalent to the class 1507db96d56Sopenharmony_ci ``[^a-zA-Z0-9_]``. 1517db96d56Sopenharmony_ci 1527db96d56Sopenharmony_ciThese sequences can be included inside a character class. For example, 1537db96d56Sopenharmony_ci``[\s,.]`` is a character class that will match any whitespace character, or 1547db96d56Sopenharmony_ci``','`` or ``'.'``. 1557db96d56Sopenharmony_ci 1567db96d56Sopenharmony_ciThe final metacharacter in this section is ``.``. It matches anything except a 1577db96d56Sopenharmony_cinewline character, and there's an alternate mode (:const:`re.DOTALL`) where it will 1587db96d56Sopenharmony_cimatch even a newline. ``.`` is often used where you want to match "any 1597db96d56Sopenharmony_cicharacter". 1607db96d56Sopenharmony_ci 1617db96d56Sopenharmony_ci 1627db96d56Sopenharmony_ciRepeating Things 1637db96d56Sopenharmony_ci---------------- 1647db96d56Sopenharmony_ci 1657db96d56Sopenharmony_ciBeing able to match varying sets of characters is the first thing regular 1667db96d56Sopenharmony_ciexpressions can do that isn't already possible with the methods available on 1677db96d56Sopenharmony_cistrings. However, if that was the only additional capability of regexes, they 1687db96d56Sopenharmony_ciwouldn't be much of an advance. Another capability is that you can specify that 1697db96d56Sopenharmony_ciportions of the RE must be repeated a certain number of times. 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ciThe first metacharacter for repeating things that we'll look at is ``*``. ``*`` 1727db96d56Sopenharmony_cidoesn't match the literal character ``'*'``; instead, it specifies that the 1737db96d56Sopenharmony_ciprevious character can be matched zero or more times, instead of exactly once. 1747db96d56Sopenharmony_ci 1757db96d56Sopenharmony_ciFor example, ``ca*t`` will match ``'ct'`` (0 ``'a'`` characters), ``'cat'`` (1 ``'a'``), 1767db96d56Sopenharmony_ci``'caaat'`` (3 ``'a'`` characters), and so forth. 1777db96d56Sopenharmony_ci 1787db96d56Sopenharmony_ciRepetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching 1797db96d56Sopenharmony_ciengine will try to repeat it as many times as possible. If later portions of the 1807db96d56Sopenharmony_cipattern don't match, the matching engine will then back up and try again with 1817db96d56Sopenharmony_cifewer repetitions. 1827db96d56Sopenharmony_ci 1837db96d56Sopenharmony_ciA step-by-step example will make this more obvious. Let's consider the 1847db96d56Sopenharmony_ciexpression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters 1857db96d56Sopenharmony_cifrom the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching 1867db96d56Sopenharmony_cithis RE against the string ``'abcbd'``. 1877db96d56Sopenharmony_ci 1887db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 1897db96d56Sopenharmony_ci| Step | Matched | Explanation | 1907db96d56Sopenharmony_ci+======+===========+=================================+ 1917db96d56Sopenharmony_ci| 1 | ``a`` | The ``a`` in the RE matches. | 1927db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 1937db96d56Sopenharmony_ci| 2 | ``abcbd`` | The engine matches ``[bcd]*``, | 1947db96d56Sopenharmony_ci| | | going as far as it can, which | 1957db96d56Sopenharmony_ci| | | is to the end of the string. | 1967db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 1977db96d56Sopenharmony_ci| 3 | *Failure* | The engine tries to match | 1987db96d56Sopenharmony_ci| | | ``b``, but the current position | 1997db96d56Sopenharmony_ci| | | is at the end of the string, so | 2007db96d56Sopenharmony_ci| | | it fails. | 2017db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 2027db96d56Sopenharmony_ci| 4 | ``abcb`` | Back up, so that ``[bcd]*`` | 2037db96d56Sopenharmony_ci| | | matches one less character. | 2047db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 2057db96d56Sopenharmony_ci| 5 | *Failure* | Try ``b`` again, but the | 2067db96d56Sopenharmony_ci| | | current position is at the last | 2077db96d56Sopenharmony_ci| | | character, which is a ``'d'``. | 2087db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 2097db96d56Sopenharmony_ci| 6 | ``abc`` | Back up again, so that | 2107db96d56Sopenharmony_ci| | | ``[bcd]*`` is only matching | 2117db96d56Sopenharmony_ci| | | ``bc``. | 2127db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 2137db96d56Sopenharmony_ci| 6 | ``abcb`` | Try ``b`` again. This time | 2147db96d56Sopenharmony_ci| | | the character at the | 2157db96d56Sopenharmony_ci| | | current position is ``'b'``, so | 2167db96d56Sopenharmony_ci| | | it succeeds. | 2177db96d56Sopenharmony_ci+------+-----------+---------------------------------+ 2187db96d56Sopenharmony_ci 2197db96d56Sopenharmony_ciThe end of the RE has now been reached, and it has matched ``'abcb'``. This 2207db96d56Sopenharmony_cidemonstrates how the matching engine goes as far as it can at first, and if no 2217db96d56Sopenharmony_cimatch is found it will then progressively back up and retry the rest of the RE 2227db96d56Sopenharmony_ciagain and again. It will back up until it has tried zero matches for 2237db96d56Sopenharmony_ci``[bcd]*``, and if that subsequently fails, the engine will conclude that the 2247db96d56Sopenharmony_cistring doesn't match the RE at all. 2257db96d56Sopenharmony_ci 2267db96d56Sopenharmony_ciAnother repeating metacharacter is ``+``, which matches one or more times. Pay 2277db96d56Sopenharmony_cicareful attention to the difference between ``*`` and ``+``; ``*`` matches 2287db96d56Sopenharmony_ci*zero* or more times, so whatever's being repeated may not be present at all, 2297db96d56Sopenharmony_ciwhile ``+`` requires at least *one* occurrence. To use a similar example, 2307db96d56Sopenharmony_ci``ca+t`` will match ``'cat'`` (1 ``'a'``), ``'caaat'`` (3 ``'a'``\ s), but won't 2317db96d56Sopenharmony_cimatch ``'ct'``. 2327db96d56Sopenharmony_ci 2337db96d56Sopenharmony_ciThere are two more repeating operators or quantifiers. The question mark character, ``?``, 2347db96d56Sopenharmony_cimatches either once or zero times; you can think of it as marking something as 2357db96d56Sopenharmony_cibeing optional. For example, ``home-?brew`` matches either ``'homebrew'`` or 2367db96d56Sopenharmony_ci``'home-brew'``. 2377db96d56Sopenharmony_ci 2387db96d56Sopenharmony_ciThe most complicated quantifier is ``{m,n}``, where *m* and *n* are 2397db96d56Sopenharmony_cidecimal integers. This quantifier means there must be at least *m* repetitions, 2407db96d56Sopenharmony_ciand at most *n*. For example, ``a/{1,3}b`` will match ``'a/b'``, ``'a//b'``, and 2417db96d56Sopenharmony_ci``'a///b'``. It won't match ``'ab'``, which has no slashes, or ``'a////b'``, which 2427db96d56Sopenharmony_cihas four. 2437db96d56Sopenharmony_ci 2447db96d56Sopenharmony_ciYou can omit either *m* or *n*; in that case, a reasonable value is assumed for 2457db96d56Sopenharmony_cithe missing value. Omitting *m* is interpreted as a lower limit of 0, while 2467db96d56Sopenharmony_ciomitting *n* results in an upper bound of infinity. 2477db96d56Sopenharmony_ci 2487db96d56Sopenharmony_ciReaders of a reductionist bent may notice that the three other quantifiers can 2497db96d56Sopenharmony_ciall be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` 2507db96d56Sopenharmony_ciis equivalent to ``+``, and ``{0,1}`` is the same as ``?``. It's better to use 2517db96d56Sopenharmony_ci``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier 2527db96d56Sopenharmony_cito read. 2537db96d56Sopenharmony_ci 2547db96d56Sopenharmony_ci 2557db96d56Sopenharmony_ciUsing Regular Expressions 2567db96d56Sopenharmony_ci========================= 2577db96d56Sopenharmony_ci 2587db96d56Sopenharmony_ciNow that we've looked at some simple regular expressions, how do we actually use 2597db96d56Sopenharmony_cithem in Python? The :mod:`re` module provides an interface to the regular 2607db96d56Sopenharmony_ciexpression engine, allowing you to compile REs into objects and then perform 2617db96d56Sopenharmony_cimatches with them. 2627db96d56Sopenharmony_ci 2637db96d56Sopenharmony_ci 2647db96d56Sopenharmony_ciCompiling Regular Expressions 2657db96d56Sopenharmony_ci----------------------------- 2667db96d56Sopenharmony_ci 2677db96d56Sopenharmony_ciRegular expressions are compiled into pattern objects, which have 2687db96d56Sopenharmony_cimethods for various operations such as searching for pattern matches or 2697db96d56Sopenharmony_ciperforming string substitutions. :: 2707db96d56Sopenharmony_ci 2717db96d56Sopenharmony_ci >>> import re 2727db96d56Sopenharmony_ci >>> p = re.compile('ab*') 2737db96d56Sopenharmony_ci >>> p 2747db96d56Sopenharmony_ci re.compile('ab*') 2757db96d56Sopenharmony_ci 2767db96d56Sopenharmony_ci:func:`re.compile` also accepts an optional *flags* argument, used to enable 2777db96d56Sopenharmony_civarious special features and syntax variations. We'll go over the available 2787db96d56Sopenharmony_cisettings later, but for now a single example will do:: 2797db96d56Sopenharmony_ci 2807db96d56Sopenharmony_ci >>> p = re.compile('ab*', re.IGNORECASE) 2817db96d56Sopenharmony_ci 2827db96d56Sopenharmony_ciThe RE is passed to :func:`re.compile` as a string. REs are handled as strings 2837db96d56Sopenharmony_cibecause regular expressions aren't part of the core Python language, and no 2847db96d56Sopenharmony_cispecial syntax was created for expressing them. (There are applications that 2857db96d56Sopenharmony_cidon't need REs at all, so there's no need to bloat the language specification by 2867db96d56Sopenharmony_ciincluding them.) Instead, the :mod:`re` module is simply a C extension module 2877db96d56Sopenharmony_ciincluded with Python, just like the :mod:`socket` or :mod:`zlib` modules. 2887db96d56Sopenharmony_ci 2897db96d56Sopenharmony_ciPutting REs in strings keeps the Python language simpler, but has one 2907db96d56Sopenharmony_cidisadvantage which is the topic of the next section. 2917db96d56Sopenharmony_ci 2927db96d56Sopenharmony_ci 2937db96d56Sopenharmony_ci.. _the-backslash-plague: 2947db96d56Sopenharmony_ci 2957db96d56Sopenharmony_ciThe Backslash Plague 2967db96d56Sopenharmony_ci-------------------- 2977db96d56Sopenharmony_ci 2987db96d56Sopenharmony_ciAs stated earlier, regular expressions use the backslash character (``'\'``) to 2997db96d56Sopenharmony_ciindicate special forms or to allow special characters to be used without 3007db96d56Sopenharmony_ciinvoking their special meaning. This conflicts with Python's usage of the same 3017db96d56Sopenharmony_cicharacter for the same purpose in string literals. 3027db96d56Sopenharmony_ci 3037db96d56Sopenharmony_ciLet's say you want to write a RE that matches the string ``\section``, which 3047db96d56Sopenharmony_cimight be found in a LaTeX file. To figure out what to write in the program 3057db96d56Sopenharmony_cicode, start with the desired string to be matched. Next, you must escape any 3067db96d56Sopenharmony_cibackslashes and other metacharacters by preceding them with a backslash, 3077db96d56Sopenharmony_ciresulting in the string ``\\section``. The resulting string that must be passed 3087db96d56Sopenharmony_cito :func:`re.compile` must be ``\\section``. However, to express this as a 3097db96d56Sopenharmony_ciPython string literal, both backslashes must be escaped *again*. 3107db96d56Sopenharmony_ci 3117db96d56Sopenharmony_ci+-------------------+------------------------------------------+ 3127db96d56Sopenharmony_ci| Characters | Stage | 3137db96d56Sopenharmony_ci+===================+==========================================+ 3147db96d56Sopenharmony_ci| ``\section`` | Text string to be matched | 3157db96d56Sopenharmony_ci+-------------------+------------------------------------------+ 3167db96d56Sopenharmony_ci| ``\\section`` | Escaped backslash for :func:`re.compile` | 3177db96d56Sopenharmony_ci+-------------------+------------------------------------------+ 3187db96d56Sopenharmony_ci| ``"\\\\section"`` | Escaped backslashes for a string literal | 3197db96d56Sopenharmony_ci+-------------------+------------------------------------------+ 3207db96d56Sopenharmony_ci 3217db96d56Sopenharmony_ciIn short, to match a literal backslash, one has to write ``'\\\\'`` as the RE 3227db96d56Sopenharmony_cistring, because the regular expression must be ``\\``, and each backslash must 3237db96d56Sopenharmony_cibe expressed as ``\\`` inside a regular Python string literal. In REs that 3247db96d56Sopenharmony_cifeature backslashes repeatedly, this leads to lots of repeated backslashes and 3257db96d56Sopenharmony_cimakes the resulting strings difficult to understand. 3267db96d56Sopenharmony_ci 3277db96d56Sopenharmony_ciThe solution is to use Python's raw string notation for regular expressions; 3287db96d56Sopenharmony_cibackslashes are not handled in any special way in a string literal prefixed with 3297db96d56Sopenharmony_ci``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, 3307db96d56Sopenharmony_ciwhile ``"\n"`` is a one-character string containing a newline. Regular 3317db96d56Sopenharmony_ciexpressions will often be written in Python code using this raw string notation. 3327db96d56Sopenharmony_ci 3337db96d56Sopenharmony_ciIn addition, special escape sequences that are valid in regular expressions, 3347db96d56Sopenharmony_cibut not valid as Python string literals, now result in a 3357db96d56Sopenharmony_ci:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`, 3367db96d56Sopenharmony_ciwhich means the sequences will be invalid if raw string notation or escaping 3377db96d56Sopenharmony_cithe backslashes isn't used. 3387db96d56Sopenharmony_ci 3397db96d56Sopenharmony_ci 3407db96d56Sopenharmony_ci+-------------------+------------------+ 3417db96d56Sopenharmony_ci| Regular String | Raw string | 3427db96d56Sopenharmony_ci+===================+==================+ 3437db96d56Sopenharmony_ci| ``"ab*"`` | ``r"ab*"`` | 3447db96d56Sopenharmony_ci+-------------------+------------------+ 3457db96d56Sopenharmony_ci| ``"\\\\section"`` | ``r"\\section"`` | 3467db96d56Sopenharmony_ci+-------------------+------------------+ 3477db96d56Sopenharmony_ci| ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"`` | 3487db96d56Sopenharmony_ci+-------------------+------------------+ 3497db96d56Sopenharmony_ci 3507db96d56Sopenharmony_ci 3517db96d56Sopenharmony_ciPerforming Matches 3527db96d56Sopenharmony_ci------------------ 3537db96d56Sopenharmony_ci 3547db96d56Sopenharmony_ciOnce you have an object representing a compiled regular expression, what do you 3557db96d56Sopenharmony_cido with it? Pattern objects have several methods and attributes. 3567db96d56Sopenharmony_ciOnly the most significant ones will be covered here; consult the :mod:`re` docs 3577db96d56Sopenharmony_cifor a complete listing. 3587db96d56Sopenharmony_ci 3597db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 3607db96d56Sopenharmony_ci| Method/Attribute | Purpose | 3617db96d56Sopenharmony_ci+==================+===============================================+ 3627db96d56Sopenharmony_ci| ``match()`` | Determine if the RE matches at the beginning | 3637db96d56Sopenharmony_ci| | of the string. | 3647db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 3657db96d56Sopenharmony_ci| ``search()`` | Scan through a string, looking for any | 3667db96d56Sopenharmony_ci| | location where this RE matches. | 3677db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 3687db96d56Sopenharmony_ci| ``findall()`` | Find all substrings where the RE matches, and | 3697db96d56Sopenharmony_ci| | returns them as a list. | 3707db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 3717db96d56Sopenharmony_ci| ``finditer()`` | Find all substrings where the RE matches, and | 3727db96d56Sopenharmony_ci| | returns them as an :term:`iterator`. | 3737db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 3747db96d56Sopenharmony_ci 3757db96d56Sopenharmony_ci:meth:`~re.Pattern.match` and :meth:`~re.Pattern.search` return ``None`` if no match can be found. If 3767db96d56Sopenharmony_cithey're successful, a :ref:`match object <match-objects>` instance is returned, 3777db96d56Sopenharmony_cicontaining information about the match: where it starts and ends, the substring 3787db96d56Sopenharmony_ciit matched, and more. 3797db96d56Sopenharmony_ci 3807db96d56Sopenharmony_ciYou can learn about this by interactively experimenting with the :mod:`re` 3817db96d56Sopenharmony_cimodule. If you have :mod:`tkinter` available, you may also want to look at 3827db96d56Sopenharmony_ci:source:`Tools/demo/redemo.py`, a demonstration program included with the 3837db96d56Sopenharmony_ciPython distribution. It allows you to enter REs and strings, and displays 3847db96d56Sopenharmony_ciwhether the RE matches or fails. :file:`redemo.py` can be quite useful when 3857db96d56Sopenharmony_citrying to debug a complicated RE. 3867db96d56Sopenharmony_ci 3877db96d56Sopenharmony_ciThis HOWTO uses the standard Python interpreter for its examples. First, run the 3887db96d56Sopenharmony_ciPython interpreter, import the :mod:`re` module, and compile a RE:: 3897db96d56Sopenharmony_ci 3907db96d56Sopenharmony_ci >>> import re 3917db96d56Sopenharmony_ci >>> p = re.compile('[a-z]+') 3927db96d56Sopenharmony_ci >>> p 3937db96d56Sopenharmony_ci re.compile('[a-z]+') 3947db96d56Sopenharmony_ci 3957db96d56Sopenharmony_ciNow, you can try matching various strings against the RE ``[a-z]+``. An empty 3967db96d56Sopenharmony_cistring shouldn't match at all, since ``+`` means 'one or more repetitions'. 3977db96d56Sopenharmony_ci:meth:`~re.Pattern.match` should return ``None`` in this case, which will cause the 3987db96d56Sopenharmony_ciinterpreter to print no output. You can explicitly print the result of 3997db96d56Sopenharmony_ci:meth:`!match` to make this clear. :: 4007db96d56Sopenharmony_ci 4017db96d56Sopenharmony_ci >>> p.match("") 4027db96d56Sopenharmony_ci >>> print(p.match("")) 4037db96d56Sopenharmony_ci None 4047db96d56Sopenharmony_ci 4057db96d56Sopenharmony_ciNow, let's try it on a string that it should match, such as ``tempo``. In this 4067db96d56Sopenharmony_cicase, :meth:`~re.Pattern.match` will return a :ref:`match object <match-objects>`, so you 4077db96d56Sopenharmony_cishould store the result in a variable for later use. :: 4087db96d56Sopenharmony_ci 4097db96d56Sopenharmony_ci >>> m = p.match('tempo') 4107db96d56Sopenharmony_ci >>> m 4117db96d56Sopenharmony_ci <re.Match object; span=(0, 5), match='tempo'> 4127db96d56Sopenharmony_ci 4137db96d56Sopenharmony_ciNow you can query the :ref:`match object <match-objects>` for information 4147db96d56Sopenharmony_ciabout the matching string. Match object instances 4157db96d56Sopenharmony_cialso have several methods and attributes; the most important ones are: 4167db96d56Sopenharmony_ci 4177db96d56Sopenharmony_ci+------------------+--------------------------------------------+ 4187db96d56Sopenharmony_ci| Method/Attribute | Purpose | 4197db96d56Sopenharmony_ci+==================+============================================+ 4207db96d56Sopenharmony_ci| ``group()`` | Return the string matched by the RE | 4217db96d56Sopenharmony_ci+------------------+--------------------------------------------+ 4227db96d56Sopenharmony_ci| ``start()`` | Return the starting position of the match | 4237db96d56Sopenharmony_ci+------------------+--------------------------------------------+ 4247db96d56Sopenharmony_ci| ``end()`` | Return the ending position of the match | 4257db96d56Sopenharmony_ci+------------------+--------------------------------------------+ 4267db96d56Sopenharmony_ci| ``span()`` | Return a tuple containing the (start, end) | 4277db96d56Sopenharmony_ci| | positions of the match | 4287db96d56Sopenharmony_ci+------------------+--------------------------------------------+ 4297db96d56Sopenharmony_ci 4307db96d56Sopenharmony_ciTrying these methods will soon clarify their meaning:: 4317db96d56Sopenharmony_ci 4327db96d56Sopenharmony_ci >>> m.group() 4337db96d56Sopenharmony_ci 'tempo' 4347db96d56Sopenharmony_ci >>> m.start(), m.end() 4357db96d56Sopenharmony_ci (0, 5) 4367db96d56Sopenharmony_ci >>> m.span() 4377db96d56Sopenharmony_ci (0, 5) 4387db96d56Sopenharmony_ci 4397db96d56Sopenharmony_ci:meth:`~re.Match.group` returns the substring that was matched by the RE. :meth:`~re.Match.start` 4407db96d56Sopenharmony_ciand :meth:`~re.Match.end` return the starting and ending index of the match. :meth:`~re.Match.span` 4417db96d56Sopenharmony_cireturns both start and end indexes in a single tuple. Since the :meth:`~re.Pattern.match` 4427db96d56Sopenharmony_cimethod only checks if the RE matches at the start of a string, :meth:`!start` 4437db96d56Sopenharmony_ciwill always be zero. However, the :meth:`~re.Pattern.search` method of patterns 4447db96d56Sopenharmony_ciscans through the string, so the match may not start at zero in that 4457db96d56Sopenharmony_cicase. :: 4467db96d56Sopenharmony_ci 4477db96d56Sopenharmony_ci >>> print(p.match('::: message')) 4487db96d56Sopenharmony_ci None 4497db96d56Sopenharmony_ci >>> m = p.search('::: message'); print(m) 4507db96d56Sopenharmony_ci <re.Match object; span=(4, 11), match='message'> 4517db96d56Sopenharmony_ci >>> m.group() 4527db96d56Sopenharmony_ci 'message' 4537db96d56Sopenharmony_ci >>> m.span() 4547db96d56Sopenharmony_ci (4, 11) 4557db96d56Sopenharmony_ci 4567db96d56Sopenharmony_ciIn actual programs, the most common style is to store the 4577db96d56Sopenharmony_ci:ref:`match object <match-objects>` in a variable, and then check if it was 4587db96d56Sopenharmony_ci``None``. This usually looks like:: 4597db96d56Sopenharmony_ci 4607db96d56Sopenharmony_ci p = re.compile( ... ) 4617db96d56Sopenharmony_ci m = p.match( 'string goes here' ) 4627db96d56Sopenharmony_ci if m: 4637db96d56Sopenharmony_ci print('Match found: ', m.group()) 4647db96d56Sopenharmony_ci else: 4657db96d56Sopenharmony_ci print('No match') 4667db96d56Sopenharmony_ci 4677db96d56Sopenharmony_ciTwo pattern methods return all of the matches for a pattern. 4687db96d56Sopenharmony_ci:meth:`~re.Pattern.findall` returns a list of matching strings:: 4697db96d56Sopenharmony_ci 4707db96d56Sopenharmony_ci >>> p = re.compile(r'\d+') 4717db96d56Sopenharmony_ci >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') 4727db96d56Sopenharmony_ci ['12', '11', '10'] 4737db96d56Sopenharmony_ci 4747db96d56Sopenharmony_ciThe ``r`` prefix, making the literal a raw string literal, is needed in this 4757db96d56Sopenharmony_ciexample because escape sequences in a normal "cooked" string literal that are 4767db96d56Sopenharmony_cinot recognized by Python, as opposed to regular expressions, now result in a 4777db96d56Sopenharmony_ci:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`. See 4787db96d56Sopenharmony_ci:ref:`the-backslash-plague`. 4797db96d56Sopenharmony_ci 4807db96d56Sopenharmony_ci:meth:`~re.Pattern.findall` has to create the entire list before it can be returned as the 4817db96d56Sopenharmony_ciresult. The :meth:`~re.Pattern.finditer` method returns a sequence of 4827db96d56Sopenharmony_ci:ref:`match object <match-objects>` instances as an :term:`iterator`:: 4837db96d56Sopenharmony_ci 4847db96d56Sopenharmony_ci >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') 4857db96d56Sopenharmony_ci >>> iterator #doctest: +ELLIPSIS 4867db96d56Sopenharmony_ci <callable_iterator object at 0x...> 4877db96d56Sopenharmony_ci >>> for match in iterator: 4887db96d56Sopenharmony_ci ... print(match.span()) 4897db96d56Sopenharmony_ci ... 4907db96d56Sopenharmony_ci (0, 2) 4917db96d56Sopenharmony_ci (22, 24) 4927db96d56Sopenharmony_ci (29, 31) 4937db96d56Sopenharmony_ci 4947db96d56Sopenharmony_ci 4957db96d56Sopenharmony_ciModule-Level Functions 4967db96d56Sopenharmony_ci---------------------- 4977db96d56Sopenharmony_ci 4987db96d56Sopenharmony_ciYou don't have to create a pattern object and call its methods; the 4997db96d56Sopenharmony_ci:mod:`re` module also provides top-level functions called :func:`~re.match`, 5007db96d56Sopenharmony_ci:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions 5017db96d56Sopenharmony_citake the same arguments as the corresponding pattern method with 5027db96d56Sopenharmony_cithe RE string added as the first argument, and still return either ``None`` or a 5037db96d56Sopenharmony_ci:ref:`match object <match-objects>` instance. :: 5047db96d56Sopenharmony_ci 5057db96d56Sopenharmony_ci >>> print(re.match(r'From\s+', 'Fromage amk')) 5067db96d56Sopenharmony_ci None 5077db96d56Sopenharmony_ci >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') #doctest: +ELLIPSIS 5087db96d56Sopenharmony_ci <re.Match object; span=(0, 5), match='From '> 5097db96d56Sopenharmony_ci 5107db96d56Sopenharmony_ciUnder the hood, these functions simply create a pattern object for you 5117db96d56Sopenharmony_ciand call the appropriate method on it. They also store the compiled 5127db96d56Sopenharmony_ciobject in a cache, so future calls using the same RE won't need to 5137db96d56Sopenharmony_ciparse the pattern again and again. 5147db96d56Sopenharmony_ci 5157db96d56Sopenharmony_ciShould you use these module-level functions, or should you get the 5167db96d56Sopenharmony_cipattern and call its methods yourself? If you're accessing a regex 5177db96d56Sopenharmony_ciwithin a loop, pre-compiling it will save a few function calls. 5187db96d56Sopenharmony_ciOutside of loops, there's not much difference thanks to the internal 5197db96d56Sopenharmony_cicache. 5207db96d56Sopenharmony_ci 5217db96d56Sopenharmony_ci 5227db96d56Sopenharmony_ciCompilation Flags 5237db96d56Sopenharmony_ci----------------- 5247db96d56Sopenharmony_ci 5257db96d56Sopenharmony_ciCompilation flags let you modify some aspects of how regular expressions work. 5267db96d56Sopenharmony_ciFlags are available in the :mod:`re` module under two names, a long name such as 5277db96d56Sopenharmony_ci:const:`IGNORECASE` and a short, one-letter form such as :const:`I`. (If you're 5287db96d56Sopenharmony_cifamiliar with Perl's pattern modifiers, the one-letter forms use the same 5297db96d56Sopenharmony_ciletters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.) 5307db96d56Sopenharmony_ciMultiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets 5317db96d56Sopenharmony_ciboth the :const:`I` and :const:`M` flags, for example. 5327db96d56Sopenharmony_ci 5337db96d56Sopenharmony_ciHere's a table of the available flags, followed by a more detailed explanation 5347db96d56Sopenharmony_ciof each one. 5357db96d56Sopenharmony_ci 5367db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+ 5377db96d56Sopenharmony_ci| Flag | Meaning | 5387db96d56Sopenharmony_ci+=================================+============================================+ 5397db96d56Sopenharmony_ci| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | 5407db96d56Sopenharmony_ci| | ``\s`` and ``\d`` match only on ASCII | 5417db96d56Sopenharmony_ci| | characters with the respective property. | 5427db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+ 5437db96d56Sopenharmony_ci| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | 5447db96d56Sopenharmony_ci| | newlines. | 5457db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+ 5467db96d56Sopenharmony_ci| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches. | 5477db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+ 5487db96d56Sopenharmony_ci| :const:`LOCALE`, :const:`L` | Do a locale-aware match. | 5497db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+ 5507db96d56Sopenharmony_ci| :const:`MULTILINE`, :const:`M` | Multi-line matching, affecting ``^`` and | 5517db96d56Sopenharmony_ci| | ``$``. | 5527db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+ 5537db96d56Sopenharmony_ci| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | 5547db96d56Sopenharmony_ci| (for 'extended') | more cleanly and understandably. | 5557db96d56Sopenharmony_ci+---------------------------------+--------------------------------------------+ 5567db96d56Sopenharmony_ci 5577db96d56Sopenharmony_ci 5587db96d56Sopenharmony_ci.. data:: I 5597db96d56Sopenharmony_ci IGNORECASE 5607db96d56Sopenharmony_ci :noindex: 5617db96d56Sopenharmony_ci 5627db96d56Sopenharmony_ci Perform case-insensitive matching; character class and literal strings will 5637db96d56Sopenharmony_ci match letters by ignoring case. For example, ``[A-Z]`` will match lowercase 5647db96d56Sopenharmony_ci letters, too. Full Unicode matching also works unless the :const:`ASCII` 5657db96d56Sopenharmony_ci flag is used to disable non-ASCII matches. When the Unicode patterns 5667db96d56Sopenharmony_ci ``[a-z]`` or ``[A-Z]`` are used in combination with the :const:`IGNORECASE` 5677db96d56Sopenharmony_ci flag, they will match the 52 ASCII letters and 4 additional non-ASCII 5687db96d56Sopenharmony_ci letters: 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, 5697db96d56Sopenharmony_ci Latin small letter dotless i), 'ſ' (U+017F, Latin small letter long s) and 5707db96d56Sopenharmony_ci 'K' (U+212A, Kelvin sign). ``Spam`` will match ``'Spam'``, ``'spam'``, 5717db96d56Sopenharmony_ci ``'spAM'``, or ``'ſpam'`` (the latter is matched only in Unicode mode). 5727db96d56Sopenharmony_ci This lowercasing doesn't take the current locale into account; 5737db96d56Sopenharmony_ci it will if you also set the :const:`LOCALE` flag. 5747db96d56Sopenharmony_ci 5757db96d56Sopenharmony_ci 5767db96d56Sopenharmony_ci.. data:: L 5777db96d56Sopenharmony_ci LOCALE 5787db96d56Sopenharmony_ci :noindex: 5797db96d56Sopenharmony_ci 5807db96d56Sopenharmony_ci Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching dependent 5817db96d56Sopenharmony_ci on the current locale instead of the Unicode database. 5827db96d56Sopenharmony_ci 5837db96d56Sopenharmony_ci Locales are a feature of the C library intended to help in writing programs 5847db96d56Sopenharmony_ci that take account of language differences. For example, if you're 5857db96d56Sopenharmony_ci processing encoded French text, you'd want to be able to write ``\w+`` to 5867db96d56Sopenharmony_ci match words, but ``\w`` only matches the character class ``[A-Za-z]`` in 5877db96d56Sopenharmony_ci bytes patterns; it won't match bytes corresponding to ``é`` or ``ç``. 5887db96d56Sopenharmony_ci If your system is configured properly and a French locale is selected, 5897db96d56Sopenharmony_ci certain C functions will tell the program that the byte corresponding to 5907db96d56Sopenharmony_ci ``é`` should also be considered a letter. 5917db96d56Sopenharmony_ci Setting the :const:`LOCALE` flag when compiling a regular expression will cause 5927db96d56Sopenharmony_ci the resulting compiled object to use these C functions for ``\w``; this is 5937db96d56Sopenharmony_ci slower, but also enables ``\w+`` to match French words as you'd expect. 5947db96d56Sopenharmony_ci The use of this flag is discouraged in Python 3 as the locale mechanism 5957db96d56Sopenharmony_ci is very unreliable, it only handles one "culture" at a time, and it only 5967db96d56Sopenharmony_ci works with 8-bit locales. Unicode matching is already enabled by default 5977db96d56Sopenharmony_ci in Python 3 for Unicode (str) patterns, and it is able to handle different 5987db96d56Sopenharmony_ci locales/languages. 5997db96d56Sopenharmony_ci 6007db96d56Sopenharmony_ci 6017db96d56Sopenharmony_ci.. data:: M 6027db96d56Sopenharmony_ci MULTILINE 6037db96d56Sopenharmony_ci :noindex: 6047db96d56Sopenharmony_ci 6057db96d56Sopenharmony_ci (``^`` and ``$`` haven't been explained yet; they'll be introduced in section 6067db96d56Sopenharmony_ci :ref:`more-metacharacters`.) 6077db96d56Sopenharmony_ci 6087db96d56Sopenharmony_ci Usually ``^`` matches only at the beginning of the string, and ``$`` matches 6097db96d56Sopenharmony_ci only at the end of the string and immediately before the newline (if any) at the 6107db96d56Sopenharmony_ci end of the string. When this flag is specified, ``^`` matches at the beginning 6117db96d56Sopenharmony_ci of the string and at the beginning of each line within the string, immediately 6127db96d56Sopenharmony_ci following each newline. Similarly, the ``$`` metacharacter matches either at 6137db96d56Sopenharmony_ci the end of the string and at the end of each line (immediately preceding each 6147db96d56Sopenharmony_ci newline). 6157db96d56Sopenharmony_ci 6167db96d56Sopenharmony_ci 6177db96d56Sopenharmony_ci.. data:: S 6187db96d56Sopenharmony_ci DOTALL 6197db96d56Sopenharmony_ci :noindex: 6207db96d56Sopenharmony_ci 6217db96d56Sopenharmony_ci Makes the ``'.'`` special character match any character at all, including a 6227db96d56Sopenharmony_ci newline; without this flag, ``'.'`` will match anything *except* a newline. 6237db96d56Sopenharmony_ci 6247db96d56Sopenharmony_ci 6257db96d56Sopenharmony_ci.. data:: A 6267db96d56Sopenharmony_ci ASCII 6277db96d56Sopenharmony_ci :noindex: 6287db96d56Sopenharmony_ci 6297db96d56Sopenharmony_ci Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only 6307db96d56Sopenharmony_ci matching instead of full Unicode matching. This is only meaningful for 6317db96d56Sopenharmony_ci Unicode patterns, and is ignored for byte patterns. 6327db96d56Sopenharmony_ci 6337db96d56Sopenharmony_ci 6347db96d56Sopenharmony_ci.. data:: X 6357db96d56Sopenharmony_ci VERBOSE 6367db96d56Sopenharmony_ci :noindex: 6377db96d56Sopenharmony_ci 6387db96d56Sopenharmony_ci This flag allows you to write regular expressions that are more readable by 6397db96d56Sopenharmony_ci granting you more flexibility in how you can format them. When this flag has 6407db96d56Sopenharmony_ci been specified, whitespace within the RE string is ignored, except when the 6417db96d56Sopenharmony_ci whitespace is in a character class or preceded by an unescaped backslash; this 6427db96d56Sopenharmony_ci lets you organize and indent the RE more clearly. This flag also lets you put 6437db96d56Sopenharmony_ci comments within a RE that will be ignored by the engine; comments are marked by 6447db96d56Sopenharmony_ci a ``'#'`` that's neither in a character class or preceded by an unescaped 6457db96d56Sopenharmony_ci backslash. 6467db96d56Sopenharmony_ci 6477db96d56Sopenharmony_ci For example, here's a RE that uses :const:`re.VERBOSE`; see how much easier it 6487db96d56Sopenharmony_ci is to read? :: 6497db96d56Sopenharmony_ci 6507db96d56Sopenharmony_ci charref = re.compile(r""" 6517db96d56Sopenharmony_ci &[#] # Start of a numeric entity reference 6527db96d56Sopenharmony_ci ( 6537db96d56Sopenharmony_ci 0[0-7]+ # Octal form 6547db96d56Sopenharmony_ci | [0-9]+ # Decimal form 6557db96d56Sopenharmony_ci | x[0-9a-fA-F]+ # Hexadecimal form 6567db96d56Sopenharmony_ci ) 6577db96d56Sopenharmony_ci ; # Trailing semicolon 6587db96d56Sopenharmony_ci """, re.VERBOSE) 6597db96d56Sopenharmony_ci 6607db96d56Sopenharmony_ci Without the verbose setting, the RE would look like this:: 6617db96d56Sopenharmony_ci 6627db96d56Sopenharmony_ci charref = re.compile("&#(0[0-7]+" 6637db96d56Sopenharmony_ci "|[0-9]+" 6647db96d56Sopenharmony_ci "|x[0-9a-fA-F]+);") 6657db96d56Sopenharmony_ci 6667db96d56Sopenharmony_ci In the above example, Python's automatic concatenation of string literals has 6677db96d56Sopenharmony_ci been used to break up the RE into smaller pieces, but it's still more difficult 6687db96d56Sopenharmony_ci to understand than the version using :const:`re.VERBOSE`. 6697db96d56Sopenharmony_ci 6707db96d56Sopenharmony_ci 6717db96d56Sopenharmony_ciMore Pattern Power 6727db96d56Sopenharmony_ci================== 6737db96d56Sopenharmony_ci 6747db96d56Sopenharmony_ciSo far we've only covered a part of the features of regular expressions. In 6757db96d56Sopenharmony_cithis section, we'll cover some new metacharacters, and how to use groups to 6767db96d56Sopenharmony_ciretrieve portions of the text that was matched. 6777db96d56Sopenharmony_ci 6787db96d56Sopenharmony_ci 6797db96d56Sopenharmony_ci.. _more-metacharacters: 6807db96d56Sopenharmony_ci 6817db96d56Sopenharmony_ciMore Metacharacters 6827db96d56Sopenharmony_ci------------------- 6837db96d56Sopenharmony_ci 6847db96d56Sopenharmony_ciThere are some metacharacters that we haven't covered yet. Most of them will be 6857db96d56Sopenharmony_cicovered in this section. 6867db96d56Sopenharmony_ci 6877db96d56Sopenharmony_ciSome of the remaining metacharacters to be discussed are :dfn:`zero-width 6887db96d56Sopenharmony_ciassertions`. They don't cause the engine to advance through the string; 6897db96d56Sopenharmony_ciinstead, they consume no characters at all, and simply succeed or fail. For 6907db96d56Sopenharmony_ciexample, ``\b`` is an assertion that the current position is located at a word 6917db96d56Sopenharmony_ciboundary; the position isn't changed by the ``\b`` at all. This means that 6927db96d56Sopenharmony_cizero-width assertions should never be repeated, because if they match once at a 6937db96d56Sopenharmony_cigiven location, they can obviously be matched an infinite number of times. 6947db96d56Sopenharmony_ci 6957db96d56Sopenharmony_ci``|`` 6967db96d56Sopenharmony_ci Alternation, or the "or" operator. If *A* and *B* are regular expressions, 6977db96d56Sopenharmony_ci ``A|B`` will match any string that matches either *A* or *B*. ``|`` has very 6987db96d56Sopenharmony_ci low precedence in order to make it work reasonably when you're alternating 6997db96d56Sopenharmony_ci multi-character strings. ``Crow|Servo`` will match either ``'Crow'`` or ``'Servo'``, 7007db96d56Sopenharmony_ci not ``'Cro'``, a ``'w'`` or an ``'S'``, and ``'ervo'``. 7017db96d56Sopenharmony_ci 7027db96d56Sopenharmony_ci To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, 7037db96d56Sopenharmony_ci as in ``[|]``. 7047db96d56Sopenharmony_ci 7057db96d56Sopenharmony_ci``^`` 7067db96d56Sopenharmony_ci Matches at the beginning of lines. Unless the :const:`MULTILINE` flag has been 7077db96d56Sopenharmony_ci set, this will only match at the beginning of the string. In :const:`MULTILINE` 7087db96d56Sopenharmony_ci mode, this also matches immediately after each newline within the string. 7097db96d56Sopenharmony_ci 7107db96d56Sopenharmony_ci For example, if you wish to match the word ``From`` only at the beginning of a 7117db96d56Sopenharmony_ci line, the RE to use is ``^From``. :: 7127db96d56Sopenharmony_ci 7137db96d56Sopenharmony_ci >>> print(re.search('^From', 'From Here to Eternity')) #doctest: +ELLIPSIS 7147db96d56Sopenharmony_ci <re.Match object; span=(0, 4), match='From'> 7157db96d56Sopenharmony_ci >>> print(re.search('^From', 'Reciting From Memory')) 7167db96d56Sopenharmony_ci None 7177db96d56Sopenharmony_ci 7187db96d56Sopenharmony_ci To match a literal ``'^'``, use ``\^``. 7197db96d56Sopenharmony_ci 7207db96d56Sopenharmony_ci``$`` 7217db96d56Sopenharmony_ci Matches at the end of a line, which is defined as either the end of the string, 7227db96d56Sopenharmony_ci or any location followed by a newline character. :: 7237db96d56Sopenharmony_ci 7247db96d56Sopenharmony_ci >>> print(re.search('}$', '{block}')) #doctest: +ELLIPSIS 7257db96d56Sopenharmony_ci <re.Match object; span=(6, 7), match='}'> 7267db96d56Sopenharmony_ci >>> print(re.search('}$', '{block} ')) 7277db96d56Sopenharmony_ci None 7287db96d56Sopenharmony_ci >>> print(re.search('}$', '{block}\n')) #doctest: +ELLIPSIS 7297db96d56Sopenharmony_ci <re.Match object; span=(6, 7), match='}'> 7307db96d56Sopenharmony_ci 7317db96d56Sopenharmony_ci To match a literal ``'$'``, use ``\$`` or enclose it inside a character class, 7327db96d56Sopenharmony_ci as in ``[$]``. 7337db96d56Sopenharmony_ci 7347db96d56Sopenharmony_ci``\A`` 7357db96d56Sopenharmony_ci Matches only at the start of the string. When not in :const:`MULTILINE` mode, 7367db96d56Sopenharmony_ci ``\A`` and ``^`` are effectively the same. In :const:`MULTILINE` mode, they're 7377db96d56Sopenharmony_ci different: ``\A`` still matches only at the beginning of the string, but ``^`` 7387db96d56Sopenharmony_ci may match at any location inside the string that follows a newline character. 7397db96d56Sopenharmony_ci 7407db96d56Sopenharmony_ci``\Z`` 7417db96d56Sopenharmony_ci Matches only at the end of the string. 7427db96d56Sopenharmony_ci 7437db96d56Sopenharmony_ci``\b`` 7447db96d56Sopenharmony_ci Word boundary. This is a zero-width assertion that matches only at the 7457db96d56Sopenharmony_ci beginning or end of a word. A word is defined as a sequence of alphanumeric 7467db96d56Sopenharmony_ci characters, so the end of a word is indicated by whitespace or a 7477db96d56Sopenharmony_ci non-alphanumeric character. 7487db96d56Sopenharmony_ci 7497db96d56Sopenharmony_ci The following example matches ``class`` only when it's a complete word; it won't 7507db96d56Sopenharmony_ci match when it's contained inside another word. :: 7517db96d56Sopenharmony_ci 7527db96d56Sopenharmony_ci >>> p = re.compile(r'\bclass\b') 7537db96d56Sopenharmony_ci >>> print(p.search('no class at all')) 7547db96d56Sopenharmony_ci <re.Match object; span=(3, 8), match='class'> 7557db96d56Sopenharmony_ci >>> print(p.search('the declassified algorithm')) 7567db96d56Sopenharmony_ci None 7577db96d56Sopenharmony_ci >>> print(p.search('one subclass is')) 7587db96d56Sopenharmony_ci None 7597db96d56Sopenharmony_ci 7607db96d56Sopenharmony_ci There are two subtleties you should remember when using this special sequence. 7617db96d56Sopenharmony_ci First, this is the worst collision between Python's string literals and regular 7627db96d56Sopenharmony_ci expression sequences. In Python's string literals, ``\b`` is the backspace 7637db96d56Sopenharmony_ci character, ASCII value 8. If you're not using raw strings, then Python will 7647db96d56Sopenharmony_ci convert the ``\b`` to a backspace, and your RE won't match as you expect it to. 7657db96d56Sopenharmony_ci The following example looks the same as our previous RE, but omits the ``'r'`` 7667db96d56Sopenharmony_ci in front of the RE string. :: 7677db96d56Sopenharmony_ci 7687db96d56Sopenharmony_ci >>> p = re.compile('\bclass\b') 7697db96d56Sopenharmony_ci >>> print(p.search('no class at all')) 7707db96d56Sopenharmony_ci None 7717db96d56Sopenharmony_ci >>> print(p.search('\b' + 'class' + '\b')) 7727db96d56Sopenharmony_ci <re.Match object; span=(0, 7), match='\x08class\x08'> 7737db96d56Sopenharmony_ci 7747db96d56Sopenharmony_ci Second, inside a character class, where there's no use for this assertion, 7757db96d56Sopenharmony_ci ``\b`` represents the backspace character, for compatibility with Python's 7767db96d56Sopenharmony_ci string literals. 7777db96d56Sopenharmony_ci 7787db96d56Sopenharmony_ci``\B`` 7797db96d56Sopenharmony_ci Another zero-width assertion, this is the opposite of ``\b``, only matching when 7807db96d56Sopenharmony_ci the current position is not at a word boundary. 7817db96d56Sopenharmony_ci 7827db96d56Sopenharmony_ci 7837db96d56Sopenharmony_ciGrouping 7847db96d56Sopenharmony_ci-------- 7857db96d56Sopenharmony_ci 7867db96d56Sopenharmony_ciFrequently you need to obtain more information than just whether the RE matched 7877db96d56Sopenharmony_cior not. Regular expressions are often used to dissect strings by writing a RE 7887db96d56Sopenharmony_cidivided into several subgroups which match different components of interest. 7897db96d56Sopenharmony_ciFor example, an RFC-822 header line is divided into a header name and a value, 7907db96d56Sopenharmony_ciseparated by a ``':'``, like this: 7917db96d56Sopenharmony_ci 7927db96d56Sopenharmony_ci.. code-block:: none 7937db96d56Sopenharmony_ci 7947db96d56Sopenharmony_ci From: author@example.com 7957db96d56Sopenharmony_ci User-Agent: Thunderbird 1.5.0.9 (X11/20061227) 7967db96d56Sopenharmony_ci MIME-Version: 1.0 7977db96d56Sopenharmony_ci To: editor@example.com 7987db96d56Sopenharmony_ci 7997db96d56Sopenharmony_ciThis can be handled by writing a regular expression which matches an entire 8007db96d56Sopenharmony_ciheader line, and has one group which matches the header name, and another group 8017db96d56Sopenharmony_ciwhich matches the header's value. 8027db96d56Sopenharmony_ci 8037db96d56Sopenharmony_ciGroups are marked by the ``'('``, ``')'`` metacharacters. ``'('`` and ``')'`` 8047db96d56Sopenharmony_cihave much the same meaning as they do in mathematical expressions; they group 8057db96d56Sopenharmony_citogether the expressions contained inside them, and you can repeat the contents 8067db96d56Sopenharmony_ciof a group with a quantifier, such as ``*``, ``+``, ``?``, or 8077db96d56Sopenharmony_ci``{m,n}``. For example, ``(ab)*`` will match zero or more repetitions of 8087db96d56Sopenharmony_ci``ab``. :: 8097db96d56Sopenharmony_ci 8107db96d56Sopenharmony_ci >>> p = re.compile('(ab)*') 8117db96d56Sopenharmony_ci >>> print(p.match('ababababab').span()) 8127db96d56Sopenharmony_ci (0, 10) 8137db96d56Sopenharmony_ci 8147db96d56Sopenharmony_ciGroups indicated with ``'('``, ``')'`` also capture the starting and ending 8157db96d56Sopenharmony_ciindex of the text that they match; this can be retrieved by passing an argument 8167db96d56Sopenharmony_cito :meth:`~re.Match.group`, :meth:`~re.Match.start`, :meth:`~re.Match.end`, and 8177db96d56Sopenharmony_ci:meth:`~re.Match.span`. Groups are 8187db96d56Sopenharmony_cinumbered starting with 0. Group 0 is always present; it's the whole RE, so 8197db96d56Sopenharmony_ci:ref:`match object <match-objects>` methods all have group 0 as their default 8207db96d56Sopenharmony_ciargument. Later we'll see how to express groups that don't capture the span 8217db96d56Sopenharmony_ciof text that they match. :: 8227db96d56Sopenharmony_ci 8237db96d56Sopenharmony_ci >>> p = re.compile('(a)b') 8247db96d56Sopenharmony_ci >>> m = p.match('ab') 8257db96d56Sopenharmony_ci >>> m.group() 8267db96d56Sopenharmony_ci 'ab' 8277db96d56Sopenharmony_ci >>> m.group(0) 8287db96d56Sopenharmony_ci 'ab' 8297db96d56Sopenharmony_ci 8307db96d56Sopenharmony_ciSubgroups are numbered from left to right, from 1 upward. Groups can be nested; 8317db96d56Sopenharmony_cito determine the number, just count the opening parenthesis characters, going 8327db96d56Sopenharmony_cifrom left to right. :: 8337db96d56Sopenharmony_ci 8347db96d56Sopenharmony_ci >>> p = re.compile('(a(b)c)d') 8357db96d56Sopenharmony_ci >>> m = p.match('abcd') 8367db96d56Sopenharmony_ci >>> m.group(0) 8377db96d56Sopenharmony_ci 'abcd' 8387db96d56Sopenharmony_ci >>> m.group(1) 8397db96d56Sopenharmony_ci 'abc' 8407db96d56Sopenharmony_ci >>> m.group(2) 8417db96d56Sopenharmony_ci 'b' 8427db96d56Sopenharmony_ci 8437db96d56Sopenharmony_ci:meth:`~re.Match.group` can be passed multiple group numbers at a time, in which case it 8447db96d56Sopenharmony_ciwill return a tuple containing the corresponding values for those groups. :: 8457db96d56Sopenharmony_ci 8467db96d56Sopenharmony_ci >>> m.group(2,1,2) 8477db96d56Sopenharmony_ci ('b', 'abc', 'b') 8487db96d56Sopenharmony_ci 8497db96d56Sopenharmony_ciThe :meth:`~re.Match.groups` method returns a tuple containing the strings for all the 8507db96d56Sopenharmony_cisubgroups, from 1 up to however many there are. :: 8517db96d56Sopenharmony_ci 8527db96d56Sopenharmony_ci >>> m.groups() 8537db96d56Sopenharmony_ci ('abc', 'b') 8547db96d56Sopenharmony_ci 8557db96d56Sopenharmony_ciBackreferences in a pattern allow you to specify that the contents of an earlier 8567db96d56Sopenharmony_cicapturing group must also be found at the current location in the string. For 8577db96d56Sopenharmony_ciexample, ``\1`` will succeed if the exact contents of group 1 can be found at 8587db96d56Sopenharmony_cithe current position, and fails otherwise. Remember that Python's string 8597db96d56Sopenharmony_ciliterals also use a backslash followed by numbers to allow including arbitrary 8607db96d56Sopenharmony_cicharacters in a string, so be sure to use a raw string when incorporating 8617db96d56Sopenharmony_cibackreferences in a RE. 8627db96d56Sopenharmony_ci 8637db96d56Sopenharmony_ciFor example, the following RE detects doubled words in a string. :: 8647db96d56Sopenharmony_ci 8657db96d56Sopenharmony_ci >>> p = re.compile(r'\b(\w+)\s+\1\b') 8667db96d56Sopenharmony_ci >>> p.search('Paris in the the spring').group() 8677db96d56Sopenharmony_ci 'the the' 8687db96d56Sopenharmony_ci 8697db96d56Sopenharmony_ciBackreferences like this aren't often useful for just searching through a string 8707db96d56Sopenharmony_ci--- there are few text formats which repeat data in this way --- but you'll soon 8717db96d56Sopenharmony_cifind out that they're *very* useful when performing string substitutions. 8727db96d56Sopenharmony_ci 8737db96d56Sopenharmony_ci 8747db96d56Sopenharmony_ciNon-capturing and Named Groups 8757db96d56Sopenharmony_ci------------------------------ 8767db96d56Sopenharmony_ci 8777db96d56Sopenharmony_ciElaborate REs may use many groups, both to capture substrings of interest, and 8787db96d56Sopenharmony_cito group and structure the RE itself. In complex REs, it becomes difficult to 8797db96d56Sopenharmony_cikeep track of the group numbers. There are two features which help with this 8807db96d56Sopenharmony_ciproblem. Both of them use a common syntax for regular expression extensions, so 8817db96d56Sopenharmony_ciwe'll look at that first. 8827db96d56Sopenharmony_ci 8837db96d56Sopenharmony_ciPerl 5 is well known for its powerful additions to standard regular expressions. 8847db96d56Sopenharmony_ciFor these new features the Perl developers couldn't choose new single-keystroke metacharacters 8857db96d56Sopenharmony_cior new special sequences beginning with ``\`` without making Perl's regular 8867db96d56Sopenharmony_ciexpressions confusingly different from standard REs. If they chose ``&`` as a 8877db96d56Sopenharmony_cinew metacharacter, for example, old expressions would be assuming that ``&`` was 8887db96d56Sopenharmony_cia regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. 8897db96d56Sopenharmony_ci 8907db96d56Sopenharmony_ciThe solution chosen by the Perl developers was to use ``(?...)`` as the 8917db96d56Sopenharmony_ciextension syntax. ``?`` immediately after a parenthesis was a syntax error 8927db96d56Sopenharmony_cibecause the ``?`` would have nothing to repeat, so this didn't introduce any 8937db96d56Sopenharmony_cicompatibility problems. The characters immediately after the ``?`` indicate 8947db96d56Sopenharmony_ciwhat extension is being used, so ``(?=foo)`` is one thing (a positive lookahead 8957db96d56Sopenharmony_ciassertion) and ``(?:foo)`` is something else (a non-capturing group containing 8967db96d56Sopenharmony_cithe subexpression ``foo``). 8977db96d56Sopenharmony_ci 8987db96d56Sopenharmony_ciPython supports several of Perl's extensions and adds an extension 8997db96d56Sopenharmony_cisyntax to Perl's extension syntax. If the first character after the 9007db96d56Sopenharmony_ciquestion mark is a ``P``, you know that it's an extension that's 9017db96d56Sopenharmony_cispecific to Python. 9027db96d56Sopenharmony_ci 9037db96d56Sopenharmony_ciNow that we've looked at the general extension syntax, we can return 9047db96d56Sopenharmony_cito the features that simplify working with groups in complex REs. 9057db96d56Sopenharmony_ci 9067db96d56Sopenharmony_ciSometimes you'll want to use a group to denote a part of a regular expression, 9077db96d56Sopenharmony_cibut aren't interested in retrieving the group's contents. You can make this fact 9087db96d56Sopenharmony_ciexplicit by using a non-capturing group: ``(?:...)``, where you can replace the 9097db96d56Sopenharmony_ci``...`` with any other regular expression. :: 9107db96d56Sopenharmony_ci 9117db96d56Sopenharmony_ci >>> m = re.match("([abc])+", "abc") 9127db96d56Sopenharmony_ci >>> m.groups() 9137db96d56Sopenharmony_ci ('c',) 9147db96d56Sopenharmony_ci >>> m = re.match("(?:[abc])+", "abc") 9157db96d56Sopenharmony_ci >>> m.groups() 9167db96d56Sopenharmony_ci () 9177db96d56Sopenharmony_ci 9187db96d56Sopenharmony_ciExcept for the fact that you can't retrieve the contents of what the group 9197db96d56Sopenharmony_cimatched, a non-capturing group behaves exactly the same as a capturing group; 9207db96d56Sopenharmony_ciyou can put anything inside it, repeat it with a repetition metacharacter such 9217db96d56Sopenharmony_cias ``*``, and nest it within other groups (capturing or non-capturing). 9227db96d56Sopenharmony_ci``(?:...)`` is particularly useful when modifying an existing pattern, since you 9237db96d56Sopenharmony_cican add new groups without changing how all the other groups are numbered. It 9247db96d56Sopenharmony_cishould be mentioned that there's no performance difference in searching between 9257db96d56Sopenharmony_cicapturing and non-capturing groups; neither form is any faster than the other. 9267db96d56Sopenharmony_ci 9277db96d56Sopenharmony_ciA more significant feature is named groups: instead of referring to them by 9287db96d56Sopenharmony_cinumbers, groups can be referenced by a name. 9297db96d56Sopenharmony_ci 9307db96d56Sopenharmony_ciThe syntax for a named group is one of the Python-specific extensions: 9317db96d56Sopenharmony_ci``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups 9327db96d56Sopenharmony_cibehave exactly like capturing groups, and additionally associate a name 9337db96d56Sopenharmony_ciwith a group. The :ref:`match object <match-objects>` methods that deal with 9347db96d56Sopenharmony_cicapturing groups all accept either integers that refer to the group by number 9357db96d56Sopenharmony_cior strings that contain the desired group's name. Named groups are still 9367db96d56Sopenharmony_cigiven numbers, so you can retrieve information about a group in two ways:: 9377db96d56Sopenharmony_ci 9387db96d56Sopenharmony_ci >>> p = re.compile(r'(?P<word>\b\w+\b)') 9397db96d56Sopenharmony_ci >>> m = p.search( '(((( Lots of punctuation )))' ) 9407db96d56Sopenharmony_ci >>> m.group('word') 9417db96d56Sopenharmony_ci 'Lots' 9427db96d56Sopenharmony_ci >>> m.group(1) 9437db96d56Sopenharmony_ci 'Lots' 9447db96d56Sopenharmony_ci 9457db96d56Sopenharmony_ciAdditionally, you can retrieve named groups as a dictionary with 9467db96d56Sopenharmony_ci:meth:`~re.Match.groupdict`:: 9477db96d56Sopenharmony_ci 9487db96d56Sopenharmony_ci >>> m = re.match(r'(?P<first>\w+) (?P<last>\w+)', 'Jane Doe') 9497db96d56Sopenharmony_ci >>> m.groupdict() 9507db96d56Sopenharmony_ci {'first': 'Jane', 'last': 'Doe'} 9517db96d56Sopenharmony_ci 9527db96d56Sopenharmony_ciNamed groups are handy because they let you use easily remembered names, instead 9537db96d56Sopenharmony_ciof having to remember numbers. Here's an example RE from the :mod:`imaplib` 9547db96d56Sopenharmony_cimodule:: 9557db96d56Sopenharmony_ci 9567db96d56Sopenharmony_ci InternalDate = re.compile(r'INTERNALDATE "' 9577db96d56Sopenharmony_ci r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' 9587db96d56Sopenharmony_ci r'(?P<year>[0-9][0-9][0-9][0-9])' 9597db96d56Sopenharmony_ci r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' 9607db96d56Sopenharmony_ci r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' 9617db96d56Sopenharmony_ci r'"') 9627db96d56Sopenharmony_ci 9637db96d56Sopenharmony_ciIt's obviously much easier to retrieve ``m.group('zonem')``, instead of having 9647db96d56Sopenharmony_cito remember to retrieve group 9. 9657db96d56Sopenharmony_ci 9667db96d56Sopenharmony_ciThe syntax for backreferences in an expression such as ``(...)\1`` refers to the 9677db96d56Sopenharmony_cinumber of the group. There's naturally a variant that uses the group name 9687db96d56Sopenharmony_ciinstead of the number. This is another Python extension: ``(?P=name)`` indicates 9697db96d56Sopenharmony_cithat the contents of the group called *name* should again be matched at the 9707db96d56Sopenharmony_cicurrent point. The regular expression for finding doubled words, 9717db96d56Sopenharmony_ci``\b(\w+)\s+\1\b`` can also be written as ``\b(?P<word>\w+)\s+(?P=word)\b``:: 9727db96d56Sopenharmony_ci 9737db96d56Sopenharmony_ci >>> p = re.compile(r'\b(?P<word>\w+)\s+(?P=word)\b') 9747db96d56Sopenharmony_ci >>> p.search('Paris in the the spring').group() 9757db96d56Sopenharmony_ci 'the the' 9767db96d56Sopenharmony_ci 9777db96d56Sopenharmony_ci 9787db96d56Sopenharmony_ciLookahead Assertions 9797db96d56Sopenharmony_ci-------------------- 9807db96d56Sopenharmony_ci 9817db96d56Sopenharmony_ciAnother zero-width assertion is the lookahead assertion. Lookahead assertions 9827db96d56Sopenharmony_ciare available in both positive and negative form, and look like this: 9837db96d56Sopenharmony_ci 9847db96d56Sopenharmony_ci``(?=...)`` 9857db96d56Sopenharmony_ci Positive lookahead assertion. This succeeds if the contained regular 9867db96d56Sopenharmony_ci expression, represented here by ``...``, successfully matches at the current 9877db96d56Sopenharmony_ci location, and fails otherwise. But, once the contained expression has been 9887db96d56Sopenharmony_ci tried, the matching engine doesn't advance at all; the rest of the pattern is 9897db96d56Sopenharmony_ci tried right where the assertion started. 9907db96d56Sopenharmony_ci 9917db96d56Sopenharmony_ci``(?!...)`` 9927db96d56Sopenharmony_ci Negative lookahead assertion. This is the opposite of the positive assertion; 9937db96d56Sopenharmony_ci it succeeds if the contained expression *doesn't* match at the current position 9947db96d56Sopenharmony_ci in the string. 9957db96d56Sopenharmony_ci 9967db96d56Sopenharmony_ciTo make this concrete, let's look at a case where a lookahead is useful. 9977db96d56Sopenharmony_ciConsider a simple pattern to match a filename and split it apart into a base 9987db96d56Sopenharmony_ciname and an extension, separated by a ``.``. For example, in ``news.rc``, 9997db96d56Sopenharmony_ci``news`` is the base name, and ``rc`` is the filename's extension. 10007db96d56Sopenharmony_ci 10017db96d56Sopenharmony_ciThe pattern to match this is quite simple: 10027db96d56Sopenharmony_ci 10037db96d56Sopenharmony_ci``.*[.].*$`` 10047db96d56Sopenharmony_ci 10057db96d56Sopenharmony_ciNotice that the ``.`` needs to be treated specially because it's a 10067db96d56Sopenharmony_cimetacharacter, so it's inside a character class to only match that 10077db96d56Sopenharmony_cispecific character. Also notice the trailing ``$``; this is added to 10087db96d56Sopenharmony_ciensure that all the rest of the string must be included in the 10097db96d56Sopenharmony_ciextension. This regular expression matches ``foo.bar`` and 10107db96d56Sopenharmony_ci``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. 10117db96d56Sopenharmony_ci 10127db96d56Sopenharmony_ciNow, consider complicating the problem a bit; what if you want to match 10137db96d56Sopenharmony_cifilenames where the extension is not ``bat``? Some incorrect attempts: 10147db96d56Sopenharmony_ci 10157db96d56Sopenharmony_ci``.*[.][^b].*$`` The first attempt above tries to exclude ``bat`` by requiring 10167db96d56Sopenharmony_cithat the first character of the extension is not a ``b``. This is wrong, 10177db96d56Sopenharmony_cibecause the pattern also doesn't match ``foo.bar``. 10187db96d56Sopenharmony_ci 10197db96d56Sopenharmony_ci``.*[.]([^b]..|.[^a].|..[^t])$`` 10207db96d56Sopenharmony_ci 10217db96d56Sopenharmony_ciThe expression gets messier when you try to patch up the first solution by 10227db96d56Sopenharmony_cirequiring one of the following cases to match: the first character of the 10237db96d56Sopenharmony_ciextension isn't ``b``; the second character isn't ``a``; or the third character 10247db96d56Sopenharmony_ciisn't ``t``. This accepts ``foo.bar`` and rejects ``autoexec.bat``, but it 10257db96d56Sopenharmony_cirequires a three-letter extension and won't accept a filename with a two-letter 10267db96d56Sopenharmony_ciextension such as ``sendmail.cf``. We'll complicate the pattern again in an 10277db96d56Sopenharmony_cieffort to fix it. 10287db96d56Sopenharmony_ci 10297db96d56Sopenharmony_ci``.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$`` 10307db96d56Sopenharmony_ci 10317db96d56Sopenharmony_ciIn the third attempt, the second and third letters are all made optional in 10327db96d56Sopenharmony_ciorder to allow matching extensions shorter than three characters, such as 10337db96d56Sopenharmony_ci``sendmail.cf``. 10347db96d56Sopenharmony_ci 10357db96d56Sopenharmony_ciThe pattern's getting really complicated now, which makes it hard to read and 10367db96d56Sopenharmony_ciunderstand. Worse, if the problem changes and you want to exclude both ``bat`` 10377db96d56Sopenharmony_ciand ``exe`` as extensions, the pattern would get even more complicated and 10387db96d56Sopenharmony_ciconfusing. 10397db96d56Sopenharmony_ci 10407db96d56Sopenharmony_ciA negative lookahead cuts through all this confusion: 10417db96d56Sopenharmony_ci 10427db96d56Sopenharmony_ci``.*[.](?!bat$)[^.]*$`` The negative lookahead means: if the expression ``bat`` 10437db96d56Sopenharmony_cidoesn't match at this point, try the rest of the pattern; if ``bat$`` does 10447db96d56Sopenharmony_cimatch, the whole pattern will fail. The trailing ``$`` is required to ensure 10457db96d56Sopenharmony_cithat something like ``sample.batch``, where the extension only starts with 10467db96d56Sopenharmony_ci``bat``, will be allowed. The ``[^.]*`` makes sure that the pattern works 10477db96d56Sopenharmony_ciwhen there are multiple dots in the filename. 10487db96d56Sopenharmony_ci 10497db96d56Sopenharmony_ciExcluding another filename extension is now easy; simply add it as an 10507db96d56Sopenharmony_cialternative inside the assertion. The following pattern excludes filenames that 10517db96d56Sopenharmony_ciend in either ``bat`` or ``exe``: 10527db96d56Sopenharmony_ci 10537db96d56Sopenharmony_ci``.*[.](?!bat$|exe$)[^.]*$`` 10547db96d56Sopenharmony_ci 10557db96d56Sopenharmony_ci 10567db96d56Sopenharmony_ciModifying Strings 10577db96d56Sopenharmony_ci================= 10587db96d56Sopenharmony_ci 10597db96d56Sopenharmony_ciUp to this point, we've simply performed searches against a static string. 10607db96d56Sopenharmony_ciRegular expressions are also commonly used to modify strings in various ways, 10617db96d56Sopenharmony_ciusing the following pattern methods: 10627db96d56Sopenharmony_ci 10637db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 10647db96d56Sopenharmony_ci| Method/Attribute | Purpose | 10657db96d56Sopenharmony_ci+==================+===============================================+ 10667db96d56Sopenharmony_ci| ``split()`` | Split the string into a list, splitting it | 10677db96d56Sopenharmony_ci| | wherever the RE matches | 10687db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 10697db96d56Sopenharmony_ci| ``sub()`` | Find all substrings where the RE matches, and | 10707db96d56Sopenharmony_ci| | replace them with a different string | 10717db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 10727db96d56Sopenharmony_ci| ``subn()`` | Does the same thing as :meth:`!sub`, but | 10737db96d56Sopenharmony_ci| | returns the new string and the number of | 10747db96d56Sopenharmony_ci| | replacements | 10757db96d56Sopenharmony_ci+------------------+-----------------------------------------------+ 10767db96d56Sopenharmony_ci 10777db96d56Sopenharmony_ci 10787db96d56Sopenharmony_ciSplitting Strings 10797db96d56Sopenharmony_ci----------------- 10807db96d56Sopenharmony_ci 10817db96d56Sopenharmony_ciThe :meth:`~re.Pattern.split` method of a pattern splits a string apart 10827db96d56Sopenharmony_ciwherever the RE matches, returning a list of the pieces. It's similar to the 10837db96d56Sopenharmony_ci:meth:`~str.split` method of strings but provides much more generality in the 10847db96d56Sopenharmony_cidelimiters that you can split by; string :meth:`!split` only supports splitting by 10857db96d56Sopenharmony_ciwhitespace or by a fixed string. As you'd expect, there's a module-level 10867db96d56Sopenharmony_ci:func:`re.split` function, too. 10877db96d56Sopenharmony_ci 10887db96d56Sopenharmony_ci 10897db96d56Sopenharmony_ci.. method:: .split(string [, maxsplit=0]) 10907db96d56Sopenharmony_ci :noindex: 10917db96d56Sopenharmony_ci 10927db96d56Sopenharmony_ci Split *string* by the matches of the regular expression. If capturing 10937db96d56Sopenharmony_ci parentheses are used in the RE, then their contents will also be returned as 10947db96d56Sopenharmony_ci part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* splits 10957db96d56Sopenharmony_ci are performed. 10967db96d56Sopenharmony_ci 10977db96d56Sopenharmony_ciYou can limit the number of splits made, by passing a value for *maxsplit*. 10987db96d56Sopenharmony_ciWhen *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the 10997db96d56Sopenharmony_ciremainder of the string is returned as the final element of the list. In the 11007db96d56Sopenharmony_cifollowing example, the delimiter is any sequence of non-alphanumeric characters. 11017db96d56Sopenharmony_ci:: 11027db96d56Sopenharmony_ci 11037db96d56Sopenharmony_ci >>> p = re.compile(r'\W+') 11047db96d56Sopenharmony_ci >>> p.split('This is a test, short and sweet, of split().') 11057db96d56Sopenharmony_ci ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] 11067db96d56Sopenharmony_ci >>> p.split('This is a test, short and sweet, of split().', 3) 11077db96d56Sopenharmony_ci ['This', 'is', 'a', 'test, short and sweet, of split().'] 11087db96d56Sopenharmony_ci 11097db96d56Sopenharmony_ciSometimes you're not only interested in what the text between delimiters is, but 11107db96d56Sopenharmony_cialso need to know what the delimiter was. If capturing parentheses are used in 11117db96d56Sopenharmony_cithe RE, then their values are also returned as part of the list. Compare the 11127db96d56Sopenharmony_cifollowing calls:: 11137db96d56Sopenharmony_ci 11147db96d56Sopenharmony_ci >>> p = re.compile(r'\W+') 11157db96d56Sopenharmony_ci >>> p2 = re.compile(r'(\W+)') 11167db96d56Sopenharmony_ci >>> p.split('This... is a test.') 11177db96d56Sopenharmony_ci ['This', 'is', 'a', 'test', ''] 11187db96d56Sopenharmony_ci >>> p2.split('This... is a test.') 11197db96d56Sopenharmony_ci ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] 11207db96d56Sopenharmony_ci 11217db96d56Sopenharmony_ciThe module-level function :func:`re.split` adds the RE to be used as the first 11227db96d56Sopenharmony_ciargument, but is otherwise the same. :: 11237db96d56Sopenharmony_ci 11247db96d56Sopenharmony_ci >>> re.split(r'[\W]+', 'Words, words, words.') 11257db96d56Sopenharmony_ci ['Words', 'words', 'words', ''] 11267db96d56Sopenharmony_ci >>> re.split(r'([\W]+)', 'Words, words, words.') 11277db96d56Sopenharmony_ci ['Words', ', ', 'words', ', ', 'words', '.', ''] 11287db96d56Sopenharmony_ci >>> re.split(r'[\W]+', 'Words, words, words.', 1) 11297db96d56Sopenharmony_ci ['Words', 'words, words.'] 11307db96d56Sopenharmony_ci 11317db96d56Sopenharmony_ci 11327db96d56Sopenharmony_ciSearch and Replace 11337db96d56Sopenharmony_ci------------------ 11347db96d56Sopenharmony_ci 11357db96d56Sopenharmony_ciAnother common task is to find all the matches for a pattern, and replace them 11367db96d56Sopenharmony_ciwith a different string. The :meth:`~re.Pattern.sub` method takes a replacement value, 11377db96d56Sopenharmony_ciwhich can be either a string or a function, and the string to be processed. 11387db96d56Sopenharmony_ci 11397db96d56Sopenharmony_ci.. method:: .sub(replacement, string[, count=0]) 11407db96d56Sopenharmony_ci :noindex: 11417db96d56Sopenharmony_ci 11427db96d56Sopenharmony_ci Returns the string obtained by replacing the leftmost non-overlapping 11437db96d56Sopenharmony_ci occurrences of the RE in *string* by the replacement *replacement*. If the 11447db96d56Sopenharmony_ci pattern isn't found, *string* is returned unchanged. 11457db96d56Sopenharmony_ci 11467db96d56Sopenharmony_ci The optional argument *count* is the maximum number of pattern occurrences to be 11477db96d56Sopenharmony_ci replaced; *count* must be a non-negative integer. The default value of 0 means 11487db96d56Sopenharmony_ci to replace all occurrences. 11497db96d56Sopenharmony_ci 11507db96d56Sopenharmony_ciHere's a simple example of using the :meth:`~re.Pattern.sub` method. It replaces colour 11517db96d56Sopenharmony_cinames with the word ``colour``:: 11527db96d56Sopenharmony_ci 11537db96d56Sopenharmony_ci >>> p = re.compile('(blue|white|red)') 11547db96d56Sopenharmony_ci >>> p.sub('colour', 'blue socks and red shoes') 11557db96d56Sopenharmony_ci 'colour socks and colour shoes' 11567db96d56Sopenharmony_ci >>> p.sub('colour', 'blue socks and red shoes', count=1) 11577db96d56Sopenharmony_ci 'colour socks and red shoes' 11587db96d56Sopenharmony_ci 11597db96d56Sopenharmony_ciThe :meth:`~re.Pattern.subn` method does the same work, but returns a 2-tuple containing the 11607db96d56Sopenharmony_cinew string value and the number of replacements that were performed:: 11617db96d56Sopenharmony_ci 11627db96d56Sopenharmony_ci >>> p = re.compile('(blue|white|red)') 11637db96d56Sopenharmony_ci >>> p.subn('colour', 'blue socks and red shoes') 11647db96d56Sopenharmony_ci ('colour socks and colour shoes', 2) 11657db96d56Sopenharmony_ci >>> p.subn('colour', 'no colours at all') 11667db96d56Sopenharmony_ci ('no colours at all', 0) 11677db96d56Sopenharmony_ci 11687db96d56Sopenharmony_ciEmpty matches are replaced only when they're not adjacent to a previous empty match. 11697db96d56Sopenharmony_ci:: 11707db96d56Sopenharmony_ci 11717db96d56Sopenharmony_ci >>> p = re.compile('x*') 11727db96d56Sopenharmony_ci >>> p.sub('-', 'abxd') 11737db96d56Sopenharmony_ci '-a-b--d-' 11747db96d56Sopenharmony_ci 11757db96d56Sopenharmony_ciIf *replacement* is a string, any backslash escapes in it are processed. That 11767db96d56Sopenharmony_ciis, ``\n`` is converted to a single newline character, ``\r`` is converted to a 11777db96d56Sopenharmony_cicarriage return, and so forth. Unknown escapes such as ``\&`` are left alone. 11787db96d56Sopenharmony_ciBackreferences, such as ``\6``, are replaced with the substring matched by the 11797db96d56Sopenharmony_cicorresponding group in the RE. This lets you incorporate portions of the 11807db96d56Sopenharmony_cioriginal text in the resulting replacement string. 11817db96d56Sopenharmony_ci 11827db96d56Sopenharmony_ciThis example matches the word ``section`` followed by a string enclosed in 11837db96d56Sopenharmony_ci``{``, ``}``, and changes ``section`` to ``subsection``:: 11847db96d56Sopenharmony_ci 11857db96d56Sopenharmony_ci >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) 11867db96d56Sopenharmony_ci >>> p.sub(r'subsection{\1}','section{First} section{second}') 11877db96d56Sopenharmony_ci 'subsection{First} subsection{second}' 11887db96d56Sopenharmony_ci 11897db96d56Sopenharmony_ciThere's also a syntax for referring to named groups as defined by the 11907db96d56Sopenharmony_ci``(?P<name>...)`` syntax. ``\g<name>`` will use the substring matched by the 11917db96d56Sopenharmony_cigroup named ``name``, and ``\g<number>`` uses the corresponding group number. 11927db96d56Sopenharmony_ci``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous in a 11937db96d56Sopenharmony_cireplacement string such as ``\g<2>0``. (``\20`` would be interpreted as a 11947db96d56Sopenharmony_cireference to group 20, not a reference to group 2 followed by the literal 11957db96d56Sopenharmony_cicharacter ``'0'``.) The following substitutions are all equivalent, but use all 11967db96d56Sopenharmony_cithree variations of the replacement string. :: 11977db96d56Sopenharmony_ci 11987db96d56Sopenharmony_ci >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) 11997db96d56Sopenharmony_ci >>> p.sub(r'subsection{\1}','section{First}') 12007db96d56Sopenharmony_ci 'subsection{First}' 12017db96d56Sopenharmony_ci >>> p.sub(r'subsection{\g<1>}','section{First}') 12027db96d56Sopenharmony_ci 'subsection{First}' 12037db96d56Sopenharmony_ci >>> p.sub(r'subsection{\g<name>}','section{First}') 12047db96d56Sopenharmony_ci 'subsection{First}' 12057db96d56Sopenharmony_ci 12067db96d56Sopenharmony_ci*replacement* can also be a function, which gives you even more control. If 12077db96d56Sopenharmony_ci*replacement* is a function, the function is called for every non-overlapping 12087db96d56Sopenharmony_cioccurrence of *pattern*. On each call, the function is passed a 12097db96d56Sopenharmony_ci:ref:`match object <match-objects>` argument for the match and can use this 12107db96d56Sopenharmony_ciinformation to compute the desired replacement string and return it. 12117db96d56Sopenharmony_ci 12127db96d56Sopenharmony_ciIn the following example, the replacement function translates decimals into 12137db96d56Sopenharmony_cihexadecimal:: 12147db96d56Sopenharmony_ci 12157db96d56Sopenharmony_ci >>> def hexrepl(match): 12167db96d56Sopenharmony_ci ... "Return the hex string for a decimal number" 12177db96d56Sopenharmony_ci ... value = int(match.group()) 12187db96d56Sopenharmony_ci ... return hex(value) 12197db96d56Sopenharmony_ci ... 12207db96d56Sopenharmony_ci >>> p = re.compile(r'\d+') 12217db96d56Sopenharmony_ci >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') 12227db96d56Sopenharmony_ci 'Call 0xffd2 for printing, 0xc000 for user code.' 12237db96d56Sopenharmony_ci 12247db96d56Sopenharmony_ciWhen using the module-level :func:`re.sub` function, the pattern is passed as 12257db96d56Sopenharmony_cithe first argument. The pattern may be provided as an object or as a string; if 12267db96d56Sopenharmony_ciyou need to specify regular expression flags, you must either use a 12277db96d56Sopenharmony_cipattern object as the first parameter, or use embedded modifiers in the 12287db96d56Sopenharmony_cipattern string, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``. 12297db96d56Sopenharmony_ci 12307db96d56Sopenharmony_ci 12317db96d56Sopenharmony_ciCommon Problems 12327db96d56Sopenharmony_ci=============== 12337db96d56Sopenharmony_ci 12347db96d56Sopenharmony_ciRegular expressions are a powerful tool for some applications, but in some ways 12357db96d56Sopenharmony_citheir behaviour isn't intuitive and at times they don't behave the way you may 12367db96d56Sopenharmony_ciexpect them to. This section will point out some of the most common pitfalls. 12377db96d56Sopenharmony_ci 12387db96d56Sopenharmony_ci 12397db96d56Sopenharmony_ciUse String Methods 12407db96d56Sopenharmony_ci------------------ 12417db96d56Sopenharmony_ci 12427db96d56Sopenharmony_ciSometimes using the :mod:`re` module is a mistake. If you're matching a fixed 12437db96d56Sopenharmony_cistring, or a single character class, and you're not using any :mod:`re` features 12447db96d56Sopenharmony_cisuch as the :const:`~re.IGNORECASE` flag, then the full power of regular expressions 12457db96d56Sopenharmony_cimay not be required. Strings have several methods for performing operations with 12467db96d56Sopenharmony_cifixed strings and they're usually much faster, because the implementation is a 12477db96d56Sopenharmony_cisingle small C loop that's been optimized for the purpose, instead of the large, 12487db96d56Sopenharmony_cimore generalized regular expression engine. 12497db96d56Sopenharmony_ci 12507db96d56Sopenharmony_ciOne example might be replacing a single fixed string with another one; for 12517db96d56Sopenharmony_ciexample, you might replace ``word`` with ``deed``. :func:`re.sub` seems like the 12527db96d56Sopenharmony_cifunction to use for this, but consider the :meth:`~str.replace` method. Note that 12537db96d56Sopenharmony_ci:meth:`!replace` will also replace ``word`` inside words, turning ``swordfish`` 12547db96d56Sopenharmony_ciinto ``sdeedfish``, but the naive RE ``word`` would have done that, too. (To 12557db96d56Sopenharmony_ciavoid performing the substitution on parts of words, the pattern would have to 12567db96d56Sopenharmony_cibe ``\bword\b``, in order to require that ``word`` have a word boundary on 12577db96d56Sopenharmony_cieither side. This takes the job beyond :meth:`!replace`'s abilities.) 12587db96d56Sopenharmony_ci 12597db96d56Sopenharmony_ciAnother common task is deleting every occurrence of a single character from a 12607db96d56Sopenharmony_cistring or replacing it with another single character. You might do this with 12617db96d56Sopenharmony_cisomething like ``re.sub('\n', ' ', S)``, but :meth:`~str.translate` is capable of 12627db96d56Sopenharmony_cidoing both tasks and will be faster than any regular expression operation can 12637db96d56Sopenharmony_cibe. 12647db96d56Sopenharmony_ci 12657db96d56Sopenharmony_ciIn short, before turning to the :mod:`re` module, consider whether your problem 12667db96d56Sopenharmony_cican be solved with a faster and simpler string method. 12677db96d56Sopenharmony_ci 12687db96d56Sopenharmony_ci 12697db96d56Sopenharmony_cimatch() versus search() 12707db96d56Sopenharmony_ci----------------------- 12717db96d56Sopenharmony_ci 12727db96d56Sopenharmony_ciThe :func:`~re.match` function only checks if the RE matches at the beginning of the 12737db96d56Sopenharmony_cistring while :func:`~re.search` will scan forward through the string for a match. 12747db96d56Sopenharmony_ciIt's important to keep this distinction in mind. Remember, :func:`!match` will 12757db96d56Sopenharmony_cionly report a successful match which will start at 0; if the match wouldn't 12767db96d56Sopenharmony_cistart at zero, :func:`!match` will *not* report it. :: 12777db96d56Sopenharmony_ci 12787db96d56Sopenharmony_ci >>> print(re.match('super', 'superstition').span()) 12797db96d56Sopenharmony_ci (0, 5) 12807db96d56Sopenharmony_ci >>> print(re.match('super', 'insuperable')) 12817db96d56Sopenharmony_ci None 12827db96d56Sopenharmony_ci 12837db96d56Sopenharmony_ciOn the other hand, :func:`~re.search` will scan forward through the string, 12847db96d56Sopenharmony_cireporting the first match it finds. :: 12857db96d56Sopenharmony_ci 12867db96d56Sopenharmony_ci >>> print(re.search('super', 'superstition').span()) 12877db96d56Sopenharmony_ci (0, 5) 12887db96d56Sopenharmony_ci >>> print(re.search('super', 'insuperable').span()) 12897db96d56Sopenharmony_ci (2, 7) 12907db96d56Sopenharmony_ci 12917db96d56Sopenharmony_ciSometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*`` 12927db96d56Sopenharmony_cito the front of your RE. Resist this temptation and use :func:`re.search` 12937db96d56Sopenharmony_ciinstead. The regular expression compiler does some analysis of REs in order to 12947db96d56Sopenharmony_cispeed up the process of looking for a match. One such analysis figures out what 12957db96d56Sopenharmony_cithe first character of a match must be; for example, a pattern starting with 12967db96d56Sopenharmony_ci``Crow`` must match starting with a ``'C'``. The analysis lets the engine 12977db96d56Sopenharmony_ciquickly scan through the string looking for the starting character, only trying 12987db96d56Sopenharmony_cithe full match if a ``'C'`` is found. 12997db96d56Sopenharmony_ci 13007db96d56Sopenharmony_ciAdding ``.*`` defeats this optimization, requiring scanning to the end of the 13017db96d56Sopenharmony_cistring and then backtracking to find a match for the rest of the RE. Use 13027db96d56Sopenharmony_ci:func:`re.search` instead. 13037db96d56Sopenharmony_ci 13047db96d56Sopenharmony_ci 13057db96d56Sopenharmony_ciGreedy versus Non-Greedy 13067db96d56Sopenharmony_ci------------------------ 13077db96d56Sopenharmony_ci 13087db96d56Sopenharmony_ciWhen repeating a regular expression, as in ``a*``, the resulting action is to 13097db96d56Sopenharmony_ciconsume as much of the pattern as possible. This fact often bites you when 13107db96d56Sopenharmony_ciyou're trying to match a pair of balanced delimiters, such as the angle brackets 13117db96d56Sopenharmony_cisurrounding an HTML tag. The naive pattern for matching a single HTML tag 13127db96d56Sopenharmony_cidoesn't work because of the greedy nature of ``.*``. :: 13137db96d56Sopenharmony_ci 13147db96d56Sopenharmony_ci >>> s = '<html><head><title>Title</title>' 13157db96d56Sopenharmony_ci >>> len(s) 13167db96d56Sopenharmony_ci 32 13177db96d56Sopenharmony_ci >>> print(re.match('<.*>', s).span()) 13187db96d56Sopenharmony_ci (0, 32) 13197db96d56Sopenharmony_ci >>> print(re.match('<.*>', s).group()) 13207db96d56Sopenharmony_ci <html><head><title>Title</title> 13217db96d56Sopenharmony_ci 13227db96d56Sopenharmony_ciThe RE matches the ``'<'`` in ``'<html>'``, and the ``.*`` consumes the rest of 13237db96d56Sopenharmony_cithe string. There's still more left in the RE, though, and the ``>`` can't 13247db96d56Sopenharmony_cimatch at the end of the string, so the regular expression engine has to 13257db96d56Sopenharmony_cibacktrack character by character until it finds a match for the ``>``. The 13267db96d56Sopenharmony_cifinal match extends from the ``'<'`` in ``'<html>'`` to the ``'>'`` in 13277db96d56Sopenharmony_ci``'</title>'``, which isn't what you want. 13287db96d56Sopenharmony_ci 13297db96d56Sopenharmony_ciIn this case, the solution is to use the non-greedy quantifiers ``*?``, ``+?``, 13307db96d56Sopenharmony_ci``??``, or ``{m,n}?``, which match as *little* text as possible. In the above 13317db96d56Sopenharmony_ciexample, the ``'>'`` is tried immediately after the first ``'<'`` matches, and 13327db96d56Sopenharmony_ciwhen it fails, the engine advances a character at a time, retrying the ``'>'`` 13337db96d56Sopenharmony_ciat every step. This produces just the right result:: 13347db96d56Sopenharmony_ci 13357db96d56Sopenharmony_ci >>> print(re.match('<.*?>', s).group()) 13367db96d56Sopenharmony_ci <html> 13377db96d56Sopenharmony_ci 13387db96d56Sopenharmony_ci(Note that parsing HTML or XML with regular expressions is painful. 13397db96d56Sopenharmony_ciQuick-and-dirty patterns will handle common cases, but HTML and XML have special 13407db96d56Sopenharmony_cicases that will break the obvious regular expression; by the time you've written 13417db96d56Sopenharmony_cia regular expression that handles all of the possible cases, the patterns will 13427db96d56Sopenharmony_cibe *very* complicated. Use an HTML or XML parser module for such tasks.) 13437db96d56Sopenharmony_ci 13447db96d56Sopenharmony_ci 13457db96d56Sopenharmony_ciUsing re.VERBOSE 13467db96d56Sopenharmony_ci---------------- 13477db96d56Sopenharmony_ci 13487db96d56Sopenharmony_ciBy now you've probably noticed that regular expressions are a very compact 13497db96d56Sopenharmony_cinotation, but they're not terribly readable. REs of moderate complexity can 13507db96d56Sopenharmony_cibecome lengthy collections of backslashes, parentheses, and metacharacters, 13517db96d56Sopenharmony_cimaking them difficult to read and understand. 13527db96d56Sopenharmony_ci 13537db96d56Sopenharmony_ciFor such REs, specifying the :const:`re.VERBOSE` flag when compiling the regular 13547db96d56Sopenharmony_ciexpression can be helpful, because it allows you to format the regular 13557db96d56Sopenharmony_ciexpression more clearly. 13567db96d56Sopenharmony_ci 13577db96d56Sopenharmony_ciThe ``re.VERBOSE`` flag has several effects. Whitespace in the regular 13587db96d56Sopenharmony_ciexpression that *isn't* inside a character class is ignored. This means that an 13597db96d56Sopenharmony_ciexpression such as ``dog | cat`` is equivalent to the less readable ``dog|cat``, 13607db96d56Sopenharmony_cibut ``[a b]`` will still match the characters ``'a'``, ``'b'``, or a space. In 13617db96d56Sopenharmony_ciaddition, you can also put comments inside a RE; comments extend from a ``#`` 13627db96d56Sopenharmony_cicharacter to the next newline. When used with triple-quoted strings, this 13637db96d56Sopenharmony_cienables REs to be formatted more neatly:: 13647db96d56Sopenharmony_ci 13657db96d56Sopenharmony_ci pat = re.compile(r""" 13667db96d56Sopenharmony_ci \s* # Skip leading whitespace 13677db96d56Sopenharmony_ci (?P<header>[^:]+) # Header name 13687db96d56Sopenharmony_ci \s* : # Whitespace, and a colon 13697db96d56Sopenharmony_ci (?P<value>.*?) # The header's value -- *? used to 13707db96d56Sopenharmony_ci # lose the following trailing whitespace 13717db96d56Sopenharmony_ci \s*$ # Trailing whitespace to end-of-line 13727db96d56Sopenharmony_ci """, re.VERBOSE) 13737db96d56Sopenharmony_ci 13747db96d56Sopenharmony_ciThis is far more readable than:: 13757db96d56Sopenharmony_ci 13767db96d56Sopenharmony_ci pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") 13777db96d56Sopenharmony_ci 13787db96d56Sopenharmony_ci 13797db96d56Sopenharmony_ciFeedback 13807db96d56Sopenharmony_ci======== 13817db96d56Sopenharmony_ci 13827db96d56Sopenharmony_ciRegular expressions are a complicated topic. Did this document help you 13837db96d56Sopenharmony_ciunderstand them? Were there parts that were unclear, or Problems you 13847db96d56Sopenharmony_ciencountered that weren't covered here? If so, please send suggestions for 13857db96d56Sopenharmony_ciimprovements to the author. 13867db96d56Sopenharmony_ci 13877db96d56Sopenharmony_ciThe most complete book on regular expressions is almost certainly Jeffrey 13887db96d56Sopenharmony_ciFriedl's Mastering Regular Expressions, published by O'Reilly. Unfortunately, 13897db96d56Sopenharmony_ciit exclusively concentrates on Perl and Java's flavours of regular expressions, 13907db96d56Sopenharmony_ciand doesn't contain any Python material at all, so it won't be useful as a 13917db96d56Sopenharmony_cireference for programming in Python. (The first edition covered Python's 13927db96d56Sopenharmony_cinow-removed :mod:`!regex` module, which won't help you much.) Consider checking 13937db96d56Sopenharmony_ciit out from your library. 1394