17db96d56Sopenharmony_ci:mod:`tokenize` --- Tokenizer for Python source 27db96d56Sopenharmony_ci=============================================== 37db96d56Sopenharmony_ci 47db96d56Sopenharmony_ci.. module:: tokenize 57db96d56Sopenharmony_ci :synopsis: Lexical scanner for Python source code. 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci.. moduleauthor:: Ka Ping Yee 87db96d56Sopenharmony_ci.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 97db96d56Sopenharmony_ci 107db96d56Sopenharmony_ci**Source code:** :source:`Lib/tokenize.py` 117db96d56Sopenharmony_ci 127db96d56Sopenharmony_ci-------------- 137db96d56Sopenharmony_ci 147db96d56Sopenharmony_ciThe :mod:`tokenize` module provides a lexical scanner for Python source code, 157db96d56Sopenharmony_ciimplemented in Python. The scanner in this module returns comments as tokens 167db96d56Sopenharmony_cias well, making it useful for implementing "pretty-printers", including 177db96d56Sopenharmony_cicolorizers for on-screen displays. 187db96d56Sopenharmony_ci 197db96d56Sopenharmony_ciTo simplify token stream handling, all :ref:`operator <operators>` and 207db96d56Sopenharmony_ci:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using 217db96d56Sopenharmony_cithe generic :data:`~token.OP` token type. The exact 227db96d56Sopenharmony_citype can be determined by checking the ``exact_type`` property on the 237db96d56Sopenharmony_ci:term:`named tuple` returned from :func:`tokenize.tokenize`. 247db96d56Sopenharmony_ci 257db96d56Sopenharmony_ciTokenizing Input 267db96d56Sopenharmony_ci---------------- 277db96d56Sopenharmony_ci 287db96d56Sopenharmony_ciThe primary entry point is a :term:`generator`: 297db96d56Sopenharmony_ci 307db96d56Sopenharmony_ci.. function:: tokenize(readline) 317db96d56Sopenharmony_ci 327db96d56Sopenharmony_ci The :func:`.tokenize` generator requires one argument, *readline*, which 337db96d56Sopenharmony_ci must be a callable object which provides the same interface as the 347db96d56Sopenharmony_ci :meth:`io.IOBase.readline` method of file objects. Each call to the 357db96d56Sopenharmony_ci function should return one line of input as bytes. 367db96d56Sopenharmony_ci 377db96d56Sopenharmony_ci The generator produces 5-tuples with these members: the token type; the 387db96d56Sopenharmony_ci token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and 397db96d56Sopenharmony_ci column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of 407db96d56Sopenharmony_ci ints specifying the row and column where the token ends in the source; and 417db96d56Sopenharmony_ci the line on which the token was found. The line passed (the last tuple item) 427db96d56Sopenharmony_ci is the *physical* line. The 5 tuple is returned as a :term:`named tuple` 437db96d56Sopenharmony_ci with the field names: 447db96d56Sopenharmony_ci ``type string start end line``. 457db96d56Sopenharmony_ci 467db96d56Sopenharmony_ci The returned :term:`named tuple` has an additional property named 477db96d56Sopenharmony_ci ``exact_type`` that contains the exact operator type for 487db96d56Sopenharmony_ci :data:`~token.OP` tokens. For all other token types ``exact_type`` 497db96d56Sopenharmony_ci equals the named tuple ``type`` field. 507db96d56Sopenharmony_ci 517db96d56Sopenharmony_ci .. versionchanged:: 3.1 527db96d56Sopenharmony_ci Added support for named tuples. 537db96d56Sopenharmony_ci 547db96d56Sopenharmony_ci .. versionchanged:: 3.3 557db96d56Sopenharmony_ci Added support for ``exact_type``. 567db96d56Sopenharmony_ci 577db96d56Sopenharmony_ci :func:`.tokenize` determines the source encoding of the file by looking for a 587db96d56Sopenharmony_ci UTF-8 BOM or encoding cookie, according to :pep:`263`. 597db96d56Sopenharmony_ci 607db96d56Sopenharmony_ci.. function:: generate_tokens(readline) 617db96d56Sopenharmony_ci 627db96d56Sopenharmony_ci Tokenize a source reading unicode strings instead of bytes. 637db96d56Sopenharmony_ci 647db96d56Sopenharmony_ci Like :func:`.tokenize`, the *readline* argument is a callable returning 657db96d56Sopenharmony_ci a single line of input. However, :func:`generate_tokens` expects *readline* 667db96d56Sopenharmony_ci to return a str object rather than bytes. 677db96d56Sopenharmony_ci 687db96d56Sopenharmony_ci The result is an iterator yielding named tuples, exactly like 697db96d56Sopenharmony_ci :func:`.tokenize`. It does not yield an :data:`~token.ENCODING` token. 707db96d56Sopenharmony_ci 717db96d56Sopenharmony_ciAll constants from the :mod:`token` module are also exported from 727db96d56Sopenharmony_ci:mod:`tokenize`. 737db96d56Sopenharmony_ci 747db96d56Sopenharmony_ciAnother function is provided to reverse the tokenization process. This is 757db96d56Sopenharmony_ciuseful for creating tools that tokenize a script, modify the token stream, and 767db96d56Sopenharmony_ciwrite back the modified script. 777db96d56Sopenharmony_ci 787db96d56Sopenharmony_ci 797db96d56Sopenharmony_ci.. function:: untokenize(iterable) 807db96d56Sopenharmony_ci 817db96d56Sopenharmony_ci Converts tokens back into Python source code. The *iterable* must return 827db96d56Sopenharmony_ci sequences with at least two elements, the token type and the token string. 837db96d56Sopenharmony_ci Any additional sequence elements are ignored. 847db96d56Sopenharmony_ci 857db96d56Sopenharmony_ci The reconstructed script is returned as a single string. The result is 867db96d56Sopenharmony_ci guaranteed to tokenize back to match the input so that the conversion is 877db96d56Sopenharmony_ci lossless and round-trips are assured. The guarantee applies only to the 887db96d56Sopenharmony_ci token type and token string as the spacing between tokens (column 897db96d56Sopenharmony_ci positions) may change. 907db96d56Sopenharmony_ci 917db96d56Sopenharmony_ci It returns bytes, encoded using the :data:`~token.ENCODING` token, which 927db96d56Sopenharmony_ci is the first token sequence output by :func:`.tokenize`. If there is no 937db96d56Sopenharmony_ci encoding token in the input, it returns a str instead. 947db96d56Sopenharmony_ci 957db96d56Sopenharmony_ci 967db96d56Sopenharmony_ci:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The 977db96d56Sopenharmony_cifunction it uses to do this is available: 987db96d56Sopenharmony_ci 997db96d56Sopenharmony_ci.. function:: detect_encoding(readline) 1007db96d56Sopenharmony_ci 1017db96d56Sopenharmony_ci The :func:`detect_encoding` function is used to detect the encoding that 1027db96d56Sopenharmony_ci should be used to decode a Python source file. It requires one argument, 1037db96d56Sopenharmony_ci readline, in the same way as the :func:`.tokenize` generator. 1047db96d56Sopenharmony_ci 1057db96d56Sopenharmony_ci It will call readline a maximum of twice, and return the encoding used 1067db96d56Sopenharmony_ci (as a string) and a list of any lines (not decoded from bytes) it has read 1077db96d56Sopenharmony_ci in. 1087db96d56Sopenharmony_ci 1097db96d56Sopenharmony_ci It detects the encoding from the presence of a UTF-8 BOM or an encoding 1107db96d56Sopenharmony_ci cookie as specified in :pep:`263`. If both a BOM and a cookie are present, 1117db96d56Sopenharmony_ci but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found, 1127db96d56Sopenharmony_ci ``'utf-8-sig'`` will be returned as an encoding. 1137db96d56Sopenharmony_ci 1147db96d56Sopenharmony_ci If no encoding is specified, then the default of ``'utf-8'`` will be 1157db96d56Sopenharmony_ci returned. 1167db96d56Sopenharmony_ci 1177db96d56Sopenharmony_ci Use :func:`.open` to open Python source files: it uses 1187db96d56Sopenharmony_ci :func:`detect_encoding` to detect the file encoding. 1197db96d56Sopenharmony_ci 1207db96d56Sopenharmony_ci 1217db96d56Sopenharmony_ci.. function:: open(filename) 1227db96d56Sopenharmony_ci 1237db96d56Sopenharmony_ci Open a file in read only mode using the encoding detected by 1247db96d56Sopenharmony_ci :func:`detect_encoding`. 1257db96d56Sopenharmony_ci 1267db96d56Sopenharmony_ci .. versionadded:: 3.2 1277db96d56Sopenharmony_ci 1287db96d56Sopenharmony_ci.. exception:: TokenError 1297db96d56Sopenharmony_ci 1307db96d56Sopenharmony_ci Raised when either a docstring or expression that may be split over several 1317db96d56Sopenharmony_ci lines is not completed anywhere in the file, for example:: 1327db96d56Sopenharmony_ci 1337db96d56Sopenharmony_ci """Beginning of 1347db96d56Sopenharmony_ci docstring 1357db96d56Sopenharmony_ci 1367db96d56Sopenharmony_ci or:: 1377db96d56Sopenharmony_ci 1387db96d56Sopenharmony_ci [1, 1397db96d56Sopenharmony_ci 2, 1407db96d56Sopenharmony_ci 3 1417db96d56Sopenharmony_ci 1427db96d56Sopenharmony_ciNote that unclosed single-quoted strings do not cause an error to be 1437db96d56Sopenharmony_ciraised. They are tokenized as :data:`~token.ERRORTOKEN`, followed by the 1447db96d56Sopenharmony_citokenization of their contents. 1457db96d56Sopenharmony_ci 1467db96d56Sopenharmony_ci 1477db96d56Sopenharmony_ci.. _tokenize-cli: 1487db96d56Sopenharmony_ci 1497db96d56Sopenharmony_ciCommand-Line Usage 1507db96d56Sopenharmony_ci------------------ 1517db96d56Sopenharmony_ci 1527db96d56Sopenharmony_ci.. versionadded:: 3.3 1537db96d56Sopenharmony_ci 1547db96d56Sopenharmony_ciThe :mod:`tokenize` module can be executed as a script from the command line. 1557db96d56Sopenharmony_ciIt is as simple as: 1567db96d56Sopenharmony_ci 1577db96d56Sopenharmony_ci.. code-block:: sh 1587db96d56Sopenharmony_ci 1597db96d56Sopenharmony_ci python -m tokenize [-e] [filename.py] 1607db96d56Sopenharmony_ci 1617db96d56Sopenharmony_ciThe following options are accepted: 1627db96d56Sopenharmony_ci 1637db96d56Sopenharmony_ci.. program:: tokenize 1647db96d56Sopenharmony_ci 1657db96d56Sopenharmony_ci.. cmdoption:: -h, --help 1667db96d56Sopenharmony_ci 1677db96d56Sopenharmony_ci show this help message and exit 1687db96d56Sopenharmony_ci 1697db96d56Sopenharmony_ci.. cmdoption:: -e, --exact 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ci display token names using the exact type 1727db96d56Sopenharmony_ci 1737db96d56Sopenharmony_ciIf :file:`filename.py` is specified its contents are tokenized to stdout. 1747db96d56Sopenharmony_ciOtherwise, tokenization is performed on stdin. 1757db96d56Sopenharmony_ci 1767db96d56Sopenharmony_ciExamples 1777db96d56Sopenharmony_ci------------------ 1787db96d56Sopenharmony_ci 1797db96d56Sopenharmony_ciExample of a script rewriter that transforms float literals into Decimal 1807db96d56Sopenharmony_ciobjects:: 1817db96d56Sopenharmony_ci 1827db96d56Sopenharmony_ci from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP 1837db96d56Sopenharmony_ci from io import BytesIO 1847db96d56Sopenharmony_ci 1857db96d56Sopenharmony_ci def decistmt(s): 1867db96d56Sopenharmony_ci """Substitute Decimals for floats in a string of statements. 1877db96d56Sopenharmony_ci 1887db96d56Sopenharmony_ci >>> from decimal import Decimal 1897db96d56Sopenharmony_ci >>> s = 'print(+21.3e-5*-.1234/81.7)' 1907db96d56Sopenharmony_ci >>> decistmt(s) 1917db96d56Sopenharmony_ci "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" 1927db96d56Sopenharmony_ci 1937db96d56Sopenharmony_ci The format of the exponent is inherited from the platform C library. 1947db96d56Sopenharmony_ci Known cases are "e-007" (Windows) and "e-07" (not Windows). Since 1957db96d56Sopenharmony_ci we're only showing 12 digits, and the 13th isn't close to 5, the 1967db96d56Sopenharmony_ci rest of the output should be platform-independent. 1977db96d56Sopenharmony_ci 1987db96d56Sopenharmony_ci >>> exec(s) #doctest: +ELLIPSIS 1997db96d56Sopenharmony_ci -3.21716034272e-0...7 2007db96d56Sopenharmony_ci 2017db96d56Sopenharmony_ci Output from calculations with Decimal should be identical across all 2027db96d56Sopenharmony_ci platforms. 2037db96d56Sopenharmony_ci 2047db96d56Sopenharmony_ci >>> exec(decistmt(s)) 2057db96d56Sopenharmony_ci -3.217160342717258261933904529E-7 2067db96d56Sopenharmony_ci """ 2077db96d56Sopenharmony_ci result = [] 2087db96d56Sopenharmony_ci g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string 2097db96d56Sopenharmony_ci for toknum, tokval, _, _, _ in g: 2107db96d56Sopenharmony_ci if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens 2117db96d56Sopenharmony_ci result.extend([ 2127db96d56Sopenharmony_ci (NAME, 'Decimal'), 2137db96d56Sopenharmony_ci (OP, '('), 2147db96d56Sopenharmony_ci (STRING, repr(tokval)), 2157db96d56Sopenharmony_ci (OP, ')') 2167db96d56Sopenharmony_ci ]) 2177db96d56Sopenharmony_ci else: 2187db96d56Sopenharmony_ci result.append((toknum, tokval)) 2197db96d56Sopenharmony_ci return untokenize(result).decode('utf-8') 2207db96d56Sopenharmony_ci 2217db96d56Sopenharmony_ciExample of tokenizing from the command line. The script:: 2227db96d56Sopenharmony_ci 2237db96d56Sopenharmony_ci def say_hello(): 2247db96d56Sopenharmony_ci print("Hello, World!") 2257db96d56Sopenharmony_ci 2267db96d56Sopenharmony_ci say_hello() 2277db96d56Sopenharmony_ci 2287db96d56Sopenharmony_ciwill be tokenized to the following output where the first column is the range 2297db96d56Sopenharmony_ciof the line/column coordinates where the token is found, the second column is 2307db96d56Sopenharmony_cithe name of the token, and the final column is the value of the token (if any) 2317db96d56Sopenharmony_ci 2327db96d56Sopenharmony_ci.. code-block:: shell-session 2337db96d56Sopenharmony_ci 2347db96d56Sopenharmony_ci $ python -m tokenize hello.py 2357db96d56Sopenharmony_ci 0,0-0,0: ENCODING 'utf-8' 2367db96d56Sopenharmony_ci 1,0-1,3: NAME 'def' 2377db96d56Sopenharmony_ci 1,4-1,13: NAME 'say_hello' 2387db96d56Sopenharmony_ci 1,13-1,14: OP '(' 2397db96d56Sopenharmony_ci 1,14-1,15: OP ')' 2407db96d56Sopenharmony_ci 1,15-1,16: OP ':' 2417db96d56Sopenharmony_ci 1,16-1,17: NEWLINE '\n' 2427db96d56Sopenharmony_ci 2,0-2,4: INDENT ' ' 2437db96d56Sopenharmony_ci 2,4-2,9: NAME 'print' 2447db96d56Sopenharmony_ci 2,9-2,10: OP '(' 2457db96d56Sopenharmony_ci 2,10-2,25: STRING '"Hello, World!"' 2467db96d56Sopenharmony_ci 2,25-2,26: OP ')' 2477db96d56Sopenharmony_ci 2,26-2,27: NEWLINE '\n' 2487db96d56Sopenharmony_ci 3,0-3,1: NL '\n' 2497db96d56Sopenharmony_ci 4,0-4,0: DEDENT '' 2507db96d56Sopenharmony_ci 4,0-4,9: NAME 'say_hello' 2517db96d56Sopenharmony_ci 4,9-4,10: OP '(' 2527db96d56Sopenharmony_ci 4,10-4,11: OP ')' 2537db96d56Sopenharmony_ci 4,11-4,12: NEWLINE '\n' 2547db96d56Sopenharmony_ci 5,0-5,0: ENDMARKER '' 2557db96d56Sopenharmony_ci 2567db96d56Sopenharmony_ciThe exact token type names can be displayed using the :option:`-e` option: 2577db96d56Sopenharmony_ci 2587db96d56Sopenharmony_ci.. code-block:: shell-session 2597db96d56Sopenharmony_ci 2607db96d56Sopenharmony_ci $ python -m tokenize -e hello.py 2617db96d56Sopenharmony_ci 0,0-0,0: ENCODING 'utf-8' 2627db96d56Sopenharmony_ci 1,0-1,3: NAME 'def' 2637db96d56Sopenharmony_ci 1,4-1,13: NAME 'say_hello' 2647db96d56Sopenharmony_ci 1,13-1,14: LPAR '(' 2657db96d56Sopenharmony_ci 1,14-1,15: RPAR ')' 2667db96d56Sopenharmony_ci 1,15-1,16: COLON ':' 2677db96d56Sopenharmony_ci 1,16-1,17: NEWLINE '\n' 2687db96d56Sopenharmony_ci 2,0-2,4: INDENT ' ' 2697db96d56Sopenharmony_ci 2,4-2,9: NAME 'print' 2707db96d56Sopenharmony_ci 2,9-2,10: LPAR '(' 2717db96d56Sopenharmony_ci 2,10-2,25: STRING '"Hello, World!"' 2727db96d56Sopenharmony_ci 2,25-2,26: RPAR ')' 2737db96d56Sopenharmony_ci 2,26-2,27: NEWLINE '\n' 2747db96d56Sopenharmony_ci 3,0-3,1: NL '\n' 2757db96d56Sopenharmony_ci 4,0-4,0: DEDENT '' 2767db96d56Sopenharmony_ci 4,0-4,9: NAME 'say_hello' 2777db96d56Sopenharmony_ci 4,9-4,10: LPAR '(' 2787db96d56Sopenharmony_ci 4,10-4,11: RPAR ')' 2797db96d56Sopenharmony_ci 4,11-4,12: NEWLINE '\n' 2807db96d56Sopenharmony_ci 5,0-5,0: ENDMARKER '' 2817db96d56Sopenharmony_ci 2827db96d56Sopenharmony_ciExample of tokenizing a file programmatically, reading unicode 2837db96d56Sopenharmony_cistrings instead of bytes with :func:`generate_tokens`:: 2847db96d56Sopenharmony_ci 2857db96d56Sopenharmony_ci import tokenize 2867db96d56Sopenharmony_ci 2877db96d56Sopenharmony_ci with tokenize.open('hello.py') as f: 2887db96d56Sopenharmony_ci tokens = tokenize.generate_tokens(f.readline) 2897db96d56Sopenharmony_ci for token in tokens: 2907db96d56Sopenharmony_ci print(token) 2917db96d56Sopenharmony_ci 2927db96d56Sopenharmony_ciOr reading bytes directly with :func:`.tokenize`:: 2937db96d56Sopenharmony_ci 2947db96d56Sopenharmony_ci import tokenize 2957db96d56Sopenharmony_ci 2967db96d56Sopenharmony_ci with open('hello.py', 'rb') as f: 2977db96d56Sopenharmony_ci tokens = tokenize.tokenize(f.readline) 2987db96d56Sopenharmony_ci for token in tokens: 2997db96d56Sopenharmony_ci print(token) 300