1:mod:`tokenize` --- Tokenizer for Python source 2=============================================== 3 4.. module:: tokenize 5 :synopsis: Lexical scanner for Python source code. 6 7.. moduleauthor:: Ka Ping Yee 8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 9 10**Source code:** :source:`Lib/tokenize.py` 11 12-------------- 13 14The :mod:`tokenize` module provides a lexical scanner for Python source code, 15implemented in Python. The scanner in this module returns comments as tokens 16as well, making it useful for implementing "pretty-printers," including 17colorizers for on-screen displays. 18 19To simplify token stream handling, all :ref:`operators` and :ref:`delimiters` 20tokens are returned using the generic :data:`token.OP` token type. The exact 21type can be determined by checking the ``exact_type`` property on the 22:term:`named tuple` returned from :func:`tokenize.tokenize`. 23 24Tokenizing Input 25---------------- 26 27The primary entry point is a :term:`generator`: 28 29.. function:: tokenize(readline) 30 31 The :func:`.tokenize` generator requires one argument, *readline*, which 32 must be a callable object which provides the same interface as the 33 :meth:`io.IOBase.readline` method of file objects. Each call to the 34 function should return one line of input as bytes. 35 36 The generator produces 5-tuples with these members: the token type; the 37 token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and 38 column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of 39 ints specifying the row and column where the token ends in the source; and 40 the line on which the token was found. The line passed (the last tuple item) 41 is the *logical* line; continuation lines are included. The 5 tuple is 42 returned as a :term:`named tuple` with the field names: 43 ``type string start end line``. 44 45 The returned :term:`named tuple` has an additional property named 46 ``exact_type`` that contains the exact operator type for 47 :data:`token.OP` tokens. For all other token types ``exact_type`` 48 equals the named tuple ``type`` field. 49 50 .. versionchanged:: 3.1 51 Added support for named tuples. 52 53 .. versionchanged:: 3.3 54 Added support for ``exact_type``. 55 56 :func:`.tokenize` determines the source encoding of the file by looking for a 57 UTF-8 BOM or encoding cookie, according to :pep:`263`. 58 59 60All constants from the :mod:`token` module are also exported from 61:mod:`tokenize`, as are three additional token type values: 62 63.. data:: COMMENT 64 65 Token value used to indicate a comment. 66 67 68.. data:: NL 69 70 Token value used to indicate a non-terminating newline. The NEWLINE token 71 indicates the end of a logical line of Python code; NL tokens are generated 72 when a logical line of code is continued over multiple physical lines. 73 74 75.. data:: ENCODING 76 77 Token value that indicates the encoding used to decode the source bytes 78 into text. The first token returned by :func:`.tokenize` will always be an 79 ENCODING token. 80 81 82Another function is provided to reverse the tokenization process. This is 83useful for creating tools that tokenize a script, modify the token stream, and 84write back the modified script. 85 86 87.. function:: untokenize(iterable) 88 89 Converts tokens back into Python source code. The *iterable* must return 90 sequences with at least two elements, the token type and the token string. 91 Any additional sequence elements are ignored. 92 93 The reconstructed script is returned as a single string. The result is 94 guaranteed to tokenize back to match the input so that the conversion is 95 lossless and round-trips are assured. The guarantee applies only to the 96 token type and token string as the spacing between tokens (column 97 positions) may change. 98 99 It returns bytes, encoded using the ENCODING token, which is the first 100 token sequence output by :func:`.tokenize`. 101 102 103:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The 104function it uses to do this is available: 105 106.. function:: detect_encoding(readline) 107 108 The :func:`detect_encoding` function is used to detect the encoding that 109 should be used to decode a Python source file. It requires one argument, 110 readline, in the same way as the :func:`.tokenize` generator. 111 112 It will call readline a maximum of twice, and return the encoding used 113 (as a string) and a list of any lines (not decoded from bytes) it has read 114 in. 115 116 It detects the encoding from the presence of a UTF-8 BOM or an encoding 117 cookie as specified in :pep:`263`. If both a BOM and a cookie are present, 118 but disagree, a SyntaxError will be raised. Note that if the BOM is found, 119 ``'utf-8-sig'`` will be returned as an encoding. 120 121 If no encoding is specified, then the default of ``'utf-8'`` will be 122 returned. 123 124 Use :func:`.open` to open Python source files: it uses 125 :func:`detect_encoding` to detect the file encoding. 126 127 128.. function:: open(filename) 129 130 Open a file in read only mode using the encoding detected by 131 :func:`detect_encoding`. 132 133 .. versionadded:: 3.2 134 135.. exception:: TokenError 136 137 Raised when either a docstring or expression that may be split over several 138 lines is not completed anywhere in the file, for example:: 139 140 """Beginning of 141 docstring 142 143 or:: 144 145 [1, 146 2, 147 3 148 149Note that unclosed single-quoted strings do not cause an error to be 150raised. They are tokenized as ``ERRORTOKEN``, followed by the tokenization of 151their contents. 152 153 154.. _tokenize-cli: 155 156Command-Line Usage 157------------------ 158 159.. versionadded:: 3.3 160 161The :mod:`tokenize` module can be executed as a script from the command line. 162It is as simple as: 163 164.. code-block:: sh 165 166 python -m tokenize [-e] [filename.py] 167 168The following options are accepted: 169 170.. program:: tokenize 171 172.. cmdoption:: -h, --help 173 174 show this help message and exit 175 176.. cmdoption:: -e, --exact 177 178 display token names using the exact type 179 180If :file:`filename.py` is specified its contents are tokenized to stdout. 181Otherwise, tokenization is performed on stdin. 182 183Examples 184------------------ 185 186Example of a script rewriter that transforms float literals into Decimal 187objects:: 188 189 from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP 190 from io import BytesIO 191 192 def decistmt(s): 193 """Substitute Decimals for floats in a string of statements. 194 195 >>> from decimal import Decimal 196 >>> s = 'print(+21.3e-5*-.1234/81.7)' 197 >>> decistmt(s) 198 "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" 199 200 The format of the exponent is inherited from the platform C library. 201 Known cases are "e-007" (Windows) and "e-07" (not Windows). Since 202 we're only showing 12 digits, and the 13th isn't close to 5, the 203 rest of the output should be platform-independent. 204 205 >>> exec(s) #doctest: +ELLIPSIS 206 -3.21716034272e-0...7 207 208 Output from calculations with Decimal should be identical across all 209 platforms. 210 211 >>> exec(decistmt(s)) 212 -3.217160342717258261933904529E-7 213 """ 214 result = [] 215 g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string 216 for toknum, tokval, _, _, _ in g: 217 if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens 218 result.extend([ 219 (NAME, 'Decimal'), 220 (OP, '('), 221 (STRING, repr(tokval)), 222 (OP, ')') 223 ]) 224 else: 225 result.append((toknum, tokval)) 226 return untokenize(result).decode('utf-8') 227 228Example of tokenizing from the command line. The script:: 229 230 def say_hello(): 231 print("Hello, World!") 232 233 say_hello() 234 235will be tokenized to the following output where the first column is the range 236of the line/column coordinates where the token is found, the second column is 237the name of the token, and the final column is the value of the token (if any) 238 239.. code-block:: sh 240 241 $ python -m tokenize hello.py 242 0,0-0,0: ENCODING 'utf-8' 243 1,0-1,3: NAME 'def' 244 1,4-1,13: NAME 'say_hello' 245 1,13-1,14: OP '(' 246 1,14-1,15: OP ')' 247 1,15-1,16: OP ':' 248 1,16-1,17: NEWLINE '\n' 249 2,0-2,4: INDENT ' ' 250 2,4-2,9: NAME 'print' 251 2,9-2,10: OP '(' 252 2,10-2,25: STRING '"Hello, World!"' 253 2,25-2,26: OP ')' 254 2,26-2,27: NEWLINE '\n' 255 3,0-3,1: NL '\n' 256 4,0-4,0: DEDENT '' 257 4,0-4,9: NAME 'say_hello' 258 4,9-4,10: OP '(' 259 4,10-4,11: OP ')' 260 4,11-4,12: NEWLINE '\n' 261 5,0-5,0: ENDMARKER '' 262 263The exact token type names can be displayed using the ``-e`` option: 264 265.. code-block:: sh 266 267 $ python -m tokenize -e hello.py 268 0,0-0,0: ENCODING 'utf-8' 269 1,0-1,3: NAME 'def' 270 1,4-1,13: NAME 'say_hello' 271 1,13-1,14: LPAR '(' 272 1,14-1,15: RPAR ')' 273 1,15-1,16: COLON ':' 274 1,16-1,17: NEWLINE '\n' 275 2,0-2,4: INDENT ' ' 276 2,4-2,9: NAME 'print' 277 2,9-2,10: LPAR '(' 278 2,10-2,25: STRING '"Hello, World!"' 279 2,25-2,26: RPAR ')' 280 2,26-2,27: NEWLINE '\n' 281 3,0-3,1: NL '\n' 282 4,0-4,0: DEDENT '' 283 4,0-4,9: NAME 'say_hello' 284 4,9-4,10: LPAR '(' 285 4,10-4,11: RPAR ')' 286 4,11-4,12: NEWLINE '\n' 287 5,0-5,0: ENDMARKER '' 288