1:mod:`!tokenize` --- Tokenizer for Python source 2================================================ 3 4.. module:: tokenize 5 :synopsis: Lexical scanner for Python source code. 6 7.. moduleauthor:: Ka Ping Yee 8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 9 10**Source code:** :source:`Lib/tokenize.py` 11 12-------------- 13 14The :mod:`tokenize` module provides a lexical scanner for Python source code, 15implemented in Python. The scanner in this module returns comments as tokens 16as well, making it useful for implementing "pretty-printers", including 17colorizers for on-screen displays. 18 19To simplify token stream handling, all :ref:`operator <operators>` and 20:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using 21the generic :data:`~token.OP` token type. The exact 22type can be determined by checking the ``exact_type`` property on the 23:term:`named tuple` returned from :func:`tokenize.tokenize`. 24 25 26.. warning:: 27 28 Note that the functions in this module are only designed to parse 29 syntactically valid Python code (code that does not raise when parsed 30 using :func:`ast.parse`). The behavior of the functions in this module is 31 **undefined** when providing invalid Python code and it can change at any 32 point. 33 34Tokenizing Input 35---------------- 36 37The primary entry point is a :term:`generator`: 38 39.. function:: tokenize(readline) 40 41 The :func:`.tokenize` generator requires one argument, *readline*, which 42 must be a callable object which provides the same interface as the 43 :meth:`io.IOBase.readline` method of file objects. Each call to the 44 function should return one line of input as bytes. 45 46 The generator produces 5-tuples with these members: the token type; the 47 token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and 48 column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of 49 ints specifying the row and column where the token ends in the source; and 50 the line on which the token was found. The line passed (the last tuple item) 51 is the *physical* line. The 5 tuple is returned as a :term:`named tuple` 52 with the field names: 53 ``type string start end line``. 54 55 The returned :term:`named tuple` has an additional property named 56 ``exact_type`` that contains the exact operator type for 57 :data:`~token.OP` tokens. For all other token types ``exact_type`` 58 equals the named tuple ``type`` field. 59 60 .. versionchanged:: 3.1 61 Added support for named tuples. 62 63 .. versionchanged:: 3.3 64 Added support for ``exact_type``. 65 66 :func:`.tokenize` determines the source encoding of the file by looking for a 67 UTF-8 BOM or encoding cookie, according to :pep:`263`. 68 69.. function:: generate_tokens(readline) 70 71 Tokenize a source reading unicode strings instead of bytes. 72 73 Like :func:`.tokenize`, the *readline* argument is a callable returning 74 a single line of input. However, :func:`generate_tokens` expects *readline* 75 to return a str object rather than bytes. 76 77 The result is an iterator yielding named tuples, exactly like 78 :func:`.tokenize`. It does not yield an :data:`~token.ENCODING` token. 79 80All constants from the :mod:`token` module are also exported from 81:mod:`tokenize`. 82 83Another function is provided to reverse the tokenization process. This is 84useful for creating tools that tokenize a script, modify the token stream, and 85write back the modified script. 86 87 88.. function:: untokenize(iterable) 89 90 Converts tokens back into Python source code. The *iterable* must return 91 sequences with at least two elements, the token type and the token string. 92 Any additional sequence elements are ignored. 93 94 The reconstructed script is returned as a single string. The result is 95 guaranteed to tokenize back to match the input so that the conversion is 96 lossless and round-trips are assured. The guarantee applies only to the 97 token type and token string as the spacing between tokens (column 98 positions) may change. 99 100 It returns bytes, encoded using the :data:`~token.ENCODING` token, which 101 is the first token sequence output by :func:`.tokenize`. If there is no 102 encoding token in the input, it returns a str instead. 103 104 105:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The 106function it uses to do this is available: 107 108.. function:: detect_encoding(readline) 109 110 The :func:`detect_encoding` function is used to detect the encoding that 111 should be used to decode a Python source file. It requires one argument, 112 readline, in the same way as the :func:`.tokenize` generator. 113 114 It will call readline a maximum of twice, and return the encoding used 115 (as a string) and a list of any lines (not decoded from bytes) it has read 116 in. 117 118 It detects the encoding from the presence of a UTF-8 BOM or an encoding 119 cookie as specified in :pep:`263`. If both a BOM and a cookie are present, 120 but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found, 121 ``'utf-8-sig'`` will be returned as an encoding. 122 123 If no encoding is specified, then the default of ``'utf-8'`` will be 124 returned. 125 126 Use :func:`.open` to open Python source files: it uses 127 :func:`detect_encoding` to detect the file encoding. 128 129 130.. function:: open(filename) 131 132 Open a file in read only mode using the encoding detected by 133 :func:`detect_encoding`. 134 135 .. versionadded:: 3.2 136 137.. exception:: TokenError 138 139 Raised when either a docstring or expression that may be split over several 140 lines is not completed anywhere in the file, for example:: 141 142 """Beginning of 143 docstring 144 145 or:: 146 147 [1, 148 2, 149 3 150 151.. _tokenize-cli: 152 153Command-Line Usage 154------------------ 155 156.. versionadded:: 3.3 157 158The :mod:`tokenize` module can be executed as a script from the command line. 159It is as simple as: 160 161.. code-block:: sh 162 163 python -m tokenize [-e] [filename.py] 164 165The following options are accepted: 166 167.. program:: tokenize 168 169.. option:: -h, --help 170 171 show this help message and exit 172 173.. option:: -e, --exact 174 175 display token names using the exact type 176 177If :file:`filename.py` is specified its contents are tokenized to stdout. 178Otherwise, tokenization is performed on stdin. 179 180Examples 181------------------ 182 183Example of a script rewriter that transforms float literals into Decimal 184objects:: 185 186 from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP 187 from io import BytesIO 188 189 def decistmt(s): 190 """Substitute Decimals for floats in a string of statements. 191 192 >>> from decimal import Decimal 193 >>> s = 'print(+21.3e-5*-.1234/81.7)' 194 >>> decistmt(s) 195 "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" 196 197 The format of the exponent is inherited from the platform C library. 198 Known cases are "e-007" (Windows) and "e-07" (not Windows). Since 199 we're only showing 12 digits, and the 13th isn't close to 5, the 200 rest of the output should be platform-independent. 201 202 >>> exec(s) #doctest: +ELLIPSIS 203 -3.21716034272e-0...7 204 205 Output from calculations with Decimal should be identical across all 206 platforms. 207 208 >>> exec(decistmt(s)) 209 -3.217160342717258261933904529E-7 210 """ 211 result = [] 212 g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string 213 for toknum, tokval, _, _, _ in g: 214 if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens 215 result.extend([ 216 (NAME, 'Decimal'), 217 (OP, '('), 218 (STRING, repr(tokval)), 219 (OP, ')') 220 ]) 221 else: 222 result.append((toknum, tokval)) 223 return untokenize(result).decode('utf-8') 224 225Example of tokenizing from the command line. The script:: 226 227 def say_hello(): 228 print("Hello, World!") 229 230 say_hello() 231 232will be tokenized to the following output where the first column is the range 233of the line/column coordinates where the token is found, the second column is 234the name of the token, and the final column is the value of the token (if any) 235 236.. code-block:: shell-session 237 238 $ python -m tokenize hello.py 239 0,0-0,0: ENCODING 'utf-8' 240 1,0-1,3: NAME 'def' 241 1,4-1,13: NAME 'say_hello' 242 1,13-1,14: OP '(' 243 1,14-1,15: OP ')' 244 1,15-1,16: OP ':' 245 1,16-1,17: NEWLINE '\n' 246 2,0-2,4: INDENT ' ' 247 2,4-2,9: NAME 'print' 248 2,9-2,10: OP '(' 249 2,10-2,25: STRING '"Hello, World!"' 250 2,25-2,26: OP ')' 251 2,26-2,27: NEWLINE '\n' 252 3,0-3,1: NL '\n' 253 4,0-4,0: DEDENT '' 254 4,0-4,9: NAME 'say_hello' 255 4,9-4,10: OP '(' 256 4,10-4,11: OP ')' 257 4,11-4,12: NEWLINE '\n' 258 5,0-5,0: ENDMARKER '' 259 260The exact token type names can be displayed using the :option:`-e` option: 261 262.. code-block:: shell-session 263 264 $ python -m tokenize -e hello.py 265 0,0-0,0: ENCODING 'utf-8' 266 1,0-1,3: NAME 'def' 267 1,4-1,13: NAME 'say_hello' 268 1,13-1,14: LPAR '(' 269 1,14-1,15: RPAR ')' 270 1,15-1,16: COLON ':' 271 1,16-1,17: NEWLINE '\n' 272 2,0-2,4: INDENT ' ' 273 2,4-2,9: NAME 'print' 274 2,9-2,10: LPAR '(' 275 2,10-2,25: STRING '"Hello, World!"' 276 2,25-2,26: RPAR ')' 277 2,26-2,27: NEWLINE '\n' 278 3,0-3,1: NL '\n' 279 4,0-4,0: DEDENT '' 280 4,0-4,9: NAME 'say_hello' 281 4,9-4,10: LPAR '(' 282 4,10-4,11: RPAR ')' 283 4,11-4,12: NEWLINE '\n' 284 5,0-5,0: ENDMARKER '' 285 286Example of tokenizing a file programmatically, reading unicode 287strings instead of bytes with :func:`generate_tokens`:: 288 289 import tokenize 290 291 with tokenize.open('hello.py') as f: 292 tokens = tokenize.generate_tokens(f.readline) 293 for token in tokens: 294 print(token) 295 296Or reading bytes directly with :func:`.tokenize`:: 297 298 import tokenize 299 300 with open('hello.py', 'rb') as f: 301 tokens = tokenize.tokenize(f.readline) 302 for token in tokens: 303 print(token) 304