1:mod:`tokenize` --- Tokenizer for Python source 2=============================================== 3 4.. module:: tokenize 5 :synopsis: Lexical scanner for Python source code. 6 7.. moduleauthor:: Ka Ping Yee 8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 9 10**Source code:** :source:`Lib/tokenize.py` 11 12-------------- 13 14The :mod:`tokenize` module provides a lexical scanner for Python source code, 15implemented in Python. The scanner in this module returns comments as tokens 16as well, making it useful for implementing "pretty-printers", including 17colorizers for on-screen displays. 18 19To simplify token stream handling, all :ref:`operator <operators>` and 20:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using 21the generic :data:`~token.OP` token type. The exact 22type can be determined by checking the ``exact_type`` property on the 23:term:`named tuple` returned from :func:`tokenize.tokenize`. 24 25Tokenizing Input 26---------------- 27 28The primary entry point is a :term:`generator`: 29 30.. function:: tokenize(readline) 31 32 The :func:`.tokenize` generator requires one argument, *readline*, which 33 must be a callable object which provides the same interface as the 34 :meth:`io.IOBase.readline` method of file objects. Each call to the 35 function should return one line of input as bytes. 36 37 The generator produces 5-tuples with these members: the token type; the 38 token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and 39 column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of 40 ints specifying the row and column where the token ends in the source; and 41 the line on which the token was found. The line passed (the last tuple item) 42 is the *physical* line. The 5 tuple is returned as a :term:`named tuple` 43 with the field names: 44 ``type string start end line``. 45 46 The returned :term:`named tuple` has an additional property named 47 ``exact_type`` that contains the exact operator type for 48 :data:`~token.OP` tokens. For all other token types ``exact_type`` 49 equals the named tuple ``type`` field. 50 51 .. versionchanged:: 3.1 52 Added support for named tuples. 53 54 .. versionchanged:: 3.3 55 Added support for ``exact_type``. 56 57 :func:`.tokenize` determines the source encoding of the file by looking for a 58 UTF-8 BOM or encoding cookie, according to :pep:`263`. 59 60.. function:: generate_tokens(readline) 61 62 Tokenize a source reading unicode strings instead of bytes. 63 64 Like :func:`.tokenize`, the *readline* argument is a callable returning 65 a single line of input. However, :func:`generate_tokens` expects *readline* 66 to return a str object rather than bytes. 67 68 The result is an iterator yielding named tuples, exactly like 69 :func:`.tokenize`. It does not yield an :data:`~token.ENCODING` token. 70 71All constants from the :mod:`token` module are also exported from 72:mod:`tokenize`. 73 74Another function is provided to reverse the tokenization process. This is 75useful for creating tools that tokenize a script, modify the token stream, and 76write back the modified script. 77 78 79.. function:: untokenize(iterable) 80 81 Converts tokens back into Python source code. The *iterable* must return 82 sequences with at least two elements, the token type and the token string. 83 Any additional sequence elements are ignored. 84 85 The reconstructed script is returned as a single string. The result is 86 guaranteed to tokenize back to match the input so that the conversion is 87 lossless and round-trips are assured. The guarantee applies only to the 88 token type and token string as the spacing between tokens (column 89 positions) may change. 90 91 It returns bytes, encoded using the :data:`~token.ENCODING` token, which 92 is the first token sequence output by :func:`.tokenize`. If there is no 93 encoding token in the input, it returns a str instead. 94 95 96:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The 97function it uses to do this is available: 98 99.. function:: detect_encoding(readline) 100 101 The :func:`detect_encoding` function is used to detect the encoding that 102 should be used to decode a Python source file. It requires one argument, 103 readline, in the same way as the :func:`.tokenize` generator. 104 105 It will call readline a maximum of twice, and return the encoding used 106 (as a string) and a list of any lines (not decoded from bytes) it has read 107 in. 108 109 It detects the encoding from the presence of a UTF-8 BOM or an encoding 110 cookie as specified in :pep:`263`. If both a BOM and a cookie are present, 111 but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found, 112 ``'utf-8-sig'`` will be returned as an encoding. 113 114 If no encoding is specified, then the default of ``'utf-8'`` will be 115 returned. 116 117 Use :func:`.open` to open Python source files: it uses 118 :func:`detect_encoding` to detect the file encoding. 119 120 121.. function:: open(filename) 122 123 Open a file in read only mode using the encoding detected by 124 :func:`detect_encoding`. 125 126 .. versionadded:: 3.2 127 128.. exception:: TokenError 129 130 Raised when either a docstring or expression that may be split over several 131 lines is not completed anywhere in the file, for example:: 132 133 """Beginning of 134 docstring 135 136 or:: 137 138 [1, 139 2, 140 3 141 142Note that unclosed single-quoted strings do not cause an error to be 143raised. They are tokenized as :data:`~token.ERRORTOKEN`, followed by the 144tokenization of their contents. 145 146 147.. _tokenize-cli: 148 149Command-Line Usage 150------------------ 151 152.. versionadded:: 3.3 153 154The :mod:`tokenize` module can be executed as a script from the command line. 155It is as simple as: 156 157.. code-block:: sh 158 159 python -m tokenize [-e] [filename.py] 160 161The following options are accepted: 162 163.. program:: tokenize 164 165.. cmdoption:: -h, --help 166 167 show this help message and exit 168 169.. cmdoption:: -e, --exact 170 171 display token names using the exact type 172 173If :file:`filename.py` is specified its contents are tokenized to stdout. 174Otherwise, tokenization is performed on stdin. 175 176Examples 177------------------ 178 179Example of a script rewriter that transforms float literals into Decimal 180objects:: 181 182 from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP 183 from io import BytesIO 184 185 def decistmt(s): 186 """Substitute Decimals for floats in a string of statements. 187 188 >>> from decimal import Decimal 189 >>> s = 'print(+21.3e-5*-.1234/81.7)' 190 >>> decistmt(s) 191 "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" 192 193 The format of the exponent is inherited from the platform C library. 194 Known cases are "e-007" (Windows) and "e-07" (not Windows). Since 195 we're only showing 12 digits, and the 13th isn't close to 5, the 196 rest of the output should be platform-independent. 197 198 >>> exec(s) #doctest: +ELLIPSIS 199 -3.21716034272e-0...7 200 201 Output from calculations with Decimal should be identical across all 202 platforms. 203 204 >>> exec(decistmt(s)) 205 -3.217160342717258261933904529E-7 206 """ 207 result = [] 208 g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string 209 for toknum, tokval, _, _, _ in g: 210 if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens 211 result.extend([ 212 (NAME, 'Decimal'), 213 (OP, '('), 214 (STRING, repr(tokval)), 215 (OP, ')') 216 ]) 217 else: 218 result.append((toknum, tokval)) 219 return untokenize(result).decode('utf-8') 220 221Example of tokenizing from the command line. The script:: 222 223 def say_hello(): 224 print("Hello, World!") 225 226 say_hello() 227 228will be tokenized to the following output where the first column is the range 229of the line/column coordinates where the token is found, the second column is 230the name of the token, and the final column is the value of the token (if any) 231 232.. code-block:: shell-session 233 234 $ python -m tokenize hello.py 235 0,0-0,0: ENCODING 'utf-8' 236 1,0-1,3: NAME 'def' 237 1,4-1,13: NAME 'say_hello' 238 1,13-1,14: OP '(' 239 1,14-1,15: OP ')' 240 1,15-1,16: OP ':' 241 1,16-1,17: NEWLINE '\n' 242 2,0-2,4: INDENT ' ' 243 2,4-2,9: NAME 'print' 244 2,9-2,10: OP '(' 245 2,10-2,25: STRING '"Hello, World!"' 246 2,25-2,26: OP ')' 247 2,26-2,27: NEWLINE '\n' 248 3,0-3,1: NL '\n' 249 4,0-4,0: DEDENT '' 250 4,0-4,9: NAME 'say_hello' 251 4,9-4,10: OP '(' 252 4,10-4,11: OP ')' 253 4,11-4,12: NEWLINE '\n' 254 5,0-5,0: ENDMARKER '' 255 256The exact token type names can be displayed using the :option:`-e` option: 257 258.. code-block:: shell-session 259 260 $ python -m tokenize -e hello.py 261 0,0-0,0: ENCODING 'utf-8' 262 1,0-1,3: NAME 'def' 263 1,4-1,13: NAME 'say_hello' 264 1,13-1,14: LPAR '(' 265 1,14-1,15: RPAR ')' 266 1,15-1,16: COLON ':' 267 1,16-1,17: NEWLINE '\n' 268 2,0-2,4: INDENT ' ' 269 2,4-2,9: NAME 'print' 270 2,9-2,10: LPAR '(' 271 2,10-2,25: STRING '"Hello, World!"' 272 2,25-2,26: RPAR ')' 273 2,26-2,27: NEWLINE '\n' 274 3,0-3,1: NL '\n' 275 4,0-4,0: DEDENT '' 276 4,0-4,9: NAME 'say_hello' 277 4,9-4,10: LPAR '(' 278 4,10-4,11: RPAR ')' 279 4,11-4,12: NEWLINE '\n' 280 5,0-5,0: ENDMARKER '' 281 282Example of tokenizing a file programmatically, reading unicode 283strings instead of bytes with :func:`generate_tokens`:: 284 285 import tokenize 286 287 with tokenize.open('hello.py') as f: 288 tokens = tokenize.generate_tokens(f.readline) 289 for token in tokens: 290 print(token) 291 292Or reading bytes directly with :func:`.tokenize`:: 293 294 import tokenize 295 296 with open('hello.py', 'rb') as f: 297 tokens = tokenize.tokenize(f.readline) 298 for token in tokens: 299 print(token) 300