• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1:mod:`!tokenize` --- Tokenizer for Python source
2================================================
3
4.. module:: tokenize
5   :synopsis: Lexical scanner for Python source code.
6
7.. moduleauthor:: Ka Ping Yee
8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
9
10**Source code:** :source:`Lib/tokenize.py`
11
12--------------
13
14The :mod:`tokenize` module provides a lexical scanner for Python source code,
15implemented in Python.  The scanner in this module returns comments as tokens
16as well, making it useful for implementing "pretty-printers", including
17colorizers for on-screen displays.
18
19To simplify token stream handling, all :ref:`operator <operators>` and
20:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using
21the generic :data:`~token.OP` token type.  The exact
22type can be determined by checking the ``exact_type`` property on the
23:term:`named tuple` returned from :func:`tokenize.tokenize`.
24
25
26.. warning::
27
28   Note that the functions in this module are only designed to parse
29   syntactically valid Python code (code that does not raise when parsed
30   using :func:`ast.parse`).  The behavior of the functions in this module is
31   **undefined** when providing invalid Python code and it can change at any
32   point.
33
34Tokenizing Input
35----------------
36
37The primary entry point is a :term:`generator`:
38
39.. function:: tokenize(readline)
40
41   The :func:`.tokenize` generator requires one argument, *readline*, which
42   must be a callable object which provides the same interface as the
43   :meth:`io.IOBase.readline` method of file objects.  Each call to the
44   function should return one line of input as bytes.
45
46   The generator produces 5-tuples with these members: the token type; the
47   token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
48   column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
49   ints specifying the row and column where the token ends in the source; and
50   the line on which the token was found. The line passed (the last tuple item)
51   is the *physical* line.  The 5 tuple is returned as a :term:`named tuple`
52   with the field names:
53   ``type string start end line``.
54
55   The returned :term:`named tuple` has an additional property named
56   ``exact_type`` that contains the exact operator type for
57   :data:`~token.OP` tokens.  For all other token types ``exact_type``
58   equals the named tuple ``type`` field.
59
60   .. versionchanged:: 3.1
61      Added support for named tuples.
62
63   .. versionchanged:: 3.3
64      Added support for ``exact_type``.
65
66   :func:`.tokenize` determines the source encoding of the file by looking for a
67   UTF-8 BOM or encoding cookie, according to :pep:`263`.
68
69.. function:: generate_tokens(readline)
70
71   Tokenize a source reading unicode strings instead of bytes.
72
73   Like :func:`.tokenize`, the *readline* argument is a callable returning
74   a single line of input. However, :func:`generate_tokens` expects *readline*
75   to return a str object rather than bytes.
76
77   The result is an iterator yielding named tuples, exactly like
78   :func:`.tokenize`. It does not yield an :data:`~token.ENCODING` token.
79
80All constants from the :mod:`token` module are also exported from
81:mod:`tokenize`.
82
83Another function is provided to reverse the tokenization process. This is
84useful for creating tools that tokenize a script, modify the token stream, and
85write back the modified script.
86
87
88.. function:: untokenize(iterable)
89
90    Converts tokens back into Python source code.  The *iterable* must return
91    sequences with at least two elements, the token type and the token string.
92    Any additional sequence elements are ignored.
93
94    The reconstructed script is returned as a single string.  The result is
95    guaranteed to tokenize back to match the input so that the conversion is
96    lossless and round-trips are assured.  The guarantee applies only to the
97    token type and token string as the spacing between tokens (column
98    positions) may change.
99
100    It returns bytes, encoded using the :data:`~token.ENCODING` token, which
101    is the first token sequence output by :func:`.tokenize`. If there is no
102    encoding token in the input, it returns a str instead.
103
104
105:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The
106function it uses to do this is available:
107
108.. function:: detect_encoding(readline)
109
110    The :func:`detect_encoding` function is used to detect the encoding that
111    should be used to decode a Python source file. It requires one argument,
112    readline, in the same way as the :func:`.tokenize` generator.
113
114    It will call readline a maximum of twice, and return the encoding used
115    (as a string) and a list of any lines (not decoded from bytes) it has read
116    in.
117
118    It detects the encoding from the presence of a UTF-8 BOM or an encoding
119    cookie as specified in :pep:`263`. If both a BOM and a cookie are present,
120    but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found,
121    ``'utf-8-sig'`` will be returned as an encoding.
122
123    If no encoding is specified, then the default of ``'utf-8'`` will be
124    returned.
125
126    Use :func:`.open` to open Python source files: it uses
127    :func:`detect_encoding` to detect the file encoding.
128
129
130.. function:: open(filename)
131
132   Open a file in read only mode using the encoding detected by
133   :func:`detect_encoding`.
134
135   .. versionadded:: 3.2
136
137.. exception:: TokenError
138
139   Raised when either a docstring or expression that may be split over several
140   lines is not completed anywhere in the file, for example::
141
142      """Beginning of
143      docstring
144
145   or::
146
147      [1,
148       2,
149       3
150
151.. _tokenize-cli:
152
153Command-Line Usage
154------------------
155
156.. versionadded:: 3.3
157
158The :mod:`tokenize` module can be executed as a script from the command line.
159It is as simple as:
160
161.. code-block:: sh
162
163   python -m tokenize [-e] [filename.py]
164
165The following options are accepted:
166
167.. program:: tokenize
168
169.. option:: -h, --help
170
171   show this help message and exit
172
173.. option:: -e, --exact
174
175   display token names using the exact type
176
177If :file:`filename.py` is specified its contents are tokenized to stdout.
178Otherwise, tokenization is performed on stdin.
179
180Examples
181------------------
182
183Example of a script rewriter that transforms float literals into Decimal
184objects::
185
186    from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
187    from io import BytesIO
188
189    def decistmt(s):
190        """Substitute Decimals for floats in a string of statements.
191
192        >>> from decimal import Decimal
193        >>> s = 'print(+21.3e-5*-.1234/81.7)'
194        >>> decistmt(s)
195        "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
196
197        The format of the exponent is inherited from the platform C library.
198        Known cases are "e-007" (Windows) and "e-07" (not Windows).  Since
199        we're only showing 12 digits, and the 13th isn't close to 5, the
200        rest of the output should be platform-independent.
201
202        >>> exec(s)  #doctest: +ELLIPSIS
203        -3.21716034272e-0...7
204
205        Output from calculations with Decimal should be identical across all
206        platforms.
207
208        >>> exec(decistmt(s))
209        -3.217160342717258261933904529E-7
210        """
211        result = []
212        g = tokenize(BytesIO(s.encode('utf-8')).readline)  # tokenize the string
213        for toknum, tokval, _, _, _ in g:
214            if toknum == NUMBER and '.' in tokval:  # replace NUMBER tokens
215                result.extend([
216                    (NAME, 'Decimal'),
217                    (OP, '('),
218                    (STRING, repr(tokval)),
219                    (OP, ')')
220                ])
221            else:
222                result.append((toknum, tokval))
223        return untokenize(result).decode('utf-8')
224
225Example of tokenizing from the command line.  The script::
226
227    def say_hello():
228        print("Hello, World!")
229
230    say_hello()
231
232will be tokenized to the following output where the first column is the range
233of the line/column coordinates where the token is found, the second column is
234the name of the token, and the final column is the value of the token (if any)
235
236.. code-block:: shell-session
237
238    $ python -m tokenize hello.py
239    0,0-0,0:            ENCODING       'utf-8'
240    1,0-1,3:            NAME           'def'
241    1,4-1,13:           NAME           'say_hello'
242    1,13-1,14:          OP             '('
243    1,14-1,15:          OP             ')'
244    1,15-1,16:          OP             ':'
245    1,16-1,17:          NEWLINE        '\n'
246    2,0-2,4:            INDENT         '    '
247    2,4-2,9:            NAME           'print'
248    2,9-2,10:           OP             '('
249    2,10-2,25:          STRING         '"Hello, World!"'
250    2,25-2,26:          OP             ')'
251    2,26-2,27:          NEWLINE        '\n'
252    3,0-3,1:            NL             '\n'
253    4,0-4,0:            DEDENT         ''
254    4,0-4,9:            NAME           'say_hello'
255    4,9-4,10:           OP             '('
256    4,10-4,11:          OP             ')'
257    4,11-4,12:          NEWLINE        '\n'
258    5,0-5,0:            ENDMARKER      ''
259
260The exact token type names can be displayed using the :option:`-e` option:
261
262.. code-block:: shell-session
263
264    $ python -m tokenize -e hello.py
265    0,0-0,0:            ENCODING       'utf-8'
266    1,0-1,3:            NAME           'def'
267    1,4-1,13:           NAME           'say_hello'
268    1,13-1,14:          LPAR           '('
269    1,14-1,15:          RPAR           ')'
270    1,15-1,16:          COLON          ':'
271    1,16-1,17:          NEWLINE        '\n'
272    2,0-2,4:            INDENT         '    '
273    2,4-2,9:            NAME           'print'
274    2,9-2,10:           LPAR           '('
275    2,10-2,25:          STRING         '"Hello, World!"'
276    2,25-2,26:          RPAR           ')'
277    2,26-2,27:          NEWLINE        '\n'
278    3,0-3,1:            NL             '\n'
279    4,0-4,0:            DEDENT         ''
280    4,0-4,9:            NAME           'say_hello'
281    4,9-4,10:           LPAR           '('
282    4,10-4,11:          RPAR           ')'
283    4,11-4,12:          NEWLINE        '\n'
284    5,0-5,0:            ENDMARKER      ''
285
286Example of tokenizing a file programmatically, reading unicode
287strings instead of bytes with :func:`generate_tokens`::
288
289    import tokenize
290
291    with tokenize.open('hello.py') as f:
292        tokens = tokenize.generate_tokens(f.readline)
293        for token in tokens:
294            print(token)
295
296Or reading bytes directly with :func:`.tokenize`::
297
298    import tokenize
299
300    with open('hello.py', 'rb') as f:
301        tokens = tokenize.tokenize(f.readline)
302        for token in tokens:
303            print(token)
304