1:mod:`shlex` --- Simple lexical analysis 2======================================== 3 4.. module:: shlex 5 :synopsis: Simple lexical analysis for Unix shell-like languages. 6 7.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 11 12**Source code:** :source:`Lib/shlex.py` 13 14-------------- 15 16The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for 17simple syntaxes resembling that of the Unix shell. This will often be useful 18for writing minilanguages, (for example, in run control files for Python 19applications) or for parsing quoted strings. 20 21The :mod:`shlex` module defines the following functions: 22 23 24.. function:: split(s, comments=False, posix=True) 25 26 Split the string *s* using shell-like syntax. If *comments* is :const:`False` 27 (the default), the parsing of comments in the given string will be disabled 28 (setting the :attr:`~shlex.commenters` attribute of the 29 :class:`~shlex.shlex` instance to the empty string). This function operates 30 in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is 31 false. 32 33 .. note:: 34 35 Since the :func:`split` function instantiates a :class:`~shlex.shlex` 36 instance, passing ``None`` for *s* will read the string to split from 37 standard input. 38 39 40.. function:: quote(s) 41 42 Return a shell-escaped version of the string *s*. The returned value is a 43 string that can safely be used as one token in a shell command line, for 44 cases where you cannot use a list. 45 46 This idiom would be unsafe: 47 48 >>> filename = 'somefile; rm -rf ~' 49 >>> command = 'ls -l {}'.format(filename) 50 >>> print(command) # executed by a shell: boom! 51 ls -l somefile; rm -rf ~ 52 53 :func:`quote` lets you plug the security hole: 54 55 >>> from shlex import quote 56 >>> command = 'ls -l {}'.format(quote(filename)) 57 >>> print(command) 58 ls -l 'somefile; rm -rf ~' 59 >>> remote_command = 'ssh home {}'.format(quote(command)) 60 >>> print(remote_command) 61 ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"'' 62 63 The quoting is compatible with UNIX shells and with :func:`split`: 64 65 >>> from shlex import split 66 >>> remote_command = split(remote_command) 67 >>> remote_command 68 ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"] 69 >>> command = split(remote_command[-1]) 70 >>> command 71 ['ls', '-l', 'somefile; rm -rf ~'] 72 73 .. versionadded:: 3.3 74 75The :mod:`shlex` module defines the following class: 76 77 78.. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False) 79 80 A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer 81 object. The initialization argument, if present, specifies where to read 82 characters from. It must be a file-/stream-like object with 83 :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or 84 a string. If no argument is given, input will be taken from ``sys.stdin``. 85 The second optional argument is a filename string, which sets the initial 86 value of the :attr:`~shlex.infile` attribute. If the *instream* 87 argument is omitted or equal to ``sys.stdin``, this second argument 88 defaults to "stdin". The *posix* argument defines the operational mode: 89 when *posix* is not true (default), the :class:`~shlex.shlex` instance will 90 operate in compatibility mode. When operating in POSIX mode, 91 :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell 92 parsing rules. The *punctuation_chars* argument provides a way to make the 93 behaviour even closer to how real shells parse. This can take a number of 94 values: the default value, ``False``, preserves the behaviour seen under 95 Python 3.5 and earlier. If set to ``True``, then parsing of the characters 96 ``();<>|&`` is changed: any run of these characters (considered punctuation 97 characters) is returned as a single token. If set to a non-empty string of 98 characters, those characters will be used as the punctuation characters. Any 99 characters in the :attr:`wordchars` attribute that appear in 100 *punctuation_chars* will be removed from :attr:`wordchars`. See 101 :ref:`improved-shell-compatibility` for more information. 102 103 .. versionchanged:: 3.6 104 The *punctuation_chars* parameter was added. 105 106.. seealso:: 107 108 Module :mod:`configparser` 109 Parser for configuration files similar to the Windows :file:`.ini` files. 110 111 112.. _shlex-objects: 113 114shlex Objects 115------------- 116 117A :class:`~shlex.shlex` instance has the following methods: 118 119 120.. method:: shlex.get_token() 121 122 Return a token. If tokens have been stacked using :meth:`push_token`, pop a 123 token off the stack. Otherwise, read one from the input stream. If reading 124 encounters an immediate end-of-file, :attr:`eof` is returned (the empty 125 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode). 126 127 128.. method:: shlex.push_token(str) 129 130 Push the argument onto the token stack. 131 132 133.. method:: shlex.read_token() 134 135 Read a raw token. Ignore the pushback stack, and do not interpret source 136 requests. (This is not ordinarily a useful entry point, and is documented here 137 only for the sake of completeness.) 138 139 140.. method:: shlex.sourcehook(filename) 141 142 When :class:`~shlex.shlex` detects a source request (see :attr:`source` 143 below) this method is given the following token as argument, and expected 144 to return a tuple consisting of a filename and an open file-like object. 145 146 Normally, this method first strips any quotes off the argument. If the result 147 is an absolute pathname, or there was no previous source request in effect, or 148 the previous source was a stream (such as ``sys.stdin``), the result is left 149 alone. Otherwise, if the result is a relative pathname, the directory part of 150 the name of the file immediately before it on the source inclusion stack is 151 prepended (this behavior is like the way the C preprocessor handles ``#include 152 "file.h"``). 153 154 The result of the manipulations is treated as a filename, and returned as the 155 first component of the tuple, with :func:`open` called on it to yield the second 156 component. (Note: this is the reverse of the order of arguments in instance 157 initialization!) 158 159 This hook is exposed so that you can use it to implement directory search paths, 160 addition of file extensions, and other namespace hacks. There is no 161 corresponding 'close' hook, but a shlex instance will call the 162 :meth:`~io.IOBase.close` method of the sourced input stream when it returns 163 EOF. 164 165 For more explicit control of source stacking, use the :meth:`push_source` and 166 :meth:`pop_source` methods. 167 168 169.. method:: shlex.push_source(newstream, newfile=None) 170 171 Push an input source stream onto the input stack. If the filename argument is 172 specified it will later be available for use in error messages. This is the 173 same method used internally by the :meth:`sourcehook` method. 174 175 176.. method:: shlex.pop_source() 177 178 Pop the last-pushed input source from the input stack. This is the same method 179 used internally when the lexer reaches EOF on a stacked input stream. 180 181 182.. method:: shlex.error_leader(infile=None, lineno=None) 183 184 This method generates an error message leader in the format of a Unix C compiler 185 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced 186 with the name of the current source file and the ``%d`` with the current input 187 line number (the optional arguments can be used to override these). 188 189 This convenience is provided to encourage :mod:`shlex` users to generate error 190 messages in the standard, parseable format understood by Emacs and other Unix 191 tools. 192 193Instances of :class:`~shlex.shlex` subclasses have some public instance 194variables which either control lexical analysis or can be used for debugging: 195 196 197.. attribute:: shlex.commenters 198 199 The string of characters that are recognized as comment beginners. All 200 characters from the comment beginner to end of line are ignored. Includes just 201 ``'#'`` by default. 202 203 204.. attribute:: shlex.wordchars 205 206 The string of characters that will accumulate into multi-character tokens. By 207 default, includes all ASCII alphanumerics and underscore. In POSIX mode, the 208 accented characters in the Latin-1 set are also included. If 209 :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can 210 appear in filename specifications and command line parameters, will also be 211 included in this attribute, and any characters which appear in 212 ``punctuation_chars`` will be removed from ``wordchars`` if they are present 213 there. 214 215 216.. attribute:: shlex.whitespace 217 218 Characters that will be considered whitespace and skipped. Whitespace bounds 219 tokens. By default, includes space, tab, linefeed and carriage-return. 220 221 222.. attribute:: shlex.escape 223 224 Characters that will be considered as escape. This will be only used in POSIX 225 mode, and includes just ``'\'`` by default. 226 227 228.. attribute:: shlex.quotes 229 230 Characters that will be considered string quotes. The token accumulates until 231 the same quote is encountered again (thus, different quote types protect each 232 other as in the shell.) By default, includes ASCII single and double quotes. 233 234 235.. attribute:: shlex.escapedquotes 236 237 Characters in :attr:`quotes` that will interpret escape characters defined in 238 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by 239 default. 240 241 242.. attribute:: shlex.whitespace_split 243 244 If ``True``, tokens will only be split in whitespaces. This is useful, for 245 example, for parsing command lines with :class:`~shlex.shlex`, getting 246 tokens in a similar way to shell arguments. If this attribute is ``True``, 247 :attr:`punctuation_chars` will have no effect, and splitting will happen 248 only on whitespaces. When using :attr:`punctuation_chars`, which is 249 intended to provide parsing closer to that implemented by shells, it is 250 advisable to leave ``whitespace_split`` as ``False`` (the default value). 251 252 253.. attribute:: shlex.infile 254 255 The name of the current input file, as initially set at class instantiation time 256 or stacked by later source requests. It may be useful to examine this when 257 constructing error messages. 258 259 260.. attribute:: shlex.instream 261 262 The input stream from which this :class:`~shlex.shlex` instance is reading 263 characters. 264 265 266.. attribute:: shlex.source 267 268 This attribute is ``None`` by default. If you assign a string to it, that 269 string will be recognized as a lexical-level inclusion request similar to the 270 ``source`` keyword in various shells. That is, the immediately following token 271 will be opened as a filename and input will be taken from that stream until 272 EOF, at which point the :meth:`~io.IOBase.close` method of that stream will be 273 called and the input source will again become the original input stream. Source 274 requests may be stacked any number of levels deep. 275 276 277.. attribute:: shlex.debug 278 279 If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex` 280 instance will print verbose progress output on its behavior. If you need 281 to use this, you can read the module source code to learn the details. 282 283 284.. attribute:: shlex.lineno 285 286 Source line number (count of newlines seen so far plus one). 287 288 289.. attribute:: shlex.token 290 291 The token buffer. It may be useful to examine this when catching exceptions. 292 293 294.. attribute:: shlex.eof 295 296 Token used to determine end of file. This will be set to the empty string 297 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode. 298 299 300.. attribute:: shlex.punctuation_chars 301 302 Characters that will be considered punctuation. Runs of punctuation 303 characters will be returned as a single token. However, note that no 304 semantic validity checking will be performed: for example, '>>>' could be 305 returned as a token, even though it may not be recognised as such by shells. 306 307 .. versionadded:: 3.6 308 309 310.. _shlex-parsing-rules: 311 312Parsing Rules 313------------- 314 315When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the 316following rules. 317 318* Quote characters are not recognized within words (``Do"Not"Separate`` is 319 parsed as the single word ``Do"Not"Separate``); 320 321* Escape characters are not recognized; 322 323* Enclosing characters in quotes preserve the literal value of all characters 324 within the quotes; 325 326* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and 327 ``Separate``); 328 329* If :attr:`~shlex.whitespace_split` is ``False``, any character not 330 declared to be a word character, whitespace, or a quote will be returned as 331 a single-character token. If it is ``True``, :class:`~shlex.shlex` will only 332 split words in whitespaces; 333 334* EOF is signaled with an empty string (``''``); 335 336* It's not possible to parse empty strings, even if quoted. 337 338When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the 339following parsing rules. 340 341* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is 342 parsed as the single word ``DoNotSeparate``); 343 344* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the 345 next character that follows; 346 347* Enclosing characters in quotes which are not part of 348 :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value 349 of all characters within the quotes; 350 351* Enclosing characters in quotes which are part of 352 :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value 353 of all characters within the quotes, with the exception of the characters 354 mentioned in :attr:`~shlex.escape`. The escape characters retain its 355 special meaning only when followed by the quote in use, or the escape 356 character itself. Otherwise the escape character will be considered a 357 normal character. 358 359* EOF is signaled with a :const:`None` value; 360 361* Quoted empty strings (``''``) are allowed. 362 363.. _improved-shell-compatibility: 364 365Improved Compatibility with Shells 366---------------------------------- 367 368.. versionadded:: 3.6 369 370The :class:`shlex` class provides compatibility with the parsing performed by 371common Unix shells like ``bash``, ``dash``, and ``sh``. To take advantage of 372this compatibility, specify the ``punctuation_chars`` argument in the 373constructor. This defaults to ``False``, which preserves pre-3.6 behaviour. 374However, if it is set to ``True``, then parsing of the characters ``();<>|&`` 375is changed: any run of these characters is returned as a single token. While 376this is short of a full parser for shells (which would be out of scope for the 377standard library, given the multiplicity of shells out there), it does allow 378you to perform processing of command lines more easily than you could 379otherwise. To illustrate, you can see the difference in the following snippet: 380 381.. doctest:: 382 :options: +NORMALIZE_WHITESPACE 383 384 >>> import shlex 385 >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")" 386 >>> list(shlex.shlex(text)) 387 ['a', '&', '&', 'b', ';', 'c', '&', '&', 'd', '|', '|', 'e', ';', 'f', '>', 388 "'abc'", ';', '(', 'def', '"ghi"', ')'] 389 >>> list(shlex.shlex(text, punctuation_chars=True)) 390 ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', "'abc'", 391 ';', '(', 'def', '"ghi"', ')'] 392 393Of course, tokens will be returned which are not valid for shells, and you'll 394need to implement your own error checks on the returned tokens. 395 396Instead of passing ``True`` as the value for the punctuation_chars parameter, 397you can pass a string with specific characters, which will be used to determine 398which characters constitute punctuation. For example:: 399 400 >>> import shlex 401 >>> s = shlex.shlex("a && b || c", punctuation_chars="|") 402 >>> list(s) 403 ['a', '&', '&', 'b', '||', 'c'] 404 405.. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars` 406 attribute is augmented with the characters ``~-./*?=``. That is because these 407 characters can appear in file names (including wildcards) and command-line 408 arguments (e.g. ``--color=auto``). Hence:: 409 410 >>> import shlex 411 >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?', 412 ... punctuation_chars=True) 413 >>> list(s) 414 ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?'] 415 416For best effect, ``punctuation_chars`` should be set in conjunction with 417``posix=True``. (Note that ``posix=False`` is the default for 418:class:`~shlex.shlex`.) 419