1'''"Executable documentation" for the pickle module. 2 3Extensive comments about the pickle protocols and pickle-machine opcodes 4can be found here. Some functions meant for external use: 5 6genops(pickle) 7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples. 8 9dis(pickle, out=None, memo=None, indentlevel=4) 10 Print a symbolic disassembly of a pickle. 11''' 12 13import codecs 14import io 15import pickle 16import re 17import sys 18 19__all__ = ['dis', 'genops', 'optimize'] 20 21bytes_types = pickle.bytes_types 22 23# Other ideas: 24# 25# - A pickle verifier: read a pickle and check it exhaustively for 26# well-formedness. dis() does a lot of this already. 27# 28# - A protocol identifier: examine a pickle and return its protocol number 29# (== the highest .proto attr value among all the opcodes in the pickle). 30# dis() already prints this info at the end. 31# 32# - A pickle optimizer: for example, tuple-building code is sometimes more 33# elaborate than necessary, catering for the possibility that the tuple 34# is recursive. Or lots of times a PUT is generated that's never accessed 35# by a later GET. 36 37 38# "A pickle" is a program for a virtual pickle machine (PM, but more accurately 39# called an unpickling machine). It's a sequence of opcodes, interpreted by the 40# PM, building an arbitrarily complex Python object. 41# 42# For the most part, the PM is very simple: there are no looping, testing, or 43# conditional instructions, no arithmetic and no function calls. Opcodes are 44# executed once each, from first to last, until a STOP opcode is reached. 45# 46# The PM has two data areas, "the stack" and "the memo". 47# 48# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python 49# integer object on the stack, whose value is gotten from a decimal string 50# literal immediately following the INT opcode in the pickle bytestream. Other 51# opcodes take Python objects off the stack. The result of unpickling is 52# whatever object is left on the stack when the final STOP opcode is executed. 53# 54# The memo is simply an array of objects, or it can be implemented as a dict 55# mapping little integers to objects. The memo serves as the PM's "long term 56# memory", and the little integers indexing the memo are akin to variable 57# names. Some opcodes pop a stack object into the memo at a given index, 58# and others push a memo object at a given index onto the stack again. 59# 60# At heart, that's all the PM has. Subtleties arise for these reasons: 61# 62# + Object identity. Objects can be arbitrarily complex, and subobjects 63# may be shared (for example, the list [a, a] refers to the same object a 64# twice). It can be vital that unpickling recreate an isomorphic object 65# graph, faithfully reproducing sharing. 66# 67# + Recursive objects. For example, after "L = []; L.append(L)", L is a 68# list, and L[0] is the same list. This is related to the object identity 69# point, and some sequences of pickle opcodes are subtle in order to 70# get the right result in all cases. 71# 72# + Things pickle doesn't know everything about. Examples of things pickle 73# does know everything about are Python's builtin scalar and container 74# types, like ints and tuples. They generally have opcodes dedicated to 75# them. For things like module references and instances of user-defined 76# classes, pickle's knowledge is limited. Historically, many enhancements 77# have been made to the pickle protocol in order to do a better (faster, 78# and/or more compact) job on those. 79# 80# + Backward compatibility and micro-optimization. As explained below, 81# pickle opcodes never go away, not even when better ways to do a thing 82# get invented. The repertoire of the PM just keeps growing over time. 83# For example, protocol 0 had two opcodes for building Python integers (INT 84# and LONG), protocol 1 added three more for more-efficient pickling of short 85# integers, and protocol 2 added two more for more-efficient pickling of 86# long integers (before protocol 2, the only ways to pickle a Python long 87# took time quadratic in the number of digits, for both pickling and 88# unpickling). "Opcode bloat" isn't so much a subtlety as a source of 89# wearying complication. 90# 91# 92# Pickle protocols: 93# 94# For compatibility, the meaning of a pickle opcode never changes. Instead new 95# pickle opcodes get added, and each version's unpickler can handle all the 96# pickle opcodes in all protocol versions to date. So old pickles continue to 97# be readable forever. The pickler can generally be told to restrict itself to 98# the subset of opcodes available under previous protocol versions too, so that 99# users can create pickles under the current version readable by older 100# versions. However, a pickle does not contain its version number embedded 101# within it. If an older unpickler tries to read a pickle using a later 102# protocol, the result is most likely an exception due to seeing an unknown (in 103# the older unpickler) opcode. 104# 105# The original pickle used what's now called "protocol 0", and what was called 106# "text mode" before Python 2.3. The entire pickle bytestream is made up of 107# printable 7-bit ASCII characters, plus the newline character, in protocol 0. 108# That's why it was called text mode. Protocol 0 is small and elegant, but 109# sometimes painfully inefficient. 110# 111# The second major set of additions is now called "protocol 1", and was called 112# "binary mode" before Python 2.3. This added many opcodes with arguments 113# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit" 114# bytes. Binary mode pickles can be substantially smaller than equivalent 115# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte 116# int as 4 bytes following the opcode, which is cheaper to unpickle than the 117# (perhaps) 11-character decimal string attached to INT. Protocol 1 also added 118# a number of opcodes that operate on many stack elements at once (like APPENDS 119# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE). 120# 121# The third major set of additions came in Python 2.3, and is called "protocol 122# 2". This added: 123# 124# - A better way to pickle instances of new-style classes (NEWOBJ). 125# 126# - A way for a pickle to identify its protocol (PROTO). 127# 128# - Time- and space- efficient pickling of long ints (LONG{1,4}). 129# 130# - Shortcuts for small tuples (TUPLE{1,2,3}}. 131# 132# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE). 133# 134# - The "extension registry", a vector of popular objects that can be pushed 135# efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but 136# the registry contents are predefined (there's nothing akin to the memo's 137# PUT). 138# 139# Another independent change with Python 2.3 is the abandonment of any 140# pretense that it might be safe to load pickles received from untrusted 141# parties -- no sufficient security analysis has been done to guarantee 142# this and there isn't a use case that warrants the expense of such an 143# analysis. 144# 145# To this end, all tests for __safe_for_unpickling__ or for 146# copyreg.safe_constructors are removed from the unpickling code. 147# References to these variables in the descriptions below are to be seen 148# as describing unpickling in Python 2.2 and before. 149 150 151# Meta-rule: Descriptions are stored in instances of descriptor objects, 152# with plain constructors. No meta-language is defined from which 153# descriptors could be constructed. If you want, e.g., XML, write a little 154# program to generate XML from the objects. 155 156############################################################################## 157# Some pickle opcodes have an argument, following the opcode in the 158# bytestream. An argument is of a specific type, described by an instance 159# of ArgumentDescriptor. These are not to be confused with arguments taken 160# off the stack -- ArgumentDescriptor applies only to arguments embedded in 161# the opcode stream, immediately following an opcode. 162 163# Represents the number of bytes consumed by an argument delimited by the 164# next newline character. 165UP_TO_NEWLINE = -1 166 167# Represents the number of bytes consumed by a two-argument opcode where 168# the first argument gives the number of bytes in the second argument. 169TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int 170TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int 171TAKEN_FROM_ARGUMENT4U = -4 # num bytes is 4-byte unsigned little-endian int 172TAKEN_FROM_ARGUMENT8U = -5 # num bytes is 8-byte unsigned little-endian int 173 174class ArgumentDescriptor(object): 175 __slots__ = ( 176 # name of descriptor record, also a module global name; a string 177 'name', 178 179 # length of argument, in bytes; an int; UP_TO_NEWLINE and 180 # TAKEN_FROM_ARGUMENT{1,4,8} are negative values for variable-length 181 # cases 182 'n', 183 184 # a function taking a file-like object, reading this kind of argument 185 # from the object at the current position, advancing the current 186 # position by n bytes, and returning the value of the argument 187 'reader', 188 189 # human-readable docs for this arg descriptor; a string 190 'doc', 191 ) 192 193 def __init__(self, name, n, reader, doc): 194 assert isinstance(name, str) 195 self.name = name 196 197 assert isinstance(n, int) and (n >= 0 or 198 n in (UP_TO_NEWLINE, 199 TAKEN_FROM_ARGUMENT1, 200 TAKEN_FROM_ARGUMENT4, 201 TAKEN_FROM_ARGUMENT4U, 202 TAKEN_FROM_ARGUMENT8U)) 203 self.n = n 204 205 self.reader = reader 206 207 assert isinstance(doc, str) 208 self.doc = doc 209 210from struct import unpack as _unpack 211 212def read_uint1(f): 213 r""" 214 >>> import io 215 >>> read_uint1(io.BytesIO(b'\xff')) 216 255 217 """ 218 219 data = f.read(1) 220 if data: 221 return data[0] 222 raise ValueError("not enough data in stream to read uint1") 223 224uint1 = ArgumentDescriptor( 225 name='uint1', 226 n=1, 227 reader=read_uint1, 228 doc="One-byte unsigned integer.") 229 230 231def read_uint2(f): 232 r""" 233 >>> import io 234 >>> read_uint2(io.BytesIO(b'\xff\x00')) 235 255 236 >>> read_uint2(io.BytesIO(b'\xff\xff')) 237 65535 238 """ 239 240 data = f.read(2) 241 if len(data) == 2: 242 return _unpack("<H", data)[0] 243 raise ValueError("not enough data in stream to read uint2") 244 245uint2 = ArgumentDescriptor( 246 name='uint2', 247 n=2, 248 reader=read_uint2, 249 doc="Two-byte unsigned integer, little-endian.") 250 251 252def read_int4(f): 253 r""" 254 >>> import io 255 >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00')) 256 255 257 >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31) 258 True 259 """ 260 261 data = f.read(4) 262 if len(data) == 4: 263 return _unpack("<i", data)[0] 264 raise ValueError("not enough data in stream to read int4") 265 266int4 = ArgumentDescriptor( 267 name='int4', 268 n=4, 269 reader=read_int4, 270 doc="Four-byte signed integer, little-endian, 2's complement.") 271 272 273def read_uint4(f): 274 r""" 275 >>> import io 276 >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00')) 277 255 278 >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31 279 True 280 """ 281 282 data = f.read(4) 283 if len(data) == 4: 284 return _unpack("<I", data)[0] 285 raise ValueError("not enough data in stream to read uint4") 286 287uint4 = ArgumentDescriptor( 288 name='uint4', 289 n=4, 290 reader=read_uint4, 291 doc="Four-byte unsigned integer, little-endian.") 292 293 294def read_uint8(f): 295 r""" 296 >>> import io 297 >>> read_uint8(io.BytesIO(b'\xff\x00\x00\x00\x00\x00\x00\x00')) 298 255 299 >>> read_uint8(io.BytesIO(b'\xff' * 8)) == 2**64-1 300 True 301 """ 302 303 data = f.read(8) 304 if len(data) == 8: 305 return _unpack("<Q", data)[0] 306 raise ValueError("not enough data in stream to read uint8") 307 308uint8 = ArgumentDescriptor( 309 name='uint8', 310 n=8, 311 reader=read_uint8, 312 doc="Eight-byte unsigned integer, little-endian.") 313 314 315def read_stringnl(f, decode=True, stripquotes=True): 316 r""" 317 >>> import io 318 >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n")) 319 'abcd' 320 321 >>> read_stringnl(io.BytesIO(b"\n")) 322 Traceback (most recent call last): 323 ... 324 ValueError: no string quotes around b'' 325 326 >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False) 327 '' 328 329 >>> read_stringnl(io.BytesIO(b"''\n")) 330 '' 331 332 >>> read_stringnl(io.BytesIO(b'"abcd"')) 333 Traceback (most recent call last): 334 ... 335 ValueError: no newline found when trying to read stringnl 336 337 Embedded escapes are undone in the result. 338 >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'")) 339 'a\n\\b\x00c\td' 340 """ 341 342 data = f.readline() 343 if not data.endswith(b'\n'): 344 raise ValueError("no newline found when trying to read stringnl") 345 data = data[:-1] # lose the newline 346 347 if stripquotes: 348 for q in (b'"', b"'"): 349 if data.startswith(q): 350 if not data.endswith(q): 351 raise ValueError("strinq quote %r not found at both " 352 "ends of %r" % (q, data)) 353 data = data[1:-1] 354 break 355 else: 356 raise ValueError("no string quotes around %r" % data) 357 358 if decode: 359 data = codecs.escape_decode(data)[0].decode("ascii") 360 return data 361 362stringnl = ArgumentDescriptor( 363 name='stringnl', 364 n=UP_TO_NEWLINE, 365 reader=read_stringnl, 366 doc="""A newline-terminated string. 367 368 This is a repr-style string, with embedded escapes, and 369 bracketing quotes. 370 """) 371 372def read_stringnl_noescape(f): 373 return read_stringnl(f, stripquotes=False) 374 375stringnl_noescape = ArgumentDescriptor( 376 name='stringnl_noescape', 377 n=UP_TO_NEWLINE, 378 reader=read_stringnl_noescape, 379 doc="""A newline-terminated string. 380 381 This is a str-style string, without embedded escapes, 382 or bracketing quotes. It should consist solely of 383 printable ASCII characters. 384 """) 385 386def read_stringnl_noescape_pair(f): 387 r""" 388 >>> import io 389 >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk")) 390 'Queue Empty' 391 """ 392 393 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f)) 394 395stringnl_noescape_pair = ArgumentDescriptor( 396 name='stringnl_noescape_pair', 397 n=UP_TO_NEWLINE, 398 reader=read_stringnl_noescape_pair, 399 doc="""A pair of newline-terminated strings. 400 401 These are str-style strings, without embedded 402 escapes, or bracketing quotes. They should 403 consist solely of printable ASCII characters. 404 The pair is returned as a single string, with 405 a single blank separating the two strings. 406 """) 407 408 409def read_string1(f): 410 r""" 411 >>> import io 412 >>> read_string1(io.BytesIO(b"\x00")) 413 '' 414 >>> read_string1(io.BytesIO(b"\x03abcdef")) 415 'abc' 416 """ 417 418 n = read_uint1(f) 419 assert n >= 0 420 data = f.read(n) 421 if len(data) == n: 422 return data.decode("latin-1") 423 raise ValueError("expected %d bytes in a string1, but only %d remain" % 424 (n, len(data))) 425 426string1 = ArgumentDescriptor( 427 name="string1", 428 n=TAKEN_FROM_ARGUMENT1, 429 reader=read_string1, 430 doc="""A counted string. 431 432 The first argument is a 1-byte unsigned int giving the number 433 of bytes in the string, and the second argument is that many 434 bytes. 435 """) 436 437 438def read_string4(f): 439 r""" 440 >>> import io 441 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc")) 442 '' 443 >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 444 'abc' 445 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 446 Traceback (most recent call last): 447 ... 448 ValueError: expected 50331648 bytes in a string4, but only 6 remain 449 """ 450 451 n = read_int4(f) 452 if n < 0: 453 raise ValueError("string4 byte count < 0: %d" % n) 454 data = f.read(n) 455 if len(data) == n: 456 return data.decode("latin-1") 457 raise ValueError("expected %d bytes in a string4, but only %d remain" % 458 (n, len(data))) 459 460string4 = ArgumentDescriptor( 461 name="string4", 462 n=TAKEN_FROM_ARGUMENT4, 463 reader=read_string4, 464 doc="""A counted string. 465 466 The first argument is a 4-byte little-endian signed int giving 467 the number of bytes in the string, and the second argument is 468 that many bytes. 469 """) 470 471 472def read_bytes1(f): 473 r""" 474 >>> import io 475 >>> read_bytes1(io.BytesIO(b"\x00")) 476 b'' 477 >>> read_bytes1(io.BytesIO(b"\x03abcdef")) 478 b'abc' 479 """ 480 481 n = read_uint1(f) 482 assert n >= 0 483 data = f.read(n) 484 if len(data) == n: 485 return data 486 raise ValueError("expected %d bytes in a bytes1, but only %d remain" % 487 (n, len(data))) 488 489bytes1 = ArgumentDescriptor( 490 name="bytes1", 491 n=TAKEN_FROM_ARGUMENT1, 492 reader=read_bytes1, 493 doc="""A counted bytes string. 494 495 The first argument is a 1-byte unsigned int giving the number 496 of bytes, and the second argument is that many bytes. 497 """) 498 499 500def read_bytes4(f): 501 r""" 502 >>> import io 503 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc")) 504 b'' 505 >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 506 b'abc' 507 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 508 Traceback (most recent call last): 509 ... 510 ValueError: expected 50331648 bytes in a bytes4, but only 6 remain 511 """ 512 513 n = read_uint4(f) 514 assert n >= 0 515 if n > sys.maxsize: 516 raise ValueError("bytes4 byte count > sys.maxsize: %d" % n) 517 data = f.read(n) 518 if len(data) == n: 519 return data 520 raise ValueError("expected %d bytes in a bytes4, but only %d remain" % 521 (n, len(data))) 522 523bytes4 = ArgumentDescriptor( 524 name="bytes4", 525 n=TAKEN_FROM_ARGUMENT4U, 526 reader=read_bytes4, 527 doc="""A counted bytes string. 528 529 The first argument is a 4-byte little-endian unsigned int giving 530 the number of bytes, and the second argument is that many bytes. 531 """) 532 533 534def read_bytes8(f): 535 r""" 536 >>> import io, struct, sys 537 >>> read_bytes8(io.BytesIO(b"\x00\x00\x00\x00\x00\x00\x00\x00abc")) 538 b'' 539 >>> read_bytes8(io.BytesIO(b"\x03\x00\x00\x00\x00\x00\x00\x00abcdef")) 540 b'abc' 541 >>> bigsize8 = struct.pack("<Q", sys.maxsize//3) 542 >>> read_bytes8(io.BytesIO(bigsize8 + b"abcdef")) #doctest: +ELLIPSIS 543 Traceback (most recent call last): 544 ... 545 ValueError: expected ... bytes in a bytes8, but only 6 remain 546 """ 547 548 n = read_uint8(f) 549 assert n >= 0 550 if n > sys.maxsize: 551 raise ValueError("bytes8 byte count > sys.maxsize: %d" % n) 552 data = f.read(n) 553 if len(data) == n: 554 return data 555 raise ValueError("expected %d bytes in a bytes8, but only %d remain" % 556 (n, len(data))) 557 558bytes8 = ArgumentDescriptor( 559 name="bytes8", 560 n=TAKEN_FROM_ARGUMENT8U, 561 reader=read_bytes8, 562 doc="""A counted bytes string. 563 564 The first argument is an 8-byte little-endian unsigned int giving 565 the number of bytes, and the second argument is that many bytes. 566 """) 567 568 569def read_bytearray8(f): 570 r""" 571 >>> import io, struct, sys 572 >>> read_bytearray8(io.BytesIO(b"\x00\x00\x00\x00\x00\x00\x00\x00abc")) 573 bytearray(b'') 574 >>> read_bytearray8(io.BytesIO(b"\x03\x00\x00\x00\x00\x00\x00\x00abcdef")) 575 bytearray(b'abc') 576 >>> bigsize8 = struct.pack("<Q", sys.maxsize//3) 577 >>> read_bytearray8(io.BytesIO(bigsize8 + b"abcdef")) #doctest: +ELLIPSIS 578 Traceback (most recent call last): 579 ... 580 ValueError: expected ... bytes in a bytearray8, but only 6 remain 581 """ 582 583 n = read_uint8(f) 584 assert n >= 0 585 if n > sys.maxsize: 586 raise ValueError("bytearray8 byte count > sys.maxsize: %d" % n) 587 data = f.read(n) 588 if len(data) == n: 589 return bytearray(data) 590 raise ValueError("expected %d bytes in a bytearray8, but only %d remain" % 591 (n, len(data))) 592 593bytearray8 = ArgumentDescriptor( 594 name="bytearray8", 595 n=TAKEN_FROM_ARGUMENT8U, 596 reader=read_bytearray8, 597 doc="""A counted bytearray. 598 599 The first argument is an 8-byte little-endian unsigned int giving 600 the number of bytes, and the second argument is that many bytes. 601 """) 602 603def read_unicodestringnl(f): 604 r""" 605 >>> import io 606 >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd' 607 True 608 """ 609 610 data = f.readline() 611 if not data.endswith(b'\n'): 612 raise ValueError("no newline found when trying to read " 613 "unicodestringnl") 614 data = data[:-1] # lose the newline 615 return str(data, 'raw-unicode-escape') 616 617unicodestringnl = ArgumentDescriptor( 618 name='unicodestringnl', 619 n=UP_TO_NEWLINE, 620 reader=read_unicodestringnl, 621 doc="""A newline-terminated Unicode string. 622 623 This is raw-unicode-escape encoded, so consists of 624 printable ASCII characters, and may contain embedded 625 escape sequences. 626 """) 627 628 629def read_unicodestring1(f): 630 r""" 631 >>> import io 632 >>> s = 'abcd\uabcd' 633 >>> enc = s.encode('utf-8') 634 >>> enc 635 b'abcd\xea\xaf\x8d' 636 >>> n = bytes([len(enc)]) # little-endian 1-byte length 637 >>> t = read_unicodestring1(io.BytesIO(n + enc + b'junk')) 638 >>> s == t 639 True 640 641 >>> read_unicodestring1(io.BytesIO(n + enc[:-1])) 642 Traceback (most recent call last): 643 ... 644 ValueError: expected 7 bytes in a unicodestring1, but only 6 remain 645 """ 646 647 n = read_uint1(f) 648 assert n >= 0 649 data = f.read(n) 650 if len(data) == n: 651 return str(data, 'utf-8', 'surrogatepass') 652 raise ValueError("expected %d bytes in a unicodestring1, but only %d " 653 "remain" % (n, len(data))) 654 655unicodestring1 = ArgumentDescriptor( 656 name="unicodestring1", 657 n=TAKEN_FROM_ARGUMENT1, 658 reader=read_unicodestring1, 659 doc="""A counted Unicode string. 660 661 The first argument is a 1-byte little-endian signed int 662 giving the number of bytes in the string, and the second 663 argument-- the UTF-8 encoding of the Unicode string -- 664 contains that many bytes. 665 """) 666 667 668def read_unicodestring4(f): 669 r""" 670 >>> import io 671 >>> s = 'abcd\uabcd' 672 >>> enc = s.encode('utf-8') 673 >>> enc 674 b'abcd\xea\xaf\x8d' 675 >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length 676 >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk')) 677 >>> s == t 678 True 679 680 >>> read_unicodestring4(io.BytesIO(n + enc[:-1])) 681 Traceback (most recent call last): 682 ... 683 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain 684 """ 685 686 n = read_uint4(f) 687 assert n >= 0 688 if n > sys.maxsize: 689 raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n) 690 data = f.read(n) 691 if len(data) == n: 692 return str(data, 'utf-8', 'surrogatepass') 693 raise ValueError("expected %d bytes in a unicodestring4, but only %d " 694 "remain" % (n, len(data))) 695 696unicodestring4 = ArgumentDescriptor( 697 name="unicodestring4", 698 n=TAKEN_FROM_ARGUMENT4U, 699 reader=read_unicodestring4, 700 doc="""A counted Unicode string. 701 702 The first argument is a 4-byte little-endian signed int 703 giving the number of bytes in the string, and the second 704 argument-- the UTF-8 encoding of the Unicode string -- 705 contains that many bytes. 706 """) 707 708 709def read_unicodestring8(f): 710 r""" 711 >>> import io 712 >>> s = 'abcd\uabcd' 713 >>> enc = s.encode('utf-8') 714 >>> enc 715 b'abcd\xea\xaf\x8d' 716 >>> n = bytes([len(enc)]) + b'\0' * 7 # little-endian 8-byte length 717 >>> t = read_unicodestring8(io.BytesIO(n + enc + b'junk')) 718 >>> s == t 719 True 720 721 >>> read_unicodestring8(io.BytesIO(n + enc[:-1])) 722 Traceback (most recent call last): 723 ... 724 ValueError: expected 7 bytes in a unicodestring8, but only 6 remain 725 """ 726 727 n = read_uint8(f) 728 assert n >= 0 729 if n > sys.maxsize: 730 raise ValueError("unicodestring8 byte count > sys.maxsize: %d" % n) 731 data = f.read(n) 732 if len(data) == n: 733 return str(data, 'utf-8', 'surrogatepass') 734 raise ValueError("expected %d bytes in a unicodestring8, but only %d " 735 "remain" % (n, len(data))) 736 737unicodestring8 = ArgumentDescriptor( 738 name="unicodestring8", 739 n=TAKEN_FROM_ARGUMENT8U, 740 reader=read_unicodestring8, 741 doc="""A counted Unicode string. 742 743 The first argument is an 8-byte little-endian signed int 744 giving the number of bytes in the string, and the second 745 argument-- the UTF-8 encoding of the Unicode string -- 746 contains that many bytes. 747 """) 748 749 750def read_decimalnl_short(f): 751 r""" 752 >>> import io 753 >>> read_decimalnl_short(io.BytesIO(b"1234\n56")) 754 1234 755 756 >>> read_decimalnl_short(io.BytesIO(b"1234L\n56")) 757 Traceback (most recent call last): 758 ... 759 ValueError: invalid literal for int() with base 10: b'1234L' 760 """ 761 762 s = read_stringnl(f, decode=False, stripquotes=False) 763 764 # There's a hack for True and False here. 765 if s == b"00": 766 return False 767 elif s == b"01": 768 return True 769 770 return int(s) 771 772def read_decimalnl_long(f): 773 r""" 774 >>> import io 775 776 >>> read_decimalnl_long(io.BytesIO(b"1234L\n56")) 777 1234 778 779 >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6")) 780 123456789012345678901234 781 """ 782 783 s = read_stringnl(f, decode=False, stripquotes=False) 784 if s[-1:] == b'L': 785 s = s[:-1] 786 return int(s) 787 788 789decimalnl_short = ArgumentDescriptor( 790 name='decimalnl_short', 791 n=UP_TO_NEWLINE, 792 reader=read_decimalnl_short, 793 doc="""A newline-terminated decimal integer literal. 794 795 This never has a trailing 'L', and the integer fit 796 in a short Python int on the box where the pickle 797 was written -- but there's no guarantee it will fit 798 in a short Python int on the box where the pickle 799 is read. 800 """) 801 802decimalnl_long = ArgumentDescriptor( 803 name='decimalnl_long', 804 n=UP_TO_NEWLINE, 805 reader=read_decimalnl_long, 806 doc="""A newline-terminated decimal integer literal. 807 808 This has a trailing 'L', and can represent integers 809 of any size. 810 """) 811 812 813def read_floatnl(f): 814 r""" 815 >>> import io 816 >>> read_floatnl(io.BytesIO(b"-1.25\n6")) 817 -1.25 818 """ 819 s = read_stringnl(f, decode=False, stripquotes=False) 820 return float(s) 821 822floatnl = ArgumentDescriptor( 823 name='floatnl', 824 n=UP_TO_NEWLINE, 825 reader=read_floatnl, 826 doc="""A newline-terminated decimal floating literal. 827 828 In general this requires 17 significant digits for roundtrip 829 identity, and pickling then unpickling infinities, NaNs, and 830 minus zero doesn't work across boxes, or on some boxes even 831 on itself (e.g., Windows can't read the strings it produces 832 for infinities or NaNs). 833 """) 834 835def read_float8(f): 836 r""" 837 >>> import io, struct 838 >>> raw = struct.pack(">d", -1.25) 839 >>> raw 840 b'\xbf\xf4\x00\x00\x00\x00\x00\x00' 841 >>> read_float8(io.BytesIO(raw + b"\n")) 842 -1.25 843 """ 844 845 data = f.read(8) 846 if len(data) == 8: 847 return _unpack(">d", data)[0] 848 raise ValueError("not enough data in stream to read float8") 849 850 851float8 = ArgumentDescriptor( 852 name='float8', 853 n=8, 854 reader=read_float8, 855 doc="""An 8-byte binary representation of a float, big-endian. 856 857 The format is unique to Python, and shared with the struct 858 module (format string '>d') "in theory" (the struct and pickle 859 implementations don't share the code -- they should). It's 860 strongly related to the IEEE-754 double format, and, in normal 861 cases, is in fact identical to the big-endian 754 double format. 862 On other boxes the dynamic range is limited to that of a 754 863 double, and "add a half and chop" rounding is used to reduce 864 the precision to 53 bits. However, even on a 754 box, 865 infinities, NaNs, and minus zero may not be handled correctly 866 (may not survive roundtrip pickling intact). 867 """) 868 869# Protocol 2 formats 870 871from pickle import decode_long 872 873def read_long1(f): 874 r""" 875 >>> import io 876 >>> read_long1(io.BytesIO(b"\x00")) 877 0 878 >>> read_long1(io.BytesIO(b"\x02\xff\x00")) 879 255 880 >>> read_long1(io.BytesIO(b"\x02\xff\x7f")) 881 32767 882 >>> read_long1(io.BytesIO(b"\x02\x00\xff")) 883 -256 884 >>> read_long1(io.BytesIO(b"\x02\x00\x80")) 885 -32768 886 """ 887 888 n = read_uint1(f) 889 data = f.read(n) 890 if len(data) != n: 891 raise ValueError("not enough data in stream to read long1") 892 return decode_long(data) 893 894long1 = ArgumentDescriptor( 895 name="long1", 896 n=TAKEN_FROM_ARGUMENT1, 897 reader=read_long1, 898 doc="""A binary long, little-endian, using 1-byte size. 899 900 This first reads one byte as an unsigned size, then reads that 901 many bytes and interprets them as a little-endian 2's-complement long. 902 If the size is 0, that's taken as a shortcut for the long 0L. 903 """) 904 905def read_long4(f): 906 r""" 907 >>> import io 908 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00")) 909 255 910 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f")) 911 32767 912 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff")) 913 -256 914 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80")) 915 -32768 916 >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00")) 917 0 918 """ 919 920 n = read_int4(f) 921 if n < 0: 922 raise ValueError("long4 byte count < 0: %d" % n) 923 data = f.read(n) 924 if len(data) != n: 925 raise ValueError("not enough data in stream to read long4") 926 return decode_long(data) 927 928long4 = ArgumentDescriptor( 929 name="long4", 930 n=TAKEN_FROM_ARGUMENT4, 931 reader=read_long4, 932 doc="""A binary representation of a long, little-endian. 933 934 This first reads four bytes as a signed size (but requires the 935 size to be >= 0), then reads that many bytes and interprets them 936 as a little-endian 2's-complement long. If the size is 0, that's taken 937 as a shortcut for the int 0, although LONG1 should really be used 938 then instead (and in any case where # of bytes < 256). 939 """) 940 941 942############################################################################## 943# Object descriptors. The stack used by the pickle machine holds objects, 944# and in the stack_before and stack_after attributes of OpcodeInfo 945# descriptors we need names to describe the various types of objects that can 946# appear on the stack. 947 948class StackObject(object): 949 __slots__ = ( 950 # name of descriptor record, for info only 951 'name', 952 953 # type of object, or tuple of type objects (meaning the object can 954 # be of any type in the tuple) 955 'obtype', 956 957 # human-readable docs for this kind of stack object; a string 958 'doc', 959 ) 960 961 def __init__(self, name, obtype, doc): 962 assert isinstance(name, str) 963 self.name = name 964 965 assert isinstance(obtype, type) or isinstance(obtype, tuple) 966 if isinstance(obtype, tuple): 967 for contained in obtype: 968 assert isinstance(contained, type) 969 self.obtype = obtype 970 971 assert isinstance(doc, str) 972 self.doc = doc 973 974 def __repr__(self): 975 return self.name 976 977 978pyint = pylong = StackObject( 979 name='int', 980 obtype=int, 981 doc="A Python integer object.") 982 983pyinteger_or_bool = StackObject( 984 name='int_or_bool', 985 obtype=(int, bool), 986 doc="A Python integer or boolean object.") 987 988pybool = StackObject( 989 name='bool', 990 obtype=bool, 991 doc="A Python boolean object.") 992 993pyfloat = StackObject( 994 name='float', 995 obtype=float, 996 doc="A Python float object.") 997 998pybytes_or_str = pystring = StackObject( 999 name='bytes_or_str', 1000 obtype=(bytes, str), 1001 doc="A Python bytes or (Unicode) string object.") 1002 1003pybytes = StackObject( 1004 name='bytes', 1005 obtype=bytes, 1006 doc="A Python bytes object.") 1007 1008pybytearray = StackObject( 1009 name='bytearray', 1010 obtype=bytearray, 1011 doc="A Python bytearray object.") 1012 1013pyunicode = StackObject( 1014 name='str', 1015 obtype=str, 1016 doc="A Python (Unicode) string object.") 1017 1018pynone = StackObject( 1019 name="None", 1020 obtype=type(None), 1021 doc="The Python None object.") 1022 1023pytuple = StackObject( 1024 name="tuple", 1025 obtype=tuple, 1026 doc="A Python tuple object.") 1027 1028pylist = StackObject( 1029 name="list", 1030 obtype=list, 1031 doc="A Python list object.") 1032 1033pydict = StackObject( 1034 name="dict", 1035 obtype=dict, 1036 doc="A Python dict object.") 1037 1038pyset = StackObject( 1039 name="set", 1040 obtype=set, 1041 doc="A Python set object.") 1042 1043pyfrozenset = StackObject( 1044 name="frozenset", 1045 obtype=set, 1046 doc="A Python frozenset object.") 1047 1048pybuffer = StackObject( 1049 name='buffer', 1050 obtype=object, 1051 doc="A Python buffer-like object.") 1052 1053anyobject = StackObject( 1054 name='any', 1055 obtype=object, 1056 doc="Any kind of object whatsoever.") 1057 1058markobject = StackObject( 1059 name="mark", 1060 obtype=StackObject, 1061 doc="""'The mark' is a unique object. 1062 1063Opcodes that operate on a variable number of objects 1064generally don't embed the count of objects in the opcode, 1065or pull it off the stack. Instead the MARK opcode is used 1066to push a special marker object on the stack, and then 1067some other opcodes grab all the objects from the top of 1068the stack down to (but not including) the topmost marker 1069object. 1070""") 1071 1072stackslice = StackObject( 1073 name="stackslice", 1074 obtype=StackObject, 1075 doc="""An object representing a contiguous slice of the stack. 1076 1077This is used in conjunction with markobject, to represent all 1078of the stack following the topmost markobject. For example, 1079the POP_MARK opcode changes the stack from 1080 1081 [..., markobject, stackslice] 1082to 1083 [...] 1084 1085No matter how many object are on the stack after the topmost 1086markobject, POP_MARK gets rid of all of them (including the 1087topmost markobject too). 1088""") 1089 1090############################################################################## 1091# Descriptors for pickle opcodes. 1092 1093class OpcodeInfo(object): 1094 1095 __slots__ = ( 1096 # symbolic name of opcode; a string 1097 'name', 1098 1099 # the code used in a bytestream to represent the opcode; a 1100 # one-character string 1101 'code', 1102 1103 # If the opcode has an argument embedded in the byte string, an 1104 # instance of ArgumentDescriptor specifying its type. Note that 1105 # arg.reader(s) can be used to read and decode the argument from 1106 # the bytestream s, and arg.doc documents the format of the raw 1107 # argument bytes. If the opcode doesn't have an argument embedded 1108 # in the bytestream, arg should be None. 1109 'arg', 1110 1111 # what the stack looks like before this opcode runs; a list 1112 'stack_before', 1113 1114 # what the stack looks like after this opcode runs; a list 1115 'stack_after', 1116 1117 # the protocol number in which this opcode was introduced; an int 1118 'proto', 1119 1120 # human-readable docs for this opcode; a string 1121 'doc', 1122 ) 1123 1124 def __init__(self, name, code, arg, 1125 stack_before, stack_after, proto, doc): 1126 assert isinstance(name, str) 1127 self.name = name 1128 1129 assert isinstance(code, str) 1130 assert len(code) == 1 1131 self.code = code 1132 1133 assert arg is None or isinstance(arg, ArgumentDescriptor) 1134 self.arg = arg 1135 1136 assert isinstance(stack_before, list) 1137 for x in stack_before: 1138 assert isinstance(x, StackObject) 1139 self.stack_before = stack_before 1140 1141 assert isinstance(stack_after, list) 1142 for x in stack_after: 1143 assert isinstance(x, StackObject) 1144 self.stack_after = stack_after 1145 1146 assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL 1147 self.proto = proto 1148 1149 assert isinstance(doc, str) 1150 self.doc = doc 1151 1152I = OpcodeInfo 1153opcodes = [ 1154 1155 # Ways to spell integers. 1156 1157 I(name='INT', 1158 code='I', 1159 arg=decimalnl_short, 1160 stack_before=[], 1161 stack_after=[pyinteger_or_bool], 1162 proto=0, 1163 doc="""Push an integer or bool. 1164 1165 The argument is a newline-terminated decimal literal string. 1166 1167 The intent may have been that this always fit in a short Python int, 1168 but INT can be generated in pickles written on a 64-bit box that 1169 require a Python long on a 32-bit box. The difference between this 1170 and LONG then is that INT skips a trailing 'L', and produces a short 1171 int whenever possible. 1172 1173 Another difference is due to that, when bool was introduced as a 1174 distinct type in 2.3, builtin names True and False were also added to 1175 2.2.2, mapping to ints 1 and 0. For compatibility in both directions, 1176 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n". 1177 Leading zeroes are never produced for a genuine integer. The 2.3 1178 (and later) unpicklers special-case these and return bool instead; 1179 earlier unpicklers ignore the leading "0" and return the int. 1180 """), 1181 1182 I(name='BININT', 1183 code='J', 1184 arg=int4, 1185 stack_before=[], 1186 stack_after=[pyint], 1187 proto=1, 1188 doc="""Push a four-byte signed integer. 1189 1190 This handles the full range of Python (short) integers on a 32-bit 1191 box, directly as binary bytes (1 for the opcode and 4 for the integer). 1192 If the integer is non-negative and fits in 1 or 2 bytes, pickling via 1193 BININT1 or BININT2 saves space. 1194 """), 1195 1196 I(name='BININT1', 1197 code='K', 1198 arg=uint1, 1199 stack_before=[], 1200 stack_after=[pyint], 1201 proto=1, 1202 doc="""Push a one-byte unsigned integer. 1203 1204 This is a space optimization for pickling very small non-negative ints, 1205 in range(256). 1206 """), 1207 1208 I(name='BININT2', 1209 code='M', 1210 arg=uint2, 1211 stack_before=[], 1212 stack_after=[pyint], 1213 proto=1, 1214 doc="""Push a two-byte unsigned integer. 1215 1216 This is a space optimization for pickling small positive ints, in 1217 range(256, 2**16). Integers in range(256) can also be pickled via 1218 BININT2, but BININT1 instead saves a byte. 1219 """), 1220 1221 I(name='LONG', 1222 code='L', 1223 arg=decimalnl_long, 1224 stack_before=[], 1225 stack_after=[pyint], 1226 proto=0, 1227 doc="""Push a long integer. 1228 1229 The same as INT, except that the literal ends with 'L', and always 1230 unpickles to a Python long. There doesn't seem a real purpose to the 1231 trailing 'L'. 1232 1233 Note that LONG takes time quadratic in the number of digits when 1234 unpickling (this is simply due to the nature of decimal->binary 1235 conversion). Proto 2 added linear-time (in C; still quadratic-time 1236 in Python) LONG1 and LONG4 opcodes. 1237 """), 1238 1239 I(name="LONG1", 1240 code='\x8a', 1241 arg=long1, 1242 stack_before=[], 1243 stack_after=[pyint], 1244 proto=2, 1245 doc="""Long integer using one-byte length. 1246 1247 A more efficient encoding of a Python long; the long1 encoding 1248 says it all."""), 1249 1250 I(name="LONG4", 1251 code='\x8b', 1252 arg=long4, 1253 stack_before=[], 1254 stack_after=[pyint], 1255 proto=2, 1256 doc="""Long integer using found-byte length. 1257 1258 A more efficient encoding of a Python long; the long4 encoding 1259 says it all."""), 1260 1261 # Ways to spell strings (8-bit, not Unicode). 1262 1263 I(name='STRING', 1264 code='S', 1265 arg=stringnl, 1266 stack_before=[], 1267 stack_after=[pybytes_or_str], 1268 proto=0, 1269 doc="""Push a Python string object. 1270 1271 The argument is a repr-style string, with bracketing quote characters, 1272 and perhaps embedded escapes. The argument extends until the next 1273 newline character. These are usually decoded into a str instance 1274 using the encoding given to the Unpickler constructor. or the default, 1275 'ASCII'. If the encoding given was 'bytes' however, they will be 1276 decoded as bytes object instead. 1277 """), 1278 1279 I(name='BINSTRING', 1280 code='T', 1281 arg=string4, 1282 stack_before=[], 1283 stack_after=[pybytes_or_str], 1284 proto=1, 1285 doc="""Push a Python string object. 1286 1287 There are two arguments: the first is a 4-byte little-endian 1288 signed int giving the number of bytes in the string, and the 1289 second is that many bytes, which are taken literally as the string 1290 content. These are usually decoded into a str instance using the 1291 encoding given to the Unpickler constructor. or the default, 1292 'ASCII'. If the encoding given was 'bytes' however, they will be 1293 decoded as bytes object instead. 1294 """), 1295 1296 I(name='SHORT_BINSTRING', 1297 code='U', 1298 arg=string1, 1299 stack_before=[], 1300 stack_after=[pybytes_or_str], 1301 proto=1, 1302 doc="""Push a Python string object. 1303 1304 There are two arguments: the first is a 1-byte unsigned int giving 1305 the number of bytes in the string, and the second is that many 1306 bytes, which are taken literally as the string content. These are 1307 usually decoded into a str instance using the encoding given to 1308 the Unpickler constructor. or the default, 'ASCII'. If the 1309 encoding given was 'bytes' however, they will be decoded as bytes 1310 object instead. 1311 """), 1312 1313 # Bytes (protocol 3 and higher) 1314 1315 I(name='BINBYTES', 1316 code='B', 1317 arg=bytes4, 1318 stack_before=[], 1319 stack_after=[pybytes], 1320 proto=3, 1321 doc="""Push a Python bytes object. 1322 1323 There are two arguments: the first is a 4-byte little-endian unsigned int 1324 giving the number of bytes, and the second is that many bytes, which are 1325 taken literally as the bytes content. 1326 """), 1327 1328 I(name='SHORT_BINBYTES', 1329 code='C', 1330 arg=bytes1, 1331 stack_before=[], 1332 stack_after=[pybytes], 1333 proto=3, 1334 doc="""Push a Python bytes object. 1335 1336 There are two arguments: the first is a 1-byte unsigned int giving 1337 the number of bytes, and the second is that many bytes, which are taken 1338 literally as the string content. 1339 """), 1340 1341 I(name='BINBYTES8', 1342 code='\x8e', 1343 arg=bytes8, 1344 stack_before=[], 1345 stack_after=[pybytes], 1346 proto=4, 1347 doc="""Push a Python bytes object. 1348 1349 There are two arguments: the first is an 8-byte unsigned int giving 1350 the number of bytes in the string, and the second is that many bytes, 1351 which are taken literally as the string content. 1352 """), 1353 1354 # Bytearray (protocol 5 and higher) 1355 1356 I(name='BYTEARRAY8', 1357 code='\x96', 1358 arg=bytearray8, 1359 stack_before=[], 1360 stack_after=[pybytearray], 1361 proto=5, 1362 doc="""Push a Python bytearray object. 1363 1364 There are two arguments: the first is an 8-byte unsigned int giving 1365 the number of bytes in the bytearray, and the second is that many bytes, 1366 which are taken literally as the bytearray content. 1367 """), 1368 1369 # Out-of-band buffer (protocol 5 and higher) 1370 1371 I(name='NEXT_BUFFER', 1372 code='\x97', 1373 arg=None, 1374 stack_before=[], 1375 stack_after=[pybuffer], 1376 proto=5, 1377 doc="Push an out-of-band buffer object."), 1378 1379 I(name='READONLY_BUFFER', 1380 code='\x98', 1381 arg=None, 1382 stack_before=[pybuffer], 1383 stack_after=[pybuffer], 1384 proto=5, 1385 doc="Make an out-of-band buffer object read-only."), 1386 1387 # Ways to spell None. 1388 1389 I(name='NONE', 1390 code='N', 1391 arg=None, 1392 stack_before=[], 1393 stack_after=[pynone], 1394 proto=0, 1395 doc="Push None on the stack."), 1396 1397 # Ways to spell bools, starting with proto 2. See INT for how this was 1398 # done before proto 2. 1399 1400 I(name='NEWTRUE', 1401 code='\x88', 1402 arg=None, 1403 stack_before=[], 1404 stack_after=[pybool], 1405 proto=2, 1406 doc="Push True onto the stack."), 1407 1408 I(name='NEWFALSE', 1409 code='\x89', 1410 arg=None, 1411 stack_before=[], 1412 stack_after=[pybool], 1413 proto=2, 1414 doc="Push False onto the stack."), 1415 1416 # Ways to spell Unicode strings. 1417 1418 I(name='UNICODE', 1419 code='V', 1420 arg=unicodestringnl, 1421 stack_before=[], 1422 stack_after=[pyunicode], 1423 proto=0, # this may be pure-text, but it's a later addition 1424 doc="""Push a Python Unicode string object. 1425 1426 The argument is a raw-unicode-escape encoding of a Unicode string, 1427 and so may contain embedded escape sequences. The argument extends 1428 until the next newline character. 1429 """), 1430 1431 I(name='SHORT_BINUNICODE', 1432 code='\x8c', 1433 arg=unicodestring1, 1434 stack_before=[], 1435 stack_after=[pyunicode], 1436 proto=4, 1437 doc="""Push a Python Unicode string object. 1438 1439 There are two arguments: the first is a 1-byte little-endian signed int 1440 giving the number of bytes in the string. The second is that many 1441 bytes, and is the UTF-8 encoding of the Unicode string. 1442 """), 1443 1444 I(name='BINUNICODE', 1445 code='X', 1446 arg=unicodestring4, 1447 stack_before=[], 1448 stack_after=[pyunicode], 1449 proto=1, 1450 doc="""Push a Python Unicode string object. 1451 1452 There are two arguments: the first is a 4-byte little-endian unsigned int 1453 giving the number of bytes in the string. The second is that many 1454 bytes, and is the UTF-8 encoding of the Unicode string. 1455 """), 1456 1457 I(name='BINUNICODE8', 1458 code='\x8d', 1459 arg=unicodestring8, 1460 stack_before=[], 1461 stack_after=[pyunicode], 1462 proto=4, 1463 doc="""Push a Python Unicode string object. 1464 1465 There are two arguments: the first is an 8-byte little-endian signed int 1466 giving the number of bytes in the string. The second is that many 1467 bytes, and is the UTF-8 encoding of the Unicode string. 1468 """), 1469 1470 # Ways to spell floats. 1471 1472 I(name='FLOAT', 1473 code='F', 1474 arg=floatnl, 1475 stack_before=[], 1476 stack_after=[pyfloat], 1477 proto=0, 1478 doc="""Newline-terminated decimal float literal. 1479 1480 The argument is repr(a_float), and in general requires 17 significant 1481 digits for roundtrip conversion to be an identity (this is so for 1482 IEEE-754 double precision values, which is what Python float maps to 1483 on most boxes). 1484 1485 In general, FLOAT cannot be used to transport infinities, NaNs, or 1486 minus zero across boxes (or even on a single box, if the platform C 1487 library can't read the strings it produces for such things -- Windows 1488 is like that), but may do less damage than BINFLOAT on boxes with 1489 greater precision or dynamic range than IEEE-754 double. 1490 """), 1491 1492 I(name='BINFLOAT', 1493 code='G', 1494 arg=float8, 1495 stack_before=[], 1496 stack_after=[pyfloat], 1497 proto=1, 1498 doc="""Float stored in binary form, with 8 bytes of data. 1499 1500 This generally requires less than half the space of FLOAT encoding. 1501 In general, BINFLOAT cannot be used to transport infinities, NaNs, or 1502 minus zero, raises an exception if the exponent exceeds the range of 1503 an IEEE-754 double, and retains no more than 53 bits of precision (if 1504 there are more than that, "add a half and chop" rounding is used to 1505 cut it back to 53 significant bits). 1506 """), 1507 1508 # Ways to build lists. 1509 1510 I(name='EMPTY_LIST', 1511 code=']', 1512 arg=None, 1513 stack_before=[], 1514 stack_after=[pylist], 1515 proto=1, 1516 doc="Push an empty list."), 1517 1518 I(name='APPEND', 1519 code='a', 1520 arg=None, 1521 stack_before=[pylist, anyobject], 1522 stack_after=[pylist], 1523 proto=0, 1524 doc="""Append an object to a list. 1525 1526 Stack before: ... pylist anyobject 1527 Stack after: ... pylist+[anyobject] 1528 1529 although pylist is really extended in-place. 1530 """), 1531 1532 I(name='APPENDS', 1533 code='e', 1534 arg=None, 1535 stack_before=[pylist, markobject, stackslice], 1536 stack_after=[pylist], 1537 proto=1, 1538 doc="""Extend a list by a slice of stack objects. 1539 1540 Stack before: ... pylist markobject stackslice 1541 Stack after: ... pylist+stackslice 1542 1543 although pylist is really extended in-place. 1544 """), 1545 1546 I(name='LIST', 1547 code='l', 1548 arg=None, 1549 stack_before=[markobject, stackslice], 1550 stack_after=[pylist], 1551 proto=0, 1552 doc="""Build a list out of the topmost stack slice, after markobject. 1553 1554 All the stack entries following the topmost markobject are placed into 1555 a single Python list, which single list object replaces all of the 1556 stack from the topmost markobject onward. For example, 1557 1558 Stack before: ... markobject 1 2 3 'abc' 1559 Stack after: ... [1, 2, 3, 'abc'] 1560 """), 1561 1562 # Ways to build tuples. 1563 1564 I(name='EMPTY_TUPLE', 1565 code=')', 1566 arg=None, 1567 stack_before=[], 1568 stack_after=[pytuple], 1569 proto=1, 1570 doc="Push an empty tuple."), 1571 1572 I(name='TUPLE', 1573 code='t', 1574 arg=None, 1575 stack_before=[markobject, stackslice], 1576 stack_after=[pytuple], 1577 proto=0, 1578 doc="""Build a tuple out of the topmost stack slice, after markobject. 1579 1580 All the stack entries following the topmost markobject are placed into 1581 a single Python tuple, which single tuple object replaces all of the 1582 stack from the topmost markobject onward. For example, 1583 1584 Stack before: ... markobject 1 2 3 'abc' 1585 Stack after: ... (1, 2, 3, 'abc') 1586 """), 1587 1588 I(name='TUPLE1', 1589 code='\x85', 1590 arg=None, 1591 stack_before=[anyobject], 1592 stack_after=[pytuple], 1593 proto=2, 1594 doc="""Build a one-tuple out of the topmost item on the stack. 1595 1596 This code pops one value off the stack and pushes a tuple of 1597 length 1 whose one item is that value back onto it. In other 1598 words: 1599 1600 stack[-1] = tuple(stack[-1:]) 1601 """), 1602 1603 I(name='TUPLE2', 1604 code='\x86', 1605 arg=None, 1606 stack_before=[anyobject, anyobject], 1607 stack_after=[pytuple], 1608 proto=2, 1609 doc="""Build a two-tuple out of the top two items on the stack. 1610 1611 This code pops two values off the stack and pushes a tuple of 1612 length 2 whose items are those values back onto it. In other 1613 words: 1614 1615 stack[-2:] = [tuple(stack[-2:])] 1616 """), 1617 1618 I(name='TUPLE3', 1619 code='\x87', 1620 arg=None, 1621 stack_before=[anyobject, anyobject, anyobject], 1622 stack_after=[pytuple], 1623 proto=2, 1624 doc="""Build a three-tuple out of the top three items on the stack. 1625 1626 This code pops three values off the stack and pushes a tuple of 1627 length 3 whose items are those values back onto it. In other 1628 words: 1629 1630 stack[-3:] = [tuple(stack[-3:])] 1631 """), 1632 1633 # Ways to build dicts. 1634 1635 I(name='EMPTY_DICT', 1636 code='}', 1637 arg=None, 1638 stack_before=[], 1639 stack_after=[pydict], 1640 proto=1, 1641 doc="Push an empty dict."), 1642 1643 I(name='DICT', 1644 code='d', 1645 arg=None, 1646 stack_before=[markobject, stackslice], 1647 stack_after=[pydict], 1648 proto=0, 1649 doc="""Build a dict out of the topmost stack slice, after markobject. 1650 1651 All the stack entries following the topmost markobject are placed into 1652 a single Python dict, which single dict object replaces all of the 1653 stack from the topmost markobject onward. The stack slice alternates 1654 key, value, key, value, .... For example, 1655 1656 Stack before: ... markobject 1 2 3 'abc' 1657 Stack after: ... {1: 2, 3: 'abc'} 1658 """), 1659 1660 I(name='SETITEM', 1661 code='s', 1662 arg=None, 1663 stack_before=[pydict, anyobject, anyobject], 1664 stack_after=[pydict], 1665 proto=0, 1666 doc="""Add a key+value pair to an existing dict. 1667 1668 Stack before: ... pydict key value 1669 Stack after: ... pydict 1670 1671 where pydict has been modified via pydict[key] = value. 1672 """), 1673 1674 I(name='SETITEMS', 1675 code='u', 1676 arg=None, 1677 stack_before=[pydict, markobject, stackslice], 1678 stack_after=[pydict], 1679 proto=1, 1680 doc="""Add an arbitrary number of key+value pairs to an existing dict. 1681 1682 The slice of the stack following the topmost markobject is taken as 1683 an alternating sequence of keys and values, added to the dict 1684 immediately under the topmost markobject. Everything at and after the 1685 topmost markobject is popped, leaving the mutated dict at the top 1686 of the stack. 1687 1688 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n 1689 Stack after: ... pydict 1690 1691 where pydict has been modified via pydict[key_i] = value_i for i in 1692 1, 2, ..., n, and in that order. 1693 """), 1694 1695 # Ways to build sets 1696 1697 I(name='EMPTY_SET', 1698 code='\x8f', 1699 arg=None, 1700 stack_before=[], 1701 stack_after=[pyset], 1702 proto=4, 1703 doc="Push an empty set."), 1704 1705 I(name='ADDITEMS', 1706 code='\x90', 1707 arg=None, 1708 stack_before=[pyset, markobject, stackslice], 1709 stack_after=[pyset], 1710 proto=4, 1711 doc="""Add an arbitrary number of items to an existing set. 1712 1713 The slice of the stack following the topmost markobject is taken as 1714 a sequence of items, added to the set immediately under the topmost 1715 markobject. Everything at and after the topmost markobject is popped, 1716 leaving the mutated set at the top of the stack. 1717 1718 Stack before: ... pyset markobject item_1 ... item_n 1719 Stack after: ... pyset 1720 1721 where pyset has been modified via pyset.add(item_i) = item_i for i in 1722 1, 2, ..., n, and in that order. 1723 """), 1724 1725 # Way to build frozensets 1726 1727 I(name='FROZENSET', 1728 code='\x91', 1729 arg=None, 1730 stack_before=[markobject, stackslice], 1731 stack_after=[pyfrozenset], 1732 proto=4, 1733 doc="""Build a frozenset out of the topmost slice, after markobject. 1734 1735 All the stack entries following the topmost markobject are placed into 1736 a single Python frozenset, which single frozenset object replaces all 1737 of the stack from the topmost markobject onward. For example, 1738 1739 Stack before: ... markobject 1 2 3 1740 Stack after: ... frozenset({1, 2, 3}) 1741 """), 1742 1743 # Stack manipulation. 1744 1745 I(name='POP', 1746 code='0', 1747 arg=None, 1748 stack_before=[anyobject], 1749 stack_after=[], 1750 proto=0, 1751 doc="Discard the top stack item, shrinking the stack by one item."), 1752 1753 I(name='DUP', 1754 code='2', 1755 arg=None, 1756 stack_before=[anyobject], 1757 stack_after=[anyobject, anyobject], 1758 proto=0, 1759 doc="Push the top stack item onto the stack again, duplicating it."), 1760 1761 I(name='MARK', 1762 code='(', 1763 arg=None, 1764 stack_before=[], 1765 stack_after=[markobject], 1766 proto=0, 1767 doc="""Push markobject onto the stack. 1768 1769 markobject is a unique object, used by other opcodes to identify a 1770 region of the stack containing a variable number of objects for them 1771 to work on. See markobject.doc for more detail. 1772 """), 1773 1774 I(name='POP_MARK', 1775 code='1', 1776 arg=None, 1777 stack_before=[markobject, stackslice], 1778 stack_after=[], 1779 proto=1, 1780 doc="""Pop all the stack objects at and above the topmost markobject. 1781 1782 When an opcode using a variable number of stack objects is done, 1783 POP_MARK is used to remove those objects, and to remove the markobject 1784 that delimited their starting position on the stack. 1785 """), 1786 1787 # Memo manipulation. There are really only two operations (get and put), 1788 # each in all-text, "short binary", and "long binary" flavors. 1789 1790 I(name='GET', 1791 code='g', 1792 arg=decimalnl_short, 1793 stack_before=[], 1794 stack_after=[anyobject], 1795 proto=0, 1796 doc="""Read an object from the memo and push it on the stack. 1797 1798 The index of the memo object to push is given by the newline-terminated 1799 decimal string following. BINGET and LONG_BINGET are space-optimized 1800 versions. 1801 """), 1802 1803 I(name='BINGET', 1804 code='h', 1805 arg=uint1, 1806 stack_before=[], 1807 stack_after=[anyobject], 1808 proto=1, 1809 doc="""Read an object from the memo and push it on the stack. 1810 1811 The index of the memo object to push is given by the 1-byte unsigned 1812 integer following. 1813 """), 1814 1815 I(name='LONG_BINGET', 1816 code='j', 1817 arg=uint4, 1818 stack_before=[], 1819 stack_after=[anyobject], 1820 proto=1, 1821 doc="""Read an object from the memo and push it on the stack. 1822 1823 The index of the memo object to push is given by the 4-byte unsigned 1824 little-endian integer following. 1825 """), 1826 1827 I(name='PUT', 1828 code='p', 1829 arg=decimalnl_short, 1830 stack_before=[], 1831 stack_after=[], 1832 proto=0, 1833 doc="""Store the stack top into the memo. The stack is not popped. 1834 1835 The index of the memo location to write into is given by the newline- 1836 terminated decimal string following. BINPUT and LONG_BINPUT are 1837 space-optimized versions. 1838 """), 1839 1840 I(name='BINPUT', 1841 code='q', 1842 arg=uint1, 1843 stack_before=[], 1844 stack_after=[], 1845 proto=1, 1846 doc="""Store the stack top into the memo. The stack is not popped. 1847 1848 The index of the memo location to write into is given by the 1-byte 1849 unsigned integer following. 1850 """), 1851 1852 I(name='LONG_BINPUT', 1853 code='r', 1854 arg=uint4, 1855 stack_before=[], 1856 stack_after=[], 1857 proto=1, 1858 doc="""Store the stack top into the memo. The stack is not popped. 1859 1860 The index of the memo location to write into is given by the 4-byte 1861 unsigned little-endian integer following. 1862 """), 1863 1864 I(name='MEMOIZE', 1865 code='\x94', 1866 arg=None, 1867 stack_before=[anyobject], 1868 stack_after=[anyobject], 1869 proto=4, 1870 doc="""Store the stack top into the memo. The stack is not popped. 1871 1872 The index of the memo location to write is the number of 1873 elements currently present in the memo. 1874 """), 1875 1876 # Access the extension registry (predefined objects). Akin to the GET 1877 # family. 1878 1879 I(name='EXT1', 1880 code='\x82', 1881 arg=uint1, 1882 stack_before=[], 1883 stack_after=[anyobject], 1884 proto=2, 1885 doc="""Extension code. 1886 1887 This code and the similar EXT2 and EXT4 allow using a registry 1888 of popular objects that are pickled by name, typically classes. 1889 It is envisioned that through a global negotiation and 1890 registration process, third parties can set up a mapping between 1891 ints and object names. 1892 1893 In order to guarantee pickle interchangeability, the extension 1894 code registry ought to be global, although a range of codes may 1895 be reserved for private use. 1896 1897 EXT1 has a 1-byte integer argument. This is used to index into the 1898 extension registry, and the object at that index is pushed on the stack. 1899 """), 1900 1901 I(name='EXT2', 1902 code='\x83', 1903 arg=uint2, 1904 stack_before=[], 1905 stack_after=[anyobject], 1906 proto=2, 1907 doc="""Extension code. 1908 1909 See EXT1. EXT2 has a two-byte integer argument. 1910 """), 1911 1912 I(name='EXT4', 1913 code='\x84', 1914 arg=int4, 1915 stack_before=[], 1916 stack_after=[anyobject], 1917 proto=2, 1918 doc="""Extension code. 1919 1920 See EXT1. EXT4 has a four-byte integer argument. 1921 """), 1922 1923 # Push a class object, or module function, on the stack, via its module 1924 # and name. 1925 1926 I(name='GLOBAL', 1927 code='c', 1928 arg=stringnl_noescape_pair, 1929 stack_before=[], 1930 stack_after=[anyobject], 1931 proto=0, 1932 doc="""Push a global object (module.attr) on the stack. 1933 1934 Two newline-terminated strings follow the GLOBAL opcode. The first is 1935 taken as a module name, and the second as a class name. The class 1936 object module.class is pushed on the stack. More accurately, the 1937 object returned by self.find_class(module, class) is pushed on the 1938 stack, so unpickling subclasses can override this form of lookup. 1939 """), 1940 1941 I(name='STACK_GLOBAL', 1942 code='\x93', 1943 arg=None, 1944 stack_before=[pyunicode, pyunicode], 1945 stack_after=[anyobject], 1946 proto=4, 1947 doc="""Push a global object (module.attr) on the stack. 1948 """), 1949 1950 # Ways to build objects of classes pickle doesn't know about directly 1951 # (user-defined classes). I despair of documenting this accurately 1952 # and comprehensibly -- you really have to read the pickle code to 1953 # find all the special cases. 1954 1955 I(name='REDUCE', 1956 code='R', 1957 arg=None, 1958 stack_before=[anyobject, anyobject], 1959 stack_after=[anyobject], 1960 proto=0, 1961 doc="""Push an object built from a callable and an argument tuple. 1962 1963 The opcode is named to remind of the __reduce__() method. 1964 1965 Stack before: ... callable pytuple 1966 Stack after: ... callable(*pytuple) 1967 1968 The callable and the argument tuple are the first two items returned 1969 by a __reduce__ method. Applying the callable to the argtuple is 1970 supposed to reproduce the original object, or at least get it started. 1971 If the __reduce__ method returns a 3-tuple, the last component is an 1972 argument to be passed to the object's __setstate__, and then the REDUCE 1973 opcode is followed by code to create setstate's argument, and then a 1974 BUILD opcode to apply __setstate__ to that argument. 1975 1976 If not isinstance(callable, type), REDUCE complains unless the 1977 callable has been registered with the copyreg module's 1978 safe_constructors dict, or the callable has a magic 1979 '__safe_for_unpickling__' attribute with a true value. I'm not sure 1980 why it does this, but I've sure seen this complaint often enough when 1981 I didn't want to <wink>. 1982 """), 1983 1984 I(name='BUILD', 1985 code='b', 1986 arg=None, 1987 stack_before=[anyobject, anyobject], 1988 stack_after=[anyobject], 1989 proto=0, 1990 doc="""Finish building an object, via __setstate__ or dict update. 1991 1992 Stack before: ... anyobject argument 1993 Stack after: ... anyobject 1994 1995 where anyobject may have been mutated, as follows: 1996 1997 If the object has a __setstate__ method, 1998 1999 anyobject.__setstate__(argument) 2000 2001 is called. 2002 2003 Else the argument must be a dict, the object must have a __dict__, and 2004 the object is updated via 2005 2006 anyobject.__dict__.update(argument) 2007 """), 2008 2009 I(name='INST', 2010 code='i', 2011 arg=stringnl_noescape_pair, 2012 stack_before=[markobject, stackslice], 2013 stack_after=[anyobject], 2014 proto=0, 2015 doc="""Build a class instance. 2016 2017 This is the protocol 0 version of protocol 1's OBJ opcode. 2018 INST is followed by two newline-terminated strings, giving a 2019 module and class name, just as for the GLOBAL opcode (and see 2020 GLOBAL for more details about that). self.find_class(module, name) 2021 is used to get a class object. 2022 2023 In addition, all the objects on the stack following the topmost 2024 markobject are gathered into a tuple and popped (along with the 2025 topmost markobject), just as for the TUPLE opcode. 2026 2027 Now it gets complicated. If all of these are true: 2028 2029 + The argtuple is empty (markobject was at the top of the stack 2030 at the start). 2031 2032 + The class object does not have a __getinitargs__ attribute. 2033 2034 then we want to create an old-style class instance without invoking 2035 its __init__() method (pickle has waffled on this over the years; not 2036 calling __init__() is current wisdom). In this case, an instance of 2037 an old-style dummy class is created, and then we try to rebind its 2038 __class__ attribute to the desired class object. If this succeeds, 2039 the new instance object is pushed on the stack, and we're done. 2040 2041 Else (the argtuple is not empty, it's not an old-style class object, 2042 or the class object does have a __getinitargs__ attribute), the code 2043 first insists that the class object have a __safe_for_unpickling__ 2044 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE, 2045 it doesn't matter whether this attribute has a true or false value, it 2046 only matters whether it exists (XXX this is a bug). If 2047 __safe_for_unpickling__ doesn't exist, UnpicklingError is raised. 2048 2049 Else (the class object does have a __safe_for_unpickling__ attr), 2050 the class object obtained from INST's arguments is applied to the 2051 argtuple obtained from the stack, and the resulting instance object 2052 is pushed on the stack. 2053 2054 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3. 2055 NOTE: the distinction between old-style and new-style classes does 2056 not make sense in Python 3. 2057 """), 2058 2059 I(name='OBJ', 2060 code='o', 2061 arg=None, 2062 stack_before=[markobject, anyobject, stackslice], 2063 stack_after=[anyobject], 2064 proto=1, 2065 doc="""Build a class instance. 2066 2067 This is the protocol 1 version of protocol 0's INST opcode, and is 2068 very much like it. The major difference is that the class object 2069 is taken off the stack, allowing it to be retrieved from the memo 2070 repeatedly if several instances of the same class are created. This 2071 can be much more efficient (in both time and space) than repeatedly 2072 embedding the module and class names in INST opcodes. 2073 2074 Unlike INST, OBJ takes no arguments from the opcode stream. Instead 2075 the class object is taken off the stack, immediately above the 2076 topmost markobject: 2077 2078 Stack before: ... markobject classobject stackslice 2079 Stack after: ... new_instance_object 2080 2081 As for INST, the remainder of the stack above the markobject is 2082 gathered into an argument tuple, and then the logic seems identical, 2083 except that no __safe_for_unpickling__ check is done (XXX this is 2084 a bug). See INST for the gory details. 2085 2086 NOTE: In Python 2.3, INST and OBJ are identical except for how they 2087 get the class object. That was always the intent; the implementations 2088 had diverged for accidental reasons. 2089 """), 2090 2091 I(name='NEWOBJ', 2092 code='\x81', 2093 arg=None, 2094 stack_before=[anyobject, anyobject], 2095 stack_after=[anyobject], 2096 proto=2, 2097 doc="""Build an object instance. 2098 2099 The stack before should be thought of as containing a class 2100 object followed by an argument tuple (the tuple being the stack 2101 top). Call these cls and args. They are popped off the stack, 2102 and the value returned by cls.__new__(cls, *args) is pushed back 2103 onto the stack. 2104 """), 2105 2106 I(name='NEWOBJ_EX', 2107 code='\x92', 2108 arg=None, 2109 stack_before=[anyobject, anyobject, anyobject], 2110 stack_after=[anyobject], 2111 proto=4, 2112 doc="""Build an object instance. 2113 2114 The stack before should be thought of as containing a class 2115 object followed by an argument tuple and by a keyword argument dict 2116 (the dict being the stack top). Call these cls and args. They are 2117 popped off the stack, and the value returned by 2118 cls.__new__(cls, *args, *kwargs) is pushed back onto the stack. 2119 """), 2120 2121 # Machine control. 2122 2123 I(name='PROTO', 2124 code='\x80', 2125 arg=uint1, 2126 stack_before=[], 2127 stack_after=[], 2128 proto=2, 2129 doc="""Protocol version indicator. 2130 2131 For protocol 2 and above, a pickle must start with this opcode. 2132 The argument is the protocol version, an int in range(2, 256). 2133 """), 2134 2135 I(name='STOP', 2136 code='.', 2137 arg=None, 2138 stack_before=[anyobject], 2139 stack_after=[], 2140 proto=0, 2141 doc="""Stop the unpickling machine. 2142 2143 Every pickle ends with this opcode. The object at the top of the stack 2144 is popped, and that's the result of unpickling. The stack should be 2145 empty then. 2146 """), 2147 2148 # Framing support. 2149 2150 I(name='FRAME', 2151 code='\x95', 2152 arg=uint8, 2153 stack_before=[], 2154 stack_after=[], 2155 proto=4, 2156 doc="""Indicate the beginning of a new frame. 2157 2158 The unpickler may use this opcode to safely prefetch data from its 2159 underlying stream. 2160 """), 2161 2162 # Ways to deal with persistent IDs. 2163 2164 I(name='PERSID', 2165 code='P', 2166 arg=stringnl_noescape, 2167 stack_before=[], 2168 stack_after=[anyobject], 2169 proto=0, 2170 doc="""Push an object identified by a persistent ID. 2171 2172 The pickle module doesn't define what a persistent ID means. PERSID's 2173 argument is a newline-terminated str-style (no embedded escapes, no 2174 bracketing quote characters) string, which *is* "the persistent ID". 2175 The unpickler passes this string to self.persistent_load(). Whatever 2176 object that returns is pushed on the stack. There is no implementation 2177 of persistent_load() in Python's unpickler: it must be supplied by an 2178 unpickler subclass. 2179 """), 2180 2181 I(name='BINPERSID', 2182 code='Q', 2183 arg=None, 2184 stack_before=[anyobject], 2185 stack_after=[anyobject], 2186 proto=1, 2187 doc="""Push an object identified by a persistent ID. 2188 2189 Like PERSID, except the persistent ID is popped off the stack (instead 2190 of being a string embedded in the opcode bytestream). The persistent 2191 ID is passed to self.persistent_load(), and whatever object that 2192 returns is pushed on the stack. See PERSID for more detail. 2193 """), 2194] 2195del I 2196 2197# Verify uniqueness of .name and .code members. 2198name2i = {} 2199code2i = {} 2200 2201for i, d in enumerate(opcodes): 2202 if d.name in name2i: 2203 raise ValueError("repeated name %r at indices %d and %d" % 2204 (d.name, name2i[d.name], i)) 2205 if d.code in code2i: 2206 raise ValueError("repeated code %r at indices %d and %d" % 2207 (d.code, code2i[d.code], i)) 2208 2209 name2i[d.name] = i 2210 code2i[d.code] = i 2211 2212del name2i, code2i, i, d 2213 2214############################################################################## 2215# Build a code2op dict, mapping opcode characters to OpcodeInfo records. 2216# Also ensure we've got the same stuff as pickle.py, although the 2217# introspection here is dicey. 2218 2219code2op = {} 2220for d in opcodes: 2221 code2op[d.code] = d 2222del d 2223 2224def assure_pickle_consistency(verbose=False): 2225 2226 copy = code2op.copy() 2227 for name in pickle.__all__: 2228 if not re.match("[A-Z][A-Z0-9_]+$", name): 2229 if verbose: 2230 print("skipping %r: it doesn't look like an opcode name" % name) 2231 continue 2232 picklecode = getattr(pickle, name) 2233 if not isinstance(picklecode, bytes) or len(picklecode) != 1: 2234 if verbose: 2235 print(("skipping %r: value %r doesn't look like a pickle " 2236 "code" % (name, picklecode))) 2237 continue 2238 picklecode = picklecode.decode("latin-1") 2239 if picklecode in copy: 2240 if verbose: 2241 print("checking name %r w/ code %r for consistency" % ( 2242 name, picklecode)) 2243 d = copy[picklecode] 2244 if d.name != name: 2245 raise ValueError("for pickle code %r, pickle.py uses name %r " 2246 "but we're using name %r" % (picklecode, 2247 name, 2248 d.name)) 2249 # Forget this one. Any left over in copy at the end are a problem 2250 # of a different kind. 2251 del copy[picklecode] 2252 else: 2253 raise ValueError("pickle.py appears to have a pickle opcode with " 2254 "name %r and code %r, but we don't" % 2255 (name, picklecode)) 2256 if copy: 2257 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"] 2258 for code, d in copy.items(): 2259 msg.append(" name %r with code %r" % (d.name, code)) 2260 raise ValueError("\n".join(msg)) 2261 2262assure_pickle_consistency() 2263del assure_pickle_consistency 2264 2265############################################################################## 2266# A pickle opcode generator. 2267 2268def _genops(data, yield_end_pos=False): 2269 if isinstance(data, bytes_types): 2270 data = io.BytesIO(data) 2271 2272 if hasattr(data, "tell"): 2273 getpos = data.tell 2274 else: 2275 getpos = lambda: None 2276 2277 while True: 2278 pos = getpos() 2279 code = data.read(1) 2280 opcode = code2op.get(code.decode("latin-1")) 2281 if opcode is None: 2282 if code == b"": 2283 raise ValueError("pickle exhausted before seeing STOP") 2284 else: 2285 raise ValueError("at position %s, opcode %r unknown" % ( 2286 "<unknown>" if pos is None else pos, 2287 code)) 2288 if opcode.arg is None: 2289 arg = None 2290 else: 2291 arg = opcode.arg.reader(data) 2292 if yield_end_pos: 2293 yield opcode, arg, pos, getpos() 2294 else: 2295 yield opcode, arg, pos 2296 if code == b'.': 2297 assert opcode.name == 'STOP' 2298 break 2299 2300def genops(pickle): 2301 """Generate all the opcodes in a pickle. 2302 2303 'pickle' is a file-like object, or string, containing the pickle. 2304 2305 Each opcode in the pickle is generated, from the current pickle position, 2306 stopping after a STOP opcode is delivered. A triple is generated for 2307 each opcode: 2308 2309 opcode, arg, pos 2310 2311 opcode is an OpcodeInfo record, describing the current opcode. 2312 2313 If the opcode has an argument embedded in the pickle, arg is its decoded 2314 value, as a Python object. If the opcode doesn't have an argument, arg 2315 is None. 2316 2317 If the pickle has a tell() method, pos was the value of pickle.tell() 2318 before reading the current opcode. If the pickle is a bytes object, 2319 it's wrapped in a BytesIO object, and the latter's tell() result is 2320 used. Else (the pickle doesn't have a tell(), and it's not obvious how 2321 to query its current position) pos is None. 2322 """ 2323 return _genops(pickle) 2324 2325############################################################################## 2326# A pickle optimizer. 2327 2328def optimize(p): 2329 'Optimize a pickle string by removing unused PUT opcodes' 2330 put = 'PUT' 2331 get = 'GET' 2332 oldids = set() # set of all PUT ids 2333 newids = {} # set of ids used by a GET opcode 2334 opcodes = [] # (op, idx) or (pos, end_pos) 2335 proto = 0 2336 protoheader = b'' 2337 for opcode, arg, pos, end_pos in _genops(p, yield_end_pos=True): 2338 if 'PUT' in opcode.name: 2339 oldids.add(arg) 2340 opcodes.append((put, arg)) 2341 elif opcode.name == 'MEMOIZE': 2342 idx = len(oldids) 2343 oldids.add(idx) 2344 opcodes.append((put, idx)) 2345 elif 'FRAME' in opcode.name: 2346 pass 2347 elif 'GET' in opcode.name: 2348 if opcode.proto > proto: 2349 proto = opcode.proto 2350 newids[arg] = None 2351 opcodes.append((get, arg)) 2352 elif opcode.name == 'PROTO': 2353 if arg > proto: 2354 proto = arg 2355 if pos == 0: 2356 protoheader = p[pos:end_pos] 2357 else: 2358 opcodes.append((pos, end_pos)) 2359 else: 2360 opcodes.append((pos, end_pos)) 2361 del oldids 2362 2363 # Copy the opcodes except for PUTS without a corresponding GET 2364 out = io.BytesIO() 2365 # Write the PROTO header before any framing 2366 out.write(protoheader) 2367 pickler = pickle._Pickler(out, proto) 2368 if proto >= 4: 2369 pickler.framer.start_framing() 2370 idx = 0 2371 for op, arg in opcodes: 2372 frameless = False 2373 if op is put: 2374 if arg not in newids: 2375 continue 2376 data = pickler.put(idx) 2377 newids[arg] = idx 2378 idx += 1 2379 elif op is get: 2380 data = pickler.get(newids[arg]) 2381 else: 2382 data = p[op:arg] 2383 frameless = len(data) > pickler.framer._FRAME_SIZE_TARGET 2384 pickler.framer.commit_frame(force=frameless) 2385 if frameless: 2386 pickler.framer.file_write(data) 2387 else: 2388 pickler.write(data) 2389 pickler.framer.end_framing() 2390 return out.getvalue() 2391 2392############################################################################## 2393# A symbolic pickle disassembler. 2394 2395def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0): 2396 """Produce a symbolic disassembly of a pickle. 2397 2398 'pickle' is a file-like object, or string, containing a (at least one) 2399 pickle. The pickle is disassembled from the current position, through 2400 the first STOP opcode encountered. 2401 2402 Optional arg 'out' is a file-like object to which the disassembly is 2403 printed. It defaults to sys.stdout. 2404 2405 Optional arg 'memo' is a Python dict, used as the pickle's memo. It 2406 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes. 2407 Passing the same memo object to another dis() call then allows disassembly 2408 to proceed across multiple pickles that were all created by the same 2409 pickler with the same memo. Ordinarily you don't need to worry about this. 2410 2411 Optional arg 'indentlevel' is the number of blanks by which to indent 2412 a new MARK level. It defaults to 4. 2413 2414 Optional arg 'annotate' if nonzero instructs dis() to add short 2415 description of the opcode on each line of disassembled output. 2416 The value given to 'annotate' must be an integer and is used as a 2417 hint for the column where annotation should start. The default 2418 value is 0, meaning no annotations. 2419 2420 In addition to printing the disassembly, some sanity checks are made: 2421 2422 + All embedded opcode arguments "make sense". 2423 2424 + Explicit and implicit pop operations have enough items on the stack. 2425 2426 + When an opcode implicitly refers to a markobject, a markobject is 2427 actually on the stack. 2428 2429 + A memo entry isn't referenced before it's defined. 2430 2431 + The markobject isn't stored in the memo. 2432 2433 + A memo entry isn't redefined. 2434 """ 2435 2436 # Most of the hair here is for sanity checks, but most of it is needed 2437 # anyway to detect when a protocol 0 POP takes a MARK off the stack 2438 # (which in turn is needed to indent MARK blocks correctly). 2439 2440 stack = [] # crude emulation of unpickler stack 2441 if memo is None: 2442 memo = {} # crude emulation of unpickler memo 2443 maxproto = -1 # max protocol number seen 2444 markstack = [] # bytecode positions of MARK opcodes 2445 indentchunk = ' ' * indentlevel 2446 errormsg = None 2447 annocol = annotate # column hint for annotations 2448 for opcode, arg, pos in genops(pickle): 2449 if pos is not None: 2450 print("%5d:" % pos, end=' ', file=out) 2451 2452 line = "%-4s %s%s" % (repr(opcode.code)[1:-1], 2453 indentchunk * len(markstack), 2454 opcode.name) 2455 2456 maxproto = max(maxproto, opcode.proto) 2457 before = opcode.stack_before # don't mutate 2458 after = opcode.stack_after # don't mutate 2459 numtopop = len(before) 2460 2461 # See whether a MARK should be popped. 2462 markmsg = None 2463 if markobject in before or (opcode.name == "POP" and 2464 stack and 2465 stack[-1] is markobject): 2466 assert markobject not in after 2467 if __debug__: 2468 if markobject in before: 2469 assert before[-1] is stackslice 2470 if markstack: 2471 markpos = markstack.pop() 2472 if markpos is None: 2473 markmsg = "(MARK at unknown opcode offset)" 2474 else: 2475 markmsg = "(MARK at %d)" % markpos 2476 # Pop everything at and after the topmost markobject. 2477 while stack[-1] is not markobject: 2478 stack.pop() 2479 stack.pop() 2480 # Stop later code from popping too much. 2481 try: 2482 numtopop = before.index(markobject) 2483 except ValueError: 2484 assert opcode.name == "POP" 2485 numtopop = 0 2486 else: 2487 errormsg = markmsg = "no MARK exists on stack" 2488 2489 # Check for correct memo usage. 2490 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"): 2491 if opcode.name == "MEMOIZE": 2492 memo_idx = len(memo) 2493 markmsg = "(as %d)" % memo_idx 2494 else: 2495 assert arg is not None 2496 memo_idx = arg 2497 if memo_idx in memo: 2498 errormsg = "memo key %r already defined" % arg 2499 elif not stack: 2500 errormsg = "stack is empty -- can't store into memo" 2501 elif stack[-1] is markobject: 2502 errormsg = "can't store markobject in the memo" 2503 else: 2504 memo[memo_idx] = stack[-1] 2505 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"): 2506 if arg in memo: 2507 assert len(after) == 1 2508 after = [memo[arg]] # for better stack emulation 2509 else: 2510 errormsg = "memo key %r has never been stored into" % arg 2511 2512 if arg is not None or markmsg: 2513 # make a mild effort to align arguments 2514 line += ' ' * (10 - len(opcode.name)) 2515 if arg is not None: 2516 line += ' ' + repr(arg) 2517 if markmsg: 2518 line += ' ' + markmsg 2519 if annotate: 2520 line += ' ' * (annocol - len(line)) 2521 # make a mild effort to align annotations 2522 annocol = len(line) 2523 if annocol > 50: 2524 annocol = annotate 2525 line += ' ' + opcode.doc.split('\n', 1)[0] 2526 print(line, file=out) 2527 2528 if errormsg: 2529 # Note that we delayed complaining until the offending opcode 2530 # was printed. 2531 raise ValueError(errormsg) 2532 2533 # Emulate the stack effects. 2534 if len(stack) < numtopop: 2535 raise ValueError("tries to pop %d items from stack with " 2536 "only %d items" % (numtopop, len(stack))) 2537 if numtopop: 2538 del stack[-numtopop:] 2539 if markobject in after: 2540 assert markobject not in before 2541 markstack.append(pos) 2542 2543 stack.extend(after) 2544 2545 print("highest protocol among opcodes =", maxproto, file=out) 2546 if stack: 2547 raise ValueError("stack not empty after STOP: %r" % stack) 2548 2549# For use in the doctest, simply as an example of a class to pickle. 2550class _Example: 2551 def __init__(self, value): 2552 self.value = value 2553 2554_dis_test = r""" 2555>>> import pickle 2556>>> x = [1, 2, (3, 4), {b'abc': "def"}] 2557>>> pkl0 = pickle.dumps(x, 0) 2558>>> dis(pkl0) 2559 0: ( MARK 2560 1: l LIST (MARK at 0) 2561 2: p PUT 0 2562 5: I INT 1 2563 8: a APPEND 2564 9: I INT 2 2565 12: a APPEND 2566 13: ( MARK 2567 14: I INT 3 2568 17: I INT 4 2569 20: t TUPLE (MARK at 13) 2570 21: p PUT 1 2571 24: a APPEND 2572 25: ( MARK 2573 26: d DICT (MARK at 25) 2574 27: p PUT 2 2575 30: c GLOBAL '_codecs encode' 2576 46: p PUT 3 2577 49: ( MARK 2578 50: V UNICODE 'abc' 2579 55: p PUT 4 2580 58: V UNICODE 'latin1' 2581 66: p PUT 5 2582 69: t TUPLE (MARK at 49) 2583 70: p PUT 6 2584 73: R REDUCE 2585 74: p PUT 7 2586 77: V UNICODE 'def' 2587 82: p PUT 8 2588 85: s SETITEM 2589 86: a APPEND 2590 87: . STOP 2591highest protocol among opcodes = 0 2592 2593Try again with a "binary" pickle. 2594 2595>>> pkl1 = pickle.dumps(x, 1) 2596>>> dis(pkl1) 2597 0: ] EMPTY_LIST 2598 1: q BINPUT 0 2599 3: ( MARK 2600 4: K BININT1 1 2601 6: K BININT1 2 2602 8: ( MARK 2603 9: K BININT1 3 2604 11: K BININT1 4 2605 13: t TUPLE (MARK at 8) 2606 14: q BINPUT 1 2607 16: } EMPTY_DICT 2608 17: q BINPUT 2 2609 19: c GLOBAL '_codecs encode' 2610 35: q BINPUT 3 2611 37: ( MARK 2612 38: X BINUNICODE 'abc' 2613 46: q BINPUT 4 2614 48: X BINUNICODE 'latin1' 2615 59: q BINPUT 5 2616 61: t TUPLE (MARK at 37) 2617 62: q BINPUT 6 2618 64: R REDUCE 2619 65: q BINPUT 7 2620 67: X BINUNICODE 'def' 2621 75: q BINPUT 8 2622 77: s SETITEM 2623 78: e APPENDS (MARK at 3) 2624 79: . STOP 2625highest protocol among opcodes = 1 2626 2627Exercise the INST/OBJ/BUILD family. 2628 2629>>> import pickletools 2630>>> dis(pickle.dumps(pickletools.dis, 0)) 2631 0: c GLOBAL 'pickletools dis' 2632 17: p PUT 0 2633 20: . STOP 2634highest protocol among opcodes = 0 2635 2636>>> from pickletools import _Example 2637>>> x = [_Example(42)] * 2 2638>>> dis(pickle.dumps(x, 0)) 2639 0: ( MARK 2640 1: l LIST (MARK at 0) 2641 2: p PUT 0 2642 5: c GLOBAL 'copy_reg _reconstructor' 2643 30: p PUT 1 2644 33: ( MARK 2645 34: c GLOBAL 'pickletools _Example' 2646 56: p PUT 2 2647 59: c GLOBAL '__builtin__ object' 2648 79: p PUT 3 2649 82: N NONE 2650 83: t TUPLE (MARK at 33) 2651 84: p PUT 4 2652 87: R REDUCE 2653 88: p PUT 5 2654 91: ( MARK 2655 92: d DICT (MARK at 91) 2656 93: p PUT 6 2657 96: V UNICODE 'value' 2658 103: p PUT 7 2659 106: I INT 42 2660 110: s SETITEM 2661 111: b BUILD 2662 112: a APPEND 2663 113: g GET 5 2664 116: a APPEND 2665 117: . STOP 2666highest protocol among opcodes = 0 2667 2668>>> dis(pickle.dumps(x, 1)) 2669 0: ] EMPTY_LIST 2670 1: q BINPUT 0 2671 3: ( MARK 2672 4: c GLOBAL 'copy_reg _reconstructor' 2673 29: q BINPUT 1 2674 31: ( MARK 2675 32: c GLOBAL 'pickletools _Example' 2676 54: q BINPUT 2 2677 56: c GLOBAL '__builtin__ object' 2678 76: q BINPUT 3 2679 78: N NONE 2680 79: t TUPLE (MARK at 31) 2681 80: q BINPUT 4 2682 82: R REDUCE 2683 83: q BINPUT 5 2684 85: } EMPTY_DICT 2685 86: q BINPUT 6 2686 88: X BINUNICODE 'value' 2687 98: q BINPUT 7 2688 100: K BININT1 42 2689 102: s SETITEM 2690 103: b BUILD 2691 104: h BINGET 5 2692 106: e APPENDS (MARK at 3) 2693 107: . STOP 2694highest protocol among opcodes = 1 2695 2696Try "the canonical" recursive-object test. 2697 2698>>> L = [] 2699>>> T = L, 2700>>> L.append(T) 2701>>> L[0] is T 2702True 2703>>> T[0] is L 2704True 2705>>> L[0][0] is L 2706True 2707>>> T[0][0] is T 2708True 2709>>> dis(pickle.dumps(L, 0)) 2710 0: ( MARK 2711 1: l LIST (MARK at 0) 2712 2: p PUT 0 2713 5: ( MARK 2714 6: g GET 0 2715 9: t TUPLE (MARK at 5) 2716 10: p PUT 1 2717 13: a APPEND 2718 14: . STOP 2719highest protocol among opcodes = 0 2720 2721>>> dis(pickle.dumps(L, 1)) 2722 0: ] EMPTY_LIST 2723 1: q BINPUT 0 2724 3: ( MARK 2725 4: h BINGET 0 2726 6: t TUPLE (MARK at 3) 2727 7: q BINPUT 1 2728 9: a APPEND 2729 10: . STOP 2730highest protocol among opcodes = 1 2731 2732Note that, in the protocol 0 pickle of the recursive tuple, the disassembler 2733has to emulate the stack in order to realize that the POP opcode at 16 gets 2734rid of the MARK at 0. 2735 2736>>> dis(pickle.dumps(T, 0)) 2737 0: ( MARK 2738 1: ( MARK 2739 2: l LIST (MARK at 1) 2740 3: p PUT 0 2741 6: ( MARK 2742 7: g GET 0 2743 10: t TUPLE (MARK at 6) 2744 11: p PUT 1 2745 14: a APPEND 2746 15: 0 POP 2747 16: 0 POP (MARK at 0) 2748 17: g GET 1 2749 20: . STOP 2750highest protocol among opcodes = 0 2751 2752>>> dis(pickle.dumps(T, 1)) 2753 0: ( MARK 2754 1: ] EMPTY_LIST 2755 2: q BINPUT 0 2756 4: ( MARK 2757 5: h BINGET 0 2758 7: t TUPLE (MARK at 4) 2759 8: q BINPUT 1 2760 10: a APPEND 2761 11: 1 POP_MARK (MARK at 0) 2762 12: h BINGET 1 2763 14: . STOP 2764highest protocol among opcodes = 1 2765 2766Try protocol 2. 2767 2768>>> dis(pickle.dumps(L, 2)) 2769 0: \x80 PROTO 2 2770 2: ] EMPTY_LIST 2771 3: q BINPUT 0 2772 5: h BINGET 0 2773 7: \x85 TUPLE1 2774 8: q BINPUT 1 2775 10: a APPEND 2776 11: . STOP 2777highest protocol among opcodes = 2 2778 2779>>> dis(pickle.dumps(T, 2)) 2780 0: \x80 PROTO 2 2781 2: ] EMPTY_LIST 2782 3: q BINPUT 0 2783 5: h BINGET 0 2784 7: \x85 TUPLE1 2785 8: q BINPUT 1 2786 10: a APPEND 2787 11: 0 POP 2788 12: h BINGET 1 2789 14: . STOP 2790highest protocol among opcodes = 2 2791 2792Try protocol 3 with annotations: 2793 2794>>> dis(pickle.dumps(T, 3), annotate=1) 2795 0: \x80 PROTO 3 Protocol version indicator. 2796 2: ] EMPTY_LIST Push an empty list. 2797 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped. 2798 5: h BINGET 0 Read an object from the memo and push it on the stack. 2799 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack. 2800 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped. 2801 10: a APPEND Append an object to a list. 2802 11: 0 POP Discard the top stack item, shrinking the stack by one item. 2803 12: h BINGET 1 Read an object from the memo and push it on the stack. 2804 14: . STOP Stop the unpickling machine. 2805highest protocol among opcodes = 2 2806 2807""" 2808 2809_memo_test = r""" 2810>>> import pickle 2811>>> import io 2812>>> f = io.BytesIO() 2813>>> p = pickle.Pickler(f, 2) 2814>>> x = [1, 2, 3] 2815>>> p.dump(x) 2816>>> p.dump(x) 2817>>> f.seek(0) 28180 2819>>> memo = {} 2820>>> dis(f, memo=memo) 2821 0: \x80 PROTO 2 2822 2: ] EMPTY_LIST 2823 3: q BINPUT 0 2824 5: ( MARK 2825 6: K BININT1 1 2826 8: K BININT1 2 2827 10: K BININT1 3 2828 12: e APPENDS (MARK at 5) 2829 13: . STOP 2830highest protocol among opcodes = 2 2831>>> dis(f, memo=memo) 2832 14: \x80 PROTO 2 2833 16: h BINGET 0 2834 18: . STOP 2835highest protocol among opcodes = 2 2836""" 2837 2838__test__ = {'disassembler_test': _dis_test, 2839 'disassembler_memo_test': _memo_test, 2840 } 2841 2842def _test(): 2843 import doctest 2844 return doctest.testmod() 2845 2846if __name__ == "__main__": 2847 import argparse 2848 parser = argparse.ArgumentParser( 2849 description='disassemble one or more pickle files') 2850 parser.add_argument( 2851 'pickle_file', type=argparse.FileType('br'), 2852 nargs='*', help='the pickle file') 2853 parser.add_argument( 2854 '-o', '--output', default=sys.stdout, type=argparse.FileType('w'), 2855 help='the file where the output should be written') 2856 parser.add_argument( 2857 '-m', '--memo', action='store_true', 2858 help='preserve memo between disassemblies') 2859 parser.add_argument( 2860 '-l', '--indentlevel', default=4, type=int, 2861 help='the number of blanks by which to indent a new MARK level') 2862 parser.add_argument( 2863 '-a', '--annotate', action='store_true', 2864 help='annotate each line with a short opcode description') 2865 parser.add_argument( 2866 '-p', '--preamble', default="==> {name} <==", 2867 help='if more than one pickle file is specified, print this before' 2868 ' each disassembly') 2869 parser.add_argument( 2870 '-t', '--test', action='store_true', 2871 help='run self-test suite') 2872 parser.add_argument( 2873 '-v', action='store_true', 2874 help='run verbosely; only affects self-test run') 2875 args = parser.parse_args() 2876 if args.test: 2877 _test() 2878 else: 2879 annotate = 30 if args.annotate else 0 2880 if not args.pickle_file: 2881 parser.print_help() 2882 elif len(args.pickle_file) == 1: 2883 dis(args.pickle_file[0], args.output, None, 2884 args.indentlevel, annotate) 2885 else: 2886 memo = {} if args.memo else None 2887 for f in args.pickle_file: 2888 preamble = args.preamble.format(name=f.name) 2889 args.output.write(preamble + '\n') 2890 dis(f, args.output, memo, args.indentlevel, annotate) 2891