1'''"Executable documentation" for the pickle module. 2 3Extensive comments about the pickle protocols and pickle-machine opcodes 4can be found here. Some functions meant for external use: 5 6genops(pickle) 7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples. 8 9dis(pickle, out=None, memo=None, indentlevel=4) 10 Print a symbolic disassembly of a pickle. 11''' 12 13import codecs 14import io 15import pickle 16import re 17import sys 18 19__all__ = ['dis', 'genops', 'optimize'] 20 21bytes_types = pickle.bytes_types 22 23# Other ideas: 24# 25# - A pickle verifier: read a pickle and check it exhaustively for 26# well-formedness. dis() does a lot of this already. 27# 28# - A protocol identifier: examine a pickle and return its protocol number 29# (== the highest .proto attr value among all the opcodes in the pickle). 30# dis() already prints this info at the end. 31# 32# - A pickle optimizer: for example, tuple-building code is sometimes more 33# elaborate than necessary, catering for the possibility that the tuple 34# is recursive. Or lots of times a PUT is generated that's never accessed 35# by a later GET. 36 37 38# "A pickle" is a program for a virtual pickle machine (PM, but more accurately 39# called an unpickling machine). It's a sequence of opcodes, interpreted by the 40# PM, building an arbitrarily complex Python object. 41# 42# For the most part, the PM is very simple: there are no looping, testing, or 43# conditional instructions, no arithmetic and no function calls. Opcodes are 44# executed once each, from first to last, until a STOP opcode is reached. 45# 46# The PM has two data areas, "the stack" and "the memo". 47# 48# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python 49# integer object on the stack, whose value is gotten from a decimal string 50# literal immediately following the INT opcode in the pickle bytestream. Other 51# opcodes take Python objects off the stack. The result of unpickling is 52# whatever object is left on the stack when the final STOP opcode is executed. 53# 54# The memo is simply an array of objects, or it can be implemented as a dict 55# mapping little integers to objects. The memo serves as the PM's "long term 56# memory", and the little integers indexing the memo are akin to variable 57# names. Some opcodes pop a stack object into the memo at a given index, 58# and others push a memo object at a given index onto the stack again. 59# 60# At heart, that's all the PM has. Subtleties arise for these reasons: 61# 62# + Object identity. Objects can be arbitrarily complex, and subobjects 63# may be shared (for example, the list [a, a] refers to the same object a 64# twice). It can be vital that unpickling recreate an isomorphic object 65# graph, faithfully reproducing sharing. 66# 67# + Recursive objects. For example, after "L = []; L.append(L)", L is a 68# list, and L[0] is the same list. This is related to the object identity 69# point, and some sequences of pickle opcodes are subtle in order to 70# get the right result in all cases. 71# 72# + Things pickle doesn't know everything about. Examples of things pickle 73# does know everything about are Python's builtin scalar and container 74# types, like ints and tuples. They generally have opcodes dedicated to 75# them. For things like module references and instances of user-defined 76# classes, pickle's knowledge is limited. Historically, many enhancements 77# have been made to the pickle protocol in order to do a better (faster, 78# and/or more compact) job on those. 79# 80# + Backward compatibility and micro-optimization. As explained below, 81# pickle opcodes never go away, not even when better ways to do a thing 82# get invented. The repertoire of the PM just keeps growing over time. 83# For example, protocol 0 had two opcodes for building Python integers (INT 84# and LONG), protocol 1 added three more for more-efficient pickling of short 85# integers, and protocol 2 added two more for more-efficient pickling of 86# long integers (before protocol 2, the only ways to pickle a Python long 87# took time quadratic in the number of digits, for both pickling and 88# unpickling). "Opcode bloat" isn't so much a subtlety as a source of 89# wearying complication. 90# 91# 92# Pickle protocols: 93# 94# For compatibility, the meaning of a pickle opcode never changes. Instead new 95# pickle opcodes get added, and each version's unpickler can handle all the 96# pickle opcodes in all protocol versions to date. So old pickles continue to 97# be readable forever. The pickler can generally be told to restrict itself to 98# the subset of opcodes available under previous protocol versions too, so that 99# users can create pickles under the current version readable by older 100# versions. However, a pickle does not contain its version number embedded 101# within it. If an older unpickler tries to read a pickle using a later 102# protocol, the result is most likely an exception due to seeing an unknown (in 103# the older unpickler) opcode. 104# 105# The original pickle used what's now called "protocol 0", and what was called 106# "text mode" before Python 2.3. The entire pickle bytestream is made up of 107# printable 7-bit ASCII characters, plus the newline character, in protocol 0. 108# That's why it was called text mode. Protocol 0 is small and elegant, but 109# sometimes painfully inefficient. 110# 111# The second major set of additions is now called "protocol 1", and was called 112# "binary mode" before Python 2.3. This added many opcodes with arguments 113# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit" 114# bytes. Binary mode pickles can be substantially smaller than equivalent 115# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte 116# int as 4 bytes following the opcode, which is cheaper to unpickle than the 117# (perhaps) 11-character decimal string attached to INT. Protocol 1 also added 118# a number of opcodes that operate on many stack elements at once (like APPENDS 119# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE). 120# 121# The third major set of additions came in Python 2.3, and is called "protocol 122# 2". This added: 123# 124# - A better way to pickle instances of new-style classes (NEWOBJ). 125# 126# - A way for a pickle to identify its protocol (PROTO). 127# 128# - Time- and space- efficient pickling of long ints (LONG{1,4}). 129# 130# - Shortcuts for small tuples (TUPLE{1,2,3}}. 131# 132# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE). 133# 134# - The "extension registry", a vector of popular objects that can be pushed 135# efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but 136# the registry contents are predefined (there's nothing akin to the memo's 137# PUT). 138# 139# Another independent change with Python 2.3 is the abandonment of any 140# pretense that it might be safe to load pickles received from untrusted 141# parties -- no sufficient security analysis has been done to guarantee 142# this and there isn't a use case that warrants the expense of such an 143# analysis. 144# 145# To this end, all tests for __safe_for_unpickling__ or for 146# copyreg.safe_constructors are removed from the unpickling code. 147# References to these variables in the descriptions below are to be seen 148# as describing unpickling in Python 2.2 and before. 149 150 151# Meta-rule: Descriptions are stored in instances of descriptor objects, 152# with plain constructors. No meta-language is defined from which 153# descriptors could be constructed. If you want, e.g., XML, write a little 154# program to generate XML from the objects. 155 156############################################################################## 157# Some pickle opcodes have an argument, following the opcode in the 158# bytestream. An argument is of a specific type, described by an instance 159# of ArgumentDescriptor. These are not to be confused with arguments taken 160# off the stack -- ArgumentDescriptor applies only to arguments embedded in 161# the opcode stream, immediately following an opcode. 162 163# Represents the number of bytes consumed by an argument delimited by the 164# next newline character. 165UP_TO_NEWLINE = -1 166 167# Represents the number of bytes consumed by a two-argument opcode where 168# the first argument gives the number of bytes in the second argument. 169TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int 170TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int 171TAKEN_FROM_ARGUMENT4U = -4 # num bytes is 4-byte unsigned little-endian int 172TAKEN_FROM_ARGUMENT8U = -5 # num bytes is 8-byte unsigned little-endian int 173 174class ArgumentDescriptor(object): 175 __slots__ = ( 176 # name of descriptor record, also a module global name; a string 177 'name', 178 179 # length of argument, in bytes; an int; UP_TO_NEWLINE and 180 # TAKEN_FROM_ARGUMENT{1,4,8} are negative values for variable-length 181 # cases 182 'n', 183 184 # a function taking a file-like object, reading this kind of argument 185 # from the object at the current position, advancing the current 186 # position by n bytes, and returning the value of the argument 187 'reader', 188 189 # human-readable docs for this arg descriptor; a string 190 'doc', 191 ) 192 193 def __init__(self, name, n, reader, doc): 194 assert isinstance(name, str) 195 self.name = name 196 197 assert isinstance(n, int) and (n >= 0 or 198 n in (UP_TO_NEWLINE, 199 TAKEN_FROM_ARGUMENT1, 200 TAKEN_FROM_ARGUMENT4, 201 TAKEN_FROM_ARGUMENT4U, 202 TAKEN_FROM_ARGUMENT8U)) 203 self.n = n 204 205 self.reader = reader 206 207 assert isinstance(doc, str) 208 self.doc = doc 209 210from struct import unpack as _unpack 211 212def read_uint1(f): 213 r""" 214 >>> import io 215 >>> read_uint1(io.BytesIO(b'\xff')) 216 255 217 """ 218 219 data = f.read(1) 220 if data: 221 return data[0] 222 raise ValueError("not enough data in stream to read uint1") 223 224uint1 = ArgumentDescriptor( 225 name='uint1', 226 n=1, 227 reader=read_uint1, 228 doc="One-byte unsigned integer.") 229 230 231def read_uint2(f): 232 r""" 233 >>> import io 234 >>> read_uint2(io.BytesIO(b'\xff\x00')) 235 255 236 >>> read_uint2(io.BytesIO(b'\xff\xff')) 237 65535 238 """ 239 240 data = f.read(2) 241 if len(data) == 2: 242 return _unpack("<H", data)[0] 243 raise ValueError("not enough data in stream to read uint2") 244 245uint2 = ArgumentDescriptor( 246 name='uint2', 247 n=2, 248 reader=read_uint2, 249 doc="Two-byte unsigned integer, little-endian.") 250 251 252def read_int4(f): 253 r""" 254 >>> import io 255 >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00')) 256 255 257 >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31) 258 True 259 """ 260 261 data = f.read(4) 262 if len(data) == 4: 263 return _unpack("<i", data)[0] 264 raise ValueError("not enough data in stream to read int4") 265 266int4 = ArgumentDescriptor( 267 name='int4', 268 n=4, 269 reader=read_int4, 270 doc="Four-byte signed integer, little-endian, 2's complement.") 271 272 273def read_uint4(f): 274 r""" 275 >>> import io 276 >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00')) 277 255 278 >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31 279 True 280 """ 281 282 data = f.read(4) 283 if len(data) == 4: 284 return _unpack("<I", data)[0] 285 raise ValueError("not enough data in stream to read uint4") 286 287uint4 = ArgumentDescriptor( 288 name='uint4', 289 n=4, 290 reader=read_uint4, 291 doc="Four-byte unsigned integer, little-endian.") 292 293 294def read_uint8(f): 295 r""" 296 >>> import io 297 >>> read_uint8(io.BytesIO(b'\xff\x00\x00\x00\x00\x00\x00\x00')) 298 255 299 >>> read_uint8(io.BytesIO(b'\xff' * 8)) == 2**64-1 300 True 301 """ 302 303 data = f.read(8) 304 if len(data) == 8: 305 return _unpack("<Q", data)[0] 306 raise ValueError("not enough data in stream to read uint8") 307 308uint8 = ArgumentDescriptor( 309 name='uint8', 310 n=8, 311 reader=read_uint8, 312 doc="Eight-byte unsigned integer, little-endian.") 313 314 315def read_stringnl(f, decode=True, stripquotes=True): 316 r""" 317 >>> import io 318 >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n")) 319 'abcd' 320 321 >>> read_stringnl(io.BytesIO(b"\n")) 322 Traceback (most recent call last): 323 ... 324 ValueError: no string quotes around b'' 325 326 >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False) 327 '' 328 329 >>> read_stringnl(io.BytesIO(b"''\n")) 330 '' 331 332 >>> read_stringnl(io.BytesIO(b'"abcd"')) 333 Traceback (most recent call last): 334 ... 335 ValueError: no newline found when trying to read stringnl 336 337 Embedded escapes are undone in the result. 338 >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'")) 339 'a\n\\b\x00c\td' 340 """ 341 342 data = f.readline() 343 if not data.endswith(b'\n'): 344 raise ValueError("no newline found when trying to read stringnl") 345 data = data[:-1] # lose the newline 346 347 if stripquotes: 348 for q in (b'"', b"'"): 349 if data.startswith(q): 350 if not data.endswith(q): 351 raise ValueError("strinq quote %r not found at both " 352 "ends of %r" % (q, data)) 353 data = data[1:-1] 354 break 355 else: 356 raise ValueError("no string quotes around %r" % data) 357 358 if decode: 359 data = codecs.escape_decode(data)[0].decode("ascii") 360 return data 361 362stringnl = ArgumentDescriptor( 363 name='stringnl', 364 n=UP_TO_NEWLINE, 365 reader=read_stringnl, 366 doc="""A newline-terminated string. 367 368 This is a repr-style string, with embedded escapes, and 369 bracketing quotes. 370 """) 371 372def read_stringnl_noescape(f): 373 return read_stringnl(f, stripquotes=False) 374 375stringnl_noescape = ArgumentDescriptor( 376 name='stringnl_noescape', 377 n=UP_TO_NEWLINE, 378 reader=read_stringnl_noescape, 379 doc="""A newline-terminated string. 380 381 This is a str-style string, without embedded escapes, 382 or bracketing quotes. It should consist solely of 383 printable ASCII characters. 384 """) 385 386def read_stringnl_noescape_pair(f): 387 r""" 388 >>> import io 389 >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk")) 390 'Queue Empty' 391 """ 392 393 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f)) 394 395stringnl_noescape_pair = ArgumentDescriptor( 396 name='stringnl_noescape_pair', 397 n=UP_TO_NEWLINE, 398 reader=read_stringnl_noescape_pair, 399 doc="""A pair of newline-terminated strings. 400 401 These are str-style strings, without embedded 402 escapes, or bracketing quotes. They should 403 consist solely of printable ASCII characters. 404 The pair is returned as a single string, with 405 a single blank separating the two strings. 406 """) 407 408 409def read_string1(f): 410 r""" 411 >>> import io 412 >>> read_string1(io.BytesIO(b"\x00")) 413 '' 414 >>> read_string1(io.BytesIO(b"\x03abcdef")) 415 'abc' 416 """ 417 418 n = read_uint1(f) 419 assert n >= 0 420 data = f.read(n) 421 if len(data) == n: 422 return data.decode("latin-1") 423 raise ValueError("expected %d bytes in a string1, but only %d remain" % 424 (n, len(data))) 425 426string1 = ArgumentDescriptor( 427 name="string1", 428 n=TAKEN_FROM_ARGUMENT1, 429 reader=read_string1, 430 doc="""A counted string. 431 432 The first argument is a 1-byte unsigned int giving the number 433 of bytes in the string, and the second argument is that many 434 bytes. 435 """) 436 437 438def read_string4(f): 439 r""" 440 >>> import io 441 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc")) 442 '' 443 >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 444 'abc' 445 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 446 Traceback (most recent call last): 447 ... 448 ValueError: expected 50331648 bytes in a string4, but only 6 remain 449 """ 450 451 n = read_int4(f) 452 if n < 0: 453 raise ValueError("string4 byte count < 0: %d" % n) 454 data = f.read(n) 455 if len(data) == n: 456 return data.decode("latin-1") 457 raise ValueError("expected %d bytes in a string4, but only %d remain" % 458 (n, len(data))) 459 460string4 = ArgumentDescriptor( 461 name="string4", 462 n=TAKEN_FROM_ARGUMENT4, 463 reader=read_string4, 464 doc="""A counted string. 465 466 The first argument is a 4-byte little-endian signed int giving 467 the number of bytes in the string, and the second argument is 468 that many bytes. 469 """) 470 471 472def read_bytes1(f): 473 r""" 474 >>> import io 475 >>> read_bytes1(io.BytesIO(b"\x00")) 476 b'' 477 >>> read_bytes1(io.BytesIO(b"\x03abcdef")) 478 b'abc' 479 """ 480 481 n = read_uint1(f) 482 assert n >= 0 483 data = f.read(n) 484 if len(data) == n: 485 return data 486 raise ValueError("expected %d bytes in a bytes1, but only %d remain" % 487 (n, len(data))) 488 489bytes1 = ArgumentDescriptor( 490 name="bytes1", 491 n=TAKEN_FROM_ARGUMENT1, 492 reader=read_bytes1, 493 doc="""A counted bytes string. 494 495 The first argument is a 1-byte unsigned int giving the number 496 of bytes in the string, and the second argument is that many 497 bytes. 498 """) 499 500 501def read_bytes1(f): 502 r""" 503 >>> import io 504 >>> read_bytes1(io.BytesIO(b"\x00")) 505 b'' 506 >>> read_bytes1(io.BytesIO(b"\x03abcdef")) 507 b'abc' 508 """ 509 510 n = read_uint1(f) 511 assert n >= 0 512 data = f.read(n) 513 if len(data) == n: 514 return data 515 raise ValueError("expected %d bytes in a bytes1, but only %d remain" % 516 (n, len(data))) 517 518bytes1 = ArgumentDescriptor( 519 name="bytes1", 520 n=TAKEN_FROM_ARGUMENT1, 521 reader=read_bytes1, 522 doc="""A counted bytes string. 523 524 The first argument is a 1-byte unsigned int giving the number 525 of bytes, and the second argument is that many bytes. 526 """) 527 528 529def read_bytes4(f): 530 r""" 531 >>> import io 532 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc")) 533 b'' 534 >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 535 b'abc' 536 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 537 Traceback (most recent call last): 538 ... 539 ValueError: expected 50331648 bytes in a bytes4, but only 6 remain 540 """ 541 542 n = read_uint4(f) 543 assert n >= 0 544 if n > sys.maxsize: 545 raise ValueError("bytes4 byte count > sys.maxsize: %d" % n) 546 data = f.read(n) 547 if len(data) == n: 548 return data 549 raise ValueError("expected %d bytes in a bytes4, but only %d remain" % 550 (n, len(data))) 551 552bytes4 = ArgumentDescriptor( 553 name="bytes4", 554 n=TAKEN_FROM_ARGUMENT4U, 555 reader=read_bytes4, 556 doc="""A counted bytes string. 557 558 The first argument is a 4-byte little-endian unsigned int giving 559 the number of bytes, and the second argument is that many bytes. 560 """) 561 562 563def read_bytes8(f): 564 r""" 565 >>> import io, struct, sys 566 >>> read_bytes8(io.BytesIO(b"\x00\x00\x00\x00\x00\x00\x00\x00abc")) 567 b'' 568 >>> read_bytes8(io.BytesIO(b"\x03\x00\x00\x00\x00\x00\x00\x00abcdef")) 569 b'abc' 570 >>> bigsize8 = struct.pack("<Q", sys.maxsize//3) 571 >>> read_bytes8(io.BytesIO(bigsize8 + b"abcdef")) #doctest: +ELLIPSIS 572 Traceback (most recent call last): 573 ... 574 ValueError: expected ... bytes in a bytes8, but only 6 remain 575 """ 576 577 n = read_uint8(f) 578 assert n >= 0 579 if n > sys.maxsize: 580 raise ValueError("bytes8 byte count > sys.maxsize: %d" % n) 581 data = f.read(n) 582 if len(data) == n: 583 return data 584 raise ValueError("expected %d bytes in a bytes8, but only %d remain" % 585 (n, len(data))) 586 587bytes8 = ArgumentDescriptor( 588 name="bytes8", 589 n=TAKEN_FROM_ARGUMENT8U, 590 reader=read_bytes8, 591 doc="""A counted bytes string. 592 593 The first argument is an 8-byte little-endian unsigned int giving 594 the number of bytes, and the second argument is that many bytes. 595 """) 596 597def read_unicodestringnl(f): 598 r""" 599 >>> import io 600 >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd' 601 True 602 """ 603 604 data = f.readline() 605 if not data.endswith(b'\n'): 606 raise ValueError("no newline found when trying to read " 607 "unicodestringnl") 608 data = data[:-1] # lose the newline 609 return str(data, 'raw-unicode-escape') 610 611unicodestringnl = ArgumentDescriptor( 612 name='unicodestringnl', 613 n=UP_TO_NEWLINE, 614 reader=read_unicodestringnl, 615 doc="""A newline-terminated Unicode string. 616 617 This is raw-unicode-escape encoded, so consists of 618 printable ASCII characters, and may contain embedded 619 escape sequences. 620 """) 621 622 623def read_unicodestring1(f): 624 r""" 625 >>> import io 626 >>> s = 'abcd\uabcd' 627 >>> enc = s.encode('utf-8') 628 >>> enc 629 b'abcd\xea\xaf\x8d' 630 >>> n = bytes([len(enc)]) # little-endian 1-byte length 631 >>> t = read_unicodestring1(io.BytesIO(n + enc + b'junk')) 632 >>> s == t 633 True 634 635 >>> read_unicodestring1(io.BytesIO(n + enc[:-1])) 636 Traceback (most recent call last): 637 ... 638 ValueError: expected 7 bytes in a unicodestring1, but only 6 remain 639 """ 640 641 n = read_uint1(f) 642 assert n >= 0 643 data = f.read(n) 644 if len(data) == n: 645 return str(data, 'utf-8', 'surrogatepass') 646 raise ValueError("expected %d bytes in a unicodestring1, but only %d " 647 "remain" % (n, len(data))) 648 649unicodestring1 = ArgumentDescriptor( 650 name="unicodestring1", 651 n=TAKEN_FROM_ARGUMENT1, 652 reader=read_unicodestring1, 653 doc="""A counted Unicode string. 654 655 The first argument is a 1-byte little-endian signed int 656 giving the number of bytes in the string, and the second 657 argument-- the UTF-8 encoding of the Unicode string -- 658 contains that many bytes. 659 """) 660 661 662def read_unicodestring4(f): 663 r""" 664 >>> import io 665 >>> s = 'abcd\uabcd' 666 >>> enc = s.encode('utf-8') 667 >>> enc 668 b'abcd\xea\xaf\x8d' 669 >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length 670 >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk')) 671 >>> s == t 672 True 673 674 >>> read_unicodestring4(io.BytesIO(n + enc[:-1])) 675 Traceback (most recent call last): 676 ... 677 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain 678 """ 679 680 n = read_uint4(f) 681 assert n >= 0 682 if n > sys.maxsize: 683 raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n) 684 data = f.read(n) 685 if len(data) == n: 686 return str(data, 'utf-8', 'surrogatepass') 687 raise ValueError("expected %d bytes in a unicodestring4, but only %d " 688 "remain" % (n, len(data))) 689 690unicodestring4 = ArgumentDescriptor( 691 name="unicodestring4", 692 n=TAKEN_FROM_ARGUMENT4U, 693 reader=read_unicodestring4, 694 doc="""A counted Unicode string. 695 696 The first argument is a 4-byte little-endian signed int 697 giving the number of bytes in the string, and the second 698 argument-- the UTF-8 encoding of the Unicode string -- 699 contains that many bytes. 700 """) 701 702 703def read_unicodestring8(f): 704 r""" 705 >>> import io 706 >>> s = 'abcd\uabcd' 707 >>> enc = s.encode('utf-8') 708 >>> enc 709 b'abcd\xea\xaf\x8d' 710 >>> n = bytes([len(enc)]) + b'\0' * 7 # little-endian 8-byte length 711 >>> t = read_unicodestring8(io.BytesIO(n + enc + b'junk')) 712 >>> s == t 713 True 714 715 >>> read_unicodestring8(io.BytesIO(n + enc[:-1])) 716 Traceback (most recent call last): 717 ... 718 ValueError: expected 7 bytes in a unicodestring8, but only 6 remain 719 """ 720 721 n = read_uint8(f) 722 assert n >= 0 723 if n > sys.maxsize: 724 raise ValueError("unicodestring8 byte count > sys.maxsize: %d" % n) 725 data = f.read(n) 726 if len(data) == n: 727 return str(data, 'utf-8', 'surrogatepass') 728 raise ValueError("expected %d bytes in a unicodestring8, but only %d " 729 "remain" % (n, len(data))) 730 731unicodestring8 = ArgumentDescriptor( 732 name="unicodestring8", 733 n=TAKEN_FROM_ARGUMENT8U, 734 reader=read_unicodestring8, 735 doc="""A counted Unicode string. 736 737 The first argument is an 8-byte little-endian signed int 738 giving the number of bytes in the string, and the second 739 argument-- the UTF-8 encoding of the Unicode string -- 740 contains that many bytes. 741 """) 742 743 744def read_decimalnl_short(f): 745 r""" 746 >>> import io 747 >>> read_decimalnl_short(io.BytesIO(b"1234\n56")) 748 1234 749 750 >>> read_decimalnl_short(io.BytesIO(b"1234L\n56")) 751 Traceback (most recent call last): 752 ... 753 ValueError: invalid literal for int() with base 10: b'1234L' 754 """ 755 756 s = read_stringnl(f, decode=False, stripquotes=False) 757 758 # There's a hack for True and False here. 759 if s == b"00": 760 return False 761 elif s == b"01": 762 return True 763 764 return int(s) 765 766def read_decimalnl_long(f): 767 r""" 768 >>> import io 769 770 >>> read_decimalnl_long(io.BytesIO(b"1234L\n56")) 771 1234 772 773 >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6")) 774 123456789012345678901234 775 """ 776 777 s = read_stringnl(f, decode=False, stripquotes=False) 778 if s[-1:] == b'L': 779 s = s[:-1] 780 return int(s) 781 782 783decimalnl_short = ArgumentDescriptor( 784 name='decimalnl_short', 785 n=UP_TO_NEWLINE, 786 reader=read_decimalnl_short, 787 doc="""A newline-terminated decimal integer literal. 788 789 This never has a trailing 'L', and the integer fit 790 in a short Python int on the box where the pickle 791 was written -- but there's no guarantee it will fit 792 in a short Python int on the box where the pickle 793 is read. 794 """) 795 796decimalnl_long = ArgumentDescriptor( 797 name='decimalnl_long', 798 n=UP_TO_NEWLINE, 799 reader=read_decimalnl_long, 800 doc="""A newline-terminated decimal integer literal. 801 802 This has a trailing 'L', and can represent integers 803 of any size. 804 """) 805 806 807def read_floatnl(f): 808 r""" 809 >>> import io 810 >>> read_floatnl(io.BytesIO(b"-1.25\n6")) 811 -1.25 812 """ 813 s = read_stringnl(f, decode=False, stripquotes=False) 814 return float(s) 815 816floatnl = ArgumentDescriptor( 817 name='floatnl', 818 n=UP_TO_NEWLINE, 819 reader=read_floatnl, 820 doc="""A newline-terminated decimal floating literal. 821 822 In general this requires 17 significant digits for roundtrip 823 identity, and pickling then unpickling infinities, NaNs, and 824 minus zero doesn't work across boxes, or on some boxes even 825 on itself (e.g., Windows can't read the strings it produces 826 for infinities or NaNs). 827 """) 828 829def read_float8(f): 830 r""" 831 >>> import io, struct 832 >>> raw = struct.pack(">d", -1.25) 833 >>> raw 834 b'\xbf\xf4\x00\x00\x00\x00\x00\x00' 835 >>> read_float8(io.BytesIO(raw + b"\n")) 836 -1.25 837 """ 838 839 data = f.read(8) 840 if len(data) == 8: 841 return _unpack(">d", data)[0] 842 raise ValueError("not enough data in stream to read float8") 843 844 845float8 = ArgumentDescriptor( 846 name='float8', 847 n=8, 848 reader=read_float8, 849 doc="""An 8-byte binary representation of a float, big-endian. 850 851 The format is unique to Python, and shared with the struct 852 module (format string '>d') "in theory" (the struct and pickle 853 implementations don't share the code -- they should). It's 854 strongly related to the IEEE-754 double format, and, in normal 855 cases, is in fact identical to the big-endian 754 double format. 856 On other boxes the dynamic range is limited to that of a 754 857 double, and "add a half and chop" rounding is used to reduce 858 the precision to 53 bits. However, even on a 754 box, 859 infinities, NaNs, and minus zero may not be handled correctly 860 (may not survive roundtrip pickling intact). 861 """) 862 863# Protocol 2 formats 864 865from pickle import decode_long 866 867def read_long1(f): 868 r""" 869 >>> import io 870 >>> read_long1(io.BytesIO(b"\x00")) 871 0 872 >>> read_long1(io.BytesIO(b"\x02\xff\x00")) 873 255 874 >>> read_long1(io.BytesIO(b"\x02\xff\x7f")) 875 32767 876 >>> read_long1(io.BytesIO(b"\x02\x00\xff")) 877 -256 878 >>> read_long1(io.BytesIO(b"\x02\x00\x80")) 879 -32768 880 """ 881 882 n = read_uint1(f) 883 data = f.read(n) 884 if len(data) != n: 885 raise ValueError("not enough data in stream to read long1") 886 return decode_long(data) 887 888long1 = ArgumentDescriptor( 889 name="long1", 890 n=TAKEN_FROM_ARGUMENT1, 891 reader=read_long1, 892 doc="""A binary long, little-endian, using 1-byte size. 893 894 This first reads one byte as an unsigned size, then reads that 895 many bytes and interprets them as a little-endian 2's-complement long. 896 If the size is 0, that's taken as a shortcut for the long 0L. 897 """) 898 899def read_long4(f): 900 r""" 901 >>> import io 902 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00")) 903 255 904 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f")) 905 32767 906 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff")) 907 -256 908 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80")) 909 -32768 910 >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00")) 911 0 912 """ 913 914 n = read_int4(f) 915 if n < 0: 916 raise ValueError("long4 byte count < 0: %d" % n) 917 data = f.read(n) 918 if len(data) != n: 919 raise ValueError("not enough data in stream to read long4") 920 return decode_long(data) 921 922long4 = ArgumentDescriptor( 923 name="long4", 924 n=TAKEN_FROM_ARGUMENT4, 925 reader=read_long4, 926 doc="""A binary representation of a long, little-endian. 927 928 This first reads four bytes as a signed size (but requires the 929 size to be >= 0), then reads that many bytes and interprets them 930 as a little-endian 2's-complement long. If the size is 0, that's taken 931 as a shortcut for the int 0, although LONG1 should really be used 932 then instead (and in any case where # of bytes < 256). 933 """) 934 935 936############################################################################## 937# Object descriptors. The stack used by the pickle machine holds objects, 938# and in the stack_before and stack_after attributes of OpcodeInfo 939# descriptors we need names to describe the various types of objects that can 940# appear on the stack. 941 942class StackObject(object): 943 __slots__ = ( 944 # name of descriptor record, for info only 945 'name', 946 947 # type of object, or tuple of type objects (meaning the object can 948 # be of any type in the tuple) 949 'obtype', 950 951 # human-readable docs for this kind of stack object; a string 952 'doc', 953 ) 954 955 def __init__(self, name, obtype, doc): 956 assert isinstance(name, str) 957 self.name = name 958 959 assert isinstance(obtype, type) or isinstance(obtype, tuple) 960 if isinstance(obtype, tuple): 961 for contained in obtype: 962 assert isinstance(contained, type) 963 self.obtype = obtype 964 965 assert isinstance(doc, str) 966 self.doc = doc 967 968 def __repr__(self): 969 return self.name 970 971 972pyint = pylong = StackObject( 973 name='int', 974 obtype=int, 975 doc="A Python integer object.") 976 977pyinteger_or_bool = StackObject( 978 name='int_or_bool', 979 obtype=(int, bool), 980 doc="A Python integer or boolean object.") 981 982pybool = StackObject( 983 name='bool', 984 obtype=bool, 985 doc="A Python boolean object.") 986 987pyfloat = StackObject( 988 name='float', 989 obtype=float, 990 doc="A Python float object.") 991 992pybytes_or_str = pystring = StackObject( 993 name='bytes_or_str', 994 obtype=(bytes, str), 995 doc="A Python bytes or (Unicode) string object.") 996 997pybytes = StackObject( 998 name='bytes', 999 obtype=bytes, 1000 doc="A Python bytes object.") 1001 1002pyunicode = StackObject( 1003 name='str', 1004 obtype=str, 1005 doc="A Python (Unicode) string object.") 1006 1007pynone = StackObject( 1008 name="None", 1009 obtype=type(None), 1010 doc="The Python None object.") 1011 1012pytuple = StackObject( 1013 name="tuple", 1014 obtype=tuple, 1015 doc="A Python tuple object.") 1016 1017pylist = StackObject( 1018 name="list", 1019 obtype=list, 1020 doc="A Python list object.") 1021 1022pydict = StackObject( 1023 name="dict", 1024 obtype=dict, 1025 doc="A Python dict object.") 1026 1027pyset = StackObject( 1028 name="set", 1029 obtype=set, 1030 doc="A Python set object.") 1031 1032pyfrozenset = StackObject( 1033 name="frozenset", 1034 obtype=set, 1035 doc="A Python frozenset object.") 1036 1037anyobject = StackObject( 1038 name='any', 1039 obtype=object, 1040 doc="Any kind of object whatsoever.") 1041 1042markobject = StackObject( 1043 name="mark", 1044 obtype=StackObject, 1045 doc="""'The mark' is a unique object. 1046 1047Opcodes that operate on a variable number of objects 1048generally don't embed the count of objects in the opcode, 1049or pull it off the stack. Instead the MARK opcode is used 1050to push a special marker object on the stack, and then 1051some other opcodes grab all the objects from the top of 1052the stack down to (but not including) the topmost marker 1053object. 1054""") 1055 1056stackslice = StackObject( 1057 name="stackslice", 1058 obtype=StackObject, 1059 doc="""An object representing a contiguous slice of the stack. 1060 1061This is used in conjunction with markobject, to represent all 1062of the stack following the topmost markobject. For example, 1063the POP_MARK opcode changes the stack from 1064 1065 [..., markobject, stackslice] 1066to 1067 [...] 1068 1069No matter how many object are on the stack after the topmost 1070markobject, POP_MARK gets rid of all of them (including the 1071topmost markobject too). 1072""") 1073 1074############################################################################## 1075# Descriptors for pickle opcodes. 1076 1077class OpcodeInfo(object): 1078 1079 __slots__ = ( 1080 # symbolic name of opcode; a string 1081 'name', 1082 1083 # the code used in a bytestream to represent the opcode; a 1084 # one-character string 1085 'code', 1086 1087 # If the opcode has an argument embedded in the byte string, an 1088 # instance of ArgumentDescriptor specifying its type. Note that 1089 # arg.reader(s) can be used to read and decode the argument from 1090 # the bytestream s, and arg.doc documents the format of the raw 1091 # argument bytes. If the opcode doesn't have an argument embedded 1092 # in the bytestream, arg should be None. 1093 'arg', 1094 1095 # what the stack looks like before this opcode runs; a list 1096 'stack_before', 1097 1098 # what the stack looks like after this opcode runs; a list 1099 'stack_after', 1100 1101 # the protocol number in which this opcode was introduced; an int 1102 'proto', 1103 1104 # human-readable docs for this opcode; a string 1105 'doc', 1106 ) 1107 1108 def __init__(self, name, code, arg, 1109 stack_before, stack_after, proto, doc): 1110 assert isinstance(name, str) 1111 self.name = name 1112 1113 assert isinstance(code, str) 1114 assert len(code) == 1 1115 self.code = code 1116 1117 assert arg is None or isinstance(arg, ArgumentDescriptor) 1118 self.arg = arg 1119 1120 assert isinstance(stack_before, list) 1121 for x in stack_before: 1122 assert isinstance(x, StackObject) 1123 self.stack_before = stack_before 1124 1125 assert isinstance(stack_after, list) 1126 for x in stack_after: 1127 assert isinstance(x, StackObject) 1128 self.stack_after = stack_after 1129 1130 assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL 1131 self.proto = proto 1132 1133 assert isinstance(doc, str) 1134 self.doc = doc 1135 1136I = OpcodeInfo 1137opcodes = [ 1138 1139 # Ways to spell integers. 1140 1141 I(name='INT', 1142 code='I', 1143 arg=decimalnl_short, 1144 stack_before=[], 1145 stack_after=[pyinteger_or_bool], 1146 proto=0, 1147 doc="""Push an integer or bool. 1148 1149 The argument is a newline-terminated decimal literal string. 1150 1151 The intent may have been that this always fit in a short Python int, 1152 but INT can be generated in pickles written on a 64-bit box that 1153 require a Python long on a 32-bit box. The difference between this 1154 and LONG then is that INT skips a trailing 'L', and produces a short 1155 int whenever possible. 1156 1157 Another difference is due to that, when bool was introduced as a 1158 distinct type in 2.3, builtin names True and False were also added to 1159 2.2.2, mapping to ints 1 and 0. For compatibility in both directions, 1160 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n". 1161 Leading zeroes are never produced for a genuine integer. The 2.3 1162 (and later) unpicklers special-case these and return bool instead; 1163 earlier unpicklers ignore the leading "0" and return the int. 1164 """), 1165 1166 I(name='BININT', 1167 code='J', 1168 arg=int4, 1169 stack_before=[], 1170 stack_after=[pyint], 1171 proto=1, 1172 doc="""Push a four-byte signed integer. 1173 1174 This handles the full range of Python (short) integers on a 32-bit 1175 box, directly as binary bytes (1 for the opcode and 4 for the integer). 1176 If the integer is non-negative and fits in 1 or 2 bytes, pickling via 1177 BININT1 or BININT2 saves space. 1178 """), 1179 1180 I(name='BININT1', 1181 code='K', 1182 arg=uint1, 1183 stack_before=[], 1184 stack_after=[pyint], 1185 proto=1, 1186 doc="""Push a one-byte unsigned integer. 1187 1188 This is a space optimization for pickling very small non-negative ints, 1189 in range(256). 1190 """), 1191 1192 I(name='BININT2', 1193 code='M', 1194 arg=uint2, 1195 stack_before=[], 1196 stack_after=[pyint], 1197 proto=1, 1198 doc="""Push a two-byte unsigned integer. 1199 1200 This is a space optimization for pickling small positive ints, in 1201 range(256, 2**16). Integers in range(256) can also be pickled via 1202 BININT2, but BININT1 instead saves a byte. 1203 """), 1204 1205 I(name='LONG', 1206 code='L', 1207 arg=decimalnl_long, 1208 stack_before=[], 1209 stack_after=[pyint], 1210 proto=0, 1211 doc="""Push a long integer. 1212 1213 The same as INT, except that the literal ends with 'L', and always 1214 unpickles to a Python long. There doesn't seem a real purpose to the 1215 trailing 'L'. 1216 1217 Note that LONG takes time quadratic in the number of digits when 1218 unpickling (this is simply due to the nature of decimal->binary 1219 conversion). Proto 2 added linear-time (in C; still quadratic-time 1220 in Python) LONG1 and LONG4 opcodes. 1221 """), 1222 1223 I(name="LONG1", 1224 code='\x8a', 1225 arg=long1, 1226 stack_before=[], 1227 stack_after=[pyint], 1228 proto=2, 1229 doc="""Long integer using one-byte length. 1230 1231 A more efficient encoding of a Python long; the long1 encoding 1232 says it all."""), 1233 1234 I(name="LONG4", 1235 code='\x8b', 1236 arg=long4, 1237 stack_before=[], 1238 stack_after=[pyint], 1239 proto=2, 1240 doc="""Long integer using found-byte length. 1241 1242 A more efficient encoding of a Python long; the long4 encoding 1243 says it all."""), 1244 1245 # Ways to spell strings (8-bit, not Unicode). 1246 1247 I(name='STRING', 1248 code='S', 1249 arg=stringnl, 1250 stack_before=[], 1251 stack_after=[pybytes_or_str], 1252 proto=0, 1253 doc="""Push a Python string object. 1254 1255 The argument is a repr-style string, with bracketing quote characters, 1256 and perhaps embedded escapes. The argument extends until the next 1257 newline character. These are usually decoded into a str instance 1258 using the encoding given to the Unpickler constructor. or the default, 1259 'ASCII'. If the encoding given was 'bytes' however, they will be 1260 decoded as bytes object instead. 1261 """), 1262 1263 I(name='BINSTRING', 1264 code='T', 1265 arg=string4, 1266 stack_before=[], 1267 stack_after=[pybytes_or_str], 1268 proto=1, 1269 doc="""Push a Python string object. 1270 1271 There are two arguments: the first is a 4-byte little-endian 1272 signed int giving the number of bytes in the string, and the 1273 second is that many bytes, which are taken literally as the string 1274 content. These are usually decoded into a str instance using the 1275 encoding given to the Unpickler constructor. or the default, 1276 'ASCII'. If the encoding given was 'bytes' however, they will be 1277 decoded as bytes object instead. 1278 """), 1279 1280 I(name='SHORT_BINSTRING', 1281 code='U', 1282 arg=string1, 1283 stack_before=[], 1284 stack_after=[pybytes_or_str], 1285 proto=1, 1286 doc="""Push a Python string object. 1287 1288 There are two arguments: the first is a 1-byte unsigned int giving 1289 the number of bytes in the string, and the second is that many 1290 bytes, which are taken literally as the string content. These are 1291 usually decoded into a str instance using the encoding given to 1292 the Unpickler constructor. or the default, 'ASCII'. If the 1293 encoding given was 'bytes' however, they will be decoded as bytes 1294 object instead. 1295 """), 1296 1297 # Bytes (protocol 3 only; older protocols don't support bytes at all) 1298 1299 I(name='BINBYTES', 1300 code='B', 1301 arg=bytes4, 1302 stack_before=[], 1303 stack_after=[pybytes], 1304 proto=3, 1305 doc="""Push a Python bytes object. 1306 1307 There are two arguments: the first is a 4-byte little-endian unsigned int 1308 giving the number of bytes, and the second is that many bytes, which are 1309 taken literally as the bytes content. 1310 """), 1311 1312 I(name='SHORT_BINBYTES', 1313 code='C', 1314 arg=bytes1, 1315 stack_before=[], 1316 stack_after=[pybytes], 1317 proto=3, 1318 doc="""Push a Python bytes object. 1319 1320 There are two arguments: the first is a 1-byte unsigned int giving 1321 the number of bytes, and the second is that many bytes, which are taken 1322 literally as the string content. 1323 """), 1324 1325 I(name='BINBYTES8', 1326 code='\x8e', 1327 arg=bytes8, 1328 stack_before=[], 1329 stack_after=[pybytes], 1330 proto=4, 1331 doc="""Push a Python bytes object. 1332 1333 There are two arguments: the first is an 8-byte unsigned int giving 1334 the number of bytes in the string, and the second is that many bytes, 1335 which are taken literally as the string content. 1336 """), 1337 1338 # Ways to spell None. 1339 1340 I(name='NONE', 1341 code='N', 1342 arg=None, 1343 stack_before=[], 1344 stack_after=[pynone], 1345 proto=0, 1346 doc="Push None on the stack."), 1347 1348 # Ways to spell bools, starting with proto 2. See INT for how this was 1349 # done before proto 2. 1350 1351 I(name='NEWTRUE', 1352 code='\x88', 1353 arg=None, 1354 stack_before=[], 1355 stack_after=[pybool], 1356 proto=2, 1357 doc="""True. 1358 1359 Push True onto the stack."""), 1360 1361 I(name='NEWFALSE', 1362 code='\x89', 1363 arg=None, 1364 stack_before=[], 1365 stack_after=[pybool], 1366 proto=2, 1367 doc="""True. 1368 1369 Push False onto the stack."""), 1370 1371 # Ways to spell Unicode strings. 1372 1373 I(name='UNICODE', 1374 code='V', 1375 arg=unicodestringnl, 1376 stack_before=[], 1377 stack_after=[pyunicode], 1378 proto=0, # this may be pure-text, but it's a later addition 1379 doc="""Push a Python Unicode string object. 1380 1381 The argument is a raw-unicode-escape encoding of a Unicode string, 1382 and so may contain embedded escape sequences. The argument extends 1383 until the next newline character. 1384 """), 1385 1386 I(name='SHORT_BINUNICODE', 1387 code='\x8c', 1388 arg=unicodestring1, 1389 stack_before=[], 1390 stack_after=[pyunicode], 1391 proto=4, 1392 doc="""Push a Python Unicode string object. 1393 1394 There are two arguments: the first is a 1-byte little-endian signed int 1395 giving the number of bytes in the string. The second is that many 1396 bytes, and is the UTF-8 encoding of the Unicode string. 1397 """), 1398 1399 I(name='BINUNICODE', 1400 code='X', 1401 arg=unicodestring4, 1402 stack_before=[], 1403 stack_after=[pyunicode], 1404 proto=1, 1405 doc="""Push a Python Unicode string object. 1406 1407 There are two arguments: the first is a 4-byte little-endian unsigned int 1408 giving the number of bytes in the string. The second is that many 1409 bytes, and is the UTF-8 encoding of the Unicode string. 1410 """), 1411 1412 I(name='BINUNICODE8', 1413 code='\x8d', 1414 arg=unicodestring8, 1415 stack_before=[], 1416 stack_after=[pyunicode], 1417 proto=4, 1418 doc="""Push a Python Unicode string object. 1419 1420 There are two arguments: the first is an 8-byte little-endian signed int 1421 giving the number of bytes in the string. The second is that many 1422 bytes, and is the UTF-8 encoding of the Unicode string. 1423 """), 1424 1425 # Ways to spell floats. 1426 1427 I(name='FLOAT', 1428 code='F', 1429 arg=floatnl, 1430 stack_before=[], 1431 stack_after=[pyfloat], 1432 proto=0, 1433 doc="""Newline-terminated decimal float literal. 1434 1435 The argument is repr(a_float), and in general requires 17 significant 1436 digits for roundtrip conversion to be an identity (this is so for 1437 IEEE-754 double precision values, which is what Python float maps to 1438 on most boxes). 1439 1440 In general, FLOAT cannot be used to transport infinities, NaNs, or 1441 minus zero across boxes (or even on a single box, if the platform C 1442 library can't read the strings it produces for such things -- Windows 1443 is like that), but may do less damage than BINFLOAT on boxes with 1444 greater precision or dynamic range than IEEE-754 double. 1445 """), 1446 1447 I(name='BINFLOAT', 1448 code='G', 1449 arg=float8, 1450 stack_before=[], 1451 stack_after=[pyfloat], 1452 proto=1, 1453 doc="""Float stored in binary form, with 8 bytes of data. 1454 1455 This generally requires less than half the space of FLOAT encoding. 1456 In general, BINFLOAT cannot be used to transport infinities, NaNs, or 1457 minus zero, raises an exception if the exponent exceeds the range of 1458 an IEEE-754 double, and retains no more than 53 bits of precision (if 1459 there are more than that, "add a half and chop" rounding is used to 1460 cut it back to 53 significant bits). 1461 """), 1462 1463 # Ways to build lists. 1464 1465 I(name='EMPTY_LIST', 1466 code=']', 1467 arg=None, 1468 stack_before=[], 1469 stack_after=[pylist], 1470 proto=1, 1471 doc="Push an empty list."), 1472 1473 I(name='APPEND', 1474 code='a', 1475 arg=None, 1476 stack_before=[pylist, anyobject], 1477 stack_after=[pylist], 1478 proto=0, 1479 doc="""Append an object to a list. 1480 1481 Stack before: ... pylist anyobject 1482 Stack after: ... pylist+[anyobject] 1483 1484 although pylist is really extended in-place. 1485 """), 1486 1487 I(name='APPENDS', 1488 code='e', 1489 arg=None, 1490 stack_before=[pylist, markobject, stackslice], 1491 stack_after=[pylist], 1492 proto=1, 1493 doc="""Extend a list by a slice of stack objects. 1494 1495 Stack before: ... pylist markobject stackslice 1496 Stack after: ... pylist+stackslice 1497 1498 although pylist is really extended in-place. 1499 """), 1500 1501 I(name='LIST', 1502 code='l', 1503 arg=None, 1504 stack_before=[markobject, stackslice], 1505 stack_after=[pylist], 1506 proto=0, 1507 doc="""Build a list out of the topmost stack slice, after markobject. 1508 1509 All the stack entries following the topmost markobject are placed into 1510 a single Python list, which single list object replaces all of the 1511 stack from the topmost markobject onward. For example, 1512 1513 Stack before: ... markobject 1 2 3 'abc' 1514 Stack after: ... [1, 2, 3, 'abc'] 1515 """), 1516 1517 # Ways to build tuples. 1518 1519 I(name='EMPTY_TUPLE', 1520 code=')', 1521 arg=None, 1522 stack_before=[], 1523 stack_after=[pytuple], 1524 proto=1, 1525 doc="Push an empty tuple."), 1526 1527 I(name='TUPLE', 1528 code='t', 1529 arg=None, 1530 stack_before=[markobject, stackslice], 1531 stack_after=[pytuple], 1532 proto=0, 1533 doc="""Build a tuple out of the topmost stack slice, after markobject. 1534 1535 All the stack entries following the topmost markobject are placed into 1536 a single Python tuple, which single tuple object replaces all of the 1537 stack from the topmost markobject onward. For example, 1538 1539 Stack before: ... markobject 1 2 3 'abc' 1540 Stack after: ... (1, 2, 3, 'abc') 1541 """), 1542 1543 I(name='TUPLE1', 1544 code='\x85', 1545 arg=None, 1546 stack_before=[anyobject], 1547 stack_after=[pytuple], 1548 proto=2, 1549 doc="""Build a one-tuple out of the topmost item on the stack. 1550 1551 This code pops one value off the stack and pushes a tuple of 1552 length 1 whose one item is that value back onto it. In other 1553 words: 1554 1555 stack[-1] = tuple(stack[-1:]) 1556 """), 1557 1558 I(name='TUPLE2', 1559 code='\x86', 1560 arg=None, 1561 stack_before=[anyobject, anyobject], 1562 stack_after=[pytuple], 1563 proto=2, 1564 doc="""Build a two-tuple out of the top two items on the stack. 1565 1566 This code pops two values off the stack and pushes a tuple of 1567 length 2 whose items are those values back onto it. In other 1568 words: 1569 1570 stack[-2:] = [tuple(stack[-2:])] 1571 """), 1572 1573 I(name='TUPLE3', 1574 code='\x87', 1575 arg=None, 1576 stack_before=[anyobject, anyobject, anyobject], 1577 stack_after=[pytuple], 1578 proto=2, 1579 doc="""Build a three-tuple out of the top three items on the stack. 1580 1581 This code pops three values off the stack and pushes a tuple of 1582 length 3 whose items are those values back onto it. In other 1583 words: 1584 1585 stack[-3:] = [tuple(stack[-3:])] 1586 """), 1587 1588 # Ways to build dicts. 1589 1590 I(name='EMPTY_DICT', 1591 code='}', 1592 arg=None, 1593 stack_before=[], 1594 stack_after=[pydict], 1595 proto=1, 1596 doc="Push an empty dict."), 1597 1598 I(name='DICT', 1599 code='d', 1600 arg=None, 1601 stack_before=[markobject, stackslice], 1602 stack_after=[pydict], 1603 proto=0, 1604 doc="""Build a dict out of the topmost stack slice, after markobject. 1605 1606 All the stack entries following the topmost markobject are placed into 1607 a single Python dict, which single dict object replaces all of the 1608 stack from the topmost markobject onward. The stack slice alternates 1609 key, value, key, value, .... For example, 1610 1611 Stack before: ... markobject 1 2 3 'abc' 1612 Stack after: ... {1: 2, 3: 'abc'} 1613 """), 1614 1615 I(name='SETITEM', 1616 code='s', 1617 arg=None, 1618 stack_before=[pydict, anyobject, anyobject], 1619 stack_after=[pydict], 1620 proto=0, 1621 doc="""Add a key+value pair to an existing dict. 1622 1623 Stack before: ... pydict key value 1624 Stack after: ... pydict 1625 1626 where pydict has been modified via pydict[key] = value. 1627 """), 1628 1629 I(name='SETITEMS', 1630 code='u', 1631 arg=None, 1632 stack_before=[pydict, markobject, stackslice], 1633 stack_after=[pydict], 1634 proto=1, 1635 doc="""Add an arbitrary number of key+value pairs to an existing dict. 1636 1637 The slice of the stack following the topmost markobject is taken as 1638 an alternating sequence of keys and values, added to the dict 1639 immediately under the topmost markobject. Everything at and after the 1640 topmost markobject is popped, leaving the mutated dict at the top 1641 of the stack. 1642 1643 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n 1644 Stack after: ... pydict 1645 1646 where pydict has been modified via pydict[key_i] = value_i for i in 1647 1, 2, ..., n, and in that order. 1648 """), 1649 1650 # Ways to build sets 1651 1652 I(name='EMPTY_SET', 1653 code='\x8f', 1654 arg=None, 1655 stack_before=[], 1656 stack_after=[pyset], 1657 proto=4, 1658 doc="Push an empty set."), 1659 1660 I(name='ADDITEMS', 1661 code='\x90', 1662 arg=None, 1663 stack_before=[pyset, markobject, stackslice], 1664 stack_after=[pyset], 1665 proto=4, 1666 doc="""Add an arbitrary number of items to an existing set. 1667 1668 The slice of the stack following the topmost markobject is taken as 1669 a sequence of items, added to the set immediately under the topmost 1670 markobject. Everything at and after the topmost markobject is popped, 1671 leaving the mutated set at the top of the stack. 1672 1673 Stack before: ... pyset markobject item_1 ... item_n 1674 Stack after: ... pyset 1675 1676 where pyset has been modified via pyset.add(item_i) = item_i for i in 1677 1, 2, ..., n, and in that order. 1678 """), 1679 1680 # Way to build frozensets 1681 1682 I(name='FROZENSET', 1683 code='\x91', 1684 arg=None, 1685 stack_before=[markobject, stackslice], 1686 stack_after=[pyfrozenset], 1687 proto=4, 1688 doc="""Build a frozenset out of the topmost slice, after markobject. 1689 1690 All the stack entries following the topmost markobject are placed into 1691 a single Python frozenset, which single frozenset object replaces all 1692 of the stack from the topmost markobject onward. For example, 1693 1694 Stack before: ... markobject 1 2 3 1695 Stack after: ... frozenset({1, 2, 3}) 1696 """), 1697 1698 # Stack manipulation. 1699 1700 I(name='POP', 1701 code='0', 1702 arg=None, 1703 stack_before=[anyobject], 1704 stack_after=[], 1705 proto=0, 1706 doc="Discard the top stack item, shrinking the stack by one item."), 1707 1708 I(name='DUP', 1709 code='2', 1710 arg=None, 1711 stack_before=[anyobject], 1712 stack_after=[anyobject, anyobject], 1713 proto=0, 1714 doc="Push the top stack item onto the stack again, duplicating it."), 1715 1716 I(name='MARK', 1717 code='(', 1718 arg=None, 1719 stack_before=[], 1720 stack_after=[markobject], 1721 proto=0, 1722 doc="""Push markobject onto the stack. 1723 1724 markobject is a unique object, used by other opcodes to identify a 1725 region of the stack containing a variable number of objects for them 1726 to work on. See markobject.doc for more detail. 1727 """), 1728 1729 I(name='POP_MARK', 1730 code='1', 1731 arg=None, 1732 stack_before=[markobject, stackslice], 1733 stack_after=[], 1734 proto=1, 1735 doc="""Pop all the stack objects at and above the topmost markobject. 1736 1737 When an opcode using a variable number of stack objects is done, 1738 POP_MARK is used to remove those objects, and to remove the markobject 1739 that delimited their starting position on the stack. 1740 """), 1741 1742 # Memo manipulation. There are really only two operations (get and put), 1743 # each in all-text, "short binary", and "long binary" flavors. 1744 1745 I(name='GET', 1746 code='g', 1747 arg=decimalnl_short, 1748 stack_before=[], 1749 stack_after=[anyobject], 1750 proto=0, 1751 doc="""Read an object from the memo and push it on the stack. 1752 1753 The index of the memo object to push is given by the newline-terminated 1754 decimal string following. BINGET and LONG_BINGET are space-optimized 1755 versions. 1756 """), 1757 1758 I(name='BINGET', 1759 code='h', 1760 arg=uint1, 1761 stack_before=[], 1762 stack_after=[anyobject], 1763 proto=1, 1764 doc="""Read an object from the memo and push it on the stack. 1765 1766 The index of the memo object to push is given by the 1-byte unsigned 1767 integer following. 1768 """), 1769 1770 I(name='LONG_BINGET', 1771 code='j', 1772 arg=uint4, 1773 stack_before=[], 1774 stack_after=[anyobject], 1775 proto=1, 1776 doc="""Read an object from the memo and push it on the stack. 1777 1778 The index of the memo object to push is given by the 4-byte unsigned 1779 little-endian integer following. 1780 """), 1781 1782 I(name='PUT', 1783 code='p', 1784 arg=decimalnl_short, 1785 stack_before=[], 1786 stack_after=[], 1787 proto=0, 1788 doc="""Store the stack top into the memo. The stack is not popped. 1789 1790 The index of the memo location to write into is given by the newline- 1791 terminated decimal string following. BINPUT and LONG_BINPUT are 1792 space-optimized versions. 1793 """), 1794 1795 I(name='BINPUT', 1796 code='q', 1797 arg=uint1, 1798 stack_before=[], 1799 stack_after=[], 1800 proto=1, 1801 doc="""Store the stack top into the memo. The stack is not popped. 1802 1803 The index of the memo location to write into is given by the 1-byte 1804 unsigned integer following. 1805 """), 1806 1807 I(name='LONG_BINPUT', 1808 code='r', 1809 arg=uint4, 1810 stack_before=[], 1811 stack_after=[], 1812 proto=1, 1813 doc="""Store the stack top into the memo. The stack is not popped. 1814 1815 The index of the memo location to write into is given by the 4-byte 1816 unsigned little-endian integer following. 1817 """), 1818 1819 I(name='MEMOIZE', 1820 code='\x94', 1821 arg=None, 1822 stack_before=[anyobject], 1823 stack_after=[anyobject], 1824 proto=4, 1825 doc="""Store the stack top into the memo. The stack is not popped. 1826 1827 The index of the memo location to write is the number of 1828 elements currently present in the memo. 1829 """), 1830 1831 # Access the extension registry (predefined objects). Akin to the GET 1832 # family. 1833 1834 I(name='EXT1', 1835 code='\x82', 1836 arg=uint1, 1837 stack_before=[], 1838 stack_after=[anyobject], 1839 proto=2, 1840 doc="""Extension code. 1841 1842 This code and the similar EXT2 and EXT4 allow using a registry 1843 of popular objects that are pickled by name, typically classes. 1844 It is envisioned that through a global negotiation and 1845 registration process, third parties can set up a mapping between 1846 ints and object names. 1847 1848 In order to guarantee pickle interchangeability, the extension 1849 code registry ought to be global, although a range of codes may 1850 be reserved for private use. 1851 1852 EXT1 has a 1-byte integer argument. This is used to index into the 1853 extension registry, and the object at that index is pushed on the stack. 1854 """), 1855 1856 I(name='EXT2', 1857 code='\x83', 1858 arg=uint2, 1859 stack_before=[], 1860 stack_after=[anyobject], 1861 proto=2, 1862 doc="""Extension code. 1863 1864 See EXT1. EXT2 has a two-byte integer argument. 1865 """), 1866 1867 I(name='EXT4', 1868 code='\x84', 1869 arg=int4, 1870 stack_before=[], 1871 stack_after=[anyobject], 1872 proto=2, 1873 doc="""Extension code. 1874 1875 See EXT1. EXT4 has a four-byte integer argument. 1876 """), 1877 1878 # Push a class object, or module function, on the stack, via its module 1879 # and name. 1880 1881 I(name='GLOBAL', 1882 code='c', 1883 arg=stringnl_noescape_pair, 1884 stack_before=[], 1885 stack_after=[anyobject], 1886 proto=0, 1887 doc="""Push a global object (module.attr) on the stack. 1888 1889 Two newline-terminated strings follow the GLOBAL opcode. The first is 1890 taken as a module name, and the second as a class name. The class 1891 object module.class is pushed on the stack. More accurately, the 1892 object returned by self.find_class(module, class) is pushed on the 1893 stack, so unpickling subclasses can override this form of lookup. 1894 """), 1895 1896 I(name='STACK_GLOBAL', 1897 code='\x93', 1898 arg=None, 1899 stack_before=[pyunicode, pyunicode], 1900 stack_after=[anyobject], 1901 proto=4, 1902 doc="""Push a global object (module.attr) on the stack. 1903 """), 1904 1905 # Ways to build objects of classes pickle doesn't know about directly 1906 # (user-defined classes). I despair of documenting this accurately 1907 # and comprehensibly -- you really have to read the pickle code to 1908 # find all the special cases. 1909 1910 I(name='REDUCE', 1911 code='R', 1912 arg=None, 1913 stack_before=[anyobject, anyobject], 1914 stack_after=[anyobject], 1915 proto=0, 1916 doc="""Push an object built from a callable and an argument tuple. 1917 1918 The opcode is named to remind of the __reduce__() method. 1919 1920 Stack before: ... callable pytuple 1921 Stack after: ... callable(*pytuple) 1922 1923 The callable and the argument tuple are the first two items returned 1924 by a __reduce__ method. Applying the callable to the argtuple is 1925 supposed to reproduce the original object, or at least get it started. 1926 If the __reduce__ method returns a 3-tuple, the last component is an 1927 argument to be passed to the object's __setstate__, and then the REDUCE 1928 opcode is followed by code to create setstate's argument, and then a 1929 BUILD opcode to apply __setstate__ to that argument. 1930 1931 If not isinstance(callable, type), REDUCE complains unless the 1932 callable has been registered with the copyreg module's 1933 safe_constructors dict, or the callable has a magic 1934 '__safe_for_unpickling__' attribute with a true value. I'm not sure 1935 why it does this, but I've sure seen this complaint often enough when 1936 I didn't want to <wink>. 1937 """), 1938 1939 I(name='BUILD', 1940 code='b', 1941 arg=None, 1942 stack_before=[anyobject, anyobject], 1943 stack_after=[anyobject], 1944 proto=0, 1945 doc="""Finish building an object, via __setstate__ or dict update. 1946 1947 Stack before: ... anyobject argument 1948 Stack after: ... anyobject 1949 1950 where anyobject may have been mutated, as follows: 1951 1952 If the object has a __setstate__ method, 1953 1954 anyobject.__setstate__(argument) 1955 1956 is called. 1957 1958 Else the argument must be a dict, the object must have a __dict__, and 1959 the object is updated via 1960 1961 anyobject.__dict__.update(argument) 1962 """), 1963 1964 I(name='INST', 1965 code='i', 1966 arg=stringnl_noescape_pair, 1967 stack_before=[markobject, stackslice], 1968 stack_after=[anyobject], 1969 proto=0, 1970 doc="""Build a class instance. 1971 1972 This is the protocol 0 version of protocol 1's OBJ opcode. 1973 INST is followed by two newline-terminated strings, giving a 1974 module and class name, just as for the GLOBAL opcode (and see 1975 GLOBAL for more details about that). self.find_class(module, name) 1976 is used to get a class object. 1977 1978 In addition, all the objects on the stack following the topmost 1979 markobject are gathered into a tuple and popped (along with the 1980 topmost markobject), just as for the TUPLE opcode. 1981 1982 Now it gets complicated. If all of these are true: 1983 1984 + The argtuple is empty (markobject was at the top of the stack 1985 at the start). 1986 1987 + The class object does not have a __getinitargs__ attribute. 1988 1989 then we want to create an old-style class instance without invoking 1990 its __init__() method (pickle has waffled on this over the years; not 1991 calling __init__() is current wisdom). In this case, an instance of 1992 an old-style dummy class is created, and then we try to rebind its 1993 __class__ attribute to the desired class object. If this succeeds, 1994 the new instance object is pushed on the stack, and we're done. 1995 1996 Else (the argtuple is not empty, it's not an old-style class object, 1997 or the class object does have a __getinitargs__ attribute), the code 1998 first insists that the class object have a __safe_for_unpickling__ 1999 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE, 2000 it doesn't matter whether this attribute has a true or false value, it 2001 only matters whether it exists (XXX this is a bug). If 2002 __safe_for_unpickling__ doesn't exist, UnpicklingError is raised. 2003 2004 Else (the class object does have a __safe_for_unpickling__ attr), 2005 the class object obtained from INST's arguments is applied to the 2006 argtuple obtained from the stack, and the resulting instance object 2007 is pushed on the stack. 2008 2009 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3. 2010 NOTE: the distinction between old-style and new-style classes does 2011 not make sense in Python 3. 2012 """), 2013 2014 I(name='OBJ', 2015 code='o', 2016 arg=None, 2017 stack_before=[markobject, anyobject, stackslice], 2018 stack_after=[anyobject], 2019 proto=1, 2020 doc="""Build a class instance. 2021 2022 This is the protocol 1 version of protocol 0's INST opcode, and is 2023 very much like it. The major difference is that the class object 2024 is taken off the stack, allowing it to be retrieved from the memo 2025 repeatedly if several instances of the same class are created. This 2026 can be much more efficient (in both time and space) than repeatedly 2027 embedding the module and class names in INST opcodes. 2028 2029 Unlike INST, OBJ takes no arguments from the opcode stream. Instead 2030 the class object is taken off the stack, immediately above the 2031 topmost markobject: 2032 2033 Stack before: ... markobject classobject stackslice 2034 Stack after: ... new_instance_object 2035 2036 As for INST, the remainder of the stack above the markobject is 2037 gathered into an argument tuple, and then the logic seems identical, 2038 except that no __safe_for_unpickling__ check is done (XXX this is 2039 a bug). See INST for the gory details. 2040 2041 NOTE: In Python 2.3, INST and OBJ are identical except for how they 2042 get the class object. That was always the intent; the implementations 2043 had diverged for accidental reasons. 2044 """), 2045 2046 I(name='NEWOBJ', 2047 code='\x81', 2048 arg=None, 2049 stack_before=[anyobject, anyobject], 2050 stack_after=[anyobject], 2051 proto=2, 2052 doc="""Build an object instance. 2053 2054 The stack before should be thought of as containing a class 2055 object followed by an argument tuple (the tuple being the stack 2056 top). Call these cls and args. They are popped off the stack, 2057 and the value returned by cls.__new__(cls, *args) is pushed back 2058 onto the stack. 2059 """), 2060 2061 I(name='NEWOBJ_EX', 2062 code='\x92', 2063 arg=None, 2064 stack_before=[anyobject, anyobject, anyobject], 2065 stack_after=[anyobject], 2066 proto=4, 2067 doc="""Build an object instance. 2068 2069 The stack before should be thought of as containing a class 2070 object followed by an argument tuple and by a keyword argument dict 2071 (the dict being the stack top). Call these cls and args. They are 2072 popped off the stack, and the value returned by 2073 cls.__new__(cls, *args, *kwargs) is pushed back onto the stack. 2074 """), 2075 2076 # Machine control. 2077 2078 I(name='PROTO', 2079 code='\x80', 2080 arg=uint1, 2081 stack_before=[], 2082 stack_after=[], 2083 proto=2, 2084 doc="""Protocol version indicator. 2085 2086 For protocol 2 and above, a pickle must start with this opcode. 2087 The argument is the protocol version, an int in range(2, 256). 2088 """), 2089 2090 I(name='STOP', 2091 code='.', 2092 arg=None, 2093 stack_before=[anyobject], 2094 stack_after=[], 2095 proto=0, 2096 doc="""Stop the unpickling machine. 2097 2098 Every pickle ends with this opcode. The object at the top of the stack 2099 is popped, and that's the result of unpickling. The stack should be 2100 empty then. 2101 """), 2102 2103 # Framing support. 2104 2105 I(name='FRAME', 2106 code='\x95', 2107 arg=uint8, 2108 stack_before=[], 2109 stack_after=[], 2110 proto=4, 2111 doc="""Indicate the beginning of a new frame. 2112 2113 The unpickler may use this opcode to safely prefetch data from its 2114 underlying stream. 2115 """), 2116 2117 # Ways to deal with persistent IDs. 2118 2119 I(name='PERSID', 2120 code='P', 2121 arg=stringnl_noescape, 2122 stack_before=[], 2123 stack_after=[anyobject], 2124 proto=0, 2125 doc="""Push an object identified by a persistent ID. 2126 2127 The pickle module doesn't define what a persistent ID means. PERSID's 2128 argument is a newline-terminated str-style (no embedded escapes, no 2129 bracketing quote characters) string, which *is* "the persistent ID". 2130 The unpickler passes this string to self.persistent_load(). Whatever 2131 object that returns is pushed on the stack. There is no implementation 2132 of persistent_load() in Python's unpickler: it must be supplied by an 2133 unpickler subclass. 2134 """), 2135 2136 I(name='BINPERSID', 2137 code='Q', 2138 arg=None, 2139 stack_before=[anyobject], 2140 stack_after=[anyobject], 2141 proto=1, 2142 doc="""Push an object identified by a persistent ID. 2143 2144 Like PERSID, except the persistent ID is popped off the stack (instead 2145 of being a string embedded in the opcode bytestream). The persistent 2146 ID is passed to self.persistent_load(), and whatever object that 2147 returns is pushed on the stack. See PERSID for more detail. 2148 """), 2149] 2150del I 2151 2152# Verify uniqueness of .name and .code members. 2153name2i = {} 2154code2i = {} 2155 2156for i, d in enumerate(opcodes): 2157 if d.name in name2i: 2158 raise ValueError("repeated name %r at indices %d and %d" % 2159 (d.name, name2i[d.name], i)) 2160 if d.code in code2i: 2161 raise ValueError("repeated code %r at indices %d and %d" % 2162 (d.code, code2i[d.code], i)) 2163 2164 name2i[d.name] = i 2165 code2i[d.code] = i 2166 2167del name2i, code2i, i, d 2168 2169############################################################################## 2170# Build a code2op dict, mapping opcode characters to OpcodeInfo records. 2171# Also ensure we've got the same stuff as pickle.py, although the 2172# introspection here is dicey. 2173 2174code2op = {} 2175for d in opcodes: 2176 code2op[d.code] = d 2177del d 2178 2179def assure_pickle_consistency(verbose=False): 2180 2181 copy = code2op.copy() 2182 for name in pickle.__all__: 2183 if not re.match("[A-Z][A-Z0-9_]+$", name): 2184 if verbose: 2185 print("skipping %r: it doesn't look like an opcode name" % name) 2186 continue 2187 picklecode = getattr(pickle, name) 2188 if not isinstance(picklecode, bytes) or len(picklecode) != 1: 2189 if verbose: 2190 print(("skipping %r: value %r doesn't look like a pickle " 2191 "code" % (name, picklecode))) 2192 continue 2193 picklecode = picklecode.decode("latin-1") 2194 if picklecode in copy: 2195 if verbose: 2196 print("checking name %r w/ code %r for consistency" % ( 2197 name, picklecode)) 2198 d = copy[picklecode] 2199 if d.name != name: 2200 raise ValueError("for pickle code %r, pickle.py uses name %r " 2201 "but we're using name %r" % (picklecode, 2202 name, 2203 d.name)) 2204 # Forget this one. Any left over in copy at the end are a problem 2205 # of a different kind. 2206 del copy[picklecode] 2207 else: 2208 raise ValueError("pickle.py appears to have a pickle opcode with " 2209 "name %r and code %r, but we don't" % 2210 (name, picklecode)) 2211 if copy: 2212 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"] 2213 for code, d in copy.items(): 2214 msg.append(" name %r with code %r" % (d.name, code)) 2215 raise ValueError("\n".join(msg)) 2216 2217assure_pickle_consistency() 2218del assure_pickle_consistency 2219 2220############################################################################## 2221# A pickle opcode generator. 2222 2223def _genops(data, yield_end_pos=False): 2224 if isinstance(data, bytes_types): 2225 data = io.BytesIO(data) 2226 2227 if hasattr(data, "tell"): 2228 getpos = data.tell 2229 else: 2230 getpos = lambda: None 2231 2232 while True: 2233 pos = getpos() 2234 code = data.read(1) 2235 opcode = code2op.get(code.decode("latin-1")) 2236 if opcode is None: 2237 if code == b"": 2238 raise ValueError("pickle exhausted before seeing STOP") 2239 else: 2240 raise ValueError("at position %s, opcode %r unknown" % ( 2241 "<unknown>" if pos is None else pos, 2242 code)) 2243 if opcode.arg is None: 2244 arg = None 2245 else: 2246 arg = opcode.arg.reader(data) 2247 if yield_end_pos: 2248 yield opcode, arg, pos, getpos() 2249 else: 2250 yield opcode, arg, pos 2251 if code == b'.': 2252 assert opcode.name == 'STOP' 2253 break 2254 2255def genops(pickle): 2256 """Generate all the opcodes in a pickle. 2257 2258 'pickle' is a file-like object, or string, containing the pickle. 2259 2260 Each opcode in the pickle is generated, from the current pickle position, 2261 stopping after a STOP opcode is delivered. A triple is generated for 2262 each opcode: 2263 2264 opcode, arg, pos 2265 2266 opcode is an OpcodeInfo record, describing the current opcode. 2267 2268 If the opcode has an argument embedded in the pickle, arg is its decoded 2269 value, as a Python object. If the opcode doesn't have an argument, arg 2270 is None. 2271 2272 If the pickle has a tell() method, pos was the value of pickle.tell() 2273 before reading the current opcode. If the pickle is a bytes object, 2274 it's wrapped in a BytesIO object, and the latter's tell() result is 2275 used. Else (the pickle doesn't have a tell(), and it's not obvious how 2276 to query its current position) pos is None. 2277 """ 2278 return _genops(pickle) 2279 2280############################################################################## 2281# A pickle optimizer. 2282 2283def optimize(p): 2284 'Optimize a pickle string by removing unused PUT opcodes' 2285 put = 'PUT' 2286 get = 'GET' 2287 oldids = set() # set of all PUT ids 2288 newids = {} # set of ids used by a GET opcode 2289 opcodes = [] # (op, idx) or (pos, end_pos) 2290 proto = 0 2291 protoheader = b'' 2292 for opcode, arg, pos, end_pos in _genops(p, yield_end_pos=True): 2293 if 'PUT' in opcode.name: 2294 oldids.add(arg) 2295 opcodes.append((put, arg)) 2296 elif opcode.name == 'MEMOIZE': 2297 idx = len(oldids) 2298 oldids.add(idx) 2299 opcodes.append((put, idx)) 2300 elif 'FRAME' in opcode.name: 2301 pass 2302 elif 'GET' in opcode.name: 2303 if opcode.proto > proto: 2304 proto = opcode.proto 2305 newids[arg] = None 2306 opcodes.append((get, arg)) 2307 elif opcode.name == 'PROTO': 2308 if arg > proto: 2309 proto = arg 2310 if pos == 0: 2311 protoheader = p[pos: end_pos] 2312 else: 2313 opcodes.append((pos, end_pos)) 2314 else: 2315 opcodes.append((pos, end_pos)) 2316 del oldids 2317 2318 # Copy the opcodes except for PUTS without a corresponding GET 2319 out = io.BytesIO() 2320 # Write the PROTO header before any framing 2321 out.write(protoheader) 2322 pickler = pickle._Pickler(out, proto) 2323 if proto >= 4: 2324 pickler.framer.start_framing() 2325 idx = 0 2326 for op, arg in opcodes: 2327 if op is put: 2328 if arg not in newids: 2329 continue 2330 data = pickler.put(idx) 2331 newids[arg] = idx 2332 idx += 1 2333 elif op is get: 2334 data = pickler.get(newids[arg]) 2335 else: 2336 data = p[op:arg] 2337 pickler.framer.commit_frame() 2338 pickler.write(data) 2339 pickler.framer.end_framing() 2340 return out.getvalue() 2341 2342############################################################################## 2343# A symbolic pickle disassembler. 2344 2345def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0): 2346 """Produce a symbolic disassembly of a pickle. 2347 2348 'pickle' is a file-like object, or string, containing a (at least one) 2349 pickle. The pickle is disassembled from the current position, through 2350 the first STOP opcode encountered. 2351 2352 Optional arg 'out' is a file-like object to which the disassembly is 2353 printed. It defaults to sys.stdout. 2354 2355 Optional arg 'memo' is a Python dict, used as the pickle's memo. It 2356 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes. 2357 Passing the same memo object to another dis() call then allows disassembly 2358 to proceed across multiple pickles that were all created by the same 2359 pickler with the same memo. Ordinarily you don't need to worry about this. 2360 2361 Optional arg 'indentlevel' is the number of blanks by which to indent 2362 a new MARK level. It defaults to 4. 2363 2364 Optional arg 'annotate' if nonzero instructs dis() to add short 2365 description of the opcode on each line of disassembled output. 2366 The value given to 'annotate' must be an integer and is used as a 2367 hint for the column where annotation should start. The default 2368 value is 0, meaning no annotations. 2369 2370 In addition to printing the disassembly, some sanity checks are made: 2371 2372 + All embedded opcode arguments "make sense". 2373 2374 + Explicit and implicit pop operations have enough items on the stack. 2375 2376 + When an opcode implicitly refers to a markobject, a markobject is 2377 actually on the stack. 2378 2379 + A memo entry isn't referenced before it's defined. 2380 2381 + The markobject isn't stored in the memo. 2382 2383 + A memo entry isn't redefined. 2384 """ 2385 2386 # Most of the hair here is for sanity checks, but most of it is needed 2387 # anyway to detect when a protocol 0 POP takes a MARK off the stack 2388 # (which in turn is needed to indent MARK blocks correctly). 2389 2390 stack = [] # crude emulation of unpickler stack 2391 if memo is None: 2392 memo = {} # crude emulation of unpickler memo 2393 maxproto = -1 # max protocol number seen 2394 markstack = [] # bytecode positions of MARK opcodes 2395 indentchunk = ' ' * indentlevel 2396 errormsg = None 2397 annocol = annotate # column hint for annotations 2398 for opcode, arg, pos in genops(pickle): 2399 if pos is not None: 2400 print("%5d:" % pos, end=' ', file=out) 2401 2402 line = "%-4s %s%s" % (repr(opcode.code)[1:-1], 2403 indentchunk * len(markstack), 2404 opcode.name) 2405 2406 maxproto = max(maxproto, opcode.proto) 2407 before = opcode.stack_before # don't mutate 2408 after = opcode.stack_after # don't mutate 2409 numtopop = len(before) 2410 2411 # See whether a MARK should be popped. 2412 markmsg = None 2413 if markobject in before or (opcode.name == "POP" and 2414 stack and 2415 stack[-1] is markobject): 2416 assert markobject not in after 2417 if __debug__: 2418 if markobject in before: 2419 assert before[-1] is stackslice 2420 if markstack: 2421 markpos = markstack.pop() 2422 if markpos is None: 2423 markmsg = "(MARK at unknown opcode offset)" 2424 else: 2425 markmsg = "(MARK at %d)" % markpos 2426 # Pop everything at and after the topmost markobject. 2427 while stack[-1] is not markobject: 2428 stack.pop() 2429 stack.pop() 2430 # Stop later code from popping too much. 2431 try: 2432 numtopop = before.index(markobject) 2433 except ValueError: 2434 assert opcode.name == "POP" 2435 numtopop = 0 2436 else: 2437 errormsg = markmsg = "no MARK exists on stack" 2438 2439 # Check for correct memo usage. 2440 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"): 2441 if opcode.name == "MEMOIZE": 2442 memo_idx = len(memo) 2443 markmsg = "(as %d)" % memo_idx 2444 else: 2445 assert arg is not None 2446 memo_idx = arg 2447 if memo_idx in memo: 2448 errormsg = "memo key %r already defined" % arg 2449 elif not stack: 2450 errormsg = "stack is empty -- can't store into memo" 2451 elif stack[-1] is markobject: 2452 errormsg = "can't store markobject in the memo" 2453 else: 2454 memo[memo_idx] = stack[-1] 2455 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"): 2456 if arg in memo: 2457 assert len(after) == 1 2458 after = [memo[arg]] # for better stack emulation 2459 else: 2460 errormsg = "memo key %r has never been stored into" % arg 2461 2462 if arg is not None or markmsg: 2463 # make a mild effort to align arguments 2464 line += ' ' * (10 - len(opcode.name)) 2465 if arg is not None: 2466 line += ' ' + repr(arg) 2467 if markmsg: 2468 line += ' ' + markmsg 2469 if annotate: 2470 line += ' ' * (annocol - len(line)) 2471 # make a mild effort to align annotations 2472 annocol = len(line) 2473 if annocol > 50: 2474 annocol = annotate 2475 line += ' ' + opcode.doc.split('\n', 1)[0] 2476 print(line, file=out) 2477 2478 if errormsg: 2479 # Note that we delayed complaining until the offending opcode 2480 # was printed. 2481 raise ValueError(errormsg) 2482 2483 # Emulate the stack effects. 2484 if len(stack) < numtopop: 2485 raise ValueError("tries to pop %d items from stack with " 2486 "only %d items" % (numtopop, len(stack))) 2487 if numtopop: 2488 del stack[-numtopop:] 2489 if markobject in after: 2490 assert markobject not in before 2491 markstack.append(pos) 2492 2493 stack.extend(after) 2494 2495 print("highest protocol among opcodes =", maxproto, file=out) 2496 if stack: 2497 raise ValueError("stack not empty after STOP: %r" % stack) 2498 2499# For use in the doctest, simply as an example of a class to pickle. 2500class _Example: 2501 def __init__(self, value): 2502 self.value = value 2503 2504_dis_test = r""" 2505>>> import pickle 2506>>> x = [1, 2, (3, 4), {b'abc': "def"}] 2507>>> pkl0 = pickle.dumps(x, 0) 2508>>> dis(pkl0) 2509 0: ( MARK 2510 1: l LIST (MARK at 0) 2511 2: p PUT 0 2512 5: L LONG 1 2513 9: a APPEND 2514 10: L LONG 2 2515 14: a APPEND 2516 15: ( MARK 2517 16: L LONG 3 2518 20: L LONG 4 2519 24: t TUPLE (MARK at 15) 2520 25: p PUT 1 2521 28: a APPEND 2522 29: ( MARK 2523 30: d DICT (MARK at 29) 2524 31: p PUT 2 2525 34: c GLOBAL '_codecs encode' 2526 50: p PUT 3 2527 53: ( MARK 2528 54: V UNICODE 'abc' 2529 59: p PUT 4 2530 62: V UNICODE 'latin1' 2531 70: p PUT 5 2532 73: t TUPLE (MARK at 53) 2533 74: p PUT 6 2534 77: R REDUCE 2535 78: p PUT 7 2536 81: V UNICODE 'def' 2537 86: p PUT 8 2538 89: s SETITEM 2539 90: a APPEND 2540 91: . STOP 2541highest protocol among opcodes = 0 2542 2543Try again with a "binary" pickle. 2544 2545>>> pkl1 = pickle.dumps(x, 1) 2546>>> dis(pkl1) 2547 0: ] EMPTY_LIST 2548 1: q BINPUT 0 2549 3: ( MARK 2550 4: K BININT1 1 2551 6: K BININT1 2 2552 8: ( MARK 2553 9: K BININT1 3 2554 11: K BININT1 4 2555 13: t TUPLE (MARK at 8) 2556 14: q BINPUT 1 2557 16: } EMPTY_DICT 2558 17: q BINPUT 2 2559 19: c GLOBAL '_codecs encode' 2560 35: q BINPUT 3 2561 37: ( MARK 2562 38: X BINUNICODE 'abc' 2563 46: q BINPUT 4 2564 48: X BINUNICODE 'latin1' 2565 59: q BINPUT 5 2566 61: t TUPLE (MARK at 37) 2567 62: q BINPUT 6 2568 64: R REDUCE 2569 65: q BINPUT 7 2570 67: X BINUNICODE 'def' 2571 75: q BINPUT 8 2572 77: s SETITEM 2573 78: e APPENDS (MARK at 3) 2574 79: . STOP 2575highest protocol among opcodes = 1 2576 2577Exercise the INST/OBJ/BUILD family. 2578 2579>>> import pickletools 2580>>> dis(pickle.dumps(pickletools.dis, 0)) 2581 0: c GLOBAL 'pickletools dis' 2582 17: p PUT 0 2583 20: . STOP 2584highest protocol among opcodes = 0 2585 2586>>> from pickletools import _Example 2587>>> x = [_Example(42)] * 2 2588>>> dis(pickle.dumps(x, 0)) 2589 0: ( MARK 2590 1: l LIST (MARK at 0) 2591 2: p PUT 0 2592 5: c GLOBAL 'copy_reg _reconstructor' 2593 30: p PUT 1 2594 33: ( MARK 2595 34: c GLOBAL 'pickletools _Example' 2596 56: p PUT 2 2597 59: c GLOBAL '__builtin__ object' 2598 79: p PUT 3 2599 82: N NONE 2600 83: t TUPLE (MARK at 33) 2601 84: p PUT 4 2602 87: R REDUCE 2603 88: p PUT 5 2604 91: ( MARK 2605 92: d DICT (MARK at 91) 2606 93: p PUT 6 2607 96: V UNICODE 'value' 2608 103: p PUT 7 2609 106: L LONG 42 2610 111: s SETITEM 2611 112: b BUILD 2612 113: a APPEND 2613 114: g GET 5 2614 117: a APPEND 2615 118: . STOP 2616highest protocol among opcodes = 0 2617 2618>>> dis(pickle.dumps(x, 1)) 2619 0: ] EMPTY_LIST 2620 1: q BINPUT 0 2621 3: ( MARK 2622 4: c GLOBAL 'copy_reg _reconstructor' 2623 29: q BINPUT 1 2624 31: ( MARK 2625 32: c GLOBAL 'pickletools _Example' 2626 54: q BINPUT 2 2627 56: c GLOBAL '__builtin__ object' 2628 76: q BINPUT 3 2629 78: N NONE 2630 79: t TUPLE (MARK at 31) 2631 80: q BINPUT 4 2632 82: R REDUCE 2633 83: q BINPUT 5 2634 85: } EMPTY_DICT 2635 86: q BINPUT 6 2636 88: X BINUNICODE 'value' 2637 98: q BINPUT 7 2638 100: K BININT1 42 2639 102: s SETITEM 2640 103: b BUILD 2641 104: h BINGET 5 2642 106: e APPENDS (MARK at 3) 2643 107: . STOP 2644highest protocol among opcodes = 1 2645 2646Try "the canonical" recursive-object test. 2647 2648>>> L = [] 2649>>> T = L, 2650>>> L.append(T) 2651>>> L[0] is T 2652True 2653>>> T[0] is L 2654True 2655>>> L[0][0] is L 2656True 2657>>> T[0][0] is T 2658True 2659>>> dis(pickle.dumps(L, 0)) 2660 0: ( MARK 2661 1: l LIST (MARK at 0) 2662 2: p PUT 0 2663 5: ( MARK 2664 6: g GET 0 2665 9: t TUPLE (MARK at 5) 2666 10: p PUT 1 2667 13: a APPEND 2668 14: . STOP 2669highest protocol among opcodes = 0 2670 2671>>> dis(pickle.dumps(L, 1)) 2672 0: ] EMPTY_LIST 2673 1: q BINPUT 0 2674 3: ( MARK 2675 4: h BINGET 0 2676 6: t TUPLE (MARK at 3) 2677 7: q BINPUT 1 2678 9: a APPEND 2679 10: . STOP 2680highest protocol among opcodes = 1 2681 2682Note that, in the protocol 0 pickle of the recursive tuple, the disassembler 2683has to emulate the stack in order to realize that the POP opcode at 16 gets 2684rid of the MARK at 0. 2685 2686>>> dis(pickle.dumps(T, 0)) 2687 0: ( MARK 2688 1: ( MARK 2689 2: l LIST (MARK at 1) 2690 3: p PUT 0 2691 6: ( MARK 2692 7: g GET 0 2693 10: t TUPLE (MARK at 6) 2694 11: p PUT 1 2695 14: a APPEND 2696 15: 0 POP 2697 16: 0 POP (MARK at 0) 2698 17: g GET 1 2699 20: . STOP 2700highest protocol among opcodes = 0 2701 2702>>> dis(pickle.dumps(T, 1)) 2703 0: ( MARK 2704 1: ] EMPTY_LIST 2705 2: q BINPUT 0 2706 4: ( MARK 2707 5: h BINGET 0 2708 7: t TUPLE (MARK at 4) 2709 8: q BINPUT 1 2710 10: a APPEND 2711 11: 1 POP_MARK (MARK at 0) 2712 12: h BINGET 1 2713 14: . STOP 2714highest protocol among opcodes = 1 2715 2716Try protocol 2. 2717 2718>>> dis(pickle.dumps(L, 2)) 2719 0: \x80 PROTO 2 2720 2: ] EMPTY_LIST 2721 3: q BINPUT 0 2722 5: h BINGET 0 2723 7: \x85 TUPLE1 2724 8: q BINPUT 1 2725 10: a APPEND 2726 11: . STOP 2727highest protocol among opcodes = 2 2728 2729>>> dis(pickle.dumps(T, 2)) 2730 0: \x80 PROTO 2 2731 2: ] EMPTY_LIST 2732 3: q BINPUT 0 2733 5: h BINGET 0 2734 7: \x85 TUPLE1 2735 8: q BINPUT 1 2736 10: a APPEND 2737 11: 0 POP 2738 12: h BINGET 1 2739 14: . STOP 2740highest protocol among opcodes = 2 2741 2742Try protocol 3 with annotations: 2743 2744>>> dis(pickle.dumps(T, 3), annotate=1) 2745 0: \x80 PROTO 3 Protocol version indicator. 2746 2: ] EMPTY_LIST Push an empty list. 2747 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped. 2748 5: h BINGET 0 Read an object from the memo and push it on the stack. 2749 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack. 2750 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped. 2751 10: a APPEND Append an object to a list. 2752 11: 0 POP Discard the top stack item, shrinking the stack by one item. 2753 12: h BINGET 1 Read an object from the memo and push it on the stack. 2754 14: . STOP Stop the unpickling machine. 2755highest protocol among opcodes = 2 2756 2757""" 2758 2759_memo_test = r""" 2760>>> import pickle 2761>>> import io 2762>>> f = io.BytesIO() 2763>>> p = pickle.Pickler(f, 2) 2764>>> x = [1, 2, 3] 2765>>> p.dump(x) 2766>>> p.dump(x) 2767>>> f.seek(0) 27680 2769>>> memo = {} 2770>>> dis(f, memo=memo) 2771 0: \x80 PROTO 2 2772 2: ] EMPTY_LIST 2773 3: q BINPUT 0 2774 5: ( MARK 2775 6: K BININT1 1 2776 8: K BININT1 2 2777 10: K BININT1 3 2778 12: e APPENDS (MARK at 5) 2779 13: . STOP 2780highest protocol among opcodes = 2 2781>>> dis(f, memo=memo) 2782 14: \x80 PROTO 2 2783 16: h BINGET 0 2784 18: . STOP 2785highest protocol among opcodes = 2 2786""" 2787 2788__test__ = {'disassembler_test': _dis_test, 2789 'disassembler_memo_test': _memo_test, 2790 } 2791 2792def _test(): 2793 import doctest 2794 return doctest.testmod() 2795 2796if __name__ == "__main__": 2797 import argparse 2798 parser = argparse.ArgumentParser( 2799 description='disassemble one or more pickle files') 2800 parser.add_argument( 2801 'pickle_file', type=argparse.FileType('br'), 2802 nargs='*', help='the pickle file') 2803 parser.add_argument( 2804 '-o', '--output', default=sys.stdout, type=argparse.FileType('w'), 2805 help='the file where the output should be written') 2806 parser.add_argument( 2807 '-m', '--memo', action='store_true', 2808 help='preserve memo between disassemblies') 2809 parser.add_argument( 2810 '-l', '--indentlevel', default=4, type=int, 2811 help='the number of blanks by which to indent a new MARK level') 2812 parser.add_argument( 2813 '-a', '--annotate', action='store_true', 2814 help='annotate each line with a short opcode description') 2815 parser.add_argument( 2816 '-p', '--preamble', default="==> {name} <==", 2817 help='if more than one pickle file is specified, print this before' 2818 ' each disassembly') 2819 parser.add_argument( 2820 '-t', '--test', action='store_true', 2821 help='run self-test suite') 2822 parser.add_argument( 2823 '-v', action='store_true', 2824 help='run verbosely; only affects self-test run') 2825 args = parser.parse_args() 2826 if args.test: 2827 _test() 2828 else: 2829 annotate = 30 if args.annotate else 0 2830 if not args.pickle_file: 2831 parser.print_help() 2832 elif len(args.pickle_file) == 1: 2833 dis(args.pickle_file[0], args.output, None, 2834 args.indentlevel, annotate) 2835 else: 2836 memo = {} if args.memo else None 2837 for f in args.pickle_file: 2838 preamble = args.preamble.format(name=f.name) 2839 args.output.write(preamble + '\n') 2840 dis(f, args.output, memo, args.indentlevel, annotate) 2841