• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1'''"Executable documentation" for the pickle module.
2
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here.  Some functions meant for external use:
5
6genops(pickle)
7   Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
9dis(pickle, out=None, memo=None, indentlevel=4)
10   Print a symbolic disassembly of a pickle.
11'''
12
13import codecs
14import io
15import pickle
16import re
17import sys
18
19__all__ = ['dis', 'genops', 'optimize']
20
21bytes_types = pickle.bytes_types
22
23# Other ideas:
24#
25# - A pickle verifier:  read a pickle and check it exhaustively for
26#   well-formedness.  dis() does a lot of this already.
27#
28# - A protocol identifier:  examine a pickle and return its protocol number
29#   (== the highest .proto attr value among all the opcodes in the pickle).
30#   dis() already prints this info at the end.
31#
32# - A pickle optimizer:  for example, tuple-building code is sometimes more
33#   elaborate than necessary, catering for the possibility that the tuple
34#   is recursive.  Or lots of times a PUT is generated that's never accessed
35#   by a later GET.
36
37
38# "A pickle" is a program for a virtual pickle machine (PM, but more accurately
39# called an unpickling machine).  It's a sequence of opcodes, interpreted by the
40# PM, building an arbitrarily complex Python object.
41#
42# For the most part, the PM is very simple:  there are no looping, testing, or
43# conditional instructions, no arithmetic and no function calls.  Opcodes are
44# executed once each, from first to last, until a STOP opcode is reached.
45#
46# The PM has two data areas, "the stack" and "the memo".
47#
48# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
49# integer object on the stack, whose value is gotten from a decimal string
50# literal immediately following the INT opcode in the pickle bytestream.  Other
51# opcodes take Python objects off the stack.  The result of unpickling is
52# whatever object is left on the stack when the final STOP opcode is executed.
53#
54# The memo is simply an array of objects, or it can be implemented as a dict
55# mapping little integers to objects.  The memo serves as the PM's "long term
56# memory", and the little integers indexing the memo are akin to variable
57# names.  Some opcodes pop a stack object into the memo at a given index,
58# and others push a memo object at a given index onto the stack again.
59#
60# At heart, that's all the PM has.  Subtleties arise for these reasons:
61#
62# + Object identity.  Objects can be arbitrarily complex, and subobjects
63#   may be shared (for example, the list [a, a] refers to the same object a
64#   twice).  It can be vital that unpickling recreate an isomorphic object
65#   graph, faithfully reproducing sharing.
66#
67# + Recursive objects.  For example, after "L = []; L.append(L)", L is a
68#   list, and L[0] is the same list.  This is related to the object identity
69#   point, and some sequences of pickle opcodes are subtle in order to
70#   get the right result in all cases.
71#
72# + Things pickle doesn't know everything about.  Examples of things pickle
73#   does know everything about are Python's builtin scalar and container
74#   types, like ints and tuples.  They generally have opcodes dedicated to
75#   them.  For things like module references and instances of user-defined
76#   classes, pickle's knowledge is limited.  Historically, many enhancements
77#   have been made to the pickle protocol in order to do a better (faster,
78#   and/or more compact) job on those.
79#
80# + Backward compatibility and micro-optimization.  As explained below,
81#   pickle opcodes never go away, not even when better ways to do a thing
82#   get invented.  The repertoire of the PM just keeps growing over time.
83#   For example, protocol 0 had two opcodes for building Python integers (INT
84#   and LONG), protocol 1 added three more for more-efficient pickling of short
85#   integers, and protocol 2 added two more for more-efficient pickling of
86#   long integers (before protocol 2, the only ways to pickle a Python long
87#   took time quadratic in the number of digits, for both pickling and
88#   unpickling).  "Opcode bloat" isn't so much a subtlety as a source of
89#   wearying complication.
90#
91#
92# Pickle protocols:
93#
94# For compatibility, the meaning of a pickle opcode never changes.  Instead new
95# pickle opcodes get added, and each version's unpickler can handle all the
96# pickle opcodes in all protocol versions to date.  So old pickles continue to
97# be readable forever.  The pickler can generally be told to restrict itself to
98# the subset of opcodes available under previous protocol versions too, so that
99# users can create pickles under the current version readable by older
100# versions.  However, a pickle does not contain its version number embedded
101# within it.  If an older unpickler tries to read a pickle using a later
102# protocol, the result is most likely an exception due to seeing an unknown (in
103# the older unpickler) opcode.
104#
105# The original pickle used what's now called "protocol 0", and what was called
106# "text mode" before Python 2.3.  The entire pickle bytestream is made up of
107# printable 7-bit ASCII characters, plus the newline character, in protocol 0.
108# That's why it was called text mode.  Protocol 0 is small and elegant, but
109# sometimes painfully inefficient.
110#
111# The second major set of additions is now called "protocol 1", and was called
112# "binary mode" before Python 2.3.  This added many opcodes with arguments
113# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
114# bytes.  Binary mode pickles can be substantially smaller than equivalent
115# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
116# int as 4 bytes following the opcode, which is cheaper to unpickle than the
117# (perhaps) 11-character decimal string attached to INT.  Protocol 1 also added
118# a number of opcodes that operate on many stack elements at once (like APPENDS
119# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
120#
121# The third major set of additions came in Python 2.3, and is called "protocol
122# 2".  This added:
123#
124# - A better way to pickle instances of new-style classes (NEWOBJ).
125#
126# - A way for a pickle to identify its protocol (PROTO).
127#
128# - Time- and space- efficient pickling of long ints (LONG{1,4}).
129#
130# - Shortcuts for small tuples (TUPLE{1,2,3}}.
131#
132# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
133#
134# - The "extension registry", a vector of popular objects that can be pushed
135#   efficiently by index (EXT{1,2,4}).  This is akin to the memo and GET, but
136#   the registry contents are predefined (there's nothing akin to the memo's
137#   PUT).
138#
139# Another independent change with Python 2.3 is the abandonment of any
140# pretense that it might be safe to load pickles received from untrusted
141# parties -- no sufficient security analysis has been done to guarantee
142# this and there isn't a use case that warrants the expense of such an
143# analysis.
144#
145# To this end, all tests for __safe_for_unpickling__ or for
146# copyreg.safe_constructors are removed from the unpickling code.
147# References to these variables in the descriptions below are to be seen
148# as describing unpickling in Python 2.2 and before.
149
150
151# Meta-rule:  Descriptions are stored in instances of descriptor objects,
152# with plain constructors.  No meta-language is defined from which
153# descriptors could be constructed.  If you want, e.g., XML, write a little
154# program to generate XML from the objects.
155
156##############################################################################
157# Some pickle opcodes have an argument, following the opcode in the
158# bytestream.  An argument is of a specific type, described by an instance
159# of ArgumentDescriptor.  These are not to be confused with arguments taken
160# off the stack -- ArgumentDescriptor applies only to arguments embedded in
161# the opcode stream, immediately following an opcode.
162
163# Represents the number of bytes consumed by an argument delimited by the
164# next newline character.
165UP_TO_NEWLINE = -1
166
167# Represents the number of bytes consumed by a two-argument opcode where
168# the first argument gives the number of bytes in the second argument.
169TAKEN_FROM_ARGUMENT1  = -2   # num bytes is 1-byte unsigned int
170TAKEN_FROM_ARGUMENT4  = -3   # num bytes is 4-byte signed little-endian int
171TAKEN_FROM_ARGUMENT4U = -4   # num bytes is 4-byte unsigned little-endian int
172TAKEN_FROM_ARGUMENT8U = -5   # num bytes is 8-byte unsigned little-endian int
173
174class ArgumentDescriptor(object):
175    __slots__ = (
176        # name of descriptor record, also a module global name; a string
177        'name',
178
179        # length of argument, in bytes; an int; UP_TO_NEWLINE and
180        # TAKEN_FROM_ARGUMENT{1,4,8} are negative values for variable-length
181        # cases
182        'n',
183
184        # a function taking a file-like object, reading this kind of argument
185        # from the object at the current position, advancing the current
186        # position by n bytes, and returning the value of the argument
187        'reader',
188
189        # human-readable docs for this arg descriptor; a string
190        'doc',
191    )
192
193    def __init__(self, name, n, reader, doc):
194        assert isinstance(name, str)
195        self.name = name
196
197        assert isinstance(n, int) and (n >= 0 or
198                                       n in (UP_TO_NEWLINE,
199                                             TAKEN_FROM_ARGUMENT1,
200                                             TAKEN_FROM_ARGUMENT4,
201                                             TAKEN_FROM_ARGUMENT4U,
202                                             TAKEN_FROM_ARGUMENT8U))
203        self.n = n
204
205        self.reader = reader
206
207        assert isinstance(doc, str)
208        self.doc = doc
209
210from struct import unpack as _unpack
211
212def read_uint1(f):
213    r"""
214    >>> import io
215    >>> read_uint1(io.BytesIO(b'\xff'))
216    255
217    """
218
219    data = f.read(1)
220    if data:
221        return data[0]
222    raise ValueError("not enough data in stream to read uint1")
223
224uint1 = ArgumentDescriptor(
225            name='uint1',
226            n=1,
227            reader=read_uint1,
228            doc="One-byte unsigned integer.")
229
230
231def read_uint2(f):
232    r"""
233    >>> import io
234    >>> read_uint2(io.BytesIO(b'\xff\x00'))
235    255
236    >>> read_uint2(io.BytesIO(b'\xff\xff'))
237    65535
238    """
239
240    data = f.read(2)
241    if len(data) == 2:
242        return _unpack("<H", data)[0]
243    raise ValueError("not enough data in stream to read uint2")
244
245uint2 = ArgumentDescriptor(
246            name='uint2',
247            n=2,
248            reader=read_uint2,
249            doc="Two-byte unsigned integer, little-endian.")
250
251
252def read_int4(f):
253    r"""
254    >>> import io
255    >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
256    255
257    >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
258    True
259    """
260
261    data = f.read(4)
262    if len(data) == 4:
263        return _unpack("<i", data)[0]
264    raise ValueError("not enough data in stream to read int4")
265
266int4 = ArgumentDescriptor(
267           name='int4',
268           n=4,
269           reader=read_int4,
270           doc="Four-byte signed integer, little-endian, 2's complement.")
271
272
273def read_uint4(f):
274    r"""
275    >>> import io
276    >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00'))
277    255
278    >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31
279    True
280    """
281
282    data = f.read(4)
283    if len(data) == 4:
284        return _unpack("<I", data)[0]
285    raise ValueError("not enough data in stream to read uint4")
286
287uint4 = ArgumentDescriptor(
288            name='uint4',
289            n=4,
290            reader=read_uint4,
291            doc="Four-byte unsigned integer, little-endian.")
292
293
294def read_uint8(f):
295    r"""
296    >>> import io
297    >>> read_uint8(io.BytesIO(b'\xff\x00\x00\x00\x00\x00\x00\x00'))
298    255
299    >>> read_uint8(io.BytesIO(b'\xff' * 8)) == 2**64-1
300    True
301    """
302
303    data = f.read(8)
304    if len(data) == 8:
305        return _unpack("<Q", data)[0]
306    raise ValueError("not enough data in stream to read uint8")
307
308uint8 = ArgumentDescriptor(
309            name='uint8',
310            n=8,
311            reader=read_uint8,
312            doc="Eight-byte unsigned integer, little-endian.")
313
314
315def read_stringnl(f, decode=True, stripquotes=True):
316    r"""
317    >>> import io
318    >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
319    'abcd'
320
321    >>> read_stringnl(io.BytesIO(b"\n"))
322    Traceback (most recent call last):
323    ...
324    ValueError: no string quotes around b''
325
326    >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
327    ''
328
329    >>> read_stringnl(io.BytesIO(b"''\n"))
330    ''
331
332    >>> read_stringnl(io.BytesIO(b'"abcd"'))
333    Traceback (most recent call last):
334    ...
335    ValueError: no newline found when trying to read stringnl
336
337    Embedded escapes are undone in the result.
338    >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
339    'a\n\\b\x00c\td'
340    """
341
342    data = f.readline()
343    if not data.endswith(b'\n'):
344        raise ValueError("no newline found when trying to read stringnl")
345    data = data[:-1]    # lose the newline
346
347    if stripquotes:
348        for q in (b'"', b"'"):
349            if data.startswith(q):
350                if not data.endswith(q):
351                    raise ValueError("strinq quote %r not found at both "
352                                     "ends of %r" % (q, data))
353                data = data[1:-1]
354                break
355        else:
356            raise ValueError("no string quotes around %r" % data)
357
358    if decode:
359        data = codecs.escape_decode(data)[0].decode("ascii")
360    return data
361
362stringnl = ArgumentDescriptor(
363               name='stringnl',
364               n=UP_TO_NEWLINE,
365               reader=read_stringnl,
366               doc="""A newline-terminated string.
367
368                   This is a repr-style string, with embedded escapes, and
369                   bracketing quotes.
370                   """)
371
372def read_stringnl_noescape(f):
373    return read_stringnl(f, stripquotes=False)
374
375stringnl_noescape = ArgumentDescriptor(
376                        name='stringnl_noescape',
377                        n=UP_TO_NEWLINE,
378                        reader=read_stringnl_noescape,
379                        doc="""A newline-terminated string.
380
381                        This is a str-style string, without embedded escapes,
382                        or bracketing quotes.  It should consist solely of
383                        printable ASCII characters.
384                        """)
385
386def read_stringnl_noescape_pair(f):
387    r"""
388    >>> import io
389    >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
390    'Queue Empty'
391    """
392
393    return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
394
395stringnl_noescape_pair = ArgumentDescriptor(
396                             name='stringnl_noescape_pair',
397                             n=UP_TO_NEWLINE,
398                             reader=read_stringnl_noescape_pair,
399                             doc="""A pair of newline-terminated strings.
400
401                             These are str-style strings, without embedded
402                             escapes, or bracketing quotes.  They should
403                             consist solely of printable ASCII characters.
404                             The pair is returned as a single string, with
405                             a single blank separating the two strings.
406                             """)
407
408
409def read_string1(f):
410    r"""
411    >>> import io
412    >>> read_string1(io.BytesIO(b"\x00"))
413    ''
414    >>> read_string1(io.BytesIO(b"\x03abcdef"))
415    'abc'
416    """
417
418    n = read_uint1(f)
419    assert n >= 0
420    data = f.read(n)
421    if len(data) == n:
422        return data.decode("latin-1")
423    raise ValueError("expected %d bytes in a string1, but only %d remain" %
424                     (n, len(data)))
425
426string1 = ArgumentDescriptor(
427              name="string1",
428              n=TAKEN_FROM_ARGUMENT1,
429              reader=read_string1,
430              doc="""A counted string.
431
432              The first argument is a 1-byte unsigned int giving the number
433              of bytes in the string, and the second argument is that many
434              bytes.
435              """)
436
437
438def read_string4(f):
439    r"""
440    >>> import io
441    >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
442    ''
443    >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
444    'abc'
445    >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
446    Traceback (most recent call last):
447    ...
448    ValueError: expected 50331648 bytes in a string4, but only 6 remain
449    """
450
451    n = read_int4(f)
452    if n < 0:
453        raise ValueError("string4 byte count < 0: %d" % n)
454    data = f.read(n)
455    if len(data) == n:
456        return data.decode("latin-1")
457    raise ValueError("expected %d bytes in a string4, but only %d remain" %
458                     (n, len(data)))
459
460string4 = ArgumentDescriptor(
461              name="string4",
462              n=TAKEN_FROM_ARGUMENT4,
463              reader=read_string4,
464              doc="""A counted string.
465
466              The first argument is a 4-byte little-endian signed int giving
467              the number of bytes in the string, and the second argument is
468              that many bytes.
469              """)
470
471
472def read_bytes1(f):
473    r"""
474    >>> import io
475    >>> read_bytes1(io.BytesIO(b"\x00"))
476    b''
477    >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
478    b'abc'
479    """
480
481    n = read_uint1(f)
482    assert n >= 0
483    data = f.read(n)
484    if len(data) == n:
485        return data
486    raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
487                     (n, len(data)))
488
489bytes1 = ArgumentDescriptor(
490              name="bytes1",
491              n=TAKEN_FROM_ARGUMENT1,
492              reader=read_bytes1,
493              doc="""A counted bytes string.
494
495              The first argument is a 1-byte unsigned int giving the number
496              of bytes in the string, and the second argument is that many
497              bytes.
498              """)
499
500
501def read_bytes1(f):
502    r"""
503    >>> import io
504    >>> read_bytes1(io.BytesIO(b"\x00"))
505    b''
506    >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
507    b'abc'
508    """
509
510    n = read_uint1(f)
511    assert n >= 0
512    data = f.read(n)
513    if len(data) == n:
514        return data
515    raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
516                     (n, len(data)))
517
518bytes1 = ArgumentDescriptor(
519              name="bytes1",
520              n=TAKEN_FROM_ARGUMENT1,
521              reader=read_bytes1,
522              doc="""A counted bytes string.
523
524              The first argument is a 1-byte unsigned int giving the number
525              of bytes, and the second argument is that many bytes.
526              """)
527
528
529def read_bytes4(f):
530    r"""
531    >>> import io
532    >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc"))
533    b''
534    >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
535    b'abc'
536    >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
537    Traceback (most recent call last):
538    ...
539    ValueError: expected 50331648 bytes in a bytes4, but only 6 remain
540    """
541
542    n = read_uint4(f)
543    assert n >= 0
544    if n > sys.maxsize:
545        raise ValueError("bytes4 byte count > sys.maxsize: %d" % n)
546    data = f.read(n)
547    if len(data) == n:
548        return data
549    raise ValueError("expected %d bytes in a bytes4, but only %d remain" %
550                     (n, len(data)))
551
552bytes4 = ArgumentDescriptor(
553              name="bytes4",
554              n=TAKEN_FROM_ARGUMENT4U,
555              reader=read_bytes4,
556              doc="""A counted bytes string.
557
558              The first argument is a 4-byte little-endian unsigned int giving
559              the number of bytes, and the second argument is that many bytes.
560              """)
561
562
563def read_bytes8(f):
564    r"""
565    >>> import io, struct, sys
566    >>> read_bytes8(io.BytesIO(b"\x00\x00\x00\x00\x00\x00\x00\x00abc"))
567    b''
568    >>> read_bytes8(io.BytesIO(b"\x03\x00\x00\x00\x00\x00\x00\x00abcdef"))
569    b'abc'
570    >>> bigsize8 = struct.pack("<Q", sys.maxsize//3)
571    >>> read_bytes8(io.BytesIO(bigsize8 + b"abcdef"))  #doctest: +ELLIPSIS
572    Traceback (most recent call last):
573    ...
574    ValueError: expected ... bytes in a bytes8, but only 6 remain
575    """
576
577    n = read_uint8(f)
578    assert n >= 0
579    if n > sys.maxsize:
580        raise ValueError("bytes8 byte count > sys.maxsize: %d" % n)
581    data = f.read(n)
582    if len(data) == n:
583        return data
584    raise ValueError("expected %d bytes in a bytes8, but only %d remain" %
585                     (n, len(data)))
586
587bytes8 = ArgumentDescriptor(
588              name="bytes8",
589              n=TAKEN_FROM_ARGUMENT8U,
590              reader=read_bytes8,
591              doc="""A counted bytes string.
592
593              The first argument is an 8-byte little-endian unsigned int giving
594              the number of bytes, and the second argument is that many bytes.
595              """)
596
597def read_unicodestringnl(f):
598    r"""
599    >>> import io
600    >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
601    True
602    """
603
604    data = f.readline()
605    if not data.endswith(b'\n'):
606        raise ValueError("no newline found when trying to read "
607                         "unicodestringnl")
608    data = data[:-1]    # lose the newline
609    return str(data, 'raw-unicode-escape')
610
611unicodestringnl = ArgumentDescriptor(
612                      name='unicodestringnl',
613                      n=UP_TO_NEWLINE,
614                      reader=read_unicodestringnl,
615                      doc="""A newline-terminated Unicode string.
616
617                      This is raw-unicode-escape encoded, so consists of
618                      printable ASCII characters, and may contain embedded
619                      escape sequences.
620                      """)
621
622
623def read_unicodestring1(f):
624    r"""
625    >>> import io
626    >>> s = 'abcd\uabcd'
627    >>> enc = s.encode('utf-8')
628    >>> enc
629    b'abcd\xea\xaf\x8d'
630    >>> n = bytes([len(enc)])  # little-endian 1-byte length
631    >>> t = read_unicodestring1(io.BytesIO(n + enc + b'junk'))
632    >>> s == t
633    True
634
635    >>> read_unicodestring1(io.BytesIO(n + enc[:-1]))
636    Traceback (most recent call last):
637    ...
638    ValueError: expected 7 bytes in a unicodestring1, but only 6 remain
639    """
640
641    n = read_uint1(f)
642    assert n >= 0
643    data = f.read(n)
644    if len(data) == n:
645        return str(data, 'utf-8', 'surrogatepass')
646    raise ValueError("expected %d bytes in a unicodestring1, but only %d "
647                     "remain" % (n, len(data)))
648
649unicodestring1 = ArgumentDescriptor(
650                    name="unicodestring1",
651                    n=TAKEN_FROM_ARGUMENT1,
652                    reader=read_unicodestring1,
653                    doc="""A counted Unicode string.
654
655                    The first argument is a 1-byte little-endian signed int
656                    giving the number of bytes in the string, and the second
657                    argument-- the UTF-8 encoding of the Unicode string --
658                    contains that many bytes.
659                    """)
660
661
662def read_unicodestring4(f):
663    r"""
664    >>> import io
665    >>> s = 'abcd\uabcd'
666    >>> enc = s.encode('utf-8')
667    >>> enc
668    b'abcd\xea\xaf\x8d'
669    >>> n = bytes([len(enc), 0, 0, 0])  # little-endian 4-byte length
670    >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
671    >>> s == t
672    True
673
674    >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
675    Traceback (most recent call last):
676    ...
677    ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
678    """
679
680    n = read_uint4(f)
681    assert n >= 0
682    if n > sys.maxsize:
683        raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n)
684    data = f.read(n)
685    if len(data) == n:
686        return str(data, 'utf-8', 'surrogatepass')
687    raise ValueError("expected %d bytes in a unicodestring4, but only %d "
688                     "remain" % (n, len(data)))
689
690unicodestring4 = ArgumentDescriptor(
691                    name="unicodestring4",
692                    n=TAKEN_FROM_ARGUMENT4U,
693                    reader=read_unicodestring4,
694                    doc="""A counted Unicode string.
695
696                    The first argument is a 4-byte little-endian signed int
697                    giving the number of bytes in the string, and the second
698                    argument-- the UTF-8 encoding of the Unicode string --
699                    contains that many bytes.
700                    """)
701
702
703def read_unicodestring8(f):
704    r"""
705    >>> import io
706    >>> s = 'abcd\uabcd'
707    >>> enc = s.encode('utf-8')
708    >>> enc
709    b'abcd\xea\xaf\x8d'
710    >>> n = bytes([len(enc)]) + b'\0' * 7  # little-endian 8-byte length
711    >>> t = read_unicodestring8(io.BytesIO(n + enc + b'junk'))
712    >>> s == t
713    True
714
715    >>> read_unicodestring8(io.BytesIO(n + enc[:-1]))
716    Traceback (most recent call last):
717    ...
718    ValueError: expected 7 bytes in a unicodestring8, but only 6 remain
719    """
720
721    n = read_uint8(f)
722    assert n >= 0
723    if n > sys.maxsize:
724        raise ValueError("unicodestring8 byte count > sys.maxsize: %d" % n)
725    data = f.read(n)
726    if len(data) == n:
727        return str(data, 'utf-8', 'surrogatepass')
728    raise ValueError("expected %d bytes in a unicodestring8, but only %d "
729                     "remain" % (n, len(data)))
730
731unicodestring8 = ArgumentDescriptor(
732                    name="unicodestring8",
733                    n=TAKEN_FROM_ARGUMENT8U,
734                    reader=read_unicodestring8,
735                    doc="""A counted Unicode string.
736
737                    The first argument is an 8-byte little-endian signed int
738                    giving the number of bytes in the string, and the second
739                    argument-- the UTF-8 encoding of the Unicode string --
740                    contains that many bytes.
741                    """)
742
743
744def read_decimalnl_short(f):
745    r"""
746    >>> import io
747    >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
748    1234
749
750    >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
751    Traceback (most recent call last):
752    ...
753    ValueError: invalid literal for int() with base 10: b'1234L'
754    """
755
756    s = read_stringnl(f, decode=False, stripquotes=False)
757
758    # There's a hack for True and False here.
759    if s == b"00":
760        return False
761    elif s == b"01":
762        return True
763
764    return int(s)
765
766def read_decimalnl_long(f):
767    r"""
768    >>> import io
769
770    >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
771    1234
772
773    >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
774    123456789012345678901234
775    """
776
777    s = read_stringnl(f, decode=False, stripquotes=False)
778    if s[-1:] == b'L':
779        s = s[:-1]
780    return int(s)
781
782
783decimalnl_short = ArgumentDescriptor(
784                      name='decimalnl_short',
785                      n=UP_TO_NEWLINE,
786                      reader=read_decimalnl_short,
787                      doc="""A newline-terminated decimal integer literal.
788
789                          This never has a trailing 'L', and the integer fit
790                          in a short Python int on the box where the pickle
791                          was written -- but there's no guarantee it will fit
792                          in a short Python int on the box where the pickle
793                          is read.
794                          """)
795
796decimalnl_long = ArgumentDescriptor(
797                     name='decimalnl_long',
798                     n=UP_TO_NEWLINE,
799                     reader=read_decimalnl_long,
800                     doc="""A newline-terminated decimal integer literal.
801
802                         This has a trailing 'L', and can represent integers
803                         of any size.
804                         """)
805
806
807def read_floatnl(f):
808    r"""
809    >>> import io
810    >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
811    -1.25
812    """
813    s = read_stringnl(f, decode=False, stripquotes=False)
814    return float(s)
815
816floatnl = ArgumentDescriptor(
817              name='floatnl',
818              n=UP_TO_NEWLINE,
819              reader=read_floatnl,
820              doc="""A newline-terminated decimal floating literal.
821
822              In general this requires 17 significant digits for roundtrip
823              identity, and pickling then unpickling infinities, NaNs, and
824              minus zero doesn't work across boxes, or on some boxes even
825              on itself (e.g., Windows can't read the strings it produces
826              for infinities or NaNs).
827              """)
828
829def read_float8(f):
830    r"""
831    >>> import io, struct
832    >>> raw = struct.pack(">d", -1.25)
833    >>> raw
834    b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
835    >>> read_float8(io.BytesIO(raw + b"\n"))
836    -1.25
837    """
838
839    data = f.read(8)
840    if len(data) == 8:
841        return _unpack(">d", data)[0]
842    raise ValueError("not enough data in stream to read float8")
843
844
845float8 = ArgumentDescriptor(
846             name='float8',
847             n=8,
848             reader=read_float8,
849             doc="""An 8-byte binary representation of a float, big-endian.
850
851             The format is unique to Python, and shared with the struct
852             module (format string '>d') "in theory" (the struct and pickle
853             implementations don't share the code -- they should).  It's
854             strongly related to the IEEE-754 double format, and, in normal
855             cases, is in fact identical to the big-endian 754 double format.
856             On other boxes the dynamic range is limited to that of a 754
857             double, and "add a half and chop" rounding is used to reduce
858             the precision to 53 bits.  However, even on a 754 box,
859             infinities, NaNs, and minus zero may not be handled correctly
860             (may not survive roundtrip pickling intact).
861             """)
862
863# Protocol 2 formats
864
865from pickle import decode_long
866
867def read_long1(f):
868    r"""
869    >>> import io
870    >>> read_long1(io.BytesIO(b"\x00"))
871    0
872    >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
873    255
874    >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
875    32767
876    >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
877    -256
878    >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
879    -32768
880    """
881
882    n = read_uint1(f)
883    data = f.read(n)
884    if len(data) != n:
885        raise ValueError("not enough data in stream to read long1")
886    return decode_long(data)
887
888long1 = ArgumentDescriptor(
889    name="long1",
890    n=TAKEN_FROM_ARGUMENT1,
891    reader=read_long1,
892    doc="""A binary long, little-endian, using 1-byte size.
893
894    This first reads one byte as an unsigned size, then reads that
895    many bytes and interprets them as a little-endian 2's-complement long.
896    If the size is 0, that's taken as a shortcut for the long 0L.
897    """)
898
899def read_long4(f):
900    r"""
901    >>> import io
902    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
903    255
904    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
905    32767
906    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
907    -256
908    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
909    -32768
910    >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
911    0
912    """
913
914    n = read_int4(f)
915    if n < 0:
916        raise ValueError("long4 byte count < 0: %d" % n)
917    data = f.read(n)
918    if len(data) != n:
919        raise ValueError("not enough data in stream to read long4")
920    return decode_long(data)
921
922long4 = ArgumentDescriptor(
923    name="long4",
924    n=TAKEN_FROM_ARGUMENT4,
925    reader=read_long4,
926    doc="""A binary representation of a long, little-endian.
927
928    This first reads four bytes as a signed size (but requires the
929    size to be >= 0), then reads that many bytes and interprets them
930    as a little-endian 2's-complement long.  If the size is 0, that's taken
931    as a shortcut for the int 0, although LONG1 should really be used
932    then instead (and in any case where # of bytes < 256).
933    """)
934
935
936##############################################################################
937# Object descriptors.  The stack used by the pickle machine holds objects,
938# and in the stack_before and stack_after attributes of OpcodeInfo
939# descriptors we need names to describe the various types of objects that can
940# appear on the stack.
941
942class StackObject(object):
943    __slots__ = (
944        # name of descriptor record, for info only
945        'name',
946
947        # type of object, or tuple of type objects (meaning the object can
948        # be of any type in the tuple)
949        'obtype',
950
951        # human-readable docs for this kind of stack object; a string
952        'doc',
953    )
954
955    def __init__(self, name, obtype, doc):
956        assert isinstance(name, str)
957        self.name = name
958
959        assert isinstance(obtype, type) or isinstance(obtype, tuple)
960        if isinstance(obtype, tuple):
961            for contained in obtype:
962                assert isinstance(contained, type)
963        self.obtype = obtype
964
965        assert isinstance(doc, str)
966        self.doc = doc
967
968    def __repr__(self):
969        return self.name
970
971
972pyint = pylong = StackObject(
973    name='int',
974    obtype=int,
975    doc="A Python integer object.")
976
977pyinteger_or_bool = StackObject(
978    name='int_or_bool',
979    obtype=(int, bool),
980    doc="A Python integer or boolean object.")
981
982pybool = StackObject(
983    name='bool',
984    obtype=bool,
985    doc="A Python boolean object.")
986
987pyfloat = StackObject(
988    name='float',
989    obtype=float,
990    doc="A Python float object.")
991
992pybytes_or_str = pystring = StackObject(
993    name='bytes_or_str',
994    obtype=(bytes, str),
995    doc="A Python bytes or (Unicode) string object.")
996
997pybytes = StackObject(
998    name='bytes',
999    obtype=bytes,
1000    doc="A Python bytes object.")
1001
1002pyunicode = StackObject(
1003    name='str',
1004    obtype=str,
1005    doc="A Python (Unicode) string object.")
1006
1007pynone = StackObject(
1008    name="None",
1009    obtype=type(None),
1010    doc="The Python None object.")
1011
1012pytuple = StackObject(
1013    name="tuple",
1014    obtype=tuple,
1015    doc="A Python tuple object.")
1016
1017pylist = StackObject(
1018    name="list",
1019    obtype=list,
1020    doc="A Python list object.")
1021
1022pydict = StackObject(
1023    name="dict",
1024    obtype=dict,
1025    doc="A Python dict object.")
1026
1027pyset = StackObject(
1028    name="set",
1029    obtype=set,
1030    doc="A Python set object.")
1031
1032pyfrozenset = StackObject(
1033    name="frozenset",
1034    obtype=set,
1035    doc="A Python frozenset object.")
1036
1037anyobject = StackObject(
1038    name='any',
1039    obtype=object,
1040    doc="Any kind of object whatsoever.")
1041
1042markobject = StackObject(
1043    name="mark",
1044    obtype=StackObject,
1045    doc="""'The mark' is a unique object.
1046
1047Opcodes that operate on a variable number of objects
1048generally don't embed the count of objects in the opcode,
1049or pull it off the stack.  Instead the MARK opcode is used
1050to push a special marker object on the stack, and then
1051some other opcodes grab all the objects from the top of
1052the stack down to (but not including) the topmost marker
1053object.
1054""")
1055
1056stackslice = StackObject(
1057    name="stackslice",
1058    obtype=StackObject,
1059    doc="""An object representing a contiguous slice of the stack.
1060
1061This is used in conjunction with markobject, to represent all
1062of the stack following the topmost markobject.  For example,
1063the POP_MARK opcode changes the stack from
1064
1065    [..., markobject, stackslice]
1066to
1067    [...]
1068
1069No matter how many object are on the stack after the topmost
1070markobject, POP_MARK gets rid of all of them (including the
1071topmost markobject too).
1072""")
1073
1074##############################################################################
1075# Descriptors for pickle opcodes.
1076
1077class OpcodeInfo(object):
1078
1079    __slots__ = (
1080        # symbolic name of opcode; a string
1081        'name',
1082
1083        # the code used in a bytestream to represent the opcode; a
1084        # one-character string
1085        'code',
1086
1087        # If the opcode has an argument embedded in the byte string, an
1088        # instance of ArgumentDescriptor specifying its type.  Note that
1089        # arg.reader(s) can be used to read and decode the argument from
1090        # the bytestream s, and arg.doc documents the format of the raw
1091        # argument bytes.  If the opcode doesn't have an argument embedded
1092        # in the bytestream, arg should be None.
1093        'arg',
1094
1095        # what the stack looks like before this opcode runs; a list
1096        'stack_before',
1097
1098        # what the stack looks like after this opcode runs; a list
1099        'stack_after',
1100
1101        # the protocol number in which this opcode was introduced; an int
1102        'proto',
1103
1104        # human-readable docs for this opcode; a string
1105        'doc',
1106    )
1107
1108    def __init__(self, name, code, arg,
1109                 stack_before, stack_after, proto, doc):
1110        assert isinstance(name, str)
1111        self.name = name
1112
1113        assert isinstance(code, str)
1114        assert len(code) == 1
1115        self.code = code
1116
1117        assert arg is None or isinstance(arg, ArgumentDescriptor)
1118        self.arg = arg
1119
1120        assert isinstance(stack_before, list)
1121        for x in stack_before:
1122            assert isinstance(x, StackObject)
1123        self.stack_before = stack_before
1124
1125        assert isinstance(stack_after, list)
1126        for x in stack_after:
1127            assert isinstance(x, StackObject)
1128        self.stack_after = stack_after
1129
1130        assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL
1131        self.proto = proto
1132
1133        assert isinstance(doc, str)
1134        self.doc = doc
1135
1136I = OpcodeInfo
1137opcodes = [
1138
1139    # Ways to spell integers.
1140
1141    I(name='INT',
1142      code='I',
1143      arg=decimalnl_short,
1144      stack_before=[],
1145      stack_after=[pyinteger_or_bool],
1146      proto=0,
1147      doc="""Push an integer or bool.
1148
1149      The argument is a newline-terminated decimal literal string.
1150
1151      The intent may have been that this always fit in a short Python int,
1152      but INT can be generated in pickles written on a 64-bit box that
1153      require a Python long on a 32-bit box.  The difference between this
1154      and LONG then is that INT skips a trailing 'L', and produces a short
1155      int whenever possible.
1156
1157      Another difference is due to that, when bool was introduced as a
1158      distinct type in 2.3, builtin names True and False were also added to
1159      2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
1160      True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
1161      Leading zeroes are never produced for a genuine integer.  The 2.3
1162      (and later) unpicklers special-case these and return bool instead;
1163      earlier unpicklers ignore the leading "0" and return the int.
1164      """),
1165
1166    I(name='BININT',
1167      code='J',
1168      arg=int4,
1169      stack_before=[],
1170      stack_after=[pyint],
1171      proto=1,
1172      doc="""Push a four-byte signed integer.
1173
1174      This handles the full range of Python (short) integers on a 32-bit
1175      box, directly as binary bytes (1 for the opcode and 4 for the integer).
1176      If the integer is non-negative and fits in 1 or 2 bytes, pickling via
1177      BININT1 or BININT2 saves space.
1178      """),
1179
1180    I(name='BININT1',
1181      code='K',
1182      arg=uint1,
1183      stack_before=[],
1184      stack_after=[pyint],
1185      proto=1,
1186      doc="""Push a one-byte unsigned integer.
1187
1188      This is a space optimization for pickling very small non-negative ints,
1189      in range(256).
1190      """),
1191
1192    I(name='BININT2',
1193      code='M',
1194      arg=uint2,
1195      stack_before=[],
1196      stack_after=[pyint],
1197      proto=1,
1198      doc="""Push a two-byte unsigned integer.
1199
1200      This is a space optimization for pickling small positive ints, in
1201      range(256, 2**16).  Integers in range(256) can also be pickled via
1202      BININT2, but BININT1 instead saves a byte.
1203      """),
1204
1205    I(name='LONG',
1206      code='L',
1207      arg=decimalnl_long,
1208      stack_before=[],
1209      stack_after=[pyint],
1210      proto=0,
1211      doc="""Push a long integer.
1212
1213      The same as INT, except that the literal ends with 'L', and always
1214      unpickles to a Python long.  There doesn't seem a real purpose to the
1215      trailing 'L'.
1216
1217      Note that LONG takes time quadratic in the number of digits when
1218      unpickling (this is simply due to the nature of decimal->binary
1219      conversion).  Proto 2 added linear-time (in C; still quadratic-time
1220      in Python) LONG1 and LONG4 opcodes.
1221      """),
1222
1223    I(name="LONG1",
1224      code='\x8a',
1225      arg=long1,
1226      stack_before=[],
1227      stack_after=[pyint],
1228      proto=2,
1229      doc="""Long integer using one-byte length.
1230
1231      A more efficient encoding of a Python long; the long1 encoding
1232      says it all."""),
1233
1234    I(name="LONG4",
1235      code='\x8b',
1236      arg=long4,
1237      stack_before=[],
1238      stack_after=[pyint],
1239      proto=2,
1240      doc="""Long integer using found-byte length.
1241
1242      A more efficient encoding of a Python long; the long4 encoding
1243      says it all."""),
1244
1245    # Ways to spell strings (8-bit, not Unicode).
1246
1247    I(name='STRING',
1248      code='S',
1249      arg=stringnl,
1250      stack_before=[],
1251      stack_after=[pybytes_or_str],
1252      proto=0,
1253      doc="""Push a Python string object.
1254
1255      The argument is a repr-style string, with bracketing quote characters,
1256      and perhaps embedded escapes.  The argument extends until the next
1257      newline character.  These are usually decoded into a str instance
1258      using the encoding given to the Unpickler constructor. or the default,
1259      'ASCII'.  If the encoding given was 'bytes' however, they will be
1260      decoded as bytes object instead.
1261      """),
1262
1263    I(name='BINSTRING',
1264      code='T',
1265      arg=string4,
1266      stack_before=[],
1267      stack_after=[pybytes_or_str],
1268      proto=1,
1269      doc="""Push a Python string object.
1270
1271      There are two arguments: the first is a 4-byte little-endian
1272      signed int giving the number of bytes in the string, and the
1273      second is that many bytes, which are taken literally as the string
1274      content.  These are usually decoded into a str instance using the
1275      encoding given to the Unpickler constructor. or the default,
1276      'ASCII'.  If the encoding given was 'bytes' however, they will be
1277      decoded as bytes object instead.
1278      """),
1279
1280    I(name='SHORT_BINSTRING',
1281      code='U',
1282      arg=string1,
1283      stack_before=[],
1284      stack_after=[pybytes_or_str],
1285      proto=1,
1286      doc="""Push a Python string object.
1287
1288      There are two arguments: the first is a 1-byte unsigned int giving
1289      the number of bytes in the string, and the second is that many
1290      bytes, which are taken literally as the string content.  These are
1291      usually decoded into a str instance using the encoding given to
1292      the Unpickler constructor. or the default, 'ASCII'.  If the
1293      encoding given was 'bytes' however, they will be decoded as bytes
1294      object instead.
1295      """),
1296
1297    # Bytes (protocol 3 only; older protocols don't support bytes at all)
1298
1299    I(name='BINBYTES',
1300      code='B',
1301      arg=bytes4,
1302      stack_before=[],
1303      stack_after=[pybytes],
1304      proto=3,
1305      doc="""Push a Python bytes object.
1306
1307      There are two arguments:  the first is a 4-byte little-endian unsigned int
1308      giving the number of bytes, and the second is that many bytes, which are
1309      taken literally as the bytes content.
1310      """),
1311
1312    I(name='SHORT_BINBYTES',
1313      code='C',
1314      arg=bytes1,
1315      stack_before=[],
1316      stack_after=[pybytes],
1317      proto=3,
1318      doc="""Push a Python bytes object.
1319
1320      There are two arguments:  the first is a 1-byte unsigned int giving
1321      the number of bytes, and the second is that many bytes, which are taken
1322      literally as the string content.
1323      """),
1324
1325    I(name='BINBYTES8',
1326      code='\x8e',
1327      arg=bytes8,
1328      stack_before=[],
1329      stack_after=[pybytes],
1330      proto=4,
1331      doc="""Push a Python bytes object.
1332
1333      There are two arguments:  the first is an 8-byte unsigned int giving
1334      the number of bytes in the string, and the second is that many bytes,
1335      which are taken literally as the string content.
1336      """),
1337
1338    # Ways to spell None.
1339
1340    I(name='NONE',
1341      code='N',
1342      arg=None,
1343      stack_before=[],
1344      stack_after=[pynone],
1345      proto=0,
1346      doc="Push None on the stack."),
1347
1348    # Ways to spell bools, starting with proto 2.  See INT for how this was
1349    # done before proto 2.
1350
1351    I(name='NEWTRUE',
1352      code='\x88',
1353      arg=None,
1354      stack_before=[],
1355      stack_after=[pybool],
1356      proto=2,
1357      doc="""True.
1358
1359      Push True onto the stack."""),
1360
1361    I(name='NEWFALSE',
1362      code='\x89',
1363      arg=None,
1364      stack_before=[],
1365      stack_after=[pybool],
1366      proto=2,
1367      doc="""True.
1368
1369      Push False onto the stack."""),
1370
1371    # Ways to spell Unicode strings.
1372
1373    I(name='UNICODE',
1374      code='V',
1375      arg=unicodestringnl,
1376      stack_before=[],
1377      stack_after=[pyunicode],
1378      proto=0,  # this may be pure-text, but it's a later addition
1379      doc="""Push a Python Unicode string object.
1380
1381      The argument is a raw-unicode-escape encoding of a Unicode string,
1382      and so may contain embedded escape sequences.  The argument extends
1383      until the next newline character.
1384      """),
1385
1386    I(name='SHORT_BINUNICODE',
1387      code='\x8c',
1388      arg=unicodestring1,
1389      stack_before=[],
1390      stack_after=[pyunicode],
1391      proto=4,
1392      doc="""Push a Python Unicode string object.
1393
1394      There are two arguments:  the first is a 1-byte little-endian signed int
1395      giving the number of bytes in the string.  The second is that many
1396      bytes, and is the UTF-8 encoding of the Unicode string.
1397      """),
1398
1399    I(name='BINUNICODE',
1400      code='X',
1401      arg=unicodestring4,
1402      stack_before=[],
1403      stack_after=[pyunicode],
1404      proto=1,
1405      doc="""Push a Python Unicode string object.
1406
1407      There are two arguments:  the first is a 4-byte little-endian unsigned int
1408      giving the number of bytes in the string.  The second is that many
1409      bytes, and is the UTF-8 encoding of the Unicode string.
1410      """),
1411
1412    I(name='BINUNICODE8',
1413      code='\x8d',
1414      arg=unicodestring8,
1415      stack_before=[],
1416      stack_after=[pyunicode],
1417      proto=4,
1418      doc="""Push a Python Unicode string object.
1419
1420      There are two arguments:  the first is an 8-byte little-endian signed int
1421      giving the number of bytes in the string.  The second is that many
1422      bytes, and is the UTF-8 encoding of the Unicode string.
1423      """),
1424
1425    # Ways to spell floats.
1426
1427    I(name='FLOAT',
1428      code='F',
1429      arg=floatnl,
1430      stack_before=[],
1431      stack_after=[pyfloat],
1432      proto=0,
1433      doc="""Newline-terminated decimal float literal.
1434
1435      The argument is repr(a_float), and in general requires 17 significant
1436      digits for roundtrip conversion to be an identity (this is so for
1437      IEEE-754 double precision values, which is what Python float maps to
1438      on most boxes).
1439
1440      In general, FLOAT cannot be used to transport infinities, NaNs, or
1441      minus zero across boxes (or even on a single box, if the platform C
1442      library can't read the strings it produces for such things -- Windows
1443      is like that), but may do less damage than BINFLOAT on boxes with
1444      greater precision or dynamic range than IEEE-754 double.
1445      """),
1446
1447    I(name='BINFLOAT',
1448      code='G',
1449      arg=float8,
1450      stack_before=[],
1451      stack_after=[pyfloat],
1452      proto=1,
1453      doc="""Float stored in binary form, with 8 bytes of data.
1454
1455      This generally requires less than half the space of FLOAT encoding.
1456      In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1457      minus zero, raises an exception if the exponent exceeds the range of
1458      an IEEE-754 double, and retains no more than 53 bits of precision (if
1459      there are more than that, "add a half and chop" rounding is used to
1460      cut it back to 53 significant bits).
1461      """),
1462
1463    # Ways to build lists.
1464
1465    I(name='EMPTY_LIST',
1466      code=']',
1467      arg=None,
1468      stack_before=[],
1469      stack_after=[pylist],
1470      proto=1,
1471      doc="Push an empty list."),
1472
1473    I(name='APPEND',
1474      code='a',
1475      arg=None,
1476      stack_before=[pylist, anyobject],
1477      stack_after=[pylist],
1478      proto=0,
1479      doc="""Append an object to a list.
1480
1481      Stack before:  ... pylist anyobject
1482      Stack after:   ... pylist+[anyobject]
1483
1484      although pylist is really extended in-place.
1485      """),
1486
1487    I(name='APPENDS',
1488      code='e',
1489      arg=None,
1490      stack_before=[pylist, markobject, stackslice],
1491      stack_after=[pylist],
1492      proto=1,
1493      doc="""Extend a list by a slice of stack objects.
1494
1495      Stack before:  ... pylist markobject stackslice
1496      Stack after:   ... pylist+stackslice
1497
1498      although pylist is really extended in-place.
1499      """),
1500
1501    I(name='LIST',
1502      code='l',
1503      arg=None,
1504      stack_before=[markobject, stackslice],
1505      stack_after=[pylist],
1506      proto=0,
1507      doc="""Build a list out of the topmost stack slice, after markobject.
1508
1509      All the stack entries following the topmost markobject are placed into
1510      a single Python list, which single list object replaces all of the
1511      stack from the topmost markobject onward.  For example,
1512
1513      Stack before: ... markobject 1 2 3 'abc'
1514      Stack after:  ... [1, 2, 3, 'abc']
1515      """),
1516
1517    # Ways to build tuples.
1518
1519    I(name='EMPTY_TUPLE',
1520      code=')',
1521      arg=None,
1522      stack_before=[],
1523      stack_after=[pytuple],
1524      proto=1,
1525      doc="Push an empty tuple."),
1526
1527    I(name='TUPLE',
1528      code='t',
1529      arg=None,
1530      stack_before=[markobject, stackslice],
1531      stack_after=[pytuple],
1532      proto=0,
1533      doc="""Build a tuple out of the topmost stack slice, after markobject.
1534
1535      All the stack entries following the topmost markobject are placed into
1536      a single Python tuple, which single tuple object replaces all of the
1537      stack from the topmost markobject onward.  For example,
1538
1539      Stack before: ... markobject 1 2 3 'abc'
1540      Stack after:  ... (1, 2, 3, 'abc')
1541      """),
1542
1543    I(name='TUPLE1',
1544      code='\x85',
1545      arg=None,
1546      stack_before=[anyobject],
1547      stack_after=[pytuple],
1548      proto=2,
1549      doc="""Build a one-tuple out of the topmost item on the stack.
1550
1551      This code pops one value off the stack and pushes a tuple of
1552      length 1 whose one item is that value back onto it.  In other
1553      words:
1554
1555          stack[-1] = tuple(stack[-1:])
1556      """),
1557
1558    I(name='TUPLE2',
1559      code='\x86',
1560      arg=None,
1561      stack_before=[anyobject, anyobject],
1562      stack_after=[pytuple],
1563      proto=2,
1564      doc="""Build a two-tuple out of the top two items on the stack.
1565
1566      This code pops two values off the stack and pushes a tuple of
1567      length 2 whose items are those values back onto it.  In other
1568      words:
1569
1570          stack[-2:] = [tuple(stack[-2:])]
1571      """),
1572
1573    I(name='TUPLE3',
1574      code='\x87',
1575      arg=None,
1576      stack_before=[anyobject, anyobject, anyobject],
1577      stack_after=[pytuple],
1578      proto=2,
1579      doc="""Build a three-tuple out of the top three items on the stack.
1580
1581      This code pops three values off the stack and pushes a tuple of
1582      length 3 whose items are those values back onto it.  In other
1583      words:
1584
1585          stack[-3:] = [tuple(stack[-3:])]
1586      """),
1587
1588    # Ways to build dicts.
1589
1590    I(name='EMPTY_DICT',
1591      code='}',
1592      arg=None,
1593      stack_before=[],
1594      stack_after=[pydict],
1595      proto=1,
1596      doc="Push an empty dict."),
1597
1598    I(name='DICT',
1599      code='d',
1600      arg=None,
1601      stack_before=[markobject, stackslice],
1602      stack_after=[pydict],
1603      proto=0,
1604      doc="""Build a dict out of the topmost stack slice, after markobject.
1605
1606      All the stack entries following the topmost markobject are placed into
1607      a single Python dict, which single dict object replaces all of the
1608      stack from the topmost markobject onward.  The stack slice alternates
1609      key, value, key, value, ....  For example,
1610
1611      Stack before: ... markobject 1 2 3 'abc'
1612      Stack after:  ... {1: 2, 3: 'abc'}
1613      """),
1614
1615    I(name='SETITEM',
1616      code='s',
1617      arg=None,
1618      stack_before=[pydict, anyobject, anyobject],
1619      stack_after=[pydict],
1620      proto=0,
1621      doc="""Add a key+value pair to an existing dict.
1622
1623      Stack before:  ... pydict key value
1624      Stack after:   ... pydict
1625
1626      where pydict has been modified via pydict[key] = value.
1627      """),
1628
1629    I(name='SETITEMS',
1630      code='u',
1631      arg=None,
1632      stack_before=[pydict, markobject, stackslice],
1633      stack_after=[pydict],
1634      proto=1,
1635      doc="""Add an arbitrary number of key+value pairs to an existing dict.
1636
1637      The slice of the stack following the topmost markobject is taken as
1638      an alternating sequence of keys and values, added to the dict
1639      immediately under the topmost markobject.  Everything at and after the
1640      topmost markobject is popped, leaving the mutated dict at the top
1641      of the stack.
1642
1643      Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
1644      Stack after:   ... pydict
1645
1646      where pydict has been modified via pydict[key_i] = value_i for i in
1647      1, 2, ..., n, and in that order.
1648      """),
1649
1650    # Ways to build sets
1651
1652    I(name='EMPTY_SET',
1653      code='\x8f',
1654      arg=None,
1655      stack_before=[],
1656      stack_after=[pyset],
1657      proto=4,
1658      doc="Push an empty set."),
1659
1660    I(name='ADDITEMS',
1661      code='\x90',
1662      arg=None,
1663      stack_before=[pyset, markobject, stackslice],
1664      stack_after=[pyset],
1665      proto=4,
1666      doc="""Add an arbitrary number of items to an existing set.
1667
1668      The slice of the stack following the topmost markobject is taken as
1669      a sequence of items, added to the set immediately under the topmost
1670      markobject.  Everything at and after the topmost markobject is popped,
1671      leaving the mutated set at the top of the stack.
1672
1673      Stack before:  ... pyset markobject item_1 ... item_n
1674      Stack after:   ... pyset
1675
1676      where pyset has been modified via pyset.add(item_i) = item_i for i in
1677      1, 2, ..., n, and in that order.
1678      """),
1679
1680    # Way to build frozensets
1681
1682    I(name='FROZENSET',
1683      code='\x91',
1684      arg=None,
1685      stack_before=[markobject, stackslice],
1686      stack_after=[pyfrozenset],
1687      proto=4,
1688      doc="""Build a frozenset out of the topmost slice, after markobject.
1689
1690      All the stack entries following the topmost markobject are placed into
1691      a single Python frozenset, which single frozenset object replaces all
1692      of the stack from the topmost markobject onward.  For example,
1693
1694      Stack before: ... markobject 1 2 3
1695      Stack after:  ... frozenset({1, 2, 3})
1696      """),
1697
1698    # Stack manipulation.
1699
1700    I(name='POP',
1701      code='0',
1702      arg=None,
1703      stack_before=[anyobject],
1704      stack_after=[],
1705      proto=0,
1706      doc="Discard the top stack item, shrinking the stack by one item."),
1707
1708    I(name='DUP',
1709      code='2',
1710      arg=None,
1711      stack_before=[anyobject],
1712      stack_after=[anyobject, anyobject],
1713      proto=0,
1714      doc="Push the top stack item onto the stack again, duplicating it."),
1715
1716    I(name='MARK',
1717      code='(',
1718      arg=None,
1719      stack_before=[],
1720      stack_after=[markobject],
1721      proto=0,
1722      doc="""Push markobject onto the stack.
1723
1724      markobject is a unique object, used by other opcodes to identify a
1725      region of the stack containing a variable number of objects for them
1726      to work on.  See markobject.doc for more detail.
1727      """),
1728
1729    I(name='POP_MARK',
1730      code='1',
1731      arg=None,
1732      stack_before=[markobject, stackslice],
1733      stack_after=[],
1734      proto=1,
1735      doc="""Pop all the stack objects at and above the topmost markobject.
1736
1737      When an opcode using a variable number of stack objects is done,
1738      POP_MARK is used to remove those objects, and to remove the markobject
1739      that delimited their starting position on the stack.
1740      """),
1741
1742    # Memo manipulation.  There are really only two operations (get and put),
1743    # each in all-text, "short binary", and "long binary" flavors.
1744
1745    I(name='GET',
1746      code='g',
1747      arg=decimalnl_short,
1748      stack_before=[],
1749      stack_after=[anyobject],
1750      proto=0,
1751      doc="""Read an object from the memo and push it on the stack.
1752
1753      The index of the memo object to push is given by the newline-terminated
1754      decimal string following.  BINGET and LONG_BINGET are space-optimized
1755      versions.
1756      """),
1757
1758    I(name='BINGET',
1759      code='h',
1760      arg=uint1,
1761      stack_before=[],
1762      stack_after=[anyobject],
1763      proto=1,
1764      doc="""Read an object from the memo and push it on the stack.
1765
1766      The index of the memo object to push is given by the 1-byte unsigned
1767      integer following.
1768      """),
1769
1770    I(name='LONG_BINGET',
1771      code='j',
1772      arg=uint4,
1773      stack_before=[],
1774      stack_after=[anyobject],
1775      proto=1,
1776      doc="""Read an object from the memo and push it on the stack.
1777
1778      The index of the memo object to push is given by the 4-byte unsigned
1779      little-endian integer following.
1780      """),
1781
1782    I(name='PUT',
1783      code='p',
1784      arg=decimalnl_short,
1785      stack_before=[],
1786      stack_after=[],
1787      proto=0,
1788      doc="""Store the stack top into the memo.  The stack is not popped.
1789
1790      The index of the memo location to write into is given by the newline-
1791      terminated decimal string following.  BINPUT and LONG_BINPUT are
1792      space-optimized versions.
1793      """),
1794
1795    I(name='BINPUT',
1796      code='q',
1797      arg=uint1,
1798      stack_before=[],
1799      stack_after=[],
1800      proto=1,
1801      doc="""Store the stack top into the memo.  The stack is not popped.
1802
1803      The index of the memo location to write into is given by the 1-byte
1804      unsigned integer following.
1805      """),
1806
1807    I(name='LONG_BINPUT',
1808      code='r',
1809      arg=uint4,
1810      stack_before=[],
1811      stack_after=[],
1812      proto=1,
1813      doc="""Store the stack top into the memo.  The stack is not popped.
1814
1815      The index of the memo location to write into is given by the 4-byte
1816      unsigned little-endian integer following.
1817      """),
1818
1819    I(name='MEMOIZE',
1820      code='\x94',
1821      arg=None,
1822      stack_before=[anyobject],
1823      stack_after=[anyobject],
1824      proto=4,
1825      doc="""Store the stack top into the memo.  The stack is not popped.
1826
1827      The index of the memo location to write is the number of
1828      elements currently present in the memo.
1829      """),
1830
1831    # Access the extension registry (predefined objects).  Akin to the GET
1832    # family.
1833
1834    I(name='EXT1',
1835      code='\x82',
1836      arg=uint1,
1837      stack_before=[],
1838      stack_after=[anyobject],
1839      proto=2,
1840      doc="""Extension code.
1841
1842      This code and the similar EXT2 and EXT4 allow using a registry
1843      of popular objects that are pickled by name, typically classes.
1844      It is envisioned that through a global negotiation and
1845      registration process, third parties can set up a mapping between
1846      ints and object names.
1847
1848      In order to guarantee pickle interchangeability, the extension
1849      code registry ought to be global, although a range of codes may
1850      be reserved for private use.
1851
1852      EXT1 has a 1-byte integer argument.  This is used to index into the
1853      extension registry, and the object at that index is pushed on the stack.
1854      """),
1855
1856    I(name='EXT2',
1857      code='\x83',
1858      arg=uint2,
1859      stack_before=[],
1860      stack_after=[anyobject],
1861      proto=2,
1862      doc="""Extension code.
1863
1864      See EXT1.  EXT2 has a two-byte integer argument.
1865      """),
1866
1867    I(name='EXT4',
1868      code='\x84',
1869      arg=int4,
1870      stack_before=[],
1871      stack_after=[anyobject],
1872      proto=2,
1873      doc="""Extension code.
1874
1875      See EXT1.  EXT4 has a four-byte integer argument.
1876      """),
1877
1878    # Push a class object, or module function, on the stack, via its module
1879    # and name.
1880
1881    I(name='GLOBAL',
1882      code='c',
1883      arg=stringnl_noescape_pair,
1884      stack_before=[],
1885      stack_after=[anyobject],
1886      proto=0,
1887      doc="""Push a global object (module.attr) on the stack.
1888
1889      Two newline-terminated strings follow the GLOBAL opcode.  The first is
1890      taken as a module name, and the second as a class name.  The class
1891      object module.class is pushed on the stack.  More accurately, the
1892      object returned by self.find_class(module, class) is pushed on the
1893      stack, so unpickling subclasses can override this form of lookup.
1894      """),
1895
1896    I(name='STACK_GLOBAL',
1897      code='\x93',
1898      arg=None,
1899      stack_before=[pyunicode, pyunicode],
1900      stack_after=[anyobject],
1901      proto=4,
1902      doc="""Push a global object (module.attr) on the stack.
1903      """),
1904
1905    # Ways to build objects of classes pickle doesn't know about directly
1906    # (user-defined classes).  I despair of documenting this accurately
1907    # and comprehensibly -- you really have to read the pickle code to
1908    # find all the special cases.
1909
1910    I(name='REDUCE',
1911      code='R',
1912      arg=None,
1913      stack_before=[anyobject, anyobject],
1914      stack_after=[anyobject],
1915      proto=0,
1916      doc="""Push an object built from a callable and an argument tuple.
1917
1918      The opcode is named to remind of the __reduce__() method.
1919
1920      Stack before: ... callable pytuple
1921      Stack after:  ... callable(*pytuple)
1922
1923      The callable and the argument tuple are the first two items returned
1924      by a __reduce__ method.  Applying the callable to the argtuple is
1925      supposed to reproduce the original object, or at least get it started.
1926      If the __reduce__ method returns a 3-tuple, the last component is an
1927      argument to be passed to the object's __setstate__, and then the REDUCE
1928      opcode is followed by code to create setstate's argument, and then a
1929      BUILD opcode to apply  __setstate__ to that argument.
1930
1931      If not isinstance(callable, type), REDUCE complains unless the
1932      callable has been registered with the copyreg module's
1933      safe_constructors dict, or the callable has a magic
1934      '__safe_for_unpickling__' attribute with a true value.  I'm not sure
1935      why it does this, but I've sure seen this complaint often enough when
1936      I didn't want to <wink>.
1937      """),
1938
1939    I(name='BUILD',
1940      code='b',
1941      arg=None,
1942      stack_before=[anyobject, anyobject],
1943      stack_after=[anyobject],
1944      proto=0,
1945      doc="""Finish building an object, via __setstate__ or dict update.
1946
1947      Stack before: ... anyobject argument
1948      Stack after:  ... anyobject
1949
1950      where anyobject may have been mutated, as follows:
1951
1952      If the object has a __setstate__ method,
1953
1954          anyobject.__setstate__(argument)
1955
1956      is called.
1957
1958      Else the argument must be a dict, the object must have a __dict__, and
1959      the object is updated via
1960
1961          anyobject.__dict__.update(argument)
1962      """),
1963
1964    I(name='INST',
1965      code='i',
1966      arg=stringnl_noescape_pair,
1967      stack_before=[markobject, stackslice],
1968      stack_after=[anyobject],
1969      proto=0,
1970      doc="""Build a class instance.
1971
1972      This is the protocol 0 version of protocol 1's OBJ opcode.
1973      INST is followed by two newline-terminated strings, giving a
1974      module and class name, just as for the GLOBAL opcode (and see
1975      GLOBAL for more details about that).  self.find_class(module, name)
1976      is used to get a class object.
1977
1978      In addition, all the objects on the stack following the topmost
1979      markobject are gathered into a tuple and popped (along with the
1980      topmost markobject), just as for the TUPLE opcode.
1981
1982      Now it gets complicated.  If all of these are true:
1983
1984        + The argtuple is empty (markobject was at the top of the stack
1985          at the start).
1986
1987        + The class object does not have a __getinitargs__ attribute.
1988
1989      then we want to create an old-style class instance without invoking
1990      its __init__() method (pickle has waffled on this over the years; not
1991      calling __init__() is current wisdom).  In this case, an instance of
1992      an old-style dummy class is created, and then we try to rebind its
1993      __class__ attribute to the desired class object.  If this succeeds,
1994      the new instance object is pushed on the stack, and we're done.
1995
1996      Else (the argtuple is not empty, it's not an old-style class object,
1997      or the class object does have a __getinitargs__ attribute), the code
1998      first insists that the class object have a __safe_for_unpickling__
1999      attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
2000      it doesn't matter whether this attribute has a true or false value, it
2001      only matters whether it exists (XXX this is a bug).  If
2002      __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
2003
2004      Else (the class object does have a __safe_for_unpickling__ attr),
2005      the class object obtained from INST's arguments is applied to the
2006      argtuple obtained from the stack, and the resulting instance object
2007      is pushed on the stack.
2008
2009      NOTE:  checks for __safe_for_unpickling__ went away in Python 2.3.
2010      NOTE:  the distinction between old-style and new-style classes does
2011             not make sense in Python 3.
2012      """),
2013
2014    I(name='OBJ',
2015      code='o',
2016      arg=None,
2017      stack_before=[markobject, anyobject, stackslice],
2018      stack_after=[anyobject],
2019      proto=1,
2020      doc="""Build a class instance.
2021
2022      This is the protocol 1 version of protocol 0's INST opcode, and is
2023      very much like it.  The major difference is that the class object
2024      is taken off the stack, allowing it to be retrieved from the memo
2025      repeatedly if several instances of the same class are created.  This
2026      can be much more efficient (in both time and space) than repeatedly
2027      embedding the module and class names in INST opcodes.
2028
2029      Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
2030      the class object is taken off the stack, immediately above the
2031      topmost markobject:
2032
2033      Stack before: ... markobject classobject stackslice
2034      Stack after:  ... new_instance_object
2035
2036      As for INST, the remainder of the stack above the markobject is
2037      gathered into an argument tuple, and then the logic seems identical,
2038      except that no __safe_for_unpickling__ check is done (XXX this is
2039      a bug).  See INST for the gory details.
2040
2041      NOTE:  In Python 2.3, INST and OBJ are identical except for how they
2042      get the class object.  That was always the intent; the implementations
2043      had diverged for accidental reasons.
2044      """),
2045
2046    I(name='NEWOBJ',
2047      code='\x81',
2048      arg=None,
2049      stack_before=[anyobject, anyobject],
2050      stack_after=[anyobject],
2051      proto=2,
2052      doc="""Build an object instance.
2053
2054      The stack before should be thought of as containing a class
2055      object followed by an argument tuple (the tuple being the stack
2056      top).  Call these cls and args.  They are popped off the stack,
2057      and the value returned by cls.__new__(cls, *args) is pushed back
2058      onto the stack.
2059      """),
2060
2061    I(name='NEWOBJ_EX',
2062      code='\x92',
2063      arg=None,
2064      stack_before=[anyobject, anyobject, anyobject],
2065      stack_after=[anyobject],
2066      proto=4,
2067      doc="""Build an object instance.
2068
2069      The stack before should be thought of as containing a class
2070      object followed by an argument tuple and by a keyword argument dict
2071      (the dict being the stack top).  Call these cls and args.  They are
2072      popped off the stack, and the value returned by
2073      cls.__new__(cls, *args, *kwargs) is  pushed back  onto the stack.
2074      """),
2075
2076    # Machine control.
2077
2078    I(name='PROTO',
2079      code='\x80',
2080      arg=uint1,
2081      stack_before=[],
2082      stack_after=[],
2083      proto=2,
2084      doc="""Protocol version indicator.
2085
2086      For protocol 2 and above, a pickle must start with this opcode.
2087      The argument is the protocol version, an int in range(2, 256).
2088      """),
2089
2090    I(name='STOP',
2091      code='.',
2092      arg=None,
2093      stack_before=[anyobject],
2094      stack_after=[],
2095      proto=0,
2096      doc="""Stop the unpickling machine.
2097
2098      Every pickle ends with this opcode.  The object at the top of the stack
2099      is popped, and that's the result of unpickling.  The stack should be
2100      empty then.
2101      """),
2102
2103    # Framing support.
2104
2105    I(name='FRAME',
2106      code='\x95',
2107      arg=uint8,
2108      stack_before=[],
2109      stack_after=[],
2110      proto=4,
2111      doc="""Indicate the beginning of a new frame.
2112
2113      The unpickler may use this opcode to safely prefetch data from its
2114      underlying stream.
2115      """),
2116
2117    # Ways to deal with persistent IDs.
2118
2119    I(name='PERSID',
2120      code='P',
2121      arg=stringnl_noescape,
2122      stack_before=[],
2123      stack_after=[anyobject],
2124      proto=0,
2125      doc="""Push an object identified by a persistent ID.
2126
2127      The pickle module doesn't define what a persistent ID means.  PERSID's
2128      argument is a newline-terminated str-style (no embedded escapes, no
2129      bracketing quote characters) string, which *is* "the persistent ID".
2130      The unpickler passes this string to self.persistent_load().  Whatever
2131      object that returns is pushed on the stack.  There is no implementation
2132      of persistent_load() in Python's unpickler:  it must be supplied by an
2133      unpickler subclass.
2134      """),
2135
2136    I(name='BINPERSID',
2137      code='Q',
2138      arg=None,
2139      stack_before=[anyobject],
2140      stack_after=[anyobject],
2141      proto=1,
2142      doc="""Push an object identified by a persistent ID.
2143
2144      Like PERSID, except the persistent ID is popped off the stack (instead
2145      of being a string embedded in the opcode bytestream).  The persistent
2146      ID is passed to self.persistent_load(), and whatever object that
2147      returns is pushed on the stack.  See PERSID for more detail.
2148      """),
2149]
2150del I
2151
2152# Verify uniqueness of .name and .code members.
2153name2i = {}
2154code2i = {}
2155
2156for i, d in enumerate(opcodes):
2157    if d.name in name2i:
2158        raise ValueError("repeated name %r at indices %d and %d" %
2159                         (d.name, name2i[d.name], i))
2160    if d.code in code2i:
2161        raise ValueError("repeated code %r at indices %d and %d" %
2162                         (d.code, code2i[d.code], i))
2163
2164    name2i[d.name] = i
2165    code2i[d.code] = i
2166
2167del name2i, code2i, i, d
2168
2169##############################################################################
2170# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
2171# Also ensure we've got the same stuff as pickle.py, although the
2172# introspection here is dicey.
2173
2174code2op = {}
2175for d in opcodes:
2176    code2op[d.code] = d
2177del d
2178
2179def assure_pickle_consistency(verbose=False):
2180
2181    copy = code2op.copy()
2182    for name in pickle.__all__:
2183        if not re.match("[A-Z][A-Z0-9_]+$", name):
2184            if verbose:
2185                print("skipping %r: it doesn't look like an opcode name" % name)
2186            continue
2187        picklecode = getattr(pickle, name)
2188        if not isinstance(picklecode, bytes) or len(picklecode) != 1:
2189            if verbose:
2190                print(("skipping %r: value %r doesn't look like a pickle "
2191                       "code" % (name, picklecode)))
2192            continue
2193        picklecode = picklecode.decode("latin-1")
2194        if picklecode in copy:
2195            if verbose:
2196                print("checking name %r w/ code %r for consistency" % (
2197                      name, picklecode))
2198            d = copy[picklecode]
2199            if d.name != name:
2200                raise ValueError("for pickle code %r, pickle.py uses name %r "
2201                                 "but we're using name %r" % (picklecode,
2202                                                              name,
2203                                                              d.name))
2204            # Forget this one.  Any left over in copy at the end are a problem
2205            # of a different kind.
2206            del copy[picklecode]
2207        else:
2208            raise ValueError("pickle.py appears to have a pickle opcode with "
2209                             "name %r and code %r, but we don't" %
2210                             (name, picklecode))
2211    if copy:
2212        msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
2213        for code, d in copy.items():
2214            msg.append("    name %r with code %r" % (d.name, code))
2215        raise ValueError("\n".join(msg))
2216
2217assure_pickle_consistency()
2218del assure_pickle_consistency
2219
2220##############################################################################
2221# A pickle opcode generator.
2222
2223def _genops(data, yield_end_pos=False):
2224    if isinstance(data, bytes_types):
2225        data = io.BytesIO(data)
2226
2227    if hasattr(data, "tell"):
2228        getpos = data.tell
2229    else:
2230        getpos = lambda: None
2231
2232    while True:
2233        pos = getpos()
2234        code = data.read(1)
2235        opcode = code2op.get(code.decode("latin-1"))
2236        if opcode is None:
2237            if code == b"":
2238                raise ValueError("pickle exhausted before seeing STOP")
2239            else:
2240                raise ValueError("at position %s, opcode %r unknown" % (
2241                                 "<unknown>" if pos is None else pos,
2242                                 code))
2243        if opcode.arg is None:
2244            arg = None
2245        else:
2246            arg = opcode.arg.reader(data)
2247        if yield_end_pos:
2248            yield opcode, arg, pos, getpos()
2249        else:
2250            yield opcode, arg, pos
2251        if code == b'.':
2252            assert opcode.name == 'STOP'
2253            break
2254
2255def genops(pickle):
2256    """Generate all the opcodes in a pickle.
2257
2258    'pickle' is a file-like object, or string, containing the pickle.
2259
2260    Each opcode in the pickle is generated, from the current pickle position,
2261    stopping after a STOP opcode is delivered.  A triple is generated for
2262    each opcode:
2263
2264        opcode, arg, pos
2265
2266    opcode is an OpcodeInfo record, describing the current opcode.
2267
2268    If the opcode has an argument embedded in the pickle, arg is its decoded
2269    value, as a Python object.  If the opcode doesn't have an argument, arg
2270    is None.
2271
2272    If the pickle has a tell() method, pos was the value of pickle.tell()
2273    before reading the current opcode.  If the pickle is a bytes object,
2274    it's wrapped in a BytesIO object, and the latter's tell() result is
2275    used.  Else (the pickle doesn't have a tell(), and it's not obvious how
2276    to query its current position) pos is None.
2277    """
2278    return _genops(pickle)
2279
2280##############################################################################
2281# A pickle optimizer.
2282
2283def optimize(p):
2284    'Optimize a pickle string by removing unused PUT opcodes'
2285    put = 'PUT'
2286    get = 'GET'
2287    oldids = set()          # set of all PUT ids
2288    newids = {}             # set of ids used by a GET opcode
2289    opcodes = []            # (op, idx) or (pos, end_pos)
2290    proto = 0
2291    protoheader = b''
2292    for opcode, arg, pos, end_pos in _genops(p, yield_end_pos=True):
2293        if 'PUT' in opcode.name:
2294            oldids.add(arg)
2295            opcodes.append((put, arg))
2296        elif opcode.name == 'MEMOIZE':
2297            idx = len(oldids)
2298            oldids.add(idx)
2299            opcodes.append((put, idx))
2300        elif 'FRAME' in opcode.name:
2301            pass
2302        elif 'GET' in opcode.name:
2303            if opcode.proto > proto:
2304                proto = opcode.proto
2305            newids[arg] = None
2306            opcodes.append((get, arg))
2307        elif opcode.name == 'PROTO':
2308            if arg > proto:
2309                proto = arg
2310            if pos == 0:
2311                protoheader = p[pos: end_pos]
2312            else:
2313                opcodes.append((pos, end_pos))
2314        else:
2315            opcodes.append((pos, end_pos))
2316    del oldids
2317
2318    # Copy the opcodes except for PUTS without a corresponding GET
2319    out = io.BytesIO()
2320    # Write the PROTO header before any framing
2321    out.write(protoheader)
2322    pickler = pickle._Pickler(out, proto)
2323    if proto >= 4:
2324        pickler.framer.start_framing()
2325    idx = 0
2326    for op, arg in opcodes:
2327        if op is put:
2328            if arg not in newids:
2329                continue
2330            data = pickler.put(idx)
2331            newids[arg] = idx
2332            idx += 1
2333        elif op is get:
2334            data = pickler.get(newids[arg])
2335        else:
2336            data = p[op:arg]
2337        pickler.framer.commit_frame()
2338        pickler.write(data)
2339    pickler.framer.end_framing()
2340    return out.getvalue()
2341
2342##############################################################################
2343# A symbolic pickle disassembler.
2344
2345def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0):
2346    """Produce a symbolic disassembly of a pickle.
2347
2348    'pickle' is a file-like object, or string, containing a (at least one)
2349    pickle.  The pickle is disassembled from the current position, through
2350    the first STOP opcode encountered.
2351
2352    Optional arg 'out' is a file-like object to which the disassembly is
2353    printed.  It defaults to sys.stdout.
2354
2355    Optional arg 'memo' is a Python dict, used as the pickle's memo.  It
2356    may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
2357    Passing the same memo object to another dis() call then allows disassembly
2358    to proceed across multiple pickles that were all created by the same
2359    pickler with the same memo.  Ordinarily you don't need to worry about this.
2360
2361    Optional arg 'indentlevel' is the number of blanks by which to indent
2362    a new MARK level.  It defaults to 4.
2363
2364    Optional arg 'annotate' if nonzero instructs dis() to add short
2365    description of the opcode on each line of disassembled output.
2366    The value given to 'annotate' must be an integer and is used as a
2367    hint for the column where annotation should start.  The default
2368    value is 0, meaning no annotations.
2369
2370    In addition to printing the disassembly, some sanity checks are made:
2371
2372    + All embedded opcode arguments "make sense".
2373
2374    + Explicit and implicit pop operations have enough items on the stack.
2375
2376    + When an opcode implicitly refers to a markobject, a markobject is
2377      actually on the stack.
2378
2379    + A memo entry isn't referenced before it's defined.
2380
2381    + The markobject isn't stored in the memo.
2382
2383    + A memo entry isn't redefined.
2384    """
2385
2386    # Most of the hair here is for sanity checks, but most of it is needed
2387    # anyway to detect when a protocol 0 POP takes a MARK off the stack
2388    # (which in turn is needed to indent MARK blocks correctly).
2389
2390    stack = []          # crude emulation of unpickler stack
2391    if memo is None:
2392        memo = {}       # crude emulation of unpickler memo
2393    maxproto = -1       # max protocol number seen
2394    markstack = []      # bytecode positions of MARK opcodes
2395    indentchunk = ' ' * indentlevel
2396    errormsg = None
2397    annocol = annotate  # column hint for annotations
2398    for opcode, arg, pos in genops(pickle):
2399        if pos is not None:
2400            print("%5d:" % pos, end=' ', file=out)
2401
2402        line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
2403                              indentchunk * len(markstack),
2404                              opcode.name)
2405
2406        maxproto = max(maxproto, opcode.proto)
2407        before = opcode.stack_before    # don't mutate
2408        after = opcode.stack_after      # don't mutate
2409        numtopop = len(before)
2410
2411        # See whether a MARK should be popped.
2412        markmsg = None
2413        if markobject in before or (opcode.name == "POP" and
2414                                    stack and
2415                                    stack[-1] is markobject):
2416            assert markobject not in after
2417            if __debug__:
2418                if markobject in before:
2419                    assert before[-1] is stackslice
2420            if markstack:
2421                markpos = markstack.pop()
2422                if markpos is None:
2423                    markmsg = "(MARK at unknown opcode offset)"
2424                else:
2425                    markmsg = "(MARK at %d)" % markpos
2426                # Pop everything at and after the topmost markobject.
2427                while stack[-1] is not markobject:
2428                    stack.pop()
2429                stack.pop()
2430                # Stop later code from popping too much.
2431                try:
2432                    numtopop = before.index(markobject)
2433                except ValueError:
2434                    assert opcode.name == "POP"
2435                    numtopop = 0
2436            else:
2437                errormsg = markmsg = "no MARK exists on stack"
2438
2439        # Check for correct memo usage.
2440        if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"):
2441            if opcode.name == "MEMOIZE":
2442                memo_idx = len(memo)
2443                markmsg = "(as %d)" % memo_idx
2444            else:
2445                assert arg is not None
2446                memo_idx = arg
2447            if memo_idx in memo:
2448                errormsg = "memo key %r already defined" % arg
2449            elif not stack:
2450                errormsg = "stack is empty -- can't store into memo"
2451            elif stack[-1] is markobject:
2452                errormsg = "can't store markobject in the memo"
2453            else:
2454                memo[memo_idx] = stack[-1]
2455        elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2456            if arg in memo:
2457                assert len(after) == 1
2458                after = [memo[arg]]     # for better stack emulation
2459            else:
2460                errormsg = "memo key %r has never been stored into" % arg
2461
2462        if arg is not None or markmsg:
2463            # make a mild effort to align arguments
2464            line += ' ' * (10 - len(opcode.name))
2465            if arg is not None:
2466                line += ' ' + repr(arg)
2467            if markmsg:
2468                line += ' ' + markmsg
2469        if annotate:
2470            line += ' ' * (annocol - len(line))
2471            # make a mild effort to align annotations
2472            annocol = len(line)
2473            if annocol > 50:
2474                annocol = annotate
2475            line += ' ' + opcode.doc.split('\n', 1)[0]
2476        print(line, file=out)
2477
2478        if errormsg:
2479            # Note that we delayed complaining until the offending opcode
2480            # was printed.
2481            raise ValueError(errormsg)
2482
2483        # Emulate the stack effects.
2484        if len(stack) < numtopop:
2485            raise ValueError("tries to pop %d items from stack with "
2486                             "only %d items" % (numtopop, len(stack)))
2487        if numtopop:
2488            del stack[-numtopop:]
2489        if markobject in after:
2490            assert markobject not in before
2491            markstack.append(pos)
2492
2493        stack.extend(after)
2494
2495    print("highest protocol among opcodes =", maxproto, file=out)
2496    if stack:
2497        raise ValueError("stack not empty after STOP: %r" % stack)
2498
2499# For use in the doctest, simply as an example of a class to pickle.
2500class _Example:
2501    def __init__(self, value):
2502        self.value = value
2503
2504_dis_test = r"""
2505>>> import pickle
2506>>> x = [1, 2, (3, 4), {b'abc': "def"}]
2507>>> pkl0 = pickle.dumps(x, 0)
2508>>> dis(pkl0)
2509    0: (    MARK
2510    1: l        LIST       (MARK at 0)
2511    2: p    PUT        0
2512    5: L    LONG       1
2513    9: a    APPEND
2514   10: L    LONG       2
2515   14: a    APPEND
2516   15: (    MARK
2517   16: L        LONG       3
2518   20: L        LONG       4
2519   24: t        TUPLE      (MARK at 15)
2520   25: p    PUT        1
2521   28: a    APPEND
2522   29: (    MARK
2523   30: d        DICT       (MARK at 29)
2524   31: p    PUT        2
2525   34: c    GLOBAL     '_codecs encode'
2526   50: p    PUT        3
2527   53: (    MARK
2528   54: V        UNICODE    'abc'
2529   59: p        PUT        4
2530   62: V        UNICODE    'latin1'
2531   70: p        PUT        5
2532   73: t        TUPLE      (MARK at 53)
2533   74: p    PUT        6
2534   77: R    REDUCE
2535   78: p    PUT        7
2536   81: V    UNICODE    'def'
2537   86: p    PUT        8
2538   89: s    SETITEM
2539   90: a    APPEND
2540   91: .    STOP
2541highest protocol among opcodes = 0
2542
2543Try again with a "binary" pickle.
2544
2545>>> pkl1 = pickle.dumps(x, 1)
2546>>> dis(pkl1)
2547    0: ]    EMPTY_LIST
2548    1: q    BINPUT     0
2549    3: (    MARK
2550    4: K        BININT1    1
2551    6: K        BININT1    2
2552    8: (        MARK
2553    9: K            BININT1    3
2554   11: K            BININT1    4
2555   13: t            TUPLE      (MARK at 8)
2556   14: q        BINPUT     1
2557   16: }        EMPTY_DICT
2558   17: q        BINPUT     2
2559   19: c        GLOBAL     '_codecs encode'
2560   35: q        BINPUT     3
2561   37: (        MARK
2562   38: X            BINUNICODE 'abc'
2563   46: q            BINPUT     4
2564   48: X            BINUNICODE 'latin1'
2565   59: q            BINPUT     5
2566   61: t            TUPLE      (MARK at 37)
2567   62: q        BINPUT     6
2568   64: R        REDUCE
2569   65: q        BINPUT     7
2570   67: X        BINUNICODE 'def'
2571   75: q        BINPUT     8
2572   77: s        SETITEM
2573   78: e        APPENDS    (MARK at 3)
2574   79: .    STOP
2575highest protocol among opcodes = 1
2576
2577Exercise the INST/OBJ/BUILD family.
2578
2579>>> import pickletools
2580>>> dis(pickle.dumps(pickletools.dis, 0))
2581    0: c    GLOBAL     'pickletools dis'
2582   17: p    PUT        0
2583   20: .    STOP
2584highest protocol among opcodes = 0
2585
2586>>> from pickletools import _Example
2587>>> x = [_Example(42)] * 2
2588>>> dis(pickle.dumps(x, 0))
2589    0: (    MARK
2590    1: l        LIST       (MARK at 0)
2591    2: p    PUT        0
2592    5: c    GLOBAL     'copy_reg _reconstructor'
2593   30: p    PUT        1
2594   33: (    MARK
2595   34: c        GLOBAL     'pickletools _Example'
2596   56: p        PUT        2
2597   59: c        GLOBAL     '__builtin__ object'
2598   79: p        PUT        3
2599   82: N        NONE
2600   83: t        TUPLE      (MARK at 33)
2601   84: p    PUT        4
2602   87: R    REDUCE
2603   88: p    PUT        5
2604   91: (    MARK
2605   92: d        DICT       (MARK at 91)
2606   93: p    PUT        6
2607   96: V    UNICODE    'value'
2608  103: p    PUT        7
2609  106: L    LONG       42
2610  111: s    SETITEM
2611  112: b    BUILD
2612  113: a    APPEND
2613  114: g    GET        5
2614  117: a    APPEND
2615  118: .    STOP
2616highest protocol among opcodes = 0
2617
2618>>> dis(pickle.dumps(x, 1))
2619    0: ]    EMPTY_LIST
2620    1: q    BINPUT     0
2621    3: (    MARK
2622    4: c        GLOBAL     'copy_reg _reconstructor'
2623   29: q        BINPUT     1
2624   31: (        MARK
2625   32: c            GLOBAL     'pickletools _Example'
2626   54: q            BINPUT     2
2627   56: c            GLOBAL     '__builtin__ object'
2628   76: q            BINPUT     3
2629   78: N            NONE
2630   79: t            TUPLE      (MARK at 31)
2631   80: q        BINPUT     4
2632   82: R        REDUCE
2633   83: q        BINPUT     5
2634   85: }        EMPTY_DICT
2635   86: q        BINPUT     6
2636   88: X        BINUNICODE 'value'
2637   98: q        BINPUT     7
2638  100: K        BININT1    42
2639  102: s        SETITEM
2640  103: b        BUILD
2641  104: h        BINGET     5
2642  106: e        APPENDS    (MARK at 3)
2643  107: .    STOP
2644highest protocol among opcodes = 1
2645
2646Try "the canonical" recursive-object test.
2647
2648>>> L = []
2649>>> T = L,
2650>>> L.append(T)
2651>>> L[0] is T
2652True
2653>>> T[0] is L
2654True
2655>>> L[0][0] is L
2656True
2657>>> T[0][0] is T
2658True
2659>>> dis(pickle.dumps(L, 0))
2660    0: (    MARK
2661    1: l        LIST       (MARK at 0)
2662    2: p    PUT        0
2663    5: (    MARK
2664    6: g        GET        0
2665    9: t        TUPLE      (MARK at 5)
2666   10: p    PUT        1
2667   13: a    APPEND
2668   14: .    STOP
2669highest protocol among opcodes = 0
2670
2671>>> dis(pickle.dumps(L, 1))
2672    0: ]    EMPTY_LIST
2673    1: q    BINPUT     0
2674    3: (    MARK
2675    4: h        BINGET     0
2676    6: t        TUPLE      (MARK at 3)
2677    7: q    BINPUT     1
2678    9: a    APPEND
2679   10: .    STOP
2680highest protocol among opcodes = 1
2681
2682Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2683has to emulate the stack in order to realize that the POP opcode at 16 gets
2684rid of the MARK at 0.
2685
2686>>> dis(pickle.dumps(T, 0))
2687    0: (    MARK
2688    1: (        MARK
2689    2: l            LIST       (MARK at 1)
2690    3: p        PUT        0
2691    6: (        MARK
2692    7: g            GET        0
2693   10: t            TUPLE      (MARK at 6)
2694   11: p        PUT        1
2695   14: a        APPEND
2696   15: 0        POP
2697   16: 0        POP        (MARK at 0)
2698   17: g    GET        1
2699   20: .    STOP
2700highest protocol among opcodes = 0
2701
2702>>> dis(pickle.dumps(T, 1))
2703    0: (    MARK
2704    1: ]        EMPTY_LIST
2705    2: q        BINPUT     0
2706    4: (        MARK
2707    5: h            BINGET     0
2708    7: t            TUPLE      (MARK at 4)
2709    8: q        BINPUT     1
2710   10: a        APPEND
2711   11: 1        POP_MARK   (MARK at 0)
2712   12: h    BINGET     1
2713   14: .    STOP
2714highest protocol among opcodes = 1
2715
2716Try protocol 2.
2717
2718>>> dis(pickle.dumps(L, 2))
2719    0: \x80 PROTO      2
2720    2: ]    EMPTY_LIST
2721    3: q    BINPUT     0
2722    5: h    BINGET     0
2723    7: \x85 TUPLE1
2724    8: q    BINPUT     1
2725   10: a    APPEND
2726   11: .    STOP
2727highest protocol among opcodes = 2
2728
2729>>> dis(pickle.dumps(T, 2))
2730    0: \x80 PROTO      2
2731    2: ]    EMPTY_LIST
2732    3: q    BINPUT     0
2733    5: h    BINGET     0
2734    7: \x85 TUPLE1
2735    8: q    BINPUT     1
2736   10: a    APPEND
2737   11: 0    POP
2738   12: h    BINGET     1
2739   14: .    STOP
2740highest protocol among opcodes = 2
2741
2742Try protocol 3 with annotations:
2743
2744>>> dis(pickle.dumps(T, 3), annotate=1)
2745    0: \x80 PROTO      3 Protocol version indicator.
2746    2: ]    EMPTY_LIST   Push an empty list.
2747    3: q    BINPUT     0 Store the stack top into the memo.  The stack is not popped.
2748    5: h    BINGET     0 Read an object from the memo and push it on the stack.
2749    7: \x85 TUPLE1       Build a one-tuple out of the topmost item on the stack.
2750    8: q    BINPUT     1 Store the stack top into the memo.  The stack is not popped.
2751   10: a    APPEND       Append an object to a list.
2752   11: 0    POP          Discard the top stack item, shrinking the stack by one item.
2753   12: h    BINGET     1 Read an object from the memo and push it on the stack.
2754   14: .    STOP         Stop the unpickling machine.
2755highest protocol among opcodes = 2
2756
2757"""
2758
2759_memo_test = r"""
2760>>> import pickle
2761>>> import io
2762>>> f = io.BytesIO()
2763>>> p = pickle.Pickler(f, 2)
2764>>> x = [1, 2, 3]
2765>>> p.dump(x)
2766>>> p.dump(x)
2767>>> f.seek(0)
27680
2769>>> memo = {}
2770>>> dis(f, memo=memo)
2771    0: \x80 PROTO      2
2772    2: ]    EMPTY_LIST
2773    3: q    BINPUT     0
2774    5: (    MARK
2775    6: K        BININT1    1
2776    8: K        BININT1    2
2777   10: K        BININT1    3
2778   12: e        APPENDS    (MARK at 5)
2779   13: .    STOP
2780highest protocol among opcodes = 2
2781>>> dis(f, memo=memo)
2782   14: \x80 PROTO      2
2783   16: h    BINGET     0
2784   18: .    STOP
2785highest protocol among opcodes = 2
2786"""
2787
2788__test__ = {'disassembler_test': _dis_test,
2789            'disassembler_memo_test': _memo_test,
2790           }
2791
2792def _test():
2793    import doctest
2794    return doctest.testmod()
2795
2796if __name__ == "__main__":
2797    import argparse
2798    parser = argparse.ArgumentParser(
2799        description='disassemble one or more pickle files')
2800    parser.add_argument(
2801        'pickle_file', type=argparse.FileType('br'),
2802        nargs='*', help='the pickle file')
2803    parser.add_argument(
2804        '-o', '--output', default=sys.stdout, type=argparse.FileType('w'),
2805        help='the file where the output should be written')
2806    parser.add_argument(
2807        '-m', '--memo', action='store_true',
2808        help='preserve memo between disassemblies')
2809    parser.add_argument(
2810        '-l', '--indentlevel', default=4, type=int,
2811        help='the number of blanks by which to indent a new MARK level')
2812    parser.add_argument(
2813        '-a', '--annotate',  action='store_true',
2814        help='annotate each line with a short opcode description')
2815    parser.add_argument(
2816        '-p', '--preamble', default="==> {name} <==",
2817        help='if more than one pickle file is specified, print this before'
2818        ' each disassembly')
2819    parser.add_argument(
2820        '-t', '--test', action='store_true',
2821        help='run self-test suite')
2822    parser.add_argument(
2823        '-v', action='store_true',
2824        help='run verbosely; only affects self-test run')
2825    args = parser.parse_args()
2826    if args.test:
2827        _test()
2828    else:
2829        annotate = 30 if args.annotate else 0
2830        if not args.pickle_file:
2831            parser.print_help()
2832        elif len(args.pickle_file) == 1:
2833            dis(args.pickle_file[0], args.output, None,
2834                args.indentlevel, annotate)
2835        else:
2836            memo = {} if args.memo else None
2837            for f in args.pickle_file:
2838                preamble = args.preamble.format(name=f.name)
2839                args.output.write(preamble + '\n')
2840                dis(f, args.output, memo, args.indentlevel, annotate)
2841