1# Adding or extending a family of adaptive instructions. 2 3## Families of instructions 4 5The core part of PEP 659 (specializing adaptive interpreter) is the families 6of instructions that perform the adaptive specialization. 7 8A family of instructions has the following fundamental properties: 9 10* It corresponds to a single instruction in the code 11 generated by the bytecode compiler. 12* It has a single adaptive instruction that records an execution count and, 13 at regular intervals, attempts to specialize itself. If not specializing, 14 it executes the base implementation. 15* It has at least one specialized form of the instruction that is tailored 16 for a particular value or set of values at runtime. 17* All members of the family must have the same number of inline cache entries, 18 to ensure correct execution. 19 Individual family members do not need to use all of the entries, 20 but must skip over any unused entries when executing. 21 22The current implementation also requires the following, 23although these are not fundamental and may change: 24 25* All families use one or more inline cache entries, 26 the first entry is always the counter. 27* All instruction names should start with the name of the adaptive 28 instruction. 29* Specialized forms should have names describing their specialization. 30 31## Example family 32 33The `LOAD_GLOBAL` instruction (in Python/bytecodes.c) already has an adaptive 34family that serves as a relatively simple example. 35 36The `LOAD_GLOBAL` instruction performs adaptive specialization, 37calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero. 38 39There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE` 40which is specialized for global variables in the module, and 41`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables. 42 43## Performance analysis 44 45The benefit of a specialization can be assessed with the following formula: 46`Tbase/Tadaptive`. 47 48Where `Tbase` is the mean time to execute the base instruction, 49and `Tadaptive` is the mean time to execute the specialized and adaptive forms. 50 51`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)` 52 53`Ti` is the time to execute the `i`th instruction in the family and `Ni` is 54the number of times that instruction is executed. 55`Tmiss` is the time to process a miss, including de-optimzation 56and the time to execute the base instruction. 57 58The ideal situation is where misses are rare and the specialized 59forms are much faster than the base instruction. 60`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`. 61In which case we have `Tadaptive ≈ sum(Ti*Ni)`. 62Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and 63`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction, 64we would expect the specialization of `LOAD_GLOBAL` to be profitable. 65 66## Design considerations 67 68While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and 69`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti` 70low for all specialized instructions and `Nmiss` as low as possible. 71 72Keeping `Nmiss` low means that there should be specializations for almost 73all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means 74keeping `Ti` low which means minimizing branches and dependent memory 75accesses (pointer chasing). These two objectives may be in conflict, 76requiring judgement and experimentation to design the family of instructions. 77 78The size of the inline cache should as small as possible, 79without impairing performance, to reduce the number of 80`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache. 81 82### Gathering data 83 84Before choosing how to specialize an instruction, it is important to gather 85some data. What are the patterns of usage of the base instruction? 86Data can best be gathered by instrumenting the interpreter. Since a 87specialization function and adaptive instruction are going to be required, 88instrumentation can most easily be added in the specialization function. 89 90### Choice of specializations 91 92The performance of the specializing adaptive interpreter relies on the 93quality of specialization and keeping the overhead of specialization low. 94 95Specialized instructions must be fast. In order to be fast, 96specialized instructions should be tailored for a particular 97set of values that allows them to: 981. Verify that incoming value is part of that set with low overhead. 992. Perform the operation quickly. 100 101This requires that the set of values is chosen such that membership can be 102tested quickly and that membership is sufficient to allow the operation to 103performed quickly. 104 105For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()` 106dictionaries that have a keys with the expected version. 107 108This can be tested quickly: 109* `globals->keys->dk_version == expected_version` 110 111and the operation can be performed quickly: 112* `value = entries[cache->index].me_value;`. 113 114Because it is impossible to measure the performance of an instruction without 115also measuring unrelated factors, the assessment of the quality of a 116specialization will require some judgement. 117 118As a general rule, specialized instructions should be much faster than the 119base instruction. 120 121### Implementation of specialized instructions 122 123In general, specialized instructions should be implemented in two parts: 1241. A sequence of guards, each of the form 125 `DEOPT_IF(guard-condition-is-false, BASE_NAME)`. 1262. The operation, which should ideally have no branches and 127 a minimum number of dependent memory accesses. 128 129In practice, the parts may overlap, as data required for guards 130can be re-used in the operation. 131 132If there are branches in the operation, then consider further specialization 133to eliminate the branches. 134 135### Maintaining stats 136 137Finally, take care that stats are gather correctly. 138After the last `DEOPT_IF` has passed, a hit should be recorded with 139`STAT_INC(BASE_INSTRUCTION, hit)`. 140After an optimization has been deferred in the adaptive instruction, 141that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`. 142