1 The packing stage in gemmlowp 2 ***************************** 3 4 5Introduction 6============ 7 8We assume familiarity with doc/design.txt and with the overall 93 stages of computations described there: packing, kernel, unpacking. 10 11This page goes into more details about the first stage: packing. 12 13We also assume familiarity with doc/kernel.txt as it describes the 14packed format requirements that the kernels expect, and that forms 15basically the contract that the packing stage must honor. 16 17Some parts below also assume familiarity with doc/low-precision.txt 18as the packing stage also has to compute the vectors of sums or columns as 19described there. 20 21 22The storage order of packed blocks, partly hidden behind sequential access 23========================================================================== 24 25As explained in doc/design.txt, the primary purpose of packing is to 26ensure that when the kernel traverses Lhs/Rhs matrix data, it can do 27so efficiently thanks to having the data stored in an order that is 28as similar as possible to the order in which the compute stage has to 29traverse this data. 30 31This traversal order is nontrivial for the reasons outlined in 32doc/design.txt: at the innermost level, one tries to work within registers 33as much as possible; at the next level, one tries to stay within L1 cache 34as much as possible. The packed blocks that we handle are supposed 35to fit entirely in L2 cache. 36 37Thus it has become standard in GEMM design to describe complicated 38"Z-order" or "fractal order" storage for packed blocks. 39 40However, we should keep in mind that the whole point of the packed storage 41order is to be as similar as possible to the order of traversal during 42the compute stage. The storage order doesn't matter in itself; the only 43thing that matters is simple access patterns during the compute stage. 44 45This suggests the following approach to implementing packing: take the 46exact same hierarchy of nested loops of the compute stage, drop the loops 47that are not relevant to the side (Lhs or Rhs) being packed, and try to use 48mostly sequential access to the destination packed data. 49 50This hierarchy of nested loops can be seen in PackSideBlockImpl 51(PackL2, PackL1, PackRun), compare to the similar hierarchy of loops 52in internal/compute.h. 53 54In this way, the more intricate small-scale details or the packed data format 55never need to be made explicit (which would be complicated). We still use 56some "seeking" but only at larger scales, where the storage order is less\ 57complicated to describe. 58 59 60Sequential access to PackedSideBlock data 61----------------------------------------- 62 63See PackedSideBlock in internal/pack.h, specifically the following data 64members: 65 66 // Handle on the buffer backing this packed block. Owned. 67 Allocator::Handle data_handle_; 68 69and: 70 71 // pos_ is the current position in the buffer, which we access 72 // sequentially, like a file. 73 // The idea is that we pack data in the same order as it is 74 // going to be traversed during the computation, which for 75 // cache-friendliness reasons is complicated to random-access, 76 // as the offsets calculations would be intricate. So we 77 // give up random-access addressing, and instead content ourselves 78 // with sequential access. 79 // 80 // pos_ is mutable because during the computation we will want to 81 // be able to iterate on the data in a const PackedSideBlock. 82 mutable int pos_; 83 84The methods exposing sequential access are: 85 86 std::uint8_t* current_data() { 87 return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_; 88 } 89 90and: 91 92 void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; } 93 94 void seek_forward_n_cells(int n) const { 95 pos_ += n * KernelSideFormat::Cell::kSize; 96 } 97 98 99Random access to PackedSideBlock data at larger scales 100------------------------------------------------------ 101 102We still need some random access at larger scales (with high granularity), 103which is unavoidable since GEMM is O(n^3) and has to traverse each of the 104O(n^2) inputs O(n) times. 105 106The watershed between sequential access and random access is at the level 107of a 'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer 108to the innermost GEMM loop in the depth dimension. That's the critical 109inner loop that must be as fast as possible, thus for which we absolutely 110want sequential access to packed data so that the storage order is optimal 111by construction. At larger scales i.e. between runs, we accept that 112the storage order is less optimal and since it's also less intricate, it's 113not too hard to implement random access there. 114 115This is done by the seek_run method: 116 117 void seek_run(int start_width, int start_depth) const { 118 int kernel_run_depth = 119 std::min<int>(params_.l1_depth, params_.l2_depth - start_depth); 120 pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth; 121 } 122 123We see that the formula involves the l1_depth parameter, which is how the 124packed storage order depends on L1 cache size. Again, the whole packed 125block is supposed to fit in L2 cache. 126 127 128The innermost loop of the packing stage, PackRun, and PackingRegisterBlock 129========================================================================== 130 131Keeping with our consistent usage of the term 'Run' throughout gemmlowp, 132the innermost loop is called PackRun(). 133 134Here we recall a very important principle that was explained in 135doc/kernels.txt: the kernel is free to dictate the precise data format 136that it expects; the packing code has to honor it. So there's an asymmetry 137here: the kernel is the master, the packing is the slave. That's why the 138packing code is templatized in the KernelSideFormat. At larger scales, 139the packing is independent of kernel format details, but inside PackRun is 140where we take care of the small-scale details that do depend on the kernel 141format details. That's why it's a good thing that we only need sequential 142access here, as it would be very complicated to spell out random access 143at this scale. 144 145Anyway, PackRun. 146 147Since it is the critical inner loop, it is what we want to allow specializing 148for particular CPU architectures. To allow that, we handle at a time blocks 149of fixed dimensions, that is intended to be friendly enough to optimization. 150These blocks are PackingRegisterBlock's and their dimensions are: 151 width = KernelWidth 152 depth = kRegisterSize 153See doc/kernels.txt and internal/kernel.h for the former, and internal/common.h 154for the latter. 155 156See the comments around PackingRegisterBlock in internal/pack.h: 157 158// A PackingRegisterBlock is a small fixed-size block of a matrix being 159// packed. This class is the generic non-optimized implementation, 160// it is inherited by the generic implementation of PackingRegisterBlock, 161// which may be overriden by template specialization. Overriding it is how 162// one may provide optimized packing code paths. 163// 164// The packing of a block proceeds in two steps: 165// 1. Ensuring that we have a complete block of source data, i.e. a block of 166// the compile-time prescribed size. This is where we handle unaligned 167// boundaries: if we don't have a complete block of source data, then 168// we copy and zero-extend it into a local temporary (complete_src_), 169// see MakeCompleteSrc. In the generic case, we do have a complete block, 170// so we just use it in-place, see UseCompleteSrcInPlace. 171// 2. Packing a complete block into the destination, see Pack. This is the 172// most critical part, so it's convenient that unaligned boundaries have 173// already been handled in step 1. 174 175 176Other things that the packing stage has to do 177============================================= 178 179Besides storing matrix entries in a suitable order, the packing stages also has 180two other things to do. 181 182First, packing has to compute the vectors of sums of entries along the depth 183dimension. If this is any mysterious, read doc/low-precision.txt. These will 184only be used at the unpacking stage. 185 186Second, if the BitDepthSetting requires less than 8 bit of precision, then at 187the packing stage we have to requantize inputs accordingly. 188See doc/less-than-8-bit.txt for details. This is the Requantize() function. 189 190 191Specialized packing paths for specific formats on specific CPU architectures 192============================================================================ 193 194Please refer to internal/pack_neon.h for examples of doing that. The piece 195of code to be specialized is PackingRegisterBlock. However, inside of it, 196only the Pack() method typically needs to be specialized (the rest is unlikely 197to be critical). So one typically specializes PackingRegisterBlock but still 198inheriting PackingRegisterBlockBase to keep the generic stuff, and then one 199typically wants to override the Pack() method. 200 201Template specialization for the right template parameters is how one specifies 202in which case a given path is to be used in place of the generic packing code. 203 204It is entirely possible to set the value of kRegisterSize differently based on 205the CPU architecture (for example, 32 on x86 with AVX) as long as all the 206specialized packing paths used on that CPU architecture are consistent with it. 207