1# The packing stage in gemmlowp 2 3## Introduction 4 5We assume familiarity with [design.md](design.md) and with the overall 3 stages 6of computations described there: packing, kernel, unpacking. 7 8This page goes into more details about the first stage: packing. 9 10We also assume familiarity with [kernel.md](kernel.md) as it describes the 11packed format requirements that the kernels expect, and that forms basically the 12contract that the packing stage must honor. 13 14Some parts below also assume familiarity with 15[low-precision.md](low-precision.md) as the packing stage also has to compute 16the vectors of sums or columns as described there. 17 18## The storage order of packed blocks, partly hidden behind sequential access 19 20As explained in [design.md](design.md), the primary purpose of packing is to 21ensure that when the kernel traverses Lhs/Rhs matrix data, it can do so 22efficiently thanks to having the data stored in an order that is as similar as 23possible to the order in which the compute stage has to traverse this data. 24 25This traversal order is nontrivial for the reasons outlined in 26[design.md](design.md): at the innermost level, one tries to work within 27registers as much as possible; at the next level, one tries to stay within L1 28cache as much as possible. The packed blocks that we handle are supposed to fit 29entirely in L2 cache. 30 31Thus it has become standard in GEMM design to describe complicated "Z-order" or 32"fractal order" storage for packed blocks. 33 34However, we should keep in mind that the whole point of the packed storage order 35is to be as similar as possible to the order of traversal during the compute 36stage. The storage order doesn't matter in itself; the only thing that matters 37is simple access patterns during the compute stage. 38 39This suggests the following approach to implementing packing: take the exact 40same hierarchy of nested loops of the compute stage, drop the loops that are not 41relevant to the side (Lhs or Rhs) being packed, and try to use mostly sequential 42access to the destination packed data. 43 44This hierarchy of nested loops can be seen in PackSideBlockImpl (PackL2, PackL1, 45PackRun), compare to the similar hierarchy of loops in internal/compute.h. 46 47In this way, the more intricate small-scale details or the packed data format 48never need to be made explicit (which would be complicated). We still use some 49"seeking" but only at larger scales, where the storage order is less \ 50complicated to describe. 51 52### Sequential access to PackedSideBlock data 53 54See PackedSideBlock in internal/pack.h, specifically the following data members: 55 56``` 57// Handle on the buffer backing this packed block. Owned. 58Allocator::Handle data_handle_; 59``` 60 61and: 62 63``` 64// pos_ is the current position in the buffer, which we access 65// sequentially, like a file. 66// The idea is that we pack data in the same order as it is 67// going to be traversed during the computation, which for 68// cache-friendliness reasons is complicated to random-access, 69// as the offsets calculations would be intricate. So we 70// give up random-access addressing, and instead content ourselves 71// with sequential access. 72// 73// pos_ is mutable because during the computation we will want to 74// be able to iterate on the data in a const PackedSideBlock. 75mutable int pos_; 76``` 77 78The methods exposing sequential access are: 79 80``` 81std::uint8_t* current_data() { 82 return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_; 83} 84``` 85 86and: 87 88``` 89void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; } 90 91void seek_forward_n_cells(int n) const { 92 pos_ += n * KernelSideFormat::Cell::kSize; 93} 94``` 95 96### Random access to PackedSideBlock data at larger scales 97 98We still need some random access at larger scales (with high granularity), which 99is unavoidable since GEMM is O(n^3) and has to traverse each of the O(n^2) 100inputs O(n) times. 101 102The watershed between sequential access and random access is at the level of a 103'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer to the 104innermost GEMM loop in the depth dimension. That's the critical inner loop that 105must be as fast as possible, thus for which we absolutely want sequential access 106to packed data so that the storage order is optimal by construction. At larger 107scales i.e. between runs, we accept that the storage order is less optimal and 108since it's also less intricate, it's not too hard to implement random access 109there. 110 111This is done by the seek_run method: 112 113``` 114void seek_run(int start_width, int start_depth) const { 115 int kernel_run_depth = 116 std::min<int>(params_.l1_depth, params_.l2_depth - start_depth); 117 pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth; 118} 119``` 120 121We see that the formula involves the l1_depth parameter, which is how the packed 122storage order depends on L1 cache size. Again, the whole packed block is 123supposed to fit in L2 cache. 124 125## The innermost loop of the packing stage, PackRun, and PackingRegisterBlock 126 127Keeping with our consistent usage of the term 'Run' throughout gemmlowp, the 128innermost loop is called PackRun(). 129 130Here we recall a very important principle that was explained in 131[kernels.md](kernels.md): the kernel is free to dictate the precise data format 132that it expects; the packing code has to honor it. So there's an asymmetry here: 133the kernel is the master, the packing is the slave. That's why the packing code 134is templatized in the KernelSideFormat. At larger scales, the packing is 135independent of kernel format details, but inside PackRun is where we take care 136of the small-scale details that do depend on the kernel format details. That's 137why it's a good thing that we only need sequential access here, as it would be 138very complicated to spell out random access at this scale. 139 140Anyway, PackRun. 141 142Since it is the critical inner loop, it is what we want to allow specializing 143for particular CPU architectures. To allow that, we handle at a time blocks of 144fixed dimensions, that is intended to be friendly enough to optimization. These 145blocks are PackingRegisterBlock's and their dimensions are: 146 147``` 148 width = KernelWidth 149 depth = kRegisterSize 150``` 151 152See [kernels.md](kernels.md) and internal/kernel.h for the former, and 153internal/common.h for the latter. 154 155See the comments around PackingRegisterBlock in internal/pack.h: 156 157``` 158// A PackingRegisterBlock is a small fixed-size block of a matrix being 159// packed. This class is the generic non-optimized implementation, 160// it is inherited by the generic implementation of PackingRegisterBlock, 161// which may be overriden by template specialization. Overriding it is how 162// one may provide optimized packing code paths. 163// 164// The packing of a block proceeds in two steps: 165// 1. Ensuring that we have a complete block of source data, i.e. a block of 166// the compile-time prescribed size. This is where we handle unaligned 167// boundaries: if we don't have a complete block of source data, then 168// we copy and zero-extend it into a local temporary (complete_src_), 169// see MakeCompleteSrc. In the generic case, we do have a complete block, 170// so we just use it in-place, see UseCompleteSrcInPlace. 171// 2. Packing a complete block into the destination, see Pack. This is the 172// most critical part, so it's convenient that unaligned boundaries have 173// already been handled in step 1. 174``` 175 176## Other things that the packing stage has to do 177 178Besides storing matrix entries in a suitable order, the packing stages also has 179two other things to do. 180 181First, packing has to compute the vectors of sums of entries along the depth 182dimension. If this is any mysterious, read [low-precision.md](low-precision.md). 183These will only be used at the unpacking stage. 184 185Second, if the BitDepthSetting requires less than 8 bit of precision, then at 186the packing stage we have to requantize inputs accordingly. See 187[less-than-8-bit.md](less-than-8-bit.md) for details. This is the Requantize() 188function. 189 190## Specialized packing paths for specific formats on specific CPU architectures 191 192Please refer to internal/pack_neon.h for examples of doing that. The piece of 193code to be specialized is PackingRegisterBlock. However, inside of it, only the 194Pack() method typically needs to be specialized (the rest is unlikely to be 195critical). So one typically specializes PackingRegisterBlock but still 196inheriting PackingRegisterBlockBase to keep the generic stuff, and then one 197typically wants to override the Pack() method. 198 199Template specialization for the right template parameters is how one specifies 200in which case a given path is to be used in place of the generic packing code. 201 202It is entirely possible to set the value of kRegisterSize differently based on 203the CPU architecture (for example, 32 on x86 with AVX) as long as all the 204specialized packing paths used on that CPU architecture are consistent with it. 205