1 2# HotSort 3 4HotSort is a high-performance GPU-accelerated integer sorting library 5for Vulkan, CUDA and OpenCL compute APIs. 6 7HotSort's advantages include: 8 9* Ultra-fast sorting of 32‑bit or 64‑bit keys 10* Reaches peak throughput on small arrays 11* In-place sorting for low-memory environments 12* Strong scaling with number of multiprocessors 13* Low memory transactions relative to array size 14* A concurrency-friendly dense kernel grid 15* Support for GPU post-processing of sorted results 16 17HotSort is typically significantly faster than other GPU-accelerated 18implementations when sorting arrays of smaller than 500K-2M keys. 19 20## Benchmarks 21 22### Throughput 23 24Here is a throughput plot for HotSort on Vulkan and CUDA sorting 2532-bit and 64-bit keys on a 640-core Quadro M1200: 26 27![](images/hs_nvidia_sm35_u32_mkeys.svg) 28![](images/hs_nvidia_sm35_u64_mkeys.svg) 29 30HotSort throughput on Vulkan on an AMD V1807B APU (similar to a Ryzen 2400G) with DDR4-2666 RAM: 31 32![](images/hs_amd_gcn_mkeys.svg) 33 34HotSort throughput on Vulkan and OpenCL on an Intel HD 630: 35 36![](images/hs_intel_gen8_mkeys.svg) 37 38### Execution time 39 40Note that these sorting rates translate to sub-millisecond to 41multi-millisecond execution times on small GPUs: 42 43![](images/hs_nvidia_sm35_u32_msecs.svg) 44![](images/hs_nvidia_sm35_u64_msecs.svg) 45![](images/hs_amd_gcn_msecs.svg) 46![](images/hs_intel_gen8_msecs.svg) 47 48# Usage 49 50There are HotSort implementations for Vulkan, CUDA and OpenCL. 51 52Note that HotSort is a comparison sort and supports in-place sorting. 53 54There are also benchmarking examples for the 55[Vulkan](vk/bench/main.c), [CUDA](cuda/bench/main.c) and 56[OpenCL](cl/bench/main.c) implementations. 57 58*Not all targeted architectures have been tested.* 59 60## Vulkan 61 62The following architectures are supported: 63 64Vendor | Architecture | 32‑bit | 64‑bit | 32+32‑bit | Notes 65-------|-------------------------------------------|:------------------:|:------------------:|:-----------:|------ 66NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | Not tested on all architectures 67NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels 68AMD | GCN | :white_check_mark: | :white_check_mark: | :x: | Tested on Linux MESA 18.2 69Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Good but the assumed *best-shaped* kernels aren't being used due to a compiler issue 70Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels 71 72Add an arch-specific HotSort algorithm (aka "target") to your project 73by including a `.c` source and `.h` header file: 74 75Key Size | Source | Header 76---------|------------------------------------------------------------------------|------------------------------------------------------- 7732‑bit | ```hs/vk/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/vk/<vendor>/<arch>/u32/hs_target.h``` 7864‑bit | ```hs/vk/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/vk/<vendor>/<arch>/u64/hs_target.h``` 79 80To sort `count` keys on Vulkan: 81 82```C 83#include "hs/vk/intel/gen8/u32/hs_target.h" 84 85... 86 87// create the Vulkan HotSort target 88struct hs_vk * hs = hs_vk_create(<address of target>,...); 89 90// allocate a descriptor set from a pool 91VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool); 92 93... 94 95// command buffer begin 96 97... 98 99// bind buffer(s) to a command buffer 100hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting 101 102// see how much padding may be required 103hs_vk_pad(hs,count,&count_padded_in,&count_padded_out); 104 105// append compute shaders to command buffer 106hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args 107 108... 109 110// command buffer end and queue submit 111 112... 113 114// release the Vulkan HotSort target 115hs_vk_release(hs,...); 116 117``` 118 119The [`hs_vk.h`](vk/hs_vk.h) header file describes these functions in 120greater detail. 121 122## CUDA 123 124The following architectures are supported: 125 126Vendor | Architecture | 32‑bit | 64‑bit | 32+32‑bit | Notes 127-------|-------------------------------------------------------|:------------------:|:------------------:|:---------:|------ 128NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | 129NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels 130 131Add an arch-specific HotSort target to your project by including a 132`.cu` CUDA source and `.h` header file: 133 134Key Size | Source | Header 135---------|-------------------------------------------------|------------------------------------------ 13632‑bit | ```hs/cuda/sm_35/u32/hs_cuda_u32.cu``` | ```hs/cuda/sm_35/u32/hs_cuda.h``` 13764‑bit | ```hs/cuda/sm_35/u64/hs_cuda_u64.cu``` | ```hs/cuda/sm_35/u64/hs_cuda.h``` 138 139Usage on CUDA is very simple. 140 141For example, to sort `count` keys: 142 143```C 144#include "hs/cuda/sm_35/u32/hs_cuda.h" 145 146... 147 148uint32_t count_padded_in, count_padded_out; 149 150hs_cuda_pad_u32(count,&count_padded_in,&count_padded_in); 151 152... 153 154hs_cuda_sort_u32(vin,vout, // or (vin,NULL,...) for in-place sorting 155 count,count_padded_in,count_padded_out, 156 true, 157 stream0,stream1,stream2); 158``` 159 160HotSort on CUDA requires two auxilary streams in order to maximize concurrency. 161 162The algorithm is guaranteed to complete on `stream0`. 163 164 165## OpenCL 166 167The following architectures are supported: 168 169Vendor | Architecture | 32‑bit | 64‑bit | 32+32‑bit | Notes 170-------|-----------------------------------------|:------------------:|:------------------:|:---------:|------ 171Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Due to a fragile compiler, the assumed best kernels are not being generated at this time 172Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels 173 174Add an arch-specific HotSort target to your project by including a 175`.c` source and `.h` header file: 176 177Key Size | Source | Header 178---------|------------------------------------------------------------------------|------------------------------------------------------- 17932‑bit | ```hs/cl/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/cl/<vendor>/<arch>/u32/hs_target.h``` 18064‑bit | ```hs/cl/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/cl/<vendor>/<arch>/u64/hs_target.h``` 181 182Note that if the symbol `HS_DUMP_SOURCE` is not defined then the 183pre-compiled GEN9 binaries will be included. These binaries may not 184be compatible with all GEN8+ devices and drivers. 185 186To sort `count` keys on OpenCL: 187 188```C 189// create the OpenCL HotSort target 190struct hs_cl * hs = hs_cl_create(<address of target>,...); 191 192... 193 194// see how much padding may be required 195hs_cl_pad(hs,count,&count_padded_in,&count_padded_out); 196 197// enqueue compute kernels 198hs_cl_sort(hs,cq,...,vin,vout,...); // or (...,vin,NULL,...) for in-place sorting 199 200... 201 202// release the OpenCL HotSort target 203hs_cl_release(hs,...); 204 205``` 206 207The [`hs_cl.h`](cl/hs_cl.h) header file describes these functions in 208greater detail. 209 210# Background 211 212The HotSort sorting algorithm was created in 2012 and generalized in 2132015 to support GPUs that benefit from non-power-of-two workgroups. 214 215The objective of HotSort is to achieve high throughput as *early* as 216possible on small GPUs when sorting modestly-sized arrays ― 1,000s to 217100s of thousands of 64‑bit keys. 218 219HotSort uses both well-known and obscure properties of bitonic 220sequences to create a novel mapping of keys onto data-parallel devices 221like GPUs. 222 223## Overview 224 225The overall flow of the HotSort algorithm is below. Kernel launches 226are in italics. 227 2281. For each workgroup of slabs: 229 1. For each slab in the workgroup: 230 1. *Slab Load* 231 1. *Slab Sort* 232 1. Until all slabs in the workgroup are merged: 233 1. *Multi-Slab Flip Merge* 234 1. *Slab Store* 2351. Until all slabs are merged: 236 1. *Streaming Flip Merge* 237 1. If necessary, *Streaming Half Merge* 238 1. If necessary, *Multi-Slab Half Merge* 239 1. If necessary, *Slab Half Merge* 240 1. If complete: 241 1. Optionally, *Report Key Changes* 242 1. Optionally, *Slab Transpose & Store* 243 1. Otherwise: *Slab Store* 2441. Done 245 246## Sorting 247 248The algorithm begins with a very *dense* per-multiprocessor block 249sorting algorithm that loads a "slab" of keys into a subgroup's 250registers, sorts the slabs, merges all slabs in the workgroup, and 251stores the slabs back to global memory. 252 253In the slab sorting phase, each lane of a subgroup executes a bitonic 254sorting network on its registers and successively merges lanes until 255the slab of registers is sorted in serpentine order. 256 257![](images/hs_sorted_slab.svg) 258 259## Merging 260 261HotSort has several different merging strategies. 262 263The merging kernels leverage the multiprocessor's register file by 264loading, merging and storing a large number of strided slab rows 265without using local memory. 266 267The merging kernels exploit the bitonic sequence property that 268interleaved subsequences of a bitonic sequence are also bitonic 269sequences. This property also holds for non-power-of-two sequences. 270 271As an example, the *Streaming Flip Merge* kernel is illustrated below: 272 273![](images/hs_flip_merge.svg) 274 275# Future Enhancements 276 277## Hybrid improved merging 278 279HotSort's initial sorting and merging phases are performed on bitonic 280sequences. Because of this, throughput decreases as the problem size 281increases. 282 283A hybrid algorithm that combined HotSort's block sorter and several 284rounds of merging with a state-of-the-art GPU merging algorithm would 285probably improve the algorithm's performance on larger arrays. 286 287## Reenable support for devices lacking shuffle functions 288 289The original version of HotSort ran on pre-Kepler GPUs without 290intra-warp/inter-lane shuffling ― reenable this capability. 291 292## Metal support 293 294Modify the HotSort generator to support Metal targets. 295