# HotSort HotSort is a high-performance GPU-accelerated integer sorting library for Vulkan, CUDA and OpenCL compute APIs. HotSort's advantages include: * Ultra-fast sorting of 32‑bit or 64‑bit keys * Reaches peak throughput on small arrays * In-place sorting for low-memory environments * Strong scaling with number of multiprocessors * Low memory transactions relative to array size * A concurrency-friendly dense kernel grid * Support for GPU post-processing of sorted results HotSort is typically significantly faster than other GPU-accelerated implementations when sorting arrays of smaller than 500K-2M keys. ## Benchmarks ### Throughput Here is a throughput plot for HotSort on Vulkan and CUDA sorting 32-bit and 64-bit keys on a 640-core Quadro M1200: ![](images/hs_nvidia_sm35_u32_mkeys.svg) ![](images/hs_nvidia_sm35_u64_mkeys.svg) HotSort throughput on Vulkan on an AMD V1807B APU (similar to a Ryzen 2400G) with DDR4-2666 RAM: ![](images/hs_amd_gcn_mkeys.svg) HotSort throughput on Vulkan and OpenCL on an Intel HD 630: ![](images/hs_intel_gen8_mkeys.svg) ### Execution time Note that these sorting rates translate to sub-millisecond to multi-millisecond execution times on small GPUs: ![](images/hs_nvidia_sm35_u32_msecs.svg) ![](images/hs_nvidia_sm35_u64_msecs.svg) ![](images/hs_amd_gcn_msecs.svg) ![](images/hs_intel_gen8_msecs.svg) # Usage There are HotSort implementations for Vulkan, CUDA and OpenCL. Note that HotSort is a comparison sort and supports in-place sorting. There are also benchmarking examples for the [Vulkan](vk/bench/main.c), [CUDA](cuda/bench/main.c) and [OpenCL](cl/bench/main.c) implementations. *Not all targeted architectures have been tested.* ## Vulkan The following architectures are supported: Vendor | Architecture | 32‑bit | 64‑bit | 32+32‑bit | Notes -------|-------------------------------------------|:------------------:|:------------------:|:-----------:|------ NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | Not tested on all architectures NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels AMD | GCN | :white_check_mark: | :white_check_mark: | :x: | Tested on Linux MESA 18.2 Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Good but the assumed *best-shaped* kernels aren't being used due to a compiler issue Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels Add an arch-specific HotSort algorithm (aka "target") to your project by including a `.c` source and `.h` header file: Key Size | Source | Header ---------|------------------------------------------------------------------------|------------------------------------------------------- 32‑bit | ```hs/vk///u32/hs___u32.c``` | ```hs/vk///u32/hs_target.h``` 64‑bit | ```hs/vk///u64/hs___u64.c``` | ```hs/vk///u64/hs_target.h``` To sort `count` keys on Vulkan: ```C #include "hs/vk/intel/gen8/u32/hs_target.h" ... // create the Vulkan HotSort target struct hs_vk * hs = hs_vk_create(
,...); // allocate a descriptor set from a pool VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool); ... // command buffer begin ... // bind buffer(s) to a command buffer hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting // see how much padding may be required hs_vk_pad(hs,count,&count_padded_in,&count_padded_out); // append compute shaders to command buffer hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args ... // command buffer end and queue submit ... // release the Vulkan HotSort target hs_vk_release(hs,...); ``` The [`hs_vk.h`](vk/hs_vk.h) header file describes these functions in greater detail. ## CUDA The following architectures are supported: Vendor | Architecture | 32‑bit | 64‑bit | 32+32‑bit | Notes -------|-------------------------------------------------------|:------------------:|:------------------:|:---------:|------ NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels Add an arch-specific HotSort target to your project by including a `.cu` CUDA source and `.h` header file: Key Size | Source | Header ---------|-------------------------------------------------|------------------------------------------ 32‑bit | ```hs/cuda/sm_35/u32/hs_cuda_u32.cu``` | ```hs/cuda/sm_35/u32/hs_cuda.h``` 64‑bit | ```hs/cuda/sm_35/u64/hs_cuda_u64.cu``` | ```hs/cuda/sm_35/u64/hs_cuda.h``` Usage on CUDA is very simple. For example, to sort `count` keys: ```C #include "hs/cuda/sm_35/u32/hs_cuda.h" ... uint32_t count_padded_in, count_padded_out; hs_cuda_pad_u32(count,&count_padded_in,&count_padded_in); ... hs_cuda_sort_u32(vin,vout, // or (vin,NULL,...) for in-place sorting count,count_padded_in,count_padded_out, true, stream0,stream1,stream2); ``` HotSort on CUDA requires two auxilary streams in order to maximize concurrency. The algorithm is guaranteed to complete on `stream0`. ## OpenCL The following architectures are supported: Vendor | Architecture | 32‑bit | 64‑bit | 32+32‑bit | Notes -------|-----------------------------------------|:------------------:|:------------------:|:---------:|------ Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Due to a fragile compiler, the assumed best kernels are not being generated at this time Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels Add an arch-specific HotSort target to your project by including a `.c` source and `.h` header file: Key Size | Source | Header ---------|------------------------------------------------------------------------|------------------------------------------------------- 32‑bit | ```hs/cl///u32/hs___u32.c``` | ```hs/cl///u32/hs_target.h``` 64‑bit | ```hs/cl///u64/hs___u64.c``` | ```hs/cl///u64/hs_target.h``` Note that if the symbol `HS_DUMP_SOURCE` is not defined then the pre-compiled GEN9 binaries will be included. These binaries may not be compatible with all GEN8+ devices and drivers. To sort `count` keys on OpenCL: ```C // create the OpenCL HotSort target struct hs_cl * hs = hs_cl_create(
,...); ... // see how much padding may be required hs_cl_pad(hs,count,&count_padded_in,&count_padded_out); // enqueue compute kernels hs_cl_sort(hs,cq,...,vin,vout,...); // or (...,vin,NULL,...) for in-place sorting ... // release the OpenCL HotSort target hs_cl_release(hs,...); ``` The [`hs_cl.h`](cl/hs_cl.h) header file describes these functions in greater detail. # Background The HotSort sorting algorithm was created in 2012 and generalized in 2015 to support GPUs that benefit from non-power-of-two workgroups. The objective of HotSort is to achieve high throughput as *early* as possible on small GPUs when sorting modestly-sized arrays ― 1,000s to 100s of thousands of 64‑bit keys. HotSort uses both well-known and obscure properties of bitonic sequences to create a novel mapping of keys onto data-parallel devices like GPUs. ## Overview The overall flow of the HotSort algorithm is below. Kernel launches are in italics. 1. For each workgroup of slabs: 1. For each slab in the workgroup: 1. *Slab Load* 1. *Slab Sort* 1. Until all slabs in the workgroup are merged: 1. *Multi-Slab Flip Merge* 1. *Slab Store* 1. Until all slabs are merged: 1. *Streaming Flip Merge* 1. If necessary, *Streaming Half Merge* 1. If necessary, *Multi-Slab Half Merge* 1. If necessary, *Slab Half Merge* 1. If complete: 1. Optionally, *Report Key Changes* 1. Optionally, *Slab Transpose & Store* 1. Otherwise: *Slab Store* 1. Done ## Sorting The algorithm begins with a very *dense* per-multiprocessor block sorting algorithm that loads a "slab" of keys into a subgroup's registers, sorts the slabs, merges all slabs in the workgroup, and stores the slabs back to global memory. In the slab sorting phase, each lane of a subgroup executes a bitonic sorting network on its registers and successively merges lanes until the slab of registers is sorted in serpentine order. ![](images/hs_sorted_slab.svg) ## Merging HotSort has several different merging strategies. The merging kernels leverage the multiprocessor's register file by loading, merging and storing a large number of strided slab rows without using local memory. The merging kernels exploit the bitonic sequence property that interleaved subsequences of a bitonic sequence are also bitonic sequences. This property also holds for non-power-of-two sequences. As an example, the *Streaming Flip Merge* kernel is illustrated below: ![](images/hs_flip_merge.svg) # Future Enhancements ## Hybrid improved merging HotSort's initial sorting and merging phases are performed on bitonic sequences. Because of this, throughput decreases as the problem size increases. A hybrid algorithm that combined HotSort's block sorter and several rounds of merging with a state-of-the-art GPU merging algorithm would probably improve the algorithm's performance on larger arrays. ## Reenable support for devices lacking shuffle functions The original version of HotSort ran on pre-Kepler GPUs without intra-warp/inter-lane shuffling ― reenable this capability. ## Metal support Modify the HotSort generator to support Metal targets.