• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1xxHash fast digest algorithm
2======================
3
4### Notices
5
6Copyright (c) Yann Collet
7
8Permission is granted to copy and distribute this document
9for any purpose and without charge,
10including translations into other languages
11and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
190.1.1 (10/10/18)
20
21
22Table of Contents
23---------------------
24- [Introduction](#introduction)
25- [XXH32 algorithm description](#xxh32-algorithm-description)
26- [XXH64 algorithm description](#xxh64-algorithm-description)
27- [Performance considerations](#performance-considerations)
28- [Reference Implementation](#reference-implementation)
29
30
31Introduction
32----------------
33
34This document describes the xxHash digest algorithm for both 32-bit and 64-bit variants, named `XXH32` and `XXH64`. The algorithm takes an input a message of arbitrary length and an optional seed value, then produces an output of 32 or 64-bit as "fingerprint" or "digest".
35
36xxHash is primarily designed for speed. It is labeled non-cryptographic, and is not meant to avoid intentional collisions (same digest for 2 different messages), or to prevent producing a message with a predefined digest.
37
38XXH32 is designed to be fast on 32-bit machines.
39XXH64 is designed to be fast on 64-bit machines.
40Both variants produce different output.
41However, a given variant shall produce exactly the same output, irrespective of the cpu / os used. In particular, the result remains identical whatever the endianness and width of the cpu is.
42
43### Operation notations
44
45All operations are performed modulo {32,64} bits. Arithmetic overflows are expected.
46`XXH32` uses 32-bit modular operations. `XXH64` uses 64-bit modular operations.
47
48- `+`: denotes modular addition
49- `*`: denotes modular multiplication
50- `X <<< s`: denotes the value obtained by circularly shifting (rotating) `X` left by `s` bit positions.
51- `X >> s`: denotes the value obtained by shifting `X` right by s bit positions. Upper `s` bits become `0`.
52- `X xor Y`: denotes the bit-wise XOR of `X` and `Y` (same width).
53
54
55XXH32 Algorithm Description
56-------------------------------------
57
58### Overview
59
60We begin by supposing that we have a message of any length `L` as input, and that we wish to find its digest. Here `L` is an arbitrary nonnegative integer; `L` may be zero. The following steps are performed to compute the digest of the message.
61
62The algorithm collect and transform input in _stripes_ of 16 bytes. The transforms are stored inside 4 "accumulators", each one storing an unsigned 32-bit value. Each accumulator can be processed independently in parallel, speeding up processing for cpu with multiple execution units.
63
64The algorithm uses 32-bits addition, multiplication, rotate, shift and xor operations. Many operations require some 32-bits prime number constants, all defined below:
65
66    static const u32 PRIME32_1 = 0x9E3779B1U;  // 0b10011110001101110111100110110001
67    static const u32 PRIME32_2 = 0x85EBCA77U;  // 0b10000101111010111100101001110111
68    static const u32 PRIME32_3 = 0xC2B2AE3DU;  // 0b11000010101100101010111000111101
69    static const u32 PRIME32_4 = 0x27D4EB2FU;  // 0b00100111110101001110101100101111
70    static const u32 PRIME32_5 = 0x165667B1U;  // 0b00010110010101100110011110110001
71
72These constants are prime numbers, and feature a good mix of bits 1 and 0, neither too regular, nor too dissymmetric. These properties help dispersion capabilities.
73
74### Step 1. Initialize internal accumulators
75
76Each accumulator gets an initial value based on optional `seed` input. Since the `seed` is optional, it can be `0`.
77
78        u32 acc1 = seed + PRIME32_1 + PRIME32_2;
79        u32 acc2 = seed + PRIME32_2;
80        u32 acc3 = seed + 0;
81        u32 acc4 = seed - PRIME32_1;
82
83#### Special case: input is less than 16 bytes
84
85When the input is too small (< 16 bytes), the algorithm will not process any stripes. Consequently, it will not make use of parallel accumulators.
86
87In this case, a simplified initialization is performed, using a single accumulator:
88
89      u32 acc  = seed + PRIME32_5;
90
91The algorithm then proceeds directly to step 4.
92
93### Step 2. Process stripes
94
95A stripe is a contiguous segment of 16 bytes.
96It is evenly divided into 4 _lanes_, of 4 bytes each.
97The first lane is used to update accumulator 1, the second lane is used to update accumulator 2, and so on.
98
99Each lane read its associated 32-bit value using __little-endian__ convention.
100
101For each {lane, accumulator}, the update process is called a _round_, and applies the following formula:
102
103    accN = accN + (laneN * PRIME32_2);
104    accN = accN <<< 13;
105    accN = accN * PRIME32_1;
106
107This shuffles the bits so that any bit from input _lane_ impacts several bits in output _accumulator_. All operations are performed modulo 2^32.
108
109Input is consumed one full stripe at a time. Step 2 is looped as many times as necessary to consume the whole input, except for the last remaining bytes which cannot form a stripe (< 16 bytes).
110When that happens, move to step 3.
111
112### Step 3. Accumulator convergence
113
114All 4 lane accumulators from the previous steps are merged to produce a single remaining accumulator of the same width (32-bit). The associated formula is as follows:
115
116    acc = (acc1 <<< 1) + (acc2 <<< 7) + (acc3 <<< 12) + (acc4 <<< 18);
117
118### Step 4. Add input length
119
120The input total length is presumed known at this stage. This step is just about adding the length to accumulator, so that it participates to final mixing.
121
122    acc = acc + (u32)inputLength;
123
124Note that, if input length is so large that it requires more than 32-bits, only the lower 32-bits are added to the accumulator.
125
126### Step 5. Consume remaining input
127
128There may be up to 15 bytes remaining to consume from the input.
129The final stage will digest them according to following pseudo-code:
130
131    while (remainingLength >= 4) {
132        lane = read_32bit_little_endian(input_ptr);
133        acc = acc + lane * PRIME32_3;
134        acc = (acc <<< 17) * PRIME32_4;
135        input_ptr += 4; remainingLength -= 4;
136    }
137
138    while (remainingLength >= 1) {
139        lane = read_byte(input_ptr);
140        acc = acc + lane * PRIME32_5;
141        acc = (acc <<< 11) * PRIME32_1;
142        input_ptr += 1; remainingLength -= 1;
143    }
144
145This process ensures that all input bytes are present in the final mix.
146
147### Step 6. Final mix (avalanche)
148
149The final mix ensures that all input bits have a chance to impact any bit in the output digest, resulting in an unbiased distribution. This is also called avalanche effect.
150
151    acc = acc xor (acc >> 15);
152    acc = acc * PRIME32_2;
153    acc = acc xor (acc >> 13);
154    acc = acc * PRIME32_3;
155    acc = acc xor (acc >> 16);
156
157### Step 7. Output
158
159The `XXH32()` function produces an unsigned 32-bit value as output.
160
161For systems which require to store and/or display the result in binary or hexadecimal format, the canonical format is defined to reproduce the same value as the natural decimal format, hence follows __big-endian__ convention (most significant byte first).
162
163
164XXH64 Algorithm Description
165-------------------------------------
166
167### Overview
168
169`XXH64`'s algorithm structure is very similar to `XXH32` one. The major difference is that `XXH64` uses 64-bit arithmetic, speeding up memory transfer for 64-bit compliant systems, but also relying on cpu capability to efficiently perform 64-bit operations.
170
171The algorithm collects and transforms input in _stripes_ of 32 bytes. The transforms are stored inside 4 "accumulators", each one storing an unsigned 64-bit value. Each accumulator can be processed independently in parallel, speeding up processing for cpu with multiple execution units.
172
173The algorithm uses 64-bit addition, multiplication, rotate, shift and xor operations. Many operations require some 64-bit prime number constants, all defined below:
174
175    static const u64 PRIME64_1 = 0x9E3779B185EBCA87ULL;  // 0b1001111000110111011110011011000110000101111010111100101010000111
176    static const u64 PRIME64_2 = 0xC2B2AE3D27D4EB4FULL;  // 0b1100001010110010101011100011110100100111110101001110101101001111
177    static const u64 PRIME64_3 = 0x165667B19E3779F9ULL;  // 0b0001011001010110011001111011000110011110001101110111100111111001
178    static const u64 PRIME64_4 = 0x85EBCA77C2B2AE63ULL;  // 0b1000010111101011110010100111011111000010101100101010111001100011
179    static const u64 PRIME64_5 = 0x27D4EB2F165667C5ULL;  // 0b0010011111010100111010110010111100010110010101100110011111000101
180
181These constants are prime numbers, and feature a good mix of bits 1 and 0, neither too regular, nor too dissymmetric. These properties help dispersion capabilities.
182
183### Step 1. Initialise internal accumulators
184
185Each accumulator gets an initial value based on optional `seed` input. Since the `seed` is optional, it can be `0`.
186
187        u64 acc1 = seed + PRIME64_1 + PRIME64_2;
188        u64 acc2 = seed + PRIME64_2;
189        u64 acc3 = seed + 0;
190        u64 acc4 = seed - PRIME64_1;
191
192#### Special case: input is less than 32 bytes
193
194When the input is too small (< 32 bytes), the algorithm will not process any stripes. Consequently, it will not make use of parallel accumulators.
195
196In this case, a simplified initialization is performed, using a single accumulator:
197
198      u64 acc  = seed + PRIME64_5;
199
200The algorithm then proceeds directly to step 4.
201
202### Step 2. Process stripes
203
204A stripe is a contiguous segment of 32 bytes.
205It is evenly divided into 4 _lanes_, of 8 bytes each.
206The first lane is used to update accumulator 1, the second lane is used to update accumulator 2, and so on.
207
208Each lane read its associated 64-bit value using __little-endian__ convention.
209
210For each {lane, accumulator}, the update process is called a _round_, and applies the following formula:
211
212    round(accN,laneN):
213    accN = accN + (laneN * PRIME64_2);
214    accN = accN <<< 31;
215    return accN * PRIME64_1;
216
217This shuffles the bits so that any bit from input _lane_ impacts several bits in output _accumulator_. All operations are performed modulo 2^64.
218
219Input is consumed one full stripe at a time. Step 2 is looped as many times as necessary to consume the whole input, except for the last remaining bytes which cannot form a stripe (< 32 bytes).
220When that happens, move to step 3.
221
222### Step 3. Accumulator convergence
223
224All 4 lane accumulators from previous steps are merged to produce a single remaining accumulator of same width (64-bit). The associated formula is as follows.
225
226Note that accumulator convergence is more complex than 32-bit variant, and requires to define another function called _mergeAccumulator()_:
227
228    mergeAccumulator(acc,accN):
229    acc  = acc xor round(0, accN);
230    acc  = acc * PRIME64_1;
231    return acc + PRIME64_4;
232
233which is then used in the convergence formula:
234
235    acc = (acc1 <<< 1) + (acc2 <<< 7) + (acc3 <<< 12) + (acc4 <<< 18);
236    acc = mergeAccumulator(acc, acc1);
237    acc = mergeAccumulator(acc, acc2);
238    acc = mergeAccumulator(acc, acc3);
239    acc = mergeAccumulator(acc, acc4);
240
241### Step 4. Add input length
242
243The input total length is presumed known at this stage. This step is just about adding the length to accumulator, so that it participates to final mixing.
244
245    acc = acc + inputLength;
246
247### Step 5. Consume remaining input
248
249There may be up to 31 bytes remaining to consume from the input.
250The final stage will digest them according to following pseudo-code:
251
252    while (remainingLength >= 8) {
253        lane = read_64bit_little_endian(input_ptr);
254        acc = acc xor round(0, lane);
255        acc = (acc <<< 27) * PRIME64_1;
256        acc = acc + PRIME64_4;
257        input_ptr += 8; remainingLength -= 8;
258    }
259
260    if (remainingLength >= 4) {
261        lane = read_32bit_little_endian(input_ptr);
262        acc = acc xor (lane * PRIME64_1);
263        acc = (acc <<< 23) * PRIME64_2;
264        acc = acc + PRIME64_3;
265        input_ptr += 4; remainingLength -= 4;
266    }
267
268    while (remainingLength >= 1) {
269        lane = read_byte(input_ptr);
270        acc = acc xor (lane * PRIME64_5);
271        acc = (acc <<< 11) * PRIME64_1;
272        input_ptr += 1; remainingLength -= 1;
273    }
274
275This process ensures that all input bytes are present in the final mix.
276
277### Step 6. Final mix (avalanche)
278
279The final mix ensures that all input bits have a chance to impact any bit in the output digest, resulting in an unbiased distribution. This is also called avalanche effect.
280
281    acc = acc xor (acc >> 33);
282    acc = acc * PRIME64_2;
283    acc = acc xor (acc >> 29);
284    acc = acc * PRIME64_3;
285    acc = acc xor (acc >> 32);
286
287### Step 7. Output
288
289The `XXH64()` function produces an unsigned 64-bit value as output.
290
291For systems which require to store and/or display the result in binary or hexadecimal format, the canonical format is defined to reproduce the same value as the natural decimal format, hence follows __big-endian__ convention (most significant byte first).
292
293Performance considerations
294----------------------------------
295
296The xxHash algorithms are simple and compact to implement. They provide a system independent "fingerprint" or digest of a message of arbitrary length.
297
298The algorithm allows input to be streamed and processed in multiple steps. In such case, an internal buffer is needed to ensure data is presented to the algorithm in full stripes.
299
300On 64-bit systems, the 64-bit variant `XXH64` is generally faster to compute, so it is a recommended variant, even when only 32-bit are needed.
301
302On 32-bit systems though, positions are reversed: `XXH64` performance is reduced, due to its usage of 64-bit arithmetic. `XXH32` becomes a faster variant.
303
304
305Reference Implementation
306----------------------------------------
307
308A reference library written in C is available at https://www.xxhash.com.
309The web page also links to multiple other implementations written in many different languages.
310It links to the [github project page](https://github.com/Cyan4973/xxHash) where an [issue board](https://github.com/Cyan4973/xxHash/issues) can be used for further public discussions on the topic.
311
312
313Version changes
314--------------------
315v0.7.3: Minor fixes
316v0.1.1: added a note on rationale for selection of constants
317v0.1.0: initial release
318