1 /* 2 * Copyright (C) 2011 The Guava Authors 3 * 4 * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except 5 * in compliance with the License. You may obtain a copy of the License at 6 * 7 * http://www.apache.org/licenses/LICENSE-2.0 8 * 9 * Unless required by applicable law or agreed to in writing, software distributed under the License 10 * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 11 * or implied. See the License for the specific language governing permissions and limitations under 12 * the License. 13 */ 14 15 package com.google.common.hash; 16 17 import com.google.common.primitives.Ints; 18 import com.google.errorprone.annotations.Immutable; 19 import java.nio.ByteBuffer; 20 import java.nio.charset.Charset; 21 import org.checkerframework.checker.nullness.qual.Nullable; 22 23 /** 24 * A hash function is a collision-averse pure function that maps an arbitrary block of data to a 25 * number called a <i>hash code</i>. 26 * 27 * <h3>Definition</h3> 28 * 29 * <p>Unpacking this definition: 30 * 31 * <ul> 32 * <li><b>block of data:</b> the input for a hash function is always, in concept, an ordered byte 33 * array. This hashing API accepts an arbitrary sequence of byte and multibyte values (via 34 * {@link Hasher}), but this is merely a convenience; these are always translated into raw 35 * byte sequences under the covers. 36 * <li><b>hash code:</b> each hash function always yields hash codes of the same fixed bit length 37 * (given by {@link #bits}). For example, {@link Hashing#sha1} produces a 160-bit number, 38 * while {@link Hashing#murmur3_32()} yields only 32 bits. Because a {@code long} value is 39 * clearly insufficient to hold all hash code values, this API represents a hash code as an 40 * instance of {@link HashCode}. 41 * <li><b>pure function:</b> the value produced must depend only on the input bytes, in the order 42 * they appear. Input data is never modified. {@link HashFunction} instances should always be 43 * stateless, and therefore thread-safe. 44 * <li><b>collision-averse:</b> while it can't be helped that a hash function will sometimes 45 * produce the same hash code for distinct inputs (a "collision"), every hash function strives 46 * to <i>some</i> degree to make this unlikely. (Without this condition, a function that 47 * always returns zero could be called a hash function. It is not.) 48 * </ul> 49 * 50 * <p>Summarizing the last two points: "equal yield equal <i>always</i>; unequal yield unequal 51 * <i>often</i>." This is the most important characteristic of all hash functions. 52 * 53 * <h3>Desirable properties</h3> 54 * 55 * <p>A high-quality hash function strives for some subset of the following virtues: 56 * 57 * <ul> 58 * <li><b>collision-resistant:</b> while the definition above requires making at least <i>some</i> 59 * token attempt, one measure of the quality of a hash function is <i>how well</i> it succeeds 60 * at this goal. Important note: it may be easy to achieve the theoretical minimum collision 61 * rate when using completely <i>random</i> sample input. The true test of a hash function is 62 * how it performs on representative real-world data, which tends to contain many hidden 63 * patterns and clumps. The goal of a good hash function is to stamp these patterns out as 64 * thoroughly as possible. 65 * <li><b>bit-dispersing:</b> masking out any <i>single bit</i> from a hash code should yield only 66 * the expected <i>twofold</i> increase to all collision rates. Informally, the "information" 67 * in the hash code should be as evenly "spread out" through the hash code's bits as possible. 68 * The result is that, for example, when choosing a bucket in a hash table of size 2^8, 69 * <i>any</i> eight bits could be consistently used. 70 * <li><b>cryptographic:</b> certain hash functions such as {@link Hashing#sha512} are designed to 71 * make it as infeasible as possible to reverse-engineer the input that produced a given hash 72 * code, or even to discover <i>any</i> two distinct inputs that yield the same result. These 73 * are called <i>cryptographic hash functions</i>. But, whenever it is learned that either of 74 * these feats has become computationally feasible, the function is deemed "broken" and should 75 * no longer be used for secure purposes. (This is the likely eventual fate of <i>all</i> 76 * cryptographic hashes.) 77 * <li><b>fast:</b> perhaps self-explanatory, but often the most important consideration. 78 * </ul> 79 * 80 * <h3>Providing input to a hash function</h3> 81 * 82 * <p>The primary way to provide the data that your hash function should act on is via a {@link 83 * Hasher}. Obtain a new hasher from the hash function using {@link #newHasher}, "push" the relevant 84 * data into it using methods like {@link Hasher#putBytes(byte[])}, and finally ask for the {@code 85 * HashCode} when finished using {@link Hasher#hash}. (See an {@linkplain #newHasher example} of 86 * this.) 87 * 88 * <p>If all you want to hash is a single byte array, string or {@code long} value, there are 89 * convenient shortcut methods defined directly on {@link HashFunction} to make this easier. 90 * 91 * <p>Hasher accepts primitive data types, but can also accept any Object of type {@code T} provided 92 * that you implement a {@link Funnel}{@code <T>} to specify how to "feed" data from that object 93 * into the function. (See {@linkplain Hasher#putObject an example} of this.) 94 * 95 * <p><b>Compatibility note:</b> Throughout this API, multibyte values are always interpreted in 96 * <i>little-endian</i> order. That is, hashing the byte array {@code {0x01, 0x02, 0x03, 0x04}} is 97 * equivalent to hashing the {@code int} value {@code 0x04030201}. If this isn't what you need, 98 * methods such as {@link Integer#reverseBytes} and {@link Ints#toByteArray} will help. 99 * 100 * <h3>Relationship to {@link Object#hashCode}</h3> 101 * 102 * <p>Java's baked-in concept of hash codes is constrained to 32 bits, and provides no separation 103 * between hash algorithms and the data they act on, so alternate hash algorithms can't be easily 104 * substituted. Also, implementations of {@code hashCode} tend to be poor-quality, in part because 105 * they end up depending on <i>other</i> existing poor-quality {@code hashCode} implementations, 106 * including those in many JDK classes. 107 * 108 * <p>{@code Object.hashCode} implementations tend to be very fast, but have weak collision 109 * prevention and <i>no</i> expectation of bit dispersion. This leaves them perfectly suitable for 110 * use in hash tables, because extra collisions cause only a slight performance hit, while poor bit 111 * dispersion is easily corrected using a secondary hash function (which all reasonable hash table 112 * implementations in Java use). For the many uses of hash functions beyond data structures, 113 * however, {@code Object.hashCode} almost always falls short -- hence this library. 114 * 115 * @author Kevin Bourrillion 116 * @since 11.0 117 */ 118 @Immutable 119 @ElementTypesAreNonnullByDefault 120 public interface HashFunction { 121 /** 122 * Begins a new hash code computation by returning an initialized, stateful {@code Hasher} 123 * instance that is ready to receive data. Example: 124 * 125 * <pre>{@code 126 * HashFunction hf = Hashing.md5(); 127 * HashCode hc = hf.newHasher() 128 * .putLong(id) 129 * .putBoolean(isActive) 130 * .hash(); 131 * }</pre> 132 */ newHasher()133 Hasher newHasher(); 134 135 /** 136 * Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the expected 137 * size of the input (in bytes). This is only important for non-streaming hash functions (hash 138 * functions that need to buffer their whole input before processing any of it). 139 */ newHasher(int expectedInputSize)140 Hasher newHasher(int expectedInputSize); 141 142 /** 143 * Shortcut for {@code newHasher().putInt(input).hash()}; returns the hash code for the given 144 * {@code int} value, interpreted in little-endian byte order. The implementation <i>might</i> 145 * perform better than its longhand equivalent, but should not perform worse. 146 * 147 * @since 12.0 148 */ hashInt(int input)149 HashCode hashInt(int input); 150 151 /** 152 * Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the given 153 * {@code long} value, interpreted in little-endian byte order. The implementation <i>might</i> 154 * perform better than its longhand equivalent, but should not perform worse. 155 */ hashLong(long input)156 HashCode hashLong(long input); 157 158 /** 159 * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation <i>might</i> 160 * perform better than its longhand equivalent, but should not perform worse. 161 */ hashBytes(byte[] input)162 HashCode hashBytes(byte[] input); 163 164 /** 165 * Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation 166 * <i>might</i> perform better than its longhand equivalent, but should not perform worse. 167 * 168 * @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length} or 169 * {@code len < 0} 170 */ hashBytes(byte[] input, int off, int len)171 HashCode hashBytes(byte[] input, int off, int len); 172 173 /** 174 * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation <i>might</i> 175 * perform better than its longhand equivalent, but should not perform worse. 176 * 177 * @since 23.0 178 */ hashBytes(ByteBuffer input)179 HashCode hashBytes(ByteBuffer input); 180 181 /** 182 * Shortcut for {@code newHasher().putUnencodedChars(input).hash()}. The implementation 183 * <i>might</i> perform better than its longhand equivalent, but should not perform worse. Note 184 * that no character encoding is performed; the low byte and high byte of each {@code char} are 185 * hashed directly (in that order). 186 * 187 * <p><b>Warning:</b> This method will produce different output than most other languages do when 188 * running the same hash function on the equivalent input. For cross-language compatibility, use 189 * {@link #hashString}, usually with a charset of UTF-8. For other use cases, use {@code 190 * hashUnencodedChars}. 191 * 192 * @since 15.0 (since 11.0 as hashString(CharSequence)). 193 */ hashUnencodedChars(CharSequence input)194 HashCode hashUnencodedChars(CharSequence input); 195 196 /** 197 * Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded using 198 * the given {@link Charset}. The implementation <i>might</i> perform better than its longhand 199 * equivalent, but should not perform worse. 200 * 201 * <p><b>Warning:</b> This method, which reencodes the input before hashing it, is useful only for 202 * cross-language compatibility. For other use cases, prefer {@link #hashUnencodedChars}, which is 203 * faster, produces the same output across Java releases, and hashes every {@code char} in the 204 * input, even if some are invalid. 205 */ hashString(CharSequence input, Charset charset)206 HashCode hashString(CharSequence input, Charset charset); 207 208 /** 209 * Shortcut for {@code newHasher().putObject(instance, funnel).hash()}. The implementation 210 * <i>might</i> perform better than its longhand equivalent, but should not perform worse. 211 * 212 * @since 14.0 213 */ hashObject( @arametricNullness T instance, Funnel<? super T> funnel)214 <T extends @Nullable Object> HashCode hashObject( 215 @ParametricNullness T instance, Funnel<? super T> funnel); 216 217 /** 218 * Returns the number of bits (a multiple of 32) that each hash code produced by this hash 219 * function has. 220 */ bits()221 int bits(); 222 } 223