• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 /*
2  * Copyright (C) 2011 The Guava Authors
3  *
4  * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
5  * in compliance with the License. You may obtain a copy of the License at
6  *
7  * http://www.apache.org/licenses/LICENSE-2.0
8  *
9  * Unless required by applicable law or agreed to in writing, software distributed under the License
10  * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
11  * or implied. See the License for the specific language governing permissions and limitations under
12  * the License.
13  */
14 
15 package com.google.common.hash;
16 
17 import com.google.common.primitives.Ints;
18 import com.google.errorprone.annotations.Immutable;
19 import java.nio.ByteBuffer;
20 import java.nio.charset.Charset;
21 import org.checkerframework.checker.nullness.qual.Nullable;
22 
23 /**
24  * A hash function is a collision-averse pure function that maps an arbitrary block of data to a
25  * number called a <i>hash code</i>.
26  *
27  * <h3>Definition</h3>
28  *
29  * <p>Unpacking this definition:
30  *
31  * <ul>
32  *   <li><b>block of data:</b> the input for a hash function is always, in concept, an ordered byte
33  *       array. This hashing API accepts an arbitrary sequence of byte and multibyte values (via
34  *       {@link Hasher}), but this is merely a convenience; these are always translated into raw
35  *       byte sequences under the covers.
36  *   <li><b>hash code:</b> each hash function always yields hash codes of the same fixed bit length
37  *       (given by {@link #bits}). For example, {@link Hashing#sha1} produces a 160-bit number,
38  *       while {@link Hashing#murmur3_32()} yields only 32 bits. Because a {@code long} value is
39  *       clearly insufficient to hold all hash code values, this API represents a hash code as an
40  *       instance of {@link HashCode}.
41  *   <li><b>pure function:</b> the value produced must depend only on the input bytes, in the order
42  *       they appear. Input data is never modified. {@link HashFunction} instances should always be
43  *       stateless, and therefore thread-safe.
44  *   <li><b>collision-averse:</b> while it can't be helped that a hash function will sometimes
45  *       produce the same hash code for distinct inputs (a "collision"), every hash function strives
46  *       to <i>some</i> degree to make this unlikely. (Without this condition, a function that
47  *       always returns zero could be called a hash function. It is not.)
48  * </ul>
49  *
50  * <p>Summarizing the last two points: "equal yield equal <i>always</i>; unequal yield unequal
51  * <i>often</i>." This is the most important characteristic of all hash functions.
52  *
53  * <h3>Desirable properties</h3>
54  *
55  * <p>A high-quality hash function strives for some subset of the following virtues:
56  *
57  * <ul>
58  *   <li><b>collision-resistant:</b> while the definition above requires making at least <i>some</i>
59  *       token attempt, one measure of the quality of a hash function is <i>how well</i> it succeeds
60  *       at this goal. Important note: it may be easy to achieve the theoretical minimum collision
61  *       rate when using completely <i>random</i> sample input. The true test of a hash function is
62  *       how it performs on representative real-world data, which tends to contain many hidden
63  *       patterns and clumps. The goal of a good hash function is to stamp these patterns out as
64  *       thoroughly as possible.
65  *   <li><b>bit-dispersing:</b> masking out any <i>single bit</i> from a hash code should yield only
66  *       the expected <i>twofold</i> increase to all collision rates. Informally, the "information"
67  *       in the hash code should be as evenly "spread out" through the hash code's bits as possible.
68  *       The result is that, for example, when choosing a bucket in a hash table of size 2^8,
69  *       <i>any</i> eight bits could be consistently used.
70  *   <li><b>cryptographic:</b> certain hash functions such as {@link Hashing#sha512} are designed to
71  *       make it as infeasible as possible to reverse-engineer the input that produced a given hash
72  *       code, or even to discover <i>any</i> two distinct inputs that yield the same result. These
73  *       are called <i>cryptographic hash functions</i>. But, whenever it is learned that either of
74  *       these feats has become computationally feasible, the function is deemed "broken" and should
75  *       no longer be used for secure purposes. (This is the likely eventual fate of <i>all</i>
76  *       cryptographic hashes.)
77  *   <li><b>fast:</b> perhaps self-explanatory, but often the most important consideration.
78  * </ul>
79  *
80  * <h3>Providing input to a hash function</h3>
81  *
82  * <p>The primary way to provide the data that your hash function should act on is via a {@link
83  * Hasher}. Obtain a new hasher from the hash function using {@link #newHasher}, "push" the relevant
84  * data into it using methods like {@link Hasher#putBytes(byte[])}, and finally ask for the {@code
85  * HashCode} when finished using {@link Hasher#hash}. (See an {@linkplain #newHasher example} of
86  * this.)
87  *
88  * <p>If all you want to hash is a single byte array, string or {@code long} value, there are
89  * convenient shortcut methods defined directly on {@link HashFunction} to make this easier.
90  *
91  * <p>Hasher accepts primitive data types, but can also accept any Object of type {@code T} provided
92  * that you implement a {@link Funnel}{@code <T>} to specify how to "feed" data from that object
93  * into the function. (See {@linkplain Hasher#putObject an example} of this.)
94  *
95  * <p><b>Compatibility note:</b> Throughout this API, multibyte values are always interpreted in
96  * <i>little-endian</i> order. That is, hashing the byte array {@code {0x01, 0x02, 0x03, 0x04}} is
97  * equivalent to hashing the {@code int} value {@code 0x04030201}. If this isn't what you need,
98  * methods such as {@link Integer#reverseBytes} and {@link Ints#toByteArray} will help.
99  *
100  * <h3>Relationship to {@link Object#hashCode}</h3>
101  *
102  * <p>Java's baked-in concept of hash codes is constrained to 32 bits, and provides no separation
103  * between hash algorithms and the data they act on, so alternate hash algorithms can't be easily
104  * substituted. Also, implementations of {@code hashCode} tend to be poor-quality, in part because
105  * they end up depending on <i>other</i> existing poor-quality {@code hashCode} implementations,
106  * including those in many JDK classes.
107  *
108  * <p>{@code Object.hashCode} implementations tend to be very fast, but have weak collision
109  * prevention and <i>no</i> expectation of bit dispersion. This leaves them perfectly suitable for
110  * use in hash tables, because extra collisions cause only a slight performance hit, while poor bit
111  * dispersion is easily corrected using a secondary hash function (which all reasonable hash table
112  * implementations in Java use). For the many uses of hash functions beyond data structures,
113  * however, {@code Object.hashCode} almost always falls short -- hence this library.
114  *
115  * @author Kevin Bourrillion
116  * @since 11.0
117  */
118 @Immutable
119 @ElementTypesAreNonnullByDefault
120 public interface HashFunction {
121   /**
122    * Begins a new hash code computation by returning an initialized, stateful {@code Hasher}
123    * instance that is ready to receive data. Example:
124    *
125    * <pre>{@code
126    * HashFunction hf = Hashing.md5();
127    * HashCode hc = hf.newHasher()
128    *     .putLong(id)
129    *     .putBoolean(isActive)
130    *     .hash();
131    * }</pre>
132    */
newHasher()133   Hasher newHasher();
134 
135   /**
136    * Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the expected
137    * size of the input (in bytes). This is only important for non-streaming hash functions (hash
138    * functions that need to buffer their whole input before processing any of it).
139    */
newHasher(int expectedInputSize)140   Hasher newHasher(int expectedInputSize);
141 
142   /**
143    * Shortcut for {@code newHasher().putInt(input).hash()}; returns the hash code for the given
144    * {@code int} value, interpreted in little-endian byte order. The implementation <i>might</i>
145    * perform better than its longhand equivalent, but should not perform worse.
146    *
147    * @since 12.0
148    */
hashInt(int input)149   HashCode hashInt(int input);
150 
151   /**
152    * Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the given
153    * {@code long} value, interpreted in little-endian byte order. The implementation <i>might</i>
154    * perform better than its longhand equivalent, but should not perform worse.
155    */
hashLong(long input)156   HashCode hashLong(long input);
157 
158   /**
159    * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation <i>might</i>
160    * perform better than its longhand equivalent, but should not perform worse.
161    */
hashBytes(byte[] input)162   HashCode hashBytes(byte[] input);
163 
164   /**
165    * Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation
166    * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
167    *
168    * @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length} or
169    *     {@code len < 0}
170    */
hashBytes(byte[] input, int off, int len)171   HashCode hashBytes(byte[] input, int off, int len);
172 
173   /**
174    * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation <i>might</i>
175    * perform better than its longhand equivalent, but should not perform worse.
176    *
177    * @since 23.0
178    */
hashBytes(ByteBuffer input)179   HashCode hashBytes(ByteBuffer input);
180 
181   /**
182    * Shortcut for {@code newHasher().putUnencodedChars(input).hash()}. The implementation
183    * <i>might</i> perform better than its longhand equivalent, but should not perform worse. Note
184    * that no character encoding is performed; the low byte and high byte of each {@code char} are
185    * hashed directly (in that order).
186    *
187    * <p><b>Warning:</b> This method will produce different output than most other languages do when
188    * running the same hash function on the equivalent input. For cross-language compatibility, use
189    * {@link #hashString}, usually with a charset of UTF-8. For other use cases, use {@code
190    * hashUnencodedChars}.
191    *
192    * @since 15.0 (since 11.0 as hashString(CharSequence)).
193    */
hashUnencodedChars(CharSequence input)194   HashCode hashUnencodedChars(CharSequence input);
195 
196   /**
197    * Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded using
198    * the given {@link Charset}. The implementation <i>might</i> perform better than its longhand
199    * equivalent, but should not perform worse.
200    *
201    * <p><b>Warning:</b> This method, which reencodes the input before hashing it, is useful only for
202    * cross-language compatibility. For other use cases, prefer {@link #hashUnencodedChars}, which is
203    * faster, produces the same output across Java releases, and hashes every {@code char} in the
204    * input, even if some are invalid.
205    */
hashString(CharSequence input, Charset charset)206   HashCode hashString(CharSequence input, Charset charset);
207 
208   /**
209    * Shortcut for {@code newHasher().putObject(instance, funnel).hash()}. The implementation
210    * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
211    *
212    * @since 14.0
213    */
hashObject( @arametricNullness T instance, Funnel<? super T> funnel)214   <T extends @Nullable Object> HashCode hashObject(
215       @ParametricNullness T instance, Funnel<? super T> funnel);
216 
217   /**
218    * Returns the number of bits (a multiple of 32) that each hash code produced by this hash
219    * function has.
220    */
bits()221   int bits();
222 }
223