• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Base38 and FourCC Codes
2
3Both of these encode a four-character string such as `"JPEG"` as a `uint32_t`
4value. Computers can compare two integer values faster than they can compare
5two arbitrary strings.
6
7Both schemes maintain ordering: if two four-character strings `s` and `t`
8satisfy `(s < t)`, and those strings have valid numerical encodings, then the
9numerical values also satisfy `(encoding(s) < encoding(t))`.
10
11
12## FourCC
13
14FourCC codes are not specific to Wuffs. For example, the AVI multimedia
15container format can hold various sub-formats, such as "H264" or "YV12",
16distinguished in the overall file format by their FourCC code.
17
18The FourCC encoding is the straightforward sequence of each character's ASCII
19encoding. The FourCC code for `"JPEG"` is `0x4A504547`, since `'J'` is `0x4A`,
20`'P'` is `0x50`, etc. This is essentially 8 bits for each character, 32 bits
21overall. The big-endian representation of this number is exactly the ASCII (and
22UTF-8) string `"JPEG"`.
23
24Other FourCC documentation sometimes use a little-endian convention, so that
25the `{0x4A, 0x50, 0x45, 0x47}` bytes on the wire for `"JPEG"` corresponds to
26the number `0x4745504A` (little-endian) instead of `0x4A504547` (big-endian).
27Wuffs uses the big-endian interpretation, as it maintains ordering.
28
29
30## Base38
31
32Base38 is a tighter encoding than FourCC, fitting four characters into 21 bits
33instead of 32 bits. This is achieved by using a smaller alphabet of 38 possible
34values (space, 0-9, ? or a-z), so that it cannot distinguish between e.g. an
35upper case 'X' and a lower case 'x'. There's also the happy coincidence that
36`38 ** 4` is slightly smaller than `2 ** 21`.
37
38The base38 encoding of `"JPEG"` is `0x122FF6`, which is `1191926`, which is
39`((21 * (38 ** 3)) + (27 * (38 ** 2)) + (16 * (38 ** 1)) + (18 * (38 ** 0)))`.
40
41Using only 21 bits means that we can use base38 values to partition the set of
42possible `uint32_t` values into file-format specific enumerations. Each package
43(i.e. Wuffs implementation of a specific file format) can define up to 1024
44different values in their own namespace, without conflicting with other
45packages (assuming that there aren't e.g. two `"JPEG"` Wuffs packages in the
46same library). The conventional `uint32_t` packing is:
47
48- Bit        31 is reserved (zero).
49- Bits 30 .. 10 are the base38 value, shifted by 10.
50- Bits  9 ..  0 are the enumeration value.
51
52For example, [quirk values](/doc/note/quirks.md) use this `((base38 << 10) |
53enumeration)` scheme.
54