• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Compression
4nav_order: 4
5parent: Conversion
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Compression
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview of SCSU
24
25Compressing Unicode text for transmission or storage results in minimal
26bandwidth usage and fewer storage devices. The compression scheme compresses
27Unicode text into a sequence of bytes by using characteristics of Unicode text.
28The compressed sequence can be used on its own or as further input to a general
29purpose file or disk-block based compression scheme. Note that the combination
30of the Unicode compression algorithm plus disk-block based compression produces
31better results than either method alone.
32
33Strings in languages using small alphabets contain runs of characters that are
34coded close together in Unicode. These runs are typically interrupted only by
35punctuation characters, which are themselves coded in proximity to each other in
36Unicode (usually in the Basic Latin range).
37
38For additional detail about the compression algorithm, which has been approved
39by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A
40Standard Compression Scheme for
41Unicode)](https://www.unicode.org/reports/tr6/).
42
43The Standard Compression Scheme for Unicode (SCSU) is used to:
44
45*   express all code points in Unicode
46
47*   approximate the storage size of traditional character sets
48
49*   facilitate the use of short strings
50
51*   provide transparency for characters between `U+0020`-`U+00FF`, as well as `CR`, `LF`
52    and `TAB`
53
54*   support very simple decoders
55
56*   support simple as well as sophisticated encoders
57
58It does not attempt to avoid the use of control bytes (including `NUL`) in the
59compressed stream.
60
61The compression scheme is mainly intended for use with short to medium length
62Unicode strings. The resulting compressed format is intended for storage or
63transmission in bandwidth limited environments. It can be used stand-alone or as
64input to traditional general purpose data compression schemes. It is not
65intended as processing format or as general purpose interchange format.
66
67## BOCU-1
68
69A MIME compatible encoding called BOCU-1 is also available in ICU. Details about
70this encoding can be found in the [Unicode Technical Note
71#6](https://www.unicode.org/notes/tn6/). Both SCSU and BOCU-1 are IANA
72registered names.
73
74## Usage
75
76The compression service in ICU is a part of Conversion framework, and follows
77the semantics of converters. For more information on how to use ICU's conversion
78service, please refer to the Usage Model section in the [Using
79Converters](converters.md) chapter.
80
81```c++
82uint16_t germanUTF16[]={
83    0x00d6, 0x006c, 0x0020, 0x0066, 0x006c, 0x0069, 0x0065, 0x00df, 0x0074
84};
85
86uint8_t germanSCSU[]={
87    0xd6, 0x6c, 0x20, 0x66, 0x6c, 0x69, 0x65, 0xdf, 0x74
88};
89char target[100];
90UChar uTarget[100];
91UErrorCode status = U_ZERO_ERROR;
92UConverter *conv;
93int32_t len;
94
95/* set up the SCSU converter */
96conv = ucnv_open("SCSU", &status);
97assert(U_SUCCESS(status));
98
99/* compress the string using SCSU */
100len = ucnv_fromUChars(conv, target, 100, germanUTF16, -1, &status);
101assert(U_SUCCESS(status));
102
103len = ucnv_toUChars(conv, uTarget, 100, germanSCSU, -1, &status);
104
105/* close the converter */
106ucnv_close(conv);
107```
108