1--- 2layout: default 3title: Compression 4nav_order: 4 5parent: Conversion 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Compression 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview of SCSU 24 25Compressing Unicode text for transmission or storage results in minimal 26bandwidth usage and fewer storage devices. The compression scheme compresses 27Unicode text into a sequence of bytes by using characteristics of Unicode text. 28The compressed sequence can be used on its own or as further input to a general 29purpose file or disk-block based compression scheme. Note that the combination 30of the Unicode compression algorithm plus disk-block based compression produces 31better results than either method alone. 32 33Strings in languages using small alphabets contain runs of characters that are 34coded close together in Unicode. These runs are typically interrupted only by 35punctuation characters, which are themselves coded in proximity to each other in 36Unicode (usually in the Basic Latin range). 37 38For additional detail about the compression algorithm, which has been approved 39by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A 40Standard Compression Scheme for 41Unicode)](https://www.unicode.org/reports/tr6/). 42 43The Standard Compression Scheme for Unicode (SCSU) is used to: 44 45* express all code points in Unicode 46 47* approximate the storage size of traditional character sets 48 49* facilitate the use of short strings 50 51* provide transparency for characters between `U+0020`-`U+00FF`, as well as `CR`, `LF` 52 and `TAB` 53 54* support very simple decoders 55 56* support simple as well as sophisticated encoders 57 58It does not attempt to avoid the use of control bytes (including `NUL`) in the 59compressed stream. 60 61The compression scheme is mainly intended for use with short to medium length 62Unicode strings. The resulting compressed format is intended for storage or 63transmission in bandwidth limited environments. It can be used stand-alone or as 64input to traditional general purpose data compression schemes. It is not 65intended as processing format or as general purpose interchange format. 66 67## BOCU-1 68 69A MIME compatible encoding called BOCU-1 is also available in ICU. Details about 70this encoding can be found in the [Unicode Technical Note 71#6](https://www.unicode.org/notes/tn6/). Both SCSU and BOCU-1 are IANA 72registered names. 73 74## Usage 75 76The compression service in ICU is a part of Conversion framework, and follows 77the semantics of converters. For more information on how to use ICU's conversion 78service, please refer to the Usage Model section in the [Using 79Converters](converters.md) chapter. 80 81```c++ 82uint16_t germanUTF16[]={ 83 0x00d6, 0x006c, 0x0020, 0x0066, 0x006c, 0x0069, 0x0065, 0x00df, 0x0074 84}; 85 86uint8_t germanSCSU[]={ 87 0xd6, 0x6c, 0x20, 0x66, 0x6c, 0x69, 0x65, 0xdf, 0x74 88}; 89char target[100]; 90UChar uTarget[100]; 91UErrorCode status = U_ZERO_ERROR; 92UConverter *conv; 93int32_t len; 94 95/* set up the SCSU converter */ 96conv = ucnv_open("SCSU", &status); 97assert(U_SUCCESS(status)); 98 99/* compress the string using SCSU */ 100len = ucnv_fromUChars(conv, target, 100, germanUTF16, -1, &status); 101assert(U_SUCCESS(status)); 102 103len = ucnv_toUChars(conv, uTarget, 100, germanSCSU, -1, &status); 104 105/* close the converter */ 106ucnv_close(conv); 107``` 108