• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Charset Detection
4nav_order: 3
5parent: Conversion
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Character Set Detection
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25Character set detection is the process of determining the character set, or
26encoding, of character data in an unknown format. This is, at best, an imprecise
27operation using statistics and heuristics. Because of this, detection works best
28if you supply at least a few hundred bytes of character data that's mostly in a
29single language. In some cases, the language can be determined along with the
30encoding.
31
32Several different techniques are used for character set detection. For
33multi-byte encodings, the sequence of bytes is checked for legal patterns. The
34detected characters are also check against a list of frequently used characters
35in that encoding. For single byte encodings, the data is checked against a list
36of the most commonly occurring three letter groups for each language that can be
37written using that encoding. The detection process can be configured to
38optionally ignore html or xml style markup, which can interfere with the
39detection process by changing the statistics.
40
41The input data can either be a Java input stream, or an array of bytes. The
42output of the detection process is a list of possible character sets, with the
43most likely one first. For simplicity, you can also ask for a Java Reader that
44will read the data in the detected encoding.
45
46There is another character set detection C++ library, the [Compact Encoding
47Detector](https://github.com/google/compact_enc_det), that may have a lower
48error rate, particularly when working with short samples of text.
49
50## CharsetMatch
51
52The `CharsetMatch` class holds the result of comparing the input data to a
53particular encoding. You can use an instance of this class to get the name of
54the character set, the language, and how good the match is. You can also use
55this class to decode the input data.
56
57To find out how good the match is, you use the `getConfidence()` method to get a
58*confidence value*. This is an integer from 0 to 100. The higher the value, the
59more confidence there is in the match For example:
60
61```java
62CharsetMatch match = ...;
63int confidence;
64confidence = match.getConfidence();
65if (confidence < 50 ) {
66// handle a poor match...
67} else {
68// handle a good match...
69}
70```
71
72In C, you can use the `ucsdet_getConfidence(const UCharsetMatch *ucsm, UErrorCode *status)`
73method to get a confidence value.
74
75```c
76const UCharsetMatch *ucm;
77UErrorCode status = U_ZERO_ERROR;
78int32_t confidence = ucsdet_getConfidence(ucm, &status);
79if (confidence <50) {
80    // handle a poor match...
81} else {
82    // handle a good match...
83}
84```
85
86To get the name of the character set, which can be used as an encoding name in
87Java, you use the `getName()` method:
88
89```java
90CharsetMatch match = ...;
91byte characterData[] = ...;
92String charsetName;
93String unicodeData;
94charsetName = match.getName();
95unicodeData = new String(characterData, charsetName);
96```
97
98To get the name of the character set in C:
99
100```c
101const UCharsetMatch *ucm;
102UErrorCode status = U_ZERO_ERROR;
103const char *name = ucsdet_getName(ucm, &status);
104```
105
106To get the three letter ISO code for the detected language, you use the
107`getLanguage()` method. If the language could not be determined, `getLanguage()`
108will return `null`. Note that language detection does not work with all charsets,
109and includes only a very small set of possible languages. It should not used if
110robust, reliable language detection is required.
111
112```java
113CharsetMatch match = ...;
114String languageCode;
115languageCode = match.getLanguage();
116if (languageCode != null) {
117    // handle the language code...
118}
119```
120
121The `ucsdet_getLanguage(const UCharsetMatch *ucsm, UErrorCode *status)` method
122can be used in C to get the language code. If the language could not be
123determined, the method will return an empty string.
124
125```c
126const UCharsetMatch *ucm;
127UErrorCode status = U_ZERO_ERROR;
128const char *language = ucsdet_getLanguage(ucm, &status);
129```
130
131If you want to get a Java String containing the converted data you can use the
132`getString()` method:
133
134```java
135CharsetMatch match = ...;
136String unicodeData;
137unicodeData = match.getString();
138```
139
140If you want to limit the number of characters in the string, pass the maximum
141number of characters you want to the `getString()` method:
142
143```java
144CharsetMatch match = ...;
145String unicodeData;
146unicodeData = match.getString(1024);
147```
148
149To get a `java.io.Reader` to read the converted data, use the `getReader()` method:
150
151```java
152CharsetMatch match = ...;
153Reader reader;
154StringBuffer sb = new StringBuffer();
155char[] buffer = new char[1024];
156int bytesRead = 0;
157reader = match.getReader();
158while ((bytesRead = reader.read(buffer, 0, 1024)) >= 0) {
159    sb.append(buffer, 0, bytesRead);
160}
161reader.close();
162```
163
164## CharsetDetector
165
166The `CharsetDetector` class does the actual detection. It matches the input data
167against all character sets, and computes a list of `CharsetMatch` objects to hold
168the results. The input data can be supplied as an array of bytes, or as a
169`java.io.InputStream`.
170
171To use a `CharsetDetector` object, first you construct it, and then you set the
172input data, using the `setText()` method. Because setting the input data is
173separate from the construction, it is easy to reuse a `CharsetDetector` object:
174
175```java
176CharsetDetector detector;
177byte[] byteData = ...;
178InputStream streamData = ...;
179detector = new CharsetDetector();
180detector.setText(byteData);
181// use detector with byte data...
182detector.setText(streamData);
183// use detector with stream data...
184```
185
186If you want to know which character set matches your input data with the highest
187confidence, you can use the `detect()` method, which will return a `CharsetMatch`
188object for the match with the highest confidence:
189
190```java
191CharsetDetector detector;
192CharsetMatch match;
193byte[] byteData = ...;
194detector = new CharsetDetector();
195detector.setText(byteData);
196match = detector.detect();
197```
198
199If you want to know which character set matches your input data in C, you can
200use the `ucsdet_detect(UCharsetDetector *csd , UErrorCode *status)` method.
201
202```c
203UCharsetDetector *csd;
204const UCharsetMatch *ucm;
205static char buffer[BUFFER_SIZE] = {....};
206int32_t inputLength = ... // length of the input text
207UErrorCode status = U_ZERO_ERROR;
208ucsdet_setText(csd, buffer, inputLength, &status);
209ucm = ucsdet_detect(csd, &status);
210```
211
212If you want to know all of the character sets that could match your input data
213with a non-zero confidence, you can use the `detectAll()` method, which will
214return an array of `CharsetMatch` objects sorted by confidence, from highest to
215lowest.:
216
217```java
218CharsetDetector detector;
219CharsetMatch matches[];
220byte[] byteData = ...;
221detector = new CharsetDetector();
222detector.setText(byteData);
223matches = detector.detectAll();
224for (int m = 0; m < matches.length; m += 1) {
225    // process this match...
226}
227```
228
229> :point_right: **Note**: The `ucsdet_detectALL(UCharsetDetector *csd , int32_t *matchesFound, UErrorCode *status)`
230> method can be used in C in order to detect all of the character sets where `matchesFound` is a pointer
231> to a variable that will be set to the number of charsets identified that are consistent with the input data.
232
233The `CharsetDetector` class also implements a crude *input filter* that can strip
234out html and xml style tags. If you want to enable the input filter, which is
235disabled when you construct a `CharsetDetector`, you use the `enableInputFilter()`
236method, which takes a boolean. Pass in true if you want to enable the input
237filter, and false if you want to disable it:
238
239```java
240CharsetDetector detector;
241CharsetMatch match;
242byte[] byteDataWithTags = ...;
243detector = new CharsetDetector();
244detector.setText(byteDataWithTags);
245detector.enableInputFilter(true);
246match = detector.detect();
247```
248
249To enable an input filter in C, you can use
250`ucsdet_enableInputFilter(UCharsetDetector *csd, UBool filter)` function.
251
252```c
253UCharsetDetector *csd;
254const UCharsetMatch *ucm;
255static char buffer[BUFFER_SIZE] = {....};
256int32_t inputLength = ... // length of the input text
257UErrorCode status = U_ZERO_ERROR;
258ucsdet_setText(csd, buffer, inputLength, &status);
259ucsdet_enableInputFilter(csd, TRUE);
260ucm = ucsdet_detect(csd, &status);
261```
262
263If you have more detailed knowledge about the structure of the input data, it is
264better to filter the data yourself before you pass it to CharsetDetector. For
265example, you might know that the data is from an html page that contains CSS
266styles, which will not be stripped by the input filter.
267
268You can use the `inputFilterEnabled()` method to see if the input filter is
269enabled:
270
271```java
272CharsetDetector detector;
273detector = new CharsetDetector();
274// do a bunch of stuff with detector
275// which may or may not enable the input filter...
276if (detector.inputFilterEnabled()) {
277    // handle enabled input filter
278} else {
279    // handle disabled input filter
280}
281```
282
283> :point_right: **Note**: The ICU4C API provide `uscdet_isInputFilterEnabled(const UCharsetDetector* csd)` function
284> to check whether the input filter is enabled.
285
286The `CharsetDetector` class also has two convenience methods that let you detect
287and convert the input data in one step: the `getReader()` and `getString()` methods:
288
289```java
290CharsetDetector detector;
291byte[] byteData = ...;
292InputStream streamData = ...;
293String unicodeData;
294Reader unicodeReader;
295detector = new CharsetDetector();
296unicodeData = detector.getString(byteData, null);
297unicodeReader = detector.getReader(streamData, null);
298```
299
300> :point_right: **Note**: The second argument to the `getReader()` and `getString()` methods
301> is a String called `declaredEncoding`, which is not currently used. There is also a
302> `setDeclaredEncoding()` method, which is also not currently used.
303
304The following code is equivalent to using the convenience methods:
305
306```java
307CharsetDetector detector;
308CharsetMatch match;
309byte[] byteData = ...;
310InputStream streamData = ...;
311String unicodeData;
312Reader unicodeReader;
313detector = new CharsetDetector();
314detector.setText(byteData);
315match = detector.detect();
316unicodeData = match.getString();
317detector.setText(streamData);
318match = detector.detect();
319unicodeReader = match.getReader();CharsetDetector
320```
321
322## Detected Encodings
323
324The following table shows all the encodings that can be detected. You can get
325this list (without the languages) by calling the `getAllDetectableCharsets()`
326method:
327
328| **Character Set** | **Languages** |
329| ----------------- | ------------- |
330| UTF-8             | &nbsp;        |
331| UTF-16BE          | &nbsp;        |
332| UTF-16LE          | &nbsp;        |
333| UTF-32BE          | &nbsp;        |
334| UTF-32LE          | &nbsp;        |
335| Shift_JIS         | Japanese      |
336| ISO-2022-JP       | Japanese      |
337| ISO-2022-CN       | Simplified Chinese |
338| ISO-2022-KR       | Korean        |
339| GB18030           | Chinese       |
340| Big5              | Traditional Chinese |
341| EUC-JP            | Japanese      |
342| EUC-KR            | Korean        |
343| ISO-8859-1        | Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
344| ISO-8859-2        | Czech, Hungarian, Polish, Romanian |
345| ISO-8859-5        | Russian       |
346| ISO-8859-6        | Arabic        |
347| ISO-8859-7        | Greek         |
348| ISO-8859-8        | Hebrew        |
349| ISO-8859-9        | Turkish       |
350| windows-1250      | Czech, Hungarian, Polish, Romanian |
351| windows-1251      | Russian       |
352| windows-1252      | Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
353| windows-1253      | Greek         |
354| windows-1254      | Turkish       |
355| windows-1255      | Hebrew        |
356| windows-1256      | Arabic        |
357| KOI8-R            | Russian       |
358| IBM420            | Arabic        |
359| IBM424            | Hebrew        |
360