1--- 2layout: default 3title: Charset Detection 4nav_order: 3 5parent: Conversion 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Character Set Detection 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25Character set detection is the process of determining the character set, or 26encoding, of character data in an unknown format. This is, at best, an imprecise 27operation using statistics and heuristics. Because of this, detection works best 28if you supply at least a few hundred bytes of character data that's mostly in a 29single language. In some cases, the language can be determined along with the 30encoding. 31 32Several different techniques are used for character set detection. For 33multi-byte encodings, the sequence of bytes is checked for legal patterns. The 34detected characters are also check against a list of frequently used characters 35in that encoding. For single byte encodings, the data is checked against a list 36of the most commonly occurring three letter groups for each language that can be 37written using that encoding. The detection process can be configured to 38optionally ignore html or xml style markup, which can interfere with the 39detection process by changing the statistics. 40 41The input data can either be a Java input stream, or an array of bytes. The 42output of the detection process is a list of possible character sets, with the 43most likely one first. For simplicity, you can also ask for a Java Reader that 44will read the data in the detected encoding. 45 46There is another character set detection C++ library, the [Compact Encoding 47Detector](https://github.com/google/compact_enc_det), that may have a lower 48error rate, particularly when working with short samples of text. 49 50## CharsetMatch 51 52The `CharsetMatch` class holds the result of comparing the input data to a 53particular encoding. You can use an instance of this class to get the name of 54the character set, the language, and how good the match is. You can also use 55this class to decode the input data. 56 57To find out how good the match is, you use the `getConfidence()` method to get a 58*confidence value*. This is an integer from 0 to 100. The higher the value, the 59more confidence there is in the match For example: 60 61```java 62CharsetMatch match = ...; 63int confidence; 64confidence = match.getConfidence(); 65if (confidence < 50 ) { 66// handle a poor match... 67} else { 68// handle a good match... 69} 70``` 71 72In C, you can use the `ucsdet_getConfidence(const UCharsetMatch *ucsm, UErrorCode *status)` 73method to get a confidence value. 74 75```c 76const UCharsetMatch *ucm; 77UErrorCode status = U_ZERO_ERROR; 78int32_t confidence = ucsdet_getConfidence(ucm, &status); 79if (confidence <50) { 80 // handle a poor match... 81} else { 82 // handle a good match... 83} 84``` 85 86To get the name of the character set, which can be used as an encoding name in 87Java, you use the `getName()` method: 88 89```java 90CharsetMatch match = ...; 91byte characterData[] = ...; 92String charsetName; 93String unicodeData; 94charsetName = match.getName(); 95unicodeData = new String(characterData, charsetName); 96``` 97 98To get the name of the character set in C: 99 100```c 101const UCharsetMatch *ucm; 102UErrorCode status = U_ZERO_ERROR; 103const char *name = ucsdet_getName(ucm, &status); 104``` 105 106To get the three letter ISO code for the detected language, you use the 107`getLanguage()` method. If the language could not be determined, `getLanguage()` 108will return `null`. Note that language detection does not work with all charsets, 109and includes only a very small set of possible languages. It should not used if 110robust, reliable language detection is required. 111 112```java 113CharsetMatch match = ...; 114String languageCode; 115languageCode = match.getLanguage(); 116if (languageCode != null) { 117 // handle the language code... 118} 119``` 120 121The `ucsdet_getLanguage(const UCharsetMatch *ucsm, UErrorCode *status)` method 122can be used in C to get the language code. If the language could not be 123determined, the method will return an empty string. 124 125```c 126const UCharsetMatch *ucm; 127UErrorCode status = U_ZERO_ERROR; 128const char *language = ucsdet_getLanguage(ucm, &status); 129``` 130 131If you want to get a Java String containing the converted data you can use the 132`getString()` method: 133 134```java 135CharsetMatch match = ...; 136String unicodeData; 137unicodeData = match.getString(); 138``` 139 140If you want to limit the number of characters in the string, pass the maximum 141number of characters you want to the `getString()` method: 142 143```java 144CharsetMatch match = ...; 145String unicodeData; 146unicodeData = match.getString(1024); 147``` 148 149To get a `java.io.Reader` to read the converted data, use the `getReader()` method: 150 151```java 152CharsetMatch match = ...; 153Reader reader; 154StringBuffer sb = new StringBuffer(); 155char[] buffer = new char[1024]; 156int bytesRead = 0; 157reader = match.getReader(); 158while ((bytesRead = reader.read(buffer, 0, 1024)) >= 0) { 159 sb.append(buffer, 0, bytesRead); 160} 161reader.close(); 162``` 163 164## CharsetDetector 165 166The `CharsetDetector` class does the actual detection. It matches the input data 167against all character sets, and computes a list of `CharsetMatch` objects to hold 168the results. The input data can be supplied as an array of bytes, or as a 169`java.io.InputStream`. 170 171To use a `CharsetDetector` object, first you construct it, and then you set the 172input data, using the `setText()` method. Because setting the input data is 173separate from the construction, it is easy to reuse a `CharsetDetector` object: 174 175```java 176CharsetDetector detector; 177byte[] byteData = ...; 178InputStream streamData = ...; 179detector = new CharsetDetector(); 180detector.setText(byteData); 181// use detector with byte data... 182detector.setText(streamData); 183// use detector with stream data... 184``` 185 186If you want to know which character set matches your input data with the highest 187confidence, you can use the `detect()` method, which will return a `CharsetMatch` 188object for the match with the highest confidence: 189 190```java 191CharsetDetector detector; 192CharsetMatch match; 193byte[] byteData = ...; 194detector = new CharsetDetector(); 195detector.setText(byteData); 196match = detector.detect(); 197``` 198 199If you want to know which character set matches your input data in C, you can 200use the `ucsdet_detect(UCharsetDetector *csd , UErrorCode *status)` method. 201 202```c 203UCharsetDetector *csd; 204const UCharsetMatch *ucm; 205static char buffer[BUFFER_SIZE] = {....}; 206int32_t inputLength = ... // length of the input text 207UErrorCode status = U_ZERO_ERROR; 208ucsdet_setText(csd, buffer, inputLength, &status); 209ucm = ucsdet_detect(csd, &status); 210``` 211 212If you want to know all of the character sets that could match your input data 213with a non-zero confidence, you can use the `detectAll()` method, which will 214return an array of `CharsetMatch` objects sorted by confidence, from highest to 215lowest.: 216 217```java 218CharsetDetector detector; 219CharsetMatch matches[]; 220byte[] byteData = ...; 221detector = new CharsetDetector(); 222detector.setText(byteData); 223matches = detector.detectAll(); 224for (int m = 0; m < matches.length; m += 1) { 225 // process this match... 226} 227``` 228 229> :point_right: **Note**: The `ucsdet_detectALL(UCharsetDetector *csd , int32_t *matchesFound, UErrorCode *status)` 230> method can be used in C in order to detect all of the character sets where `matchesFound` is a pointer 231> to a variable that will be set to the number of charsets identified that are consistent with the input data. 232 233The `CharsetDetector` class also implements a crude *input filter* that can strip 234out html and xml style tags. If you want to enable the input filter, which is 235disabled when you construct a `CharsetDetector`, you use the `enableInputFilter()` 236method, which takes a boolean. Pass in true if you want to enable the input 237filter, and false if you want to disable it: 238 239```java 240CharsetDetector detector; 241CharsetMatch match; 242byte[] byteDataWithTags = ...; 243detector = new CharsetDetector(); 244detector.setText(byteDataWithTags); 245detector.enableInputFilter(true); 246match = detector.detect(); 247``` 248 249To enable an input filter in C, you can use 250`ucsdet_enableInputFilter(UCharsetDetector *csd, UBool filter)` function. 251 252```c 253UCharsetDetector *csd; 254const UCharsetMatch *ucm; 255static char buffer[BUFFER_SIZE] = {....}; 256int32_t inputLength = ... // length of the input text 257UErrorCode status = U_ZERO_ERROR; 258ucsdet_setText(csd, buffer, inputLength, &status); 259ucsdet_enableInputFilter(csd, TRUE); 260ucm = ucsdet_detect(csd, &status); 261``` 262 263If you have more detailed knowledge about the structure of the input data, it is 264better to filter the data yourself before you pass it to CharsetDetector. For 265example, you might know that the data is from an html page that contains CSS 266styles, which will not be stripped by the input filter. 267 268You can use the `inputFilterEnabled()` method to see if the input filter is 269enabled: 270 271```java 272CharsetDetector detector; 273detector = new CharsetDetector(); 274// do a bunch of stuff with detector 275// which may or may not enable the input filter... 276if (detector.inputFilterEnabled()) { 277 // handle enabled input filter 278} else { 279 // handle disabled input filter 280} 281``` 282 283> :point_right: **Note**: The ICU4C API provide `uscdet_isInputFilterEnabled(const UCharsetDetector* csd)` function 284> to check whether the input filter is enabled. 285 286The `CharsetDetector` class also has two convenience methods that let you detect 287and convert the input data in one step: the `getReader()` and `getString()` methods: 288 289```java 290CharsetDetector detector; 291byte[] byteData = ...; 292InputStream streamData = ...; 293String unicodeData; 294Reader unicodeReader; 295detector = new CharsetDetector(); 296unicodeData = detector.getString(byteData, null); 297unicodeReader = detector.getReader(streamData, null); 298``` 299 300> :point_right: **Note**: The second argument to the `getReader()` and `getString()` methods 301> is a String called `declaredEncoding`, which is not currently used. There is also a 302> `setDeclaredEncoding()` method, which is also not currently used. 303 304The following code is equivalent to using the convenience methods: 305 306```java 307CharsetDetector detector; 308CharsetMatch match; 309byte[] byteData = ...; 310InputStream streamData = ...; 311String unicodeData; 312Reader unicodeReader; 313detector = new CharsetDetector(); 314detector.setText(byteData); 315match = detector.detect(); 316unicodeData = match.getString(); 317detector.setText(streamData); 318match = detector.detect(); 319unicodeReader = match.getReader();CharsetDetector 320``` 321 322## Detected Encodings 323 324The following table shows all the encodings that can be detected. You can get 325this list (without the languages) by calling the `getAllDetectableCharsets()` 326method: 327 328| **Character Set** | **Languages** | 329| ----------------- | ------------- | 330| UTF-8 | | 331| UTF-16BE | | 332| UTF-16LE | | 333| UTF-32BE | | 334| UTF-32LE | | 335| Shift_JIS | Japanese | 336| ISO-2022-JP | Japanese | 337| ISO-2022-CN | Simplified Chinese | 338| ISO-2022-KR | Korean | 339| GB18030 | Chinese | 340| Big5 | Traditional Chinese | 341| EUC-JP | Japanese | 342| EUC-KR | Korean | 343| ISO-8859-1 | Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish | 344| ISO-8859-2 | Czech, Hungarian, Polish, Romanian | 345| ISO-8859-5 | Russian | 346| ISO-8859-6 | Arabic | 347| ISO-8859-7 | Greek | 348| ISO-8859-8 | Hebrew | 349| ISO-8859-9 | Turkish | 350| windows-1250 | Czech, Hungarian, Polish, Romanian | 351| windows-1251 | Russian | 352| windows-1252 | Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish | 353| windows-1253 | Greek | 354| windows-1254 | Turkish | 355| windows-1255 | Hebrew | 356| windows-1256 | Arabic | 357| KOI8-R | Russian | 358| IBM420 | Arabic | 359| IBM424 | Hebrew | 360