third_party/icu/README.google

This directory contains the source code of ICU 4.2.1 for C/C++

1. It was obtained with the following:

    $ svn export --native-eol LF http://source.icu-project.org/repos/icu/icu/tags/release-4-2-1 icu42

2. The following directories were removed because they're not used by Chromium
   at the moment:
   as_is
   packaging
   source/extra
   source/sample
   source/layout
   source/layoutex

3. Platform header files for Linux and Mac OS X:
   On Linux and Mac OS X, 'runConfigureICU Linux' and 'runConfigureICU MacOSX'
   are run to generate source/common/unicode/platform.h.
   Rename it to 'plinux.h' and 'pmac.h' on Linux and Mac and check them in.

   The Mac 'pmac.h' file needs to have patches/pmac.h.patch applied.

   Change source/common/unicode/umachine.h to refer to plinux.h and pmac.h
   on Linux and Mac, respetively.

4. To avoid name collisions (two different versions of StringPiece
   are in Chrome's base and ICU), make the use of 'icu::' namespace
   qualifier required by setting U_USING_ICU_NAMESPACE to 0 in
   source/common/unicode/uversion.h

   In addition, the patches for ICU ticket 6935
   (http://icu-project.org/trac/ticket/6935) are applied.

   The combined patch is patches/namespace.patch.txt

5. The word breaking for Chinese and Japanese were modified to use a word
   frequency list with the following patch and cjdict.txt.

   In addition, the word breaking rule for ASCII and full-width full stop(period)
   surrounded by letters has been modified to fit our need for segmenting
   a host name into its components  (e.g. treating 'www.google.com' not as
   a single word but as 5 words). It's what ICU 3.8 did before UTR 29
   changed the rule (WB #6, #7).  This also let us pass
   LayoutTests/css1/text_properties/text_transform.html without rebaselining.

   These patches alone will not work without build-related changes mentioned
   in #10 below.

   - patches/segmentation.patch.txt :
       Adds a dictionary (word-frequency)-based word breaking for CJK
       (Korean is supported in the code, but it does not do anything
        because we don't have a Korean word-list.)

   - source/data/brkitr/cjdict.txt :
       Chinese and Japanese word frequency list.
       See the file for license/copyright notice

   - source/data/brkitr/cc_edict.txt :
       the list of words derived from CC-Edict.)

   The following two files were removed (because Japanese breaking rules
   are now the same as that of other langauges).

   - source/data/brkitr/word_ja.txt
   - source/data/brkitr/ja.txt

   If you want to run ICU tests, you have to copy source/data/brkitr/cjdict.txt
   to source/test/testdata/cjdict-truncated.txt to pass TestTrieWithValue test.

6. A minor break iterator change

   - patches/brkitr.patch.txt

7. Converter changes : converters.patch.txt
  - Include what we really need. See source/data/mappings/ucmlocal.txt
  - Alias and mapping changes : source/data/mappings/convrtrs.txt
  - Changes several tables and add six new tables, three of which
    are 'fake' tables for ISO-2022-CN(-Ext).
  - ucnv2022.c is modified to use 3 'fake' tables added above for
    ISO-2022-CN(-Ext).

8. Locale changes
  - patches/locale1.patch.txt :
      Filipino locale, exemplar character set changes for CJK + 9 Indian
      locales with minor fixes for Danish, Hungarian, Turkish, Korean
      and Catalan.

  - patches/locale2.patch.txt :
      The minimum locale data Chrome needs for 35 languages Chrome is
      not localized to. Each locale data file has ExemplarCharacters,
      LocaleScript, layout, and the name of the language for a locale
      in its native language.

  - patches/locale3.patch.txt : Locale build configuration files

9. Removal of unihan collation tables from data/coll/{zh,ja,ko}.txt

  - patches/unihan.patch.txt:
    unihan collation tables are never used in Chrome/Webkit, but it takes
    about 1MB in the uncompressed ICU data file in ICU 4.2.1.

10. Build-related changes

  - patches/wpo.patch
  - patches/windows.patch
  - patches/data.build.patch :
      To remove some data files we don't use and cut down the data size.
  - patches/data.build.win.patch :
      Windows-only data build patch. Add a new target DATALIB to makedata.mak
  - add an empty file (stubdatabuilt.txt) to source/stubdata

11. Pre-built data libraries are checked in.

    - source/data/in/icudt42l.dat : Built on Linux with all the patches
      above applied,
    - icudt42.dll : With icudt42l.dat in place, all the patches applied
      and header files moved (#11 below), generated in bin/ by building
      icudt_build project of build/icudt_build.sln on Windows.
      It's made in bin/ and moved to the top and checked in.
    - {mac,linux}/icudt42l_dat.s : Built on Mac and Linux with all the
      patches above applied and checked in.
      linux needs the '@' in the preamble changed to '%'. See
      http://codereview.chromium.org/215026.
      mac/icudt42l_dat.s needs one line added after it is generated.  A
      .private_extern directive needs to be added so that the top of the
      file looks like:

.globl _icudt42_dat
        .private_extern _icudt42_dat
        .data

12. The header files were moved as shown below:

   source/common/unicode ==> public/common/unicode
   source/i18n/unicode   ==> public/i18n/unicode

13. The patch for a memory leak in i18n/timezone.cpp (Windows only):
    see http://bugs.icu-project.org/trac/ticket/7135

    - patches/tzmemory.patch

14. The patch for a crash in common/putil.c (Linux only):
    see http://bugs.icu-project.org/trac/ticket/7177

    - patches/linuxtz.patch

15. The patch for Linux locale detection

    - patches/locdet.patch