1--- 2title: Updating Script Metadata 3--- 4 5# Updating Script Metadata 6 7### New Unicode scripts 8 9We should work on script metadata early for a Unicode version, so that it is available for tools (such as Mark's "UCA" tools). 10 11- Unicode 9/CLDR 29: New scripts in CLDR but not yet in ICU caused trouble. 12- Unicode 10: Working on a pre\-CLDR\-31 branch, plan to merge into CLDR trunk after CLDR 31 is done. 13- Should the script metadata code live in the Unicode Tools, so that we don't need a CLDR branch during early Unicode next\-version work? 14 15If the new Unicode version's PropertyValueAliases.txt does not have lines for Block and Script properties yet, then create a preliminary version. Diff the Blocks.txt file and UnicodeData.txt to find new scripts. Get the script codes from <http://www.unicode.org/iso15924/codelists.html> . Follow existing patterns for block and script names, especially for abbreviations. Do not add abbreviations (which differ from the long forms) unless there is a well\-established pattern in the existing data. 16 17Aside from instructions below for all script metadata changes, new script codes need English names (common/main/en.xml) and need to be added to common/supplemental/coverageLevels, under key %script100, so that the new script names will show up in the survey tool. For example, see the [changes for new Unicode 8 scripts](https://unicode-org.atlassian.net/browse/CLDR-8109). 18 19Can we add new scripts in CLDR *trunk* before or only after adding them to CLDR's copy of ICU4J? We did add new Unicode 9 scripts in CLDR 29 before adding them to ICU4J. The CLDR unit tests do not fail any more for scripts that are newer than the Unicode version in CLDR's copy of ICU. 20 21### Sample characters 22 23We need sample characters for the "UCA" tools for generating FractionalUCA.txt. 24 25Look for patterns of what kinds of characters we have picked for other scripts, for example the script's letter "KA". We basically want a character where people say "that looks Greek", and the same shape should not be used in multiple scripts. So for Latin we use "L", not "A". We usually prefer consonants, if applicable, but it is more important that a character look unique across scripts. It does want to be a *letter*, and if possible should not be a combining mark. It would be nice if the letters were commonly used in the majority language, if there are multiple. Compare with the [charts for existing scripts](http://www.unicode.org/charts/), especially related ones. 26 27### Editing the spreadsheet 28 29Google Spreadsheet: [Script Metadata](https://docs.google.com/spreadsheets/d/1Y90M0Ie3MUJ6UVCRDOypOtijlMDLNNyyLk36T6iMu0o/edit#gid=0) 30 31Use and copy cell formulas rather than duplicating contents, if possible. Look for which cells have formulas in existing data, especially for Unicode 1\.1 and 7\.0 scripts. 32 33For example, 34 35- Script names should only be entered on the LikelyLanguage sheet. Other sheets should use a formula to map from the script code. 36- On the Samples sheet, use a formula to map from the code point to the actual character. This is especially important for avoiding mistakes since almost no one will have font support for the new scripts, which means that most people will see "Tofu" glyphs for the sample characters. 37 38### Script Metadata properties file 391. Go to the spreadsheet [Script Metadata](https://docs.google.com/spreadsheets/d/1Y90M0Ie3MUJ6UVCRDOypOtijlMDLNNyyLk36T6iMu0o/edit#gid=0) 40 1. File\>Download as\>Comma Separated Values 41 2. Location/Name \= {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/Script\_Metadata.csv 42 3. Refresh files (eclipse), then compare with previous version for sanity check. If there are no new scripts for target Unicode version of CLDR release you're working on, then skip the rest of steps below. For example, script "Toto" is ignore for CLDR 39 because target Unicode release of CLDR 39 is Unicode 13 and "Toto" will be added in Unicode 14\. 432. **Note: VM arguments** 44 1. Each tool (and test) needs \-DCLDR\_DIR\=/usr/local/google/home/mscherer/cldr/uni/src (or wherever your repo root is) 45 2. It is easiest to set this once in the global Preferences, rather than in the Run Configuration for each tool. 46 3. Most of these tools also need \-DSCRIPT\_UNICODE\_VERSION\=14 (set to the upcoming Unicode version), but it is easier to edit the ScriptMetadata.java line that sets the UNICODE\_VERSION variable. 47 4. Run {cldr}/tools/cldr\-code/src/test/java/org/unicode/cldr/unittest/TestScriptMetadata.java 48 5. A common error is if some of the data from the spreadsheet is missing, or has incorrect values. 493. Run GenerateScriptMetadata, which will produce a modified [common/properties/scriptMetadata.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt) file. 50 1. If this ignores the new scripts: Check the \-DSCRIPT\_UNICODE\_VERSION or the ScriptMetadata.java UNICODE\_VERSION. 51 2. Add the English script names (from the script metadata spreadsheet) to common/main/en.xml. 52 3. Add the French script names from [ISO 15924](https://www.unicode.org/iso15924/iso15924-codes.html) to common/main/fr.xml, but mark them as draft\="provisional". 53 4. Add the script codes to common/supplemental/coverageLevels.xml (under key %script100\) so that the new script names will show up in the CLDR survey tool. 54 1. See [\#8109\#comment:4](https://unicode-org.atlassian.net/browse/CLDR-8109#comment:4) [r11491](https://github.com/unicode-org/cldr/commit/1d6f2a4db84cc449983c7a01e5a2679dc1827598) 55 2. See changes for Unicode 10: <http://unicode.org/cldr/trac/review/9882> 56 3. See changes for Unicode 12: [CLDR\-11478](https://unicode-org.atlassian.net/browse/CLDR-11478) [commit/647ce01](https://github.com/unicode-org/cldr/commit/be3000629ca3af2ae77de6304480abefe647ce01) 57 5. Maybe add the script codes to TestCoverageLevel.java variable script100\. 58 1. Starting with [cldr/pull/1296](https://github.com/unicode-org/cldr/pull/1296) we should not need to list a script here explicitly unless it is Identifier\_Type\=Recommended. 59 6. Remove new script codes from $scriptNonUnicode in common/supplemental/attributeValueValidity.xml if needed 60 7. For the following step to work as expected, the CLDR copy of the IANA BCP 47 language subtag registry must be updated (at least with the new script codes). 61 1. Copy the latest version of https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry to {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/language\-subtag\-registry 62 2. Consider copying only the new script subtags (and making a note near the top of the CLDR file, or lines like "Comments: Unicode 14 script manually added 2021\-06\-01") to avoid having to update other parts of CLDR. 63 8. Run GenerateValidityXML.java like this: 64 1. See [Update Validity XML](https://cldr.unicode.org/development/updating-codes/update-validity-xml) 65 2. This needs the previous version of CLDR in a sibling folder. 66 1. see [Creating the Archive](https://cldr.unicode.org/development/creating-the-archive) for details on running the CheckoutArchive tool 67 3. Now run GenerateValidityXML.java 68 4. If this crashes with a NullPointerException trying to create a Validity object, check that ToolConstants.LAST\_RELEASE\_VERSION is set to the actual last release. 69 1. Currently, the CHART\_VERSION must be a simple integer, no ".1" suffix. 70 9. At least script.xml should show the new scripts. The generator overwrites the source data file; use ```git diff``` or ```git difftool``` to make sure the new scripts have been added. 71 10. Run GenerateMaximalLocales, [as described on the likelysubtags page](https://cldr.unicode.org/development/updating-codes/likelysubtags-and-default-content), which generates another two files. 72 11. Compare the latest git master files with the generated ones: meld common/supplemental ../Generated/cldr/supplemental 73 1. Copy likelySubtags.xml and supplementalMetadata.xml to the latest git master if they have changes. 74 12. Compare generated files with previous versions for sanity check. 75 13. Run the CLDR unit tests. 76 1. Project cldr\-core: Debug As \> Maven test 77 14. These tests have sometimes failed: 78 1. LikelySubtagsTest 79 2. TestInheritance 80 3. They may need special adjustments, for example in GenerateMaximalLocales.java adding an extra entry to its MAX\_ADDITIONS or LANGUAGE\_OVERRIDES. 814. Check in the updated files. 82 83Problems are typically because a non\-standard name is used for a territory name. That can be fixed and the process rerun. 84 85