1--- 2title: Updating DTDs 3--- 4 5# Updating DTDs 6 7## Introduction 8 9CLDR makes special use of XML because of the way it is structured. In particular, the XML is designed so that you can read in a CLDR XML file and interpret it as an unordered list of \<path,value> pairs, called a CLDRFile internally. These path/value pairs can be added to or deleted, and then the CLDRFile can be written back out to disk, resulting in a valid XML file. That is a very powerful mechanism, and also allows for the CLDR inheritance model. 10 11Sounds simple, right? But it isn't quite that easy. 12 13## Summary 14 15In summary, when you add an element, attribute, or new kind of attribute value, there are some important steps you must also take. Note that running our unit tests and ConsoleCheck will catch most of these, but you should understand what is going on. Make sure that you don't break any of the invariants below (read through once to make sure you get them)! There is more detailed information further down on the page. 16 17### New Alt Values 18 19If you are only adding new alt values, it is much easier. You still need to change related information, otherwise your strings won't show up properly in the Survey Tool, or the right default values won't be set. So go to [Root Aliases](https://cldr.unicode.org/development/updating-dtds). 20 21## Changing DTDs 22 23We augment the DTD structure in various ways. 24 251. Annotations, included below the !ELEMENT or !ATTLIST line 26 - \<!--@VALUE--> to indicate that an attribute is not distinguishing, and is treated like an element value. 27 - \<!--@METADATA--> to indicate that an attribute is a "comment" on the data, like the draft status. 28 - \<!--@ORDERED--> to indicate that an element's children are ordered. 29 - \<!--@DEPRECATED--> to indicate that an attribute or element is deprecated. 30 - \<!--@DEPRECATED:attribute-value--> to indicate that an attribute value is deprecated. 312. attributeValueValidity.xml 32 - For additional validity checks 333. Check\* tests and unit tests 34 - There are many consistency tests that are performed on the data that can't be expressed with the above. 35 36### Removing Structure 37 381. We never explicitly remove structure except in very unusual cases, so be sure that the committee is in full agreement before doing that. 392. Normally, we just deprecate it, by adding attributes in the DTD file 40 1. \<!--@DEPRECATED --> below an !ELEMENT or !ATTLIST item 41 2. \<!--@DEPRECATED: comma-separated-attribute-value-list --> for specific attribute values 42 43 44### Adding structure (elements, attributes, attribute-values) 45 461. For each element 47 1. add @ORDERED if it is must be ordered. 48 2. read more details below. 492. For each attribute 50 1. add @VALUE or @METADATA to an !ATTLIST if the attribute is non-distinguishing. (See the spec for what this means) 51 1. **@VALUE should never occur except on leaf nodes!** (There are some cases before we realized this was a mistake.) 52 2. If the attribute values are a closed set, you can add them explicitly, like: 53 - \<!ATTLIST version draft (approved | contributed | provisional | unconfirmed) #IMPLIED> 54 3. Otherwise 55 1. Make it NMTOKEN where only single values are allowed, or NMTOKENS otherwise (CDATA in rare cases, but clear with the committee first) 56 2. Add validity information to attributeValueValidity.xml 57 3. **Never introduce any default DTD attribute values.** (There are some cases before we realized this was a mistake.) 58 4. For each attribute 59 1. add @VALUE or @METADATA to an !ATTLIST if the attribute is non-distinguishing. (See the spec for what this means) 60 2. add @ORDERED to an !ELEMENT. 61 62Add the annotations. 63 64### ldml.dtd 65 661. **Attribute Value.** 67 - Certain values have special sorting behavior. These are listed in **CLDRFile.getAttributeValueComparator**. They look like:: 68 - attribute.equals("day") 69 - || attribute.equals("type") && 70 - element.endsWith("FormatLength") 71 - || element.endsWith("Width") 72 - ... 73 - Those need to be updated, or an exception will be thrown when the items are processed. *Note that this is different than the sort order used in PathHeader for the survey tool.* 74 - To fix them, look at the code and find the right comparator, then modify. Example: 75 - widthOrder = (MapComparator) new MapComparator().add(new String\[\] {"abbreviated", "narrow", "short", "wide"}).freeze(); 762. **Survey Tool Data.** Add information so that the Survey Tool can display these properly to translators 77 1. PathHeader.txt (tools/java/org/unicode/cldr/util/data/) - provides the information for what section of the Survey Tool this item shows up in, and how it sorts. 78 1. Edit as described in [PathHeader](https://cldr.unicode.org/development/updating-dtds). 79 2. PathDescription.txt (tools/java/org/unicode/cldr/util/data/) - provides a description of what the field is, for translators. 80 1. If it needs more explanation, add a section (or perhaps a whole page) to the translation guide, eg http://cldr.org/translation/plurals. 81 2. For an example, see [8479](https://cldr.unicode.org/index/bug-reports#TOC-Filing-a-Ticket) 82 3. Placeholders.txt - provides information about the placeholders, if there can be any. 83 1. If the value has placeholders ({0}, {1},...) then edit this file as described in [Placeholders](https://cldr.unicode.org/development/updating-dtds). 84 4. The coverageLevels.xml (common/supplemental/coverageLevels) - sets the coverage level for the path. 85 1. **\[TBD - John\]** 86 5. *Making sure paths are visible.* 87 1. There are 3 ways for paths to show up in ST even though there are no values in root. See Visible Paths below 88 2. **Examples:** For any value that has placeholders, or is used in other values that have placeholders, add handling code to the **test/ExampleGenerator** so that survey tool users see examples of your structure in place. 89 3. **Cleaning up input.** If there are things you can do to fix the user data on entry, add to **test/DisplayAndInputProcessor** 903. **Survey Tool Tests.** Add those needed to CheckCLDR 91 1. In particular, add to CheckNew so that people see it **\[TBD, fix this advice\]** 92 1. If the user's input could be bad, add a survey test to one or more of the tests subclassed from CheckCLDR, to check for bad user input. 93 1. Look at test/**CheckDates** to see how this is done. 94 2. Run test/**ConsoleCheckCLDR** with various types of invalid input to make sure that they fail. 95 2. To update the casing files used by CheckConsistentCasing , run org.unicode.cldr.test.CasingInfo -l \<locale\_regex> which will update the casing files in common/casing. When you check this in, sanity check the values, because in some cases we have have had different rules than just what the heuristics generate. 96 3. TEST out the **SurveyTool** to verify that you can see/edit the new items. If users should be able to input data and are not able to, the item has not been properly added to CLDR. See [Running the Survey Tool in Eclipse](https://cldr.unicode.org/development/running-survey-tool). 974. **Data.** 98 1. Add necessary data to root and English. 99 2. (Optional) add additional data for locales (if part of main). If the data is just seed data (that you aren't sure of), make sure that you have draft="unconfirmed" on the leaf nodes. 100 101### supplementalData.dtd 102 1031. Add code to util/SupplementalDataInfo to fetch the data. 1042. You should develop a chart program that shows your data in http://www.unicode.org/cldr/data/charts/supplemental/index.html 105 106 107### Structure Requirements 108 109The following are required for elements, attributes, and attribute values. 110 111#### Elements 112 113We never have "mixed" content. That is, no element values can occur in anything but leaf nodes. You can never have \<x>abcd\<y>def\</y>\</x>. You must instead introduce another element, such as: \<x>\<z>abcd\</z>\<y>def\</y>\</x> 114 115There is a strong distinction between *rule elements and structure elements*. Example: in collations you have \<p>x\</p>\<p>y\</p> representing x < y. Clearly changing the order would cause problems! There are restrictions on this, however: 116 1171. Rule elements must be written in the same order they are read. 1182. They can't inherit. 1193. You can't (easily) add to them programmatically. 1204. You can't mix rule and structure elements under the same parent element. That is, if you can have \<x>\<y>...\</y>\<z>...\</z>\</x>, then either y and z must *both* be rule or *both* be structure elements. 1215. In our code, rule elements have their ordering preserved by adding a fake attribute added when reading, \_q="nnn". 1226. The CLDRFile code has a list of these, in the right order, as **orderedElements**. If you ever add an rule element to a DTD, you MUST add it there. Be careful to preserve the above invariants. 123 - Note: we should change the name *orderedElements* for clarity. 124 125In order to write out an XML file correctly, we also have to know the valid ordering of paths for elements that are not ordered. This ordering is generated automatically from the DTD, constructed by merging. ***If there are any cycles in the ordering, then the CLDR tools will throw an exception, and you have to fix it.*** That also means that we cannot have complicated DTDs; each non-leaf node **MUST** be of the form: 126- \<!ELEMENT foo (alias (*first?*, *second*\*, *third*?, ... special\*))>. 127 128The subelements of an element will vary between \* and ?. Note however that all leaf nodes MUST allow for the attributes alt=... draft=... and references=.... So that the alt can work, the leaf nodes MUST occur in their parent as \*, not ?, even if logically there can be only one. For example, even though logically there is only a single quotationStart, we see: 129- \<!ELEMENT delimiters (alias | (quotationStart\*, ... 130 131#### Attributes 132 133The attribute order is much more flexible, since it doesn't affect the validity of the file. That is, in XML the following are equal: 134- \<info iso4217="ADP" digits="0" rounding="0"/> 135- \<info digits="0" rounding="0" iso4217="ADP"/> 136 137However, when this is turned into a path, the order does matter. That is, as *strings* the following are *not* equal 138 139- //supplementalData/currencyData/fractions/info\[@iso4217="ADP"\]\[@digits="0"\]\[@rounding="0"\] 140- //supplementalData/currencyData/fractions/info\[@digits="0"\]\[@rounding="0"\]\[@iso4217="ADP"\] 141 142The ordering of attributes in the string path and in the output file is controlled by the ordering in the DTD. Certain attributes always come first (like \_q and type), and certain others always come last (like draft and references). Normally you add new attributes to the middle somewhere. 143 144When computing the file ordering, we compare paths using CLDRFile.ldmlComparator. Here is the basic ordering algorithm: 145 146Walk through the elements in the path. For each element and its attributes: 147 1481. compare the corresponding elements at that level in the respective paths; if unequal, return their ordering 149 - If they are orderedElements, treat them as equal (the \_q attributes will distinguish them). 150 - Otherwise the "less than" ordering is given by elementOrdering. 1512. otherwise compare the respective attributes and attribute values, one by one: 152 1. if the attributes are unequal, return their ordering (according to attributeOrdering) 153 2. if the attribute values are unequal, return their ordering 154 155While attribute value orderings are mostly alphabetic, we do have a number of tweaks in getAttributeValueComparator so that values come in a reasonable order, such as "sun" < "mon" < "tues" < ... 156 157There is an important distinction for attributes. The **distinguishing** attributes are relevant to the identity of the path and for inheritance. For example, in <language type="en"...> the type is a distinguishing attribute. The **non-distinguishing** attributes instead carry information, and aren't relevant to the identity of the path, nor are they used in the ordering above. ***Non-distinguishing elements in the ldml DTD cause problems: try to design all future DTD structure to avoid them; put data in element values, not attribute values.*** It is ok to have data in attributes in the other DTDs. The distinction between the distinguishing and non-distinguishing elements is captured in the distinguishingData in CLDRFile. So by default, always put new ldml attributes in this array. 158 159- *(Note: we should change this to be exclusive instead of inclusive, to reduce the possibility for error.)* 160 161#### Attribute Values 162 163We use some default attribute values in our DTD, such as 164 165- \<!ATTLIST decimalFormat type NMTOKEN **"standard"** > 166 167This was a mistake, since it makes the interpretation of the file depend on the DTD; we might fix it some day, maybe if we go to Relax, but for now just don't introduce any more of these. It also means that we have a table in CLDRFile with these values: defaultSuppressionMap. 168 169When you make a draft attribute on a new element, don't copy the old ones like this: 170 171\<!ATTLIST xxx draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED >\<!-- true and false are deprecated. --> 172 173That is, we *don't* want the deprecated values on new elements. Just make it: 174 175\<!ATTLIST xxx draft ( approved | contributed | provisional | unconfirmed ) #IMPLIED > 176 177The DTD cannot do anything like the level of testing for legitimate values that we need, so supplemental data also has a set of attributeValueValidity.xml data for checking attribute values. For example, we see: 178 179- \<attributeValues dtds='supplementalData' elements='calendarPreference' attributes='ordering' type='list'>$\_bcp47\_calendar\</attributeValues> 180 181 182This means that whenever you see any matching dtd/element/attribute combination, it can be tested for a list of values that are contained in the variable \$\_bcp47\_calendar. Some of these variables are lists, and some are regex, and some (those with $\_) are generated internally from other information. When you add a new attribute to ldml, you must add a \<validity> element unless it is a closed set. 183 184#### No default attribute values 185 186The ones we have in CLDR were (in hindsight) a mistake, since it makes the interpretation of the file depend on the DTD; we might fix it some day, maybe if we go to Relax, but for now just don't introduce any more of these. It also means that for writing out the files we have a table in CLDRFile with these values: defaultSuppressionMap and in supplementalMetadata *\<suppress>*. 187 188#### Don't Reuse 189 190For many many reasons, you never reuse an element name or attribute name unless you mean precisely the same thing, and the item is used in the same way. So to="2009-05-21" is always an attribute that means an end date. Be very careful about new elements with the same name as old ones. You can't have \<territory> be an orderedElement in one place, and a non-orderedElement in another. The attribute type=... is always used as an id. For historial reasons, sometimes it is distinguishing and sometimes note (this is very painful, don't add to it!). It is also not used as the id in numberingSystems. 191 192## Root Aliases 193 194If your new structure should have aliases, such as when the "narrow" values should default to the "short" values, which should default to the regular values, then you need to add aliases in root.xml. Look at examples there for how to do this. 195 196## PathHeader 197 198PathHeader.txt determines the placement and ordering in SurveyTool. It consists of a sequence of regex lines of the following form: 199 200\<regex> ; \<section> ; \<page> ; \<header> ; \<code> 201 202Here's an example: 203 204//ldml/dates/timeZoneNames/metazone\[@type="%A"\]/%E/%E ; Timezones ; &metazone($1) ; $1 ; $3-$2 205 206### Key Features 207 208These are also in the header of PathHeader.txt: 209 210- \# Be careful, order matters. It is used to determine the order on the page and in menus. Also, be sure to put longer matches first, unless terminated with $. 211 - \# The quoting of \\\[ is handled automatically, as is alt=X 212 - \# If you add new paths, change @type="..." => @type="%A" 213 - \# The syntax &function(data) means that a function generates both the string and the ordering. The functions MUST be supported in PathHeader.java 214 - \# The only function that can be in Page right now are &metazone and &calendar, and NO functions can be in Section 215 - \# A \* at the front (like \*$1) means to not change the sorting group. 216 217There are a set of variables at the top of the file. These all are in parens, so the %A, %E, and %E correspond to the $1, $2, and $3 in the \<section> ; \<page> ; \<header> ; \<code> 218 219The order of the section and page is determined by the enums in the PathHeader.java file. So the \<section> and \<page> must correspond to those enum values. 220 221### Uniqueness is Vital 222 223The results from PathHeader must be unique: that is, if the source paths are different, then at least one of \<section> ; \<page> ; \<header> ; \<code> must be different. 224 225### Changing Order 226 227If you need to change the order of the header or code or the appearance programmatically, then you need to create a function (call it xyz), and use it in the PathHeader.txt file (eg &xyz($1)). In PathHeader.java, search for *functionMap* to see examples of these. 228 229The order of the header and then of the code within the same header is normally determined by the ordering in the file. To override this, set the order field in your function. For example, the following gets integer values and changes them into real ints for comparison. 230 231**int** m = Integer.*parseInt*(source); 232 233*order* = m; 234 235There is also a "suborder" used in a few cases for the code. You probably don't need to worry about this, but here is an example. Ask for help on the cldr-dev list if you need this. 236 237*suborder* = **new** SubstringOrder(source, 1); 238 239The return value is the appearance to the user. For example, the following changes integer months into strings for display: 240 241**static** String\[\] *months* = { "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Und" }; 242 243... 244 245**return** *months*\[m - 1\]; 246 247## Placeholders 248 249If a value has placeholders, edit Placeholders.txt: 250 2511. Add 1 item per placeholder, with the form 252 - \<regex> ; {0}=\<message\_name> \<example> ; {1}=\<message\_name> \<example> ... 253 - ^//ldml/units/unit\\\[@type="day%A"\]/unitPattern ; {0}=NUMBER\_OF\_DAYS 3 2542. There is a variable %A that will match attribute value syntax (or substrings). 2553. \<example> may contain spaces, but \<message\_name> must not. 2564. For an example, see [8484](https://cldr.unicode.org/index/bug-reports#TOC-Filing-a-Ticket) 2575. Check that the ConsoleCheckCLDR **CheckForExamplars** fails if there are no placeholders in the value 2586. Note: we should switch methods so that we don't need to quote \\\[, etc, but we haven't yet. 259 260## PathDescription 261 262This file provides a description of each kind of path, and a link to a section of https://cldr.unicode.org/translation. Easiest is to take an existing description and modify. 263 264## Coverage 265 266Coverage determines the minimum coverage level at which a given item will appear in the survey tool. If a given field is not in coverage, then the item will not appear in the survey tool at all. This data is required for the elements in /main/. 267 268The file **common/supplemental/coverageLevels.xml** is a series of regular expressions describing the paths and the coverage levels associated with each. The file also gives you the ability to define a "coverage variable", which can then be used as a placeholder in the regular expressions used for matching. Always try to be as exact as possible and avoid using wildcards in the regular expressions, as they can impact lookup performance. 269 270Coverage values are currently numeric, although we may change them to be words in the near future in order to make them easier to understand. The coverage level values are: 271 27210 = Core data, 20 = POSIX, 30 = Minimal, 40 = Basic, 60 = Moderate, 80 = Modern, 100 = Comprehensive 273 274Example: The following two lines define the coverage for the exemplar characters items. Note that "//ldml" is automatically prepended to the path names, in order to make the paths in this file smaller. 275 276\<coverageVariable key="%exemplarTypes" value="(auxiliary|index|punctuation)"/> 277 278\<coverageLevel value="10" match="characters/exemplarCharacters\[@type='%exemplarTypes'\]"/> 279 280## LDML2ICU 281 282Modify the following files as described in [ldml2icu\_readme.txt](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/tools/java/org/unicode/cldr/icu/ldml2icu_readme.txt). This will allow NewLdml2IcuConverter.java to work properly so that the data can be read into ICU and tested there. 283 2841. ldml2icu\_locale.txt and/or 2852. ldml2icu\_supplemental.txt 286 287Unfortunately, you have to change input parameters to get the different kinds of generated files. Here's an example: 288 289\-s {workspace-cldr}/common/supplemental 290 291\-d {workspace-temp}/cldr/icu/ 292 293\-t supplementalData 294 295\-k 296 297Use -k to build into a single file, which is helpful for checking the supplemental data. There are a few other useful parameters if you look at the top of NewLdml2IcuConverter. 298 299### Warning 300 301If you add a new kind of file or directory, you may have to adjust the tool to make sure it is seen and built. For example, if you add a new kind of supplemental file, you also have to modify SupplementalMapper.fillFromCldr(...). 302 303## Visible Paths 304 305There are three ways for paths to show up in the Survey Tool (and in other tooling!) even if the value is null for a given locale. These are important, since they determine what users will be able to enter. 306 3071. **root.** This is the simplest, and should always be used whenever there is a 'real' fallback value for the path, and the path is not part of an algorithmically computed set. It also has the aliases for paths that get special inheritance. 3082. **code\_fallback.** This is used for all algorithmically computed paths *that **don't** depend on the locale*. For example, the paths for language codes, currency codes, region codes, etc. are here. 309 - To modify, go to XMLSource.java (tools/java/org/unicode/cldr/util/) and update constructedItems to add special paths for items that should appear in locales even though there is no corresponding item in root (e.g. for localeDisplayNames including standard language codes and regional variants, and for all alt="short" or alt="variant" forms). 310 - Check to make sure that all of the special alt values in en.xml are there. 3111. **extraPaths.** This is used for algorithmically computed paths *that **do** depend on the locale*. For example, we generate count values based on the plural rules. The 'other' form must be in root, but all other forms are calculated here. This should not be overused, since it is recalculated dynamically, whereas root and code\_fallback are constant over the life of the ST. 312 - To modify, look at CLDRFile.getRawExtraPaths(). 313 314 315### Gotchas 316 317- Even if root, code\_fallback, or extraPaths are set up right, the data may not be visible in ST. If it should show up but isn't, look at: 318 - **PathHeader:** Special items are suppressed (they all have HIDE on them). This is used for all paths that don't vary by locale. Paths can also be marked as having unmodifiable values. 319 - **Coverage:** If a path has too high a coverage level, then it will be hidden. 320 - **Other stuff?** \[Steven to fill out\]. 321 322 323### OK if Missing 324 325Certain paths don't have to be present in locales. They are not counted as Missing in the Dashboard and shouldn't have an effect on coverage. To handle these, modify the file [missingOk.txt](https://cldr.unicode.org/index/bug-reports#TOC-Filing-a-Ticket) to provide a regex that captures those paths. Be careful, however, to not be overly inclusive: you want all and only those paths that are ok to skip. Typically those are paths for which root values are perfectly fine. 326 327## Examples of DTD modifications 328 329The following is an example of the different files that may need to be modified. It has both count= and a placeholder, so it hits most of the kinds of changes. 330- https://cldr.unicode.org/index/bug-reports#TOC-Filing-a-Ticket 331 332 333## Modifying English/Root 334 335Whenever you modify values in English or Root, be sure to run GenerateBirth as described on [Updating English/Root](https://cldr.unicode.org/development/cldr-development-site/updating-englishroot) and check in the results. That ensures that CheckNew works properly. This must be done before the Survey Tool starts or is in the Submission Phase. 336 337## Validation 338 339- **Do the steps on** [**Running Tests**](https://cldr.unicode.org/development/running-tests) 340 341 342## Debugging Regexes 343 344- Moved to [**Running Tests**](https://cldr.unicode.org/development/running-tests) 345 346