Unicode CLDR Technical Committee Process
1.
Introduction
This document describes the Unicode CLDR Technical
Committee, and its process for data collection, resolution, public
feedback and release. The process is designed to be light-weight: in
particular, the meetings are frequent, short, and informal. Most of the
work is by email or phone, with a database recording requested changes
in data.
When gathering data for a region and language, it
is important to have multiple sources for that data to produce the most
widely acceptable data. Initial versions of data were based on the best
available sources, but CLDR data will be modified and improved, in
successive versions, by more input from the contributors inside and
outside of the Unicode Consortium.
It is important to note that CLDR is a Repository,
not a Registration. That is, contributors should not expect that their
contributions will simply be adopted into the repository; instead, it
will be vetted against the best available information.
All inputs are open, and gathered via the CLDR Survey Tool or
recorded in a bug/feature request database (CLDR Bug Reports).
Changes in response to requests in the database may be entered into the
repository snapshot over time by the maintainers of the repository, but
the final approval of the release of any version of CLDR is up to the
decision of the CLDR Technical Committee.
For more information on the formal procedures for
the Unicode CLDR Technical Committee, see the Technical
Committee Procedures for the Unicode Consortium.
2.
Specification Changes
The UTS #35: Locale
Data Markup Language (LDML) specification may be changed to
add structure for new kinds of data or other features. Requests for
changes are entered in the bug/feature request database (CLDR Bug Reports).
Structural changes are always
backwards-compatible. That is, previous files will continue to work.
Deprecated elements remain and can be used, although their usage is
strongly discouraged.
There is a standing policy for structural changes
that require non-trivial code for proper implementation, such as time
zone fallback or alias mechanisms. These require the existence of at
least a prototype implementation that demonstrates correct function
according to the proposed specification.
3.
Data Submission and Vetting Process
Once data for a country and language has been
received, the data from the different sources will be compared to show
agreements and differences. Initial data contributions are normally
marked as draft; this may be changed once the data is
vetted.
Note that there are two types of data in the
repository:
- Common Data: The contents
is decided upon by the CLDR Technical Committee, following its
procedures and this process.
- Comparison data: The
contributor can be an individual or an organization. Data is normally
gathered by calling public APIs, to ensure that the data matches what
is actually in use. The data is only for comparison, and will not be
changed except where necessary to update the data to match the external
source. The only requirement is that all changed data be versioned, and
the Version Numbering Scheme be used.
Contributors are encouraged to use local language
and country contacts, inside and outside their organization, to help
vet current common data and any new proposals for addition or amendment
of common data. In particular, national standards organizations are
encouraged to be involved in the data vetting process.
For CLDR to add a new language just requires that
the proposer to commit to providing at least the minimal localization
(exemplar characters, months, days, date/time formats, translations for
a few countries, languages, currencies, etc.). The exemplar characters,
however, are required before the new locale can be added: see also Exemplar
Character Sources. The new locale then becomes
available for additional translations and vetting during the next
review cycle.
The following procedure is used when resolving
differences in submitted data. At the end, for each field a single
value will be chosen as optimal, while the others will have an alt=proposed
attribute. The draft attribute on all the values
will be set to one of 4 states:
- unconfirmed
- provisional
- contributed (= minimally approved)
- approved
(equivalent to an absent draft attribute)
Implementations may choose the level at which they
wish to accept data. They may choose to accept even unconfirmed
data, especially if there is no translated alternative. Approved data
is approved by the Technical Committee, as described by the resolution
process below. This does not mean that the data is guaranteed to be
error-free -- this is simply the best judgment of the committee
according to the process.
User Levels
There are multiple levels of access and control:
Vetter Level |
Vote |
Description |
Committee Vetters |
8 |
- Can vet and submit data for all locales
- Can manage users in their organization
- Can see the email addresses for all vetters
- Technical Committee
(TC) members
|
Expert Vetters |
8 |
- Can vet and submit data for a particular set of
locales
- Cannot manage other users.
- Can see the email addresses for submitted data
in their locales.
|
Regular Vetters |
4 |
- Can vet and submit data for a particular set of
locales
- Cannot manage other users.
- Can see the email addresses for submitted data
in their locales.
|
Guest Vetters |
1 |
- Can vet and submit data for a particular set of
locales
- Cannot manage other users.
- Cannot see email addresses.
|
Locked Vetters |
0 |
- If a user is locked or removed, then his/er
vote is zero.
|
These levels are decided by the technical committee and the TC representative
for the respective organizations.
- Unicode TC members (full/institutional/supporting)
can assign its users to Regular or Guest level, and with approval of the TC, users at the Expert level.
- Liaison or associate members can assign to Guest, or to other levels with approval of the TC.
- The liaison/associate member him/erself
gets TC status in order to manage users, but gets a Guest status in
terms of voting, unless the committee approves a higher level.
- Users assigned to "unicode.org" are
normally assigned as Guest, but the committee can assign a different
level.
Voting Process
- Each user gets one vote on each value, but the strength
of the vote varies according to the user level according to the table above.
- All values with survey tool errors get zero votes
- For example, a date pattern of "14.
september" instead of "dd MMM".
- They cannot be voted for, and show a
visible error.
- Collision errors are an exception. They get normal votes, but are handled below.
- For each value, each organization gets a vote
based on the maximum (not cumulative) strength of
the votes of its users who voted on that item.
- That is, even if an organization has 10
Vetters voting for an value, if the highest level is regular vetter, then the vote count attributed to the
organization as a whole is 4.
- If there is a dispute (votes for different
values) within an organization, then the majority vote for that
organization is chosen. If there is a tie, then no vote is counted for
the organization.
- Batch data (marked with x999, for example) gets
a status based on committee decision.
All fields are then assessed as follows:
Optimal
Field Value
For each release, there is one optimal field value
determined by the following:
- Add up the votes for each value from each
organization.
- Out of all the possible alternative values for
a given field, pick the one with the most votes, the optimal value.
- If there was a tie, pick the least one (in UCA
order).
Draft
Status of Optimal Field Value
- Let O
be the optimal value's vote,
N
be the vote of the next best value, and G be the number of
organizations that voted for the optimal value.
- Assign the draft status according to the first
of the conditions below that applies:
Resulting Draft Status |
Condition |
approved |
O ≥
8 and O
≥
2×N |
contributed |
O ≥
4 and O > N
OR
O ≥ 2 and O > N and G ≥ 2 |
provisional |
O ≥ 2 and O ≥
N |
unconfirmed |
otherwise |
- If the draft status of the previously released
value is better than the new draft status, then no change is made.
Otherwise, the optimal value and its draft status are made part of the
new release.
- For example, if the new optimal value does
not have the status of approved, and the previous
release had an approved value
(one that does not have an error and is not a fallback), then that
previously-released value stays approved and
replaces the optimal value in the following steps.
- If there was an alt=proposed
on the optimal value, the alt=proposed is removed.
Further Processing
After the optimal value is chosen:
- Collisions errors are resolved by retaining one of the values and
removing the other(s).
- The choice is based
on the judgment of the committee, typically according to which field is most commonly used.
When an item is removed, an alternate may then become the new optimal
value.
- All other values with errors are removed.
- Non-optimal values are handled as follows:
- Those with no votes are removed.
- Those with votes are marked with alt=proposed
and given the draft status: unconfirmed
If a locale does not have minimal data (at least
at a provisional level), then it may be excluded from the release.
Where this is done, it may be restored to the repository for the next
submission cycle.
Note: Starting
with CLDR 1.7, we are planning to save votes across releases, for any
active (unlocked) voters. However, where there are English changes, old
votes will be discarded.
This process can be fine-tuned by the Technical
Committee as needed, to resolve any problems that turn up. A committee
decision can also override any of the above process for any specific
values.
For
more information see the key links in CLDR Survey Tool (especially the Vetting Phase).
Notes:
- If data has a formal problem, it can be fixed
directly (in CVS) without going through the above process. Examples
include:
- syntactic problems in pattern, extra
trailing spaces, inconsistent decimals, mechanical sweeps to change
attributes, translatable characters not quoted in patterns, changing '
(punctuation mark) to curly apostrophe or s-cedilla to s-comma-below,
removing disallowed exemplar characters (non-letter, number, mark,
uppercase when there is a lowercase).
- These are changed in-place, without
changing the draft status.
- Linguistically-sensitive data should always go
through the survey tool. Examples include:
- names of months, territories, number
formats, changing ASCII apostrophe to U+02BC modifier letter apostrophe
or U+02BB modifier letter turned comma, or U+02BD modifier letter
reversed comma, adding/removing normal exemplar characters.
- The TC committee can authorize bulk submissions
of new data directly (CVS), with all new data marked
draft="unconfirmed" (or other status decided by the committee), but
only where the data passes the CheckCLDR console tests.
- The survey tool does not currently handle all
CLDR data. For data it doesn't cover, the regular bug system is used to
submit new data or ask for revisions of this data. In particular:
- Collation, transforms, or text
segmentation, which are more complex.
- Non-linguistic locale data:
There may be conflicting common practices or
standards for a given country and language. Thus LDML provides keyword
variants to reflect the different practices. For example, for German it
allows the distinction between PHONEBOOK and DICTIONARY collation.
When there is an existing national standard for a
country that is widely accepted in practice, the goal is to follow that
standard as much as possible. Where the common practice in the country
deviates from the national standard, or if there are multiple
conflicting common practices, or options in conforming to the national
standard, or conflicting national standards, multiple variants may be
entered into the CLDR, distinguished by keyword variants or variant
locale identifiers.
Where a data value is identified as following a
particular national standard (or other reference), the goal is to keep
that data aligned with that standard. There is, however, no guarantee
that data will be tagged with any or all of the national standards that
it follows.
Dot-dot releases, such as 1.4.1, are issued
whenever the standard identifiers change (that is, BCP 47 identifiers,
Time zone identifiers, or ISO 4217 Currency identifiers).
Updates to identifiers will also mean updating the English names for
those identifiers.
Corrigenda may also be included in dot-dot
releases. Dot-dot releases may also be issued if there are substantive
changes to supplemental (non-language) data. An example of supplemental
data additions would be adding more transforms, or adding more
script-language info.
Normally there are no dot-dot releases for language data, but the
committee may decide to issue one if the situation warrants. Normally
there are no major changes in the specification.
The structure and DTD may change, but except for
additions or for small bug fixes, data will not be changed in a way
that would affect the content of resolved data.
The public can supply formal feedback into CLDR
via the Survey
Tool or by filing a Bug
Report or Feature Request. There is also a public forum for
questions at CLDR
Mailing List (details on archives are found there).
Anyone can also asked to be added to a list that
will receive notification of new CLDR bugs, so they can track issues if
they want. Anyone can also to reply to any bug report to add comments
or questions.
- To subscribe, send a note to
"ecartis+unicode.org" (use an at sign instead of
the +) and put "subscribe cldr-bugrfe" in the subject line.
- To unsubscribe, put "unsubscribe cldr-bugrfe"
in the subject line instead.
There is also a members-only CLDR
mailing list for members of the CLDR Technical Committee.
Public
Review Issues may be posted in cases where broader public
feedback is desired on a particular issue.
Be aware that changes and updates to CLDR will
only be taken in response to information entered in the Survey Tool
or by filing a Bug
Report or Feature Request. Discussion on public mailing lists
is not monitored; no actions will be taken in
response to such discussion -- only in response to filed bugs. The
process of checking and entering data takes time and effort; so even
when bugs/feature requests are accepted, it may take some time before
they are in a release of CLDR.
5.
Data Release Process
The locale data is frozen per version. Once a
version is released, it is never modified. Any changes, however minor,
will mean a newer version of the locale data being released. The
versioning scheme is x.y.z, where z is incremented for bug fixes, y is
incremented for any additions (such as new locale data or LDML
elements), and x is incremented for any major changes in format.
Early releases of a version of the common locale
data will be issued as either alpha or beta releases, available for
public feedback. The dates for the next scheduled release will be on CLDR Project.
The schedule milestones are:
Design (p1) |
All the proposed design changes have been
accepted in place for changes in structure, and tools. All the DTD and specification changes are
made according to proposed design. The tools are updated to support the
new structure, including the survey tool for displaying, collecting,
and vetting data. |
Survey Tool Beta (p2) |
Users can try out the survey tool and supply feedback |
Data Submission |
Users can add data and vet (vote for) for data |
Data Vetting |
Users can vet (vote for) data, and can add in certain disputed cases |
Data Resolution |
Data resolution, data/structure verification and correction by the committee. |
Final Candidate |
Final Candidate available for testing. Only showstoppers fixed. |
Release |
Released, stable, referenceable version. |
Each phase ends at 24:00 (midnight) on the day in question.
6.
Meetings and Communication
The currently-scheduled meetings are listed on the
Unicode
Calendar. Meetings are held by phone, every week at 8:00
Pacific Time (-08:00 GMT in winter, -07:00 GMT in summer). Some
meetings may be skipped if they conflict with holidays or other Unicode
meetings.
There is an internal email list for the Unicode
CLDR Technical Committee, open to Unicode members and invited experts.
All national standards bodies who are interested in locale data are
also invited to become involved by establishing a Liaison
membership in the Unicode Consortium, to gain access to this
list.
Notification of the telephone numbers and
passcode, and agenda, and any change in schedule are sent out on the
this email list.
7.
Officers
The current Technical Committee Officers are:
- Chair: Mark Davis (Google)
- Vice-Chair: John Emmons (IBM)