Skip to content

Commit

Permalink
CLDR-17566 Converting Cldr Spec (#4009)
Browse files Browse the repository at this point in the history
  • Loading branch information
chpy04 authored Sep 3, 2024
1 parent eb4b003 commit ec5cadd
Show file tree
Hide file tree
Showing 5 changed files with 923 additions and 0 deletions.
33 changes: 33 additions & 0 deletions docs/site/index/cldr-spec/core-data-for-new-locales.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: Core Data for New Locales
---

# Core Data for New Locales

This document describes the minimal data needed for a new locale. There are two kinds of data that are relevant for new locales:

1. **Core Data** \- This is data that the CLDR committee needs from the proposer ***before*** a new locale is added. The proposer is expected to also get a Survey Tool account, and contribute towards the Basic Data.
2. **Basic Data** \- The Core data is just the first step. It is only created under the expectation that people will engage in suppling data, at a [Basic Coverage Level](https://cldr.unicode.org/index/cldr-spec/coverage-levels#h.yi1eiryx7yl4). **If the locale does not meet the [Basic Coverage Level](https://cldr.unicode.org/index/cldr-spec/coverage-levels#h.yi1eiryx7yl4) in the next Survey Tool cycle, the committee may remove the locale.**

## Core Data

Collect and submit the following data, using the [Core Data Submission Form](https://docs.google.com/forms/d/e/1FAIpQLSfSyz0VUSXD93IJQQdjzUCnbQwC2nwz6eiLjTaFjASQZzpoSg/viewform). *Note to translators: If you are having difficulties or questions about the following data, please contact us: [file a new bug](https://cldr.unicode.org/index/bug-reports#TOC-Filing-a-Ticket), or post a follow\-up to comment to your existing bug.*

1. The correct language code according to [Picking the Right Language Identifier](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code).
2. The four exemplar sets: main, auxiliary, numbers, punctuation. 
- These must reflect the Unicode model. For more information, see [tr35\-general.html\#Character\_Elements](http://www.unicode.org/reports/tr35/tr35-general.html#Character_Elements).
3. Verified country data ( i.e. the population of speakers in the regions (countries) in which the language is commonly used) 
- There must be at least one country, but should include enough others that they cover approximately 75% or more of the users of the language.
- "Users of the language" includes as either a 1st or 2nd language. The main focus is on written language.
4. Default content script and region (normally the region is the country with largest population using that language, and the customary script used for that language in that country). 
- **\[[supplemental/supplementalMetadata.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/supplementalMetadata.xml#LC1654:~:text=%3CdefaultContent)]**
- *See*: [http://cldr.unicode.org/translation/translation\-guide\-general/default\-content](https://cldr.unicode.org/translation/translation-guide-general/default-content)
5. The correct time cycle used with the language in the default content region
- In common/supplemental/supplementalData.xml, this is the "timeData" element
- The value should be h (1\-12\), H (0\-23\), k (1\-24\), or K (0\-11\); as defined in [https://www.unicode.org/reports/tr35/tr35\-dates.html\#Date\_Field\_Symbol\_Table](https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table)

***You must commit to supplying [the data required for the new locale to reach Basic level](https://cldr.unicode.org/index/cldr-spec/core-data-for-new-locales#h.yaraq3qjxnns) during the next open CLDR submission when requesting a new locale to be added.***

For more information on the other coverage levels refer to [Coverage Levels](https://cldr.unicode.org/index/cldr-spec/coverage-levels) 

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
108 changes: 108 additions & 0 deletions docs/site/index/cldr-spec/coverage-levels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: Coverage Levels
---

# Coverage Levels

There are four main coverage levels as defined in the [UTS \#35: Unicode Locale Data Markup Language (LDML) Part 6: Supplemental: 8 Coverage Levels](https://www.unicode.org/reports/tr35/tr35-info.html#Coverage_Levels). They are described more fully below.

## Usage

You can use the file **common/properties/coverageLevels.txt** (added in v41\) for a given release to filter the locales that they support. For example, see [coverageLevels.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/coverageLevels.txt). (This and other links to data files are to the development versions; see the specific version for the release you are working with.) For a detailed chart of the coverage levels, see the [locale\_coverage.html](https://unicode-org.github.io/cldr-staging/charts/43/supplemental/locale_coverage.html) file for the respective release.

The file format is semicolon delimited, with 3 fields per line.


```Locale ID ; Coverage Level ; Name```

Each locale ID also covers all the locales that inherit from it. So to get locales at a desired coverage level or above, the following process is used.

1. Always include the root locale file, **root.xml**
2. Include all of the locale files listed in **coverageLevels.txt** at that level or above.
3. Recursively include all other files that inherit from the files in \#2\.
- **Warning**: Inheritance is not simple truncation; the **parentLocale** information in [supplementalData.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/supplementalData.xml) needs to be applied also. See [Parent\_Locales](https://www.unicode.org/reports/tr35/tr35.html#Parent_Locales).
- For example, if you include fr.xml in \#2, you would also include fr\_CA.xml; if you include no.xml in \#2 you would also include nn.xml.

### Filtering

To filter "at that level or above", you use the fact that basic ⊂ moderate ⊂ modern, so 

1. to filter for basic and above, filter for basic\|moderate\|modern
2. to filter for moderate and above, filter for moderate\|modern

### Migration

As of v43, the files in **/seed/** have been moved to **/common/**. Older versions of CLDR separated some locale files into a 'seed' directory. Some implementations used for filtering, but the criteria for moving from seed to common were not rigorous. To maintain compatibility with the set of locales used from previous versions, an implementation may use the above process for Basic and above, but then also add locales that were previously included. For more information, see [CLDR 43 Release Note](https://cldr.unicode.org/index/downloads/cldr-43)

## Core Data

**The data needed for a new locale to be added. See [Core Data for New Locales](https://cldr.unicode.org/index/cldr-spec/core-data-for-new-locales) for details on Core Data and how to submit for new locales.**

**It is expected that during the next Survey Tool cycle after a new locale is added, the data for the Basic Coverage Level will be supplied.**

## Basic Data

**Suitable for locale selection and minimal support, eg. choice of language on mobile phone**

This includes very minimal data for support of the language: basic dates, times, autonyms:

1. Delimiter Data —Quotation start/end, including alternates
2. Numbering system — default numbering system \+ native numbering system (if default \= Latin and native ≠ Latin)
3. Locale Pattern Info — Locale pattern and separator, and code pattern
4. Language Names — in the native language for the native language and for English
5. Script Name(s) — Scripts customarily used to write the language
6. Country Name(s) — For countries where commonly used (see "Core XML Data")
7. Measurement System — metric vs UK vs US
8. Full Month and Day of Week names
9. AM/PM period names
10. Date and Time formats
11. Date/Time interval patterns — fallback
12. Timezone baseline formats — region, gmt, gmt\-zero, hour, fallback
13. Number symbols — decimal and grouping separators; plus, minus, percent sign (for Latin number system, plus native if different)
14. Number patterns — decimal, currency, percent, scientific

## Moderate Data

**Suitable for “document content” internationalization, eg. content in a spreadsheet**

Before submitting data above the Basic Level, the following must be in place:

1. Plural and Ordinal rules
- As in \[supplemental/plurals.xml] and \[supplemental/ordinals.xml]
- Must also include minimal pairs
- For more information, see [cldr\-spec/plural\-rules](https://cldr.unicode.org/index/cldr-spec/plural-rules).
2. Casing information (only where the language uses a cased scripts according to [ScriptMetadata.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt))
- This will go into [common/casing](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/casing/)
3. Collation rules \[non\-Survey Tool]
- This can be supplied as a list of characters, or as rule file.
- The list is a space\-delimited list of the characters used by the language (in the given script). The list may include multiple\-character strings, where those are treated specially. For example, if "ch" is sorted after "h" one might see "a b c d .. g h ch i j ..."
- More sophisticated users can do a better job, supplying a file of rules as in [cldr\-spec/collation\-guidelines](https://cldr.unicode.org/index/cldr-spec/collation-guidelines).
4. The result will be a file like: [common/collation/ar.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/collation/ar.xml) or [common/collation/da.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/collation/da.xml).

The data for the Moderate Level includes subsets of the Modern data, both in depth and breadth.

## Modern Data

**Suitable for full UI internationalization**

Before submitting data at the Moderate Level, the following must be in place:

1. Grammatical Features
1. The grammatical cases and other information, as in [supplemental/grammaticalFeatures.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/grammaticalFeatures.xml)
2. Must include minimal pair values.
2. Romanization table (non\-Latin scripts only)
1. This can be supplied as a spreadsheet or as a rule file.
2. If a spreadsheet, for each letter (or sequence) in the exemplars, what is the corresponding Latin letter (or sequence).
3. More sophisticated users can do a better job, supplying a file of rules like [transforms/Arabic\-Latin\-BGN.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/transforms/Arabic-Latin-BGN.xml).

The data for the Modern Level includes:

**\#\#\# TBD**

## References

For the coverage in the latest released version of CLDR, see [Locale Coverage Chart](https://unicode-org.github.io/cldr-staging/charts/latest/supplemental/locale_coverage.html).

To see the development version of the rules used to determine coverage, see [coverageLevels.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/coverageLevels.xml). For a list of the locales at a given level, see [coverageLevels.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/coverageLevels.txt)

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Loading

0 comments on commit ec5cadd

Please sign in to comment.