collation/customization/index.md

2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_cilayout: default
2e5b6d6dSopenharmony_cititle: Customization
2e5b6d6dSopenharmony_cinav_order: 3
2e5b6d6dSopenharmony_ciparent: Collation
2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_ci<!--
2e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
2e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
2e5b6d6dSopenharmony_ci-->
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci# Collation Customization
2e5b6d6dSopenharmony_ci{: .no_toc }
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Contents
2e5b6d6dSopenharmony_ci{: .no_toc .text-delta }
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1. TOC
2e5b6d6dSopenharmony_ci{:toc}
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Overview
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU uses the [CLDR root collation
2e5b6d6dSopenharmony_ciorder](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation)
2e5b6d6dSopenharmony_cias a default starting point for ordering. (The CLDR root collation is based on
2e5b6d6dSopenharmony_cithe [UCA
2e5b6d6dSopenharmony_ciDUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table).)
2e5b6d6dSopenharmony_ciNot all languages have sorting sequences that correspond with the root collation
2e5b6d6dSopenharmony_ciorder because no single sort order can simultaneously encompass the specifics of
2e5b6d6dSopenharmony_ciall the languages. In particular, languages that share a script may sort the
2e5b6d6dSopenharmony_cisame letters differently.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciTherefore, ICU provides a data-driven, flexible, and run-time-customizable
2e5b6d6dSopenharmony_cimechanism called "tailoring". Tailoring overrides the default order of code
2e5b6d6dSopenharmony_cipoints and the values of the ICU Collation Service attributes.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Collation Rule
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA `RuleBasedCollator` is built from a rule string which changes the sort order of
2e5b6d6dSopenharmony_cisome characters and strings relative to the default order. An empty string (or
2e5b6d6dSopenharmony_cione with only white space and comments) results in a collator that behaves like
2e5b6d6dSopenharmony_cithe root collator.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA tailoring is specified via a string containing a set of rules. ICU implements
2e5b6d6dSopenharmony_cithe (CLDR) [LDML collation rule
2e5b6d6dSopenharmony_cisyntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules). For more
2e5b6d6dSopenharmony_cidetails see there.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciEach rule contains a string of ordered characters that starts with an **anchor
2e5b6d6dSopenharmony_cipoint** or a **reset value**. For example, `"&a < g"`, places "g"
2e5b6d6dSopenharmony_ciafter "a" and before "b", and the "a" does not change place. This rule has the
2e5b6d6dSopenharmony_cifollowing sorting consequences:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWithout rule | With rule
2e5b6d6dSopenharmony_ci------------ | ---------
2e5b6d6dSopenharmony_ciAbernathy    | Abernathy
2e5b6d6dSopenharmony_ciapple        | apple
2e5b6d6dSopenharmony_cibird         | green
2e5b6d6dSopenharmony_ciBoston       | bird
2e5b6d6dSopenharmony_ciGraham       | Boston
2e5b6d6dSopenharmony_cigreen        | Graham
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNote that only the word that starts with "g" has changed place. All the words
2e5b6d6dSopenharmony_cisorted after "a" and "A" are sorted after "g".
2e5b6d6dSopenharmony_ciThis includes "Graham"; "G" would have to be tailored separately, such as with
2e5b6d6dSopenharmony_ci`"&a < g <<< G"`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis is a non-complex example of a tailoring rule. Tailoring rules consist of
2e5b6d6dSopenharmony_cizero or more rules and zero or more options. There must be at least one rule or
2e5b6d6dSopenharmony_ciat least one option. The rule syntax is discussed in more detail in the
2e5b6d6dSopenharmony_cifollowing sections.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNote that the tailoring rules override the UCA ordering. In addition, if a
2e5b6d6dSopenharmony_cicharacter is reordered, it automatically reorders any other equivalent
2e5b6d6dSopenharmony_cicharacters. For example, if the rule "&e<a" is used to reorder "a" in the list,
2e5b6d6dSopenharmony_ci"á" is also greater than "é".
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Syntax
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following table summarizes the basic syntax necessary for most usages:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSymbol | Example&nbsp; | Description
2e5b6d6dSopenharmony_ci------ | ------------- | ----------------------------------
2e5b6d6dSopenharmony_ci`<`    | `a < b`       | Identifies a primary (base letter) difference between "a" and "b"
2e5b6d6dSopenharmony_ci`<<`   | `a << ä`      | Signifies a secondary (accent) difference between "a" and "ä"
2e5b6d6dSopenharmony_ci`<<<`  | `a<<<A`       | Identifies a tertiary difference between "a" and "A"
2e5b6d6dSopenharmony_ci`<<<<` | `か<<<<カ`     | Identifies a quaternary difference between "か" and "カ". (New in ICU 53.)
2e5b6d6dSopenharmony_ci`=`    | `x = y`       | Signifies no difference between "x" and "y".
2e5b6d6dSopenharmony_ci`&`    | `&Z`          | Instructs ICU to reset at this letter. These rules will be relative to this letter from here on, but will not affect the position of Z itself.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: ICU permits up to three quaternary relations in a row
2e5b6d6dSopenharmony_ci> (except for intervening "=" identity relations).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: In releases prior to 1.8,
2e5b6d6dSopenharmony_ci> ICU used the notations `;` to represent secondary relations and `,` to represent tertiary relations.
2e5b6d6dSopenharmony_ci> Starting in release 1.8, use `<<` symbols to represent secondary relations and
2e5b6d6dSopenharmony_ci> `<<<` symbols to represent tertiary relations.
2e5b6d6dSopenharmony_ci> Rules that use the `;` and `,` notations are still processed by ICU for compatibility;
2e5b6d6dSopenharmony_ci> also, some of the data used for tailoring to particular locales
2e5b6d6dSopenharmony_ci> has not yet been updated to the new syntax.
2e5b6d6dSopenharmony_ci> However, one should consider these symbols deprecated.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: See the [LDML collation rule syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules)
2e5b6d6dSopenharmony_ci> and [Properties and ICU Rule Syntax](../../strings/properties.md) for
2e5b6d6dSopenharmony_ci> information regarding syntax characters.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRepeated use of the same relation can be abbreviated, for example
2e5b6d6dSopenharmony_ci`&a <* bcd-gp-s` for `&a < b < c < d < e < f < g < p < q < r < s`.
2e5b6d6dSopenharmony_ciFor details see the
2e5b6d6dSopenharmony_ci[LDML collation spec, section
2e5b6d6dSopenharmony_ciOrderings](http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Escaping Rules
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciMost of the characters can be used as parts of rules. However, whitespace
2e5b6d6dSopenharmony_cicharacters will be skipped over, and all ASCII characters that are not digits or
2e5b6d6dSopenharmony_ciletters are considered to be part of syntax. In order to use these characters in
2e5b6d6dSopenharmony_cirules, they need to be escaped. Escaping can be done in several ways:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   Single characters can be escaped using backslash **\\** (U+005C).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   Strings can be escaped by putting them between single quotes **'like
2e5b6d6dSopenharmony_ci    this'**.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   The single quote (ASCII apostrophe) can be quoted using two single quotes
2e5b6d6dSopenharmony_ci    **''**, both inside and outside single-quote-escaped strings.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Simple Tailoring Examples
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSerbian (Latin) or Croatian: `& C < č <<< Č < ć <<< Ć`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis rule is needed because the root collation order usually considers accents
2e5b6d6dSopenharmony_cito have secondary differences in order to base character. This rule ensures that 'ć'
2e5b6d6dSopenharmony_ci'č' are treated as base letters.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUCA             | Tailoring: `& C < č <<< Č < ć <<< Ć`
2e5b6d6dSopenharmony_ci--------------- | --------------
2e5b6d6dSopenharmony_ciCUKIĆ RADOJICA  | CUKIĆ RADOJICA
2e5b6d6dSopenharmony_ciČUKIĆ SLOBODAN  | CUKIĆ SVETOZAR
2e5b6d6dSopenharmony_ciCUKIĆ SVETOZAR  | CURIĆ MILOŠ
2e5b6d6dSopenharmony_ciČUKIĆ ZORAN     | CVRKALJ ÐURO
2e5b6d6dSopenharmony_ciCURIĆ MILOŠ     | ČUKIĆ SLOBODAN
2e5b6d6dSopenharmony_ciĆURIĆ MILOŠ     | ČUKIĆ ZORAN
2e5b6d6dSopenharmony_ciCVRKALJ ÐURO    | ĆURIĆ MILOŠ
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSerbian (Latin) or Croatian: `& Ð < dž <<< Dž <<< DŽ`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis rule is an example of a contraction. "D" alone is sorted after "C" and "Ž"
2e5b6d6dSopenharmony_ciis sorted after "Z", but "DŽ", due to the tailoring rule, is treated as a single
2e5b6d6dSopenharmony_ciletter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter
2e5b6d6dSopenharmony_ciafter "D" in the UCA). Another thing to note in this example is capitalization
2e5b6d6dSopenharmony_ciof the letter "DŽ". There are three versions, since all three can legally appear
2e5b6d6dSopenharmony_ciin text. The fourth version "dŽ" is omitted since it does not occur.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUCA      | Tailoring: `& Ð < dž <<< Dž <<< DŽ`
2e5b6d6dSopenharmony_ci-------- | ---------
2e5b6d6dSopenharmony_cidan      | dan
2e5b6d6dSopenharmony_cidubok    | dubok
2e5b6d6dSopenharmony_cidžabe    | đak
2e5b6d6dSopenharmony_cidžin     | džabe
2e5b6d6dSopenharmony_ciDžin     | džin
2e5b6d6dSopenharmony_ciDŽIN     | Džin
2e5b6d6dSopenharmony_ciđak      | DŽIN
2e5b6d6dSopenharmony_ciEvropa   | Evropa
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciDanish: `&V <<< w <<< W`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe letter 'W' is sorted after 'V', but is treated as a tertiary difference
2e5b6d6dSopenharmony_cisimilar to the difference between 'v' and 'V'.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUCA | `&V <<< w <<< W`
2e5b6d6dSopenharmony_ci--- | ----------------
2e5b6d6dSopenharmony_civa  | va
2e5b6d6dSopenharmony_ciVa  | Va
2e5b6d6dSopenharmony_ciVA  | VA
2e5b6d6dSopenharmony_civb  | wa
2e5b6d6dSopenharmony_ciVb  | Wa
2e5b6d6dSopenharmony_ciVB  | WA
2e5b6d6dSopenharmony_civz  | vb
2e5b6d6dSopenharmony_ciVz  | Vb
2e5b6d6dSopenharmony_ciVZ  | VB
2e5b6d6dSopenharmony_ciwa  | wb
2e5b6d6dSopenharmony_ciWa  | Wb
2e5b6d6dSopenharmony_ciWA  | WB
2e5b6d6dSopenharmony_ciwb  | vz
2e5b6d6dSopenharmony_ciWb  | Vz
2e5b6d6dSopenharmony_ciWB  | VZ
2e5b6d6dSopenharmony_ciwz  | wz
2e5b6d6dSopenharmony_ciWz  | Wz
2e5b6d6dSopenharmony_ciWZ  | WZ
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Default Options
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU implements the [LDML collation
2e5b6d6dSopenharmony_cioptions/settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options).
2e5b6d6dSopenharmony_ciFor more information see there.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe tailoring inherits all the attribute values from the root collator unless
2e5b6d6dSopenharmony_cithey are explicitly redefined in the tailoring. The following summarizes
2e5b6d6dSopenharmony_cithe option settings. Default options are **in emphasis**.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### alternate
2e5b6d6dSopenharmony_ci- **`[alternate non-ignorable]`**
2e5b6d6dSopenharmony_ci- `[alternate shifted]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSets the default value of the UCOL_ALTERNATE_HANDLING attribute. If
2e5b6d6dSopenharmony_ciset to shifted, variable code points will be ignored on the primary level.
2e5b6d6dSopenharmony_ciFor details see the [“Ignore Punctuation” Options](ignorepunct.md) page.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### maxVariable
2e5b6d6dSopenharmony_ci- **`[maxVariable punct]`**
2e5b6d6dSopenharmony_ci- `[maxVariable space]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSets the variable top to the top of the specified
2e5b6d6dSopenharmony_cireordering group. (New in ICU 53.) All code points with primary weights less
2e5b6d6dSopenharmony_cithan or equal to the variable top will be considered variable, and thus affected
2e5b6d6dSopenharmony_ciby the alternate handling.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### variable top
2e5b6d6dSopenharmony_ci(deprecated)
2e5b6d6dSopenharmony_ci- `& X < [variable top]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSets the default value for the variable top. All the code points with primary
2e5b6d6dSopenharmony_cistrengths less than variable top will be considered variable.
2e5b6d6dSopenharmony_ci*Changing the variable top via this rule syntax is deprecated since ICU 53.*
2e5b6d6dSopenharmony_ciIt has been replaced by the maxVariable option.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### normalization
2e5b6d6dSopenharmony_ci- **`[normalization off]`**
2e5b6d6dSopenharmony_ci- `[normalization on]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciTurns on or off the UCOL_NORMALIZATION_MODE attribute.
2e5b6d6dSopenharmony_ciIf set to on, a quick check and necessary normalization will be performed.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### strength
2e5b6d6dSopenharmony_ci- `[strength 1]`
2e5b6d6dSopenharmony_ci- `[strength 2]`
2e5b6d6dSopenharmony_ci- **`[strength 3]`**
2e5b6d6dSopenharmony_ci- `[strength 4]`
2e5b6d6dSopenharmony_ci- `[strength I]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSets the default strength for the collator.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### backwards
2e5b6d6dSopenharmony_ci- `[backwards 2]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSets the default value of the UCOL_FRENCH_COLLATION attribute. If set to on,
2e5b6d6dSopenharmony_ciweights on the secondary level will be reversed.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### caseLevel
2e5b6d6dSopenharmony_ci- **`[caseLevel off]`**
2e5b6d6dSopenharmony_ci- `[caseLevel on]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciTurns on or off the UCOL_CASE_LEVEL attribute. If set to on a
2e5b6d6dSopenharmony_cilevel consisting only of case characteristics will be inserted in front of
2e5b6d6dSopenharmony_citertiary level. To ignore accents but take cases into account, set strength to
2e5b6d6dSopenharmony_ciprimary and case level to on.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### caseFirst
2e5b6d6dSopenharmony_ci- **`[caseFirst off]`**
2e5b6d6dSopenharmony_ci- `[caseFirst upper]`
2e5b6d6dSopenharmony_ci- `[caseFirst lower]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSets the value for the UCOL_CASE_FIRST attribute. If set to
2e5b6d6dSopenharmony_ciupper, causes upper case to sort before lower case. If set to lower, lower case
2e5b6d6dSopenharmony_ciwill sort before upper case. Useful for locales that have an already supported
2e5b6d6dSopenharmony_ciordering but require different order of cases. Affects case and tertiary levels.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### numericOrdering
2e5b6d6dSopenharmony_ci- **`[numericOrdering off]`**
2e5b6d6dSopenharmony_ci- `[numericOrdering on]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciTurns on or off the UCOL_NUMERIC_COLLATION attribute. If
2e5b6d6dSopenharmony_ciset to on, then sequences of decimal digits (gc=Nd) sort by their numeric value.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### hiraganaQ
2e5b6d6dSopenharmony_ci(deprecated)
2e5b6d6dSopenharmony_ci- **`[hiraganaQ off]`**
2e5b6d6dSopenharmony_ci- `[hiraganaQ on]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciControls special treatment of Hiragana code points on
2e5b6d6dSopenharmony_ciquaternary level. If turned on, Hiragana code points will get lower values than
2e5b6d6dSopenharmony_ciall the other non-variable code points. Strength must be greater or equal than
2e5b6d6dSopenharmony_ciquaternary if you want this attribute to take effect.
2e5b6d6dSopenharmony_ci*hiraganaQ is deprecated since ICU 50.* It was an implementation detail of the
2e5b6d6dSopenharmony_ciJapanese tailoring. In CLDR 25/ICU 53, the Japanese tailoring expresses the
2e5b6d6dSopenharmony_cidifferences between Hiragana and Katakana via explicit quaternary (`<<<<`)
2e5b6d6dSopenharmony_cirelations.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### suppressContractions
2e5b6d6dSopenharmony_ci- `[suppressContractions [Љ-ґ]]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRemoves context-sensitive mappings (contractions and prefix/context-before mappings)
2e5b6d6dSopenharmony_ciassociated with each of the code points in the given UnicodeSet. It works on the
2e5b6d6dSopenharmony_cicurrent set of rules: It removes mappings from the root collation as well as
2e5b6d6dSopenharmony_cifrom previous rules.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis is the only way to *remove* mappings: The rule syntax otherwise only adds
2e5b6d6dSopenharmony_ciand overrides mappings. This special command is used in CLDR tailoring data to
2e5b6d6dSopenharmony_ciremove Cyrillic root collation contractions that are not necessary in several
2e5b6d6dSopenharmony_cilanguages.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### optimize
2e5b6d6dSopenharmony_ci- `[optimize [Ά-ώ]]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciPerformance optimization for the code points in the UnicodeSet.
2e5b6d6dSopenharmony_ciIn ICU, where tailoring data only contains the
2e5b6d6dSopenharmony_cimappings that are different from the root collation (otherwise the data would be
2e5b6d6dSopenharmony_citoo large), falling back to root collation mappings for the rest of Unicode is
2e5b6d6dSopenharmony_cislightly slower. The optimize command copies mappings for additional characters
2e5b6d6dSopenharmony_ciinto the tailoring data.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### reorder
2e5b6d6dSopenharmony_cifollowed by one or more reorder codes
2e5b6d6dSopenharmony_ci- `[reorder Grek Hani space]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciReorders scripts relative to each other and relative to a special set of
2e5b6d6dSopenharmony_cinon-script blocks (space, punctuation, symbol, currency, and digit). The default
2e5b6d6dSopenharmony_ciorder is the same as in the DUCET and in the CLDR root collator.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci----
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA tailoring that consists only of options is also valid and has the same basic
2e5b6d6dSopenharmony_ciordering as the root collation. For example, the Greek tailoring has option
2e5b6d6dSopenharmony_cisettings only: `[normalization on][reorder Grek]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci(The examples in this chapter might refer to older versions of data for
2e5b6d6dSopenharmony_ciparticular languages. Check CLDR or ICU for actual, current tailorings.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following tailoring example reorders uppercase and lowercase and uses
2e5b6d6dSopenharmony_cibackwards-secondary ordering:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci[caseFirst upper]
2e5b6d6dSopenharmony_ci[backwards 2]
2e5b6d6dSopenharmony_ci& C < č , Č
2e5b6d6dSopenharmony_ci& G < ģ , Ģ
2e5b6d6dSopenharmony_ci& I < y, Y
2e5b6d6dSopenharmony_ci& K < ķ , Ķ
2e5b6d6dSopenharmony_ci& L < ļ , Ļ
2e5b6d6dSopenharmony_ci& N < ņ , Ņ
2e5b6d6dSopenharmony_ci& S < š , Š
2e5b6d6dSopenharmony_ci& Z < ž , Ž
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Values for Reorder Codes
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciReordering Group                         | Rule Value
2e5b6d6dSopenharmony_ci---------------------------------------- | ----------
2e5b6d6dSopenharmony_ciUnicode white space characters           | space
2e5b6d6dSopenharmony_ciUnicode punctuation                      | punct
2e5b6d6dSopenharmony_ciUnicode symbols except currency symbols  | symbol
2e5b6d6dSopenharmony_ciUnicode currency symbols                 | currency
2e5b6d6dSopenharmony_ciUnicode decimal digits                   | digit
2e5b6d6dSopenharmony_ciUnicode scripts not mentioned ("others") |Zzzz (= Unknown script)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn addition, ISO **4-letter script codes** can be used. Codes for scripts that
2e5b6d6dSopenharmony_cido not have Unicode characters (according to the Unicode Script property values)
2e5b6d6dSopenharmony_ciare ignored.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciLimitations of ICU 4.8-52: (Except `Kore` is still not usable because it refers
2e5b6d6dSopenharmony_cito multiple scripts that do not sort primary-equal.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   For Chinese, use script code `Hani`, *not* `Hans` or `Hant`.
2e5b6d6dSopenharmony_ci*   For Japanese, use both `Kana` and `Hani` (*not* `Hira`).
2e5b6d6dSopenharmony_ci*   For Korean, use both `Hang` and `Hani` (*not* `Kore`).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Semantics of a List of Reorder Codes
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis section is relevant for both the `[reorder ...]` rule syntax and the
2e5b6d6dSopenharmony_ci`Collator.setReorderCodes()` API.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor an introduction and examples see the section “Script Reordering” in the
2e5b6d6dSopenharmony_ci[Collation Concepts chapter](../concepts.md).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOn the API, the special groups are represented with `Collator.ReorderCode`s
2e5b6d6dSopenharmony_ci(`UColReorderCode`) values rather than `UScript` (`UScriptCode`) values.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn ICU 4.8-54, not every script could be reordered independently. CLDR and ICU
2e5b6d6dSopenharmony_cisupported reordering of groups of scripts, each of which started with one of the
2e5b6d6dSopenharmony_ci[Recommended
2e5b6d6dSopenharmony_ciScripts](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). A
2e5b6d6dSopenharmony_ciscript that is not Recommended always moved together with the Recommended Script
2e5b6d6dSopenharmony_cithat precedes it in DUCET order. (Hiragana sorts together with Katakana, Coptic
2e5b6d6dSopenharmony_ciwith Greek, etc.) ICU allowed any one script of a (Recommended Script +
2e5b6d6dSopenharmony_ciDUCET-following) group in the `[reorder]` list, moving the whole set of scripts
2e5b6d6dSopenharmony_citogether. However, it was strongly recommended that only Recommended Scripts be
2e5b6d6dSopenharmony_ciused.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBeginning with ICU 55, scripts only reorder together if they are primary-equal,
2e5b6d6dSopenharmony_cifor example Hiragana and Katakana.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciZyyy=Common and Zinh=Inherited cannot be reordered.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe special code Zzzz (= Unknown script = `UScript.UNKNOWN` =
2e5b6d6dSopenharmony_ci`Collator.ReorderCodes.OTHERS` = "others") stands for any script that is not
2e5b6d6dSopenharmony_ciexplicitly mentioned in the list of reordering codes. If Zzzz is mentioned in
2e5b6d6dSopenharmony_cithe list, then any groups and scripts mentioned later in the list will go at the
2e5b6d6dSopenharmony_civery end of the reordering, in the order given. If Zzzz is not mentioned, then
2e5b6d6dSopenharmony_ciall scripts that are not explicitly listed follow at the end in DUCET order.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe special reorder code `Collator.ReorderCodes.NONE` (= `UScript.UNKNOWN`), when
2e5b6d6dSopenharmony_ciused alone (same as `[reorder Zzzz]` or not specifying a `[reorder]` rule in a
2e5b6d6dSopenharmony_citailoring), will remove any reordering for this collator. The result of setting
2e5b6d6dSopenharmony_cino reordering will be to use the DUCET/CLDR order.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOn the API (not applicable to rule syntax), the special reorder code
2e5b6d6dSopenharmony_ci`Collator.ReorderCodes.DEFAULT` (= `UScript.INHERITED`) will reset the reordering
2e5b6d6dSopenharmony_cifor the collator to its default order. The default reordering may be the
2e5b6d6dSopenharmony_ciDUCET/CLDR order or may be a reordering that was specified when this collator
2e5b6d6dSopenharmony_ciwas created from resource data or from rules. The DEFAULT code must be the sole
2e5b6d6dSopenharmony_cicode supplied when it used.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor details see the [section “Collation Reordering” in the LDML collation
2e5b6d6dSopenharmony_cispec](http://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Advanced Syntactical Elements
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSeveral other syntactical elements are needed in more specific situations.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Order before
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci- Syntax: `[before 1|2|3]`
2e5b6d6dSopenharmony_ci- Example: `&[before 2]a<ā<á<ǎ<à`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciEnables users to order characters **before **a given character. In UCA 3.0, the
2e5b6d6dSopenharmony_ciexample is equivalent to & ㍡<ā<á<ǎ<à (㍡= \\u3361, ideographic telegraph symbol
2e5b6d6dSopenharmony_cifor hour nine) and makes accented 'a' letters sort before 'a'. Accents are often
2e5b6d6dSopenharmony_ciused to indicate the intonations in Pinyin. In this case, the non-accented
2e5b6d6dSopenharmony_ciletters sort after the accented letters.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Expansion
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci- Syntax: `/`
2e5b6d6dSopenharmony_ci- Example: `æ/e`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAdds the collation element for 'e' to the collation element for æ.
2e5b6d6dSopenharmony_ciAfter a reset `&ae << æ` is equivalent to `&a << æ/e`. See the Expansion example
2e5b6d6dSopenharmony_cibelow.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Prefix processing
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci- Syntax: `|`
2e5b6d6dSopenharmony_ci- Example: `a|b`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf 'b' is encountered and it follows 'a',
2e5b6d6dSopenharmony_cioutput the appropriate collation element. If 'b' follows any other letter,
2e5b6d6dSopenharmony_cioutput the normal collation element for 'b'.
2e5b6d6dSopenharmony_ciThe collation element for 'a' is not affected.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis element is used to speed up sorting under JIS X 4061. See the
2e5b6d6dSopenharmony_ciPrefix example below.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Reset to top
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci- Syntax: `[top]`
2e5b6d6dSopenharmony_ci- Example: `&[top] < a < b < c …`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Deprecated, use indirect positioning instead**
2e5b6d6dSopenharmony_ci(`&[last regular]`, see section below)
2e5b6d6dSopenharmony_ciReorders a set of characters 'above' the UCA. `[top]` is a virtual code point having the
2e5b6d6dSopenharmony_cibiggest primary weight value that will ever be assigned in the UCA. Above top,
2e5b6d6dSopenharmony_cithere is a large number of unassigned primary weights that can be used for a
2e5b6d6dSopenharmony_ci'large' tailoring, such as the reordering of the CJK characters according to a
2e5b6d6dSopenharmony_ciFar Eastern code page. The first difference after the top is always primary.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Indirect Positioning of Collation Elements
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSince ICU version 2.0, ICU allows for indirect positioning of collation elements
2e5b6d6dSopenharmony_ci(CE). Similar to the reset anchor `top`, these reset anchors allow for positioning of the
2e5b6d6dSopenharmony_citailoring relative to significant sections of the UCA table. You can use the
2e5b6d6dSopenharmony_ci`[before]` reset option to position before these sections.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciName                      | Example CE value  | Note
2e5b6d6dSopenharmony_ci------------------------- | ----------------- | ------------
2e5b6d6dSopenharmony_cifirst tertiary ignorable  | `[,,]`            | Start of the UCA table. This value will never change unless CEs are extended with higher level values.
2e5b6d6dSopenharmony_cilast tertiary ignorable   | `[,,]`            | This value will never change unless CEs are extended with higher level values.
2e5b6d6dSopenharmony_cifirst secondary ignorable | `[,, 05]`         | Currently there are no secondary ignorables in the UCA table.
2e5b6d6dSopenharmony_cilast secondary ignorable  | `[,, 05]`         | Currently there are no secondary ignorables in the UCA table.
2e5b6d6dSopenharmony_cifirst primary ignorable   | `[, 87, 05]`      | Mostly for non-spacing combining marks.
2e5b6d6dSopenharmony_cilast primary ignorable    | `[, E1 B1, 05]`   | Currently this value points to a non-existing code point, used to facilitate sorting of compatibility characters.
2e5b6d6dSopenharmony_cifirst variable            | `[05 07, 05, 05]` | The lowest CE that is not primary-ignorable. (see below)
2e5b6d6dSopenharmony_cilast variable             | `[17 9B, 05, 05]` | End of variable section.
2e5b6d6dSopenharmony_cifirst regular             | `[1A 20, 05, 05]` | This is the first regular CE (not primary ignorable and not variable). The majority of code points have regular CEs.
2e5b6d6dSopenharmony_cilast regular              | `[78 AA B2, 05, 05]` | Use `&[last regular]` instead of `&[top]`. (see below)
2e5b6d6dSopenharmony_cifirst implicit            | `[E0 03 03, 05, 05]` | Section of implicitly generated collation elements. (see below)
2e5b6d6dSopenharmony_cilast implicit             | `[E3 DC 70 C0, 05, 05]` | End of implicit section. This is the CE of the last unassigned code point (U+10FFFD). (see below)
2e5b6d6dSopenharmony_cifirst trailing            | `[E5, 05, 05]`    | Start of trailing section. (see below)
2e5b6d6dSopenharmony_cilast trailing             | `[FF FF, 05, 05]` | End of trailing collation elements section. This is the highest possible CE, and is the CE for U+FFFF. Not available for tailoring, see `[first trailing]`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"first variable": The current code point is TAB=U+0009. This is the start of the variable section. "Variable" characters will be ignored on primary/secondary/tertiary levels when the "shifted" option is on.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciTailoring after "last regular" will effectively position characters
2e5b6d6dSopenharmony_cibetween regular code points and "implicit" CEs (the next section).
2e5b6d6dSopenharmony_ciThis should be used (only) for tailoring Han characters
2e5b6d6dSopenharmony_ciwhich tends to affect thousands of characters.
2e5b6d6dSopenharmony_ciThe script reordering implementation assumes that CEs in this section
2e5b6d6dSopenharmony_ciare for "Hani" script characters.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"Implicit" means that the UCA default ordering table (DUCET)
2e5b6d6dSopenharmony_cidoes not explicitly specify CEs for CJK ideographs and unassigned code points;
2e5b6d6dSopenharmony_ciinstead, their CEs are computed at runtime.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBeginning with ICU 53, tailoring to any unassigned code point,
2e5b6d6dSopenharmony_ciincluding "last implicit", is not supported any more.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"trailing": Tailoring characters after `[first trailing]`
2e5b6d6dSopenharmony_cimakes them sort after all other non-tailored code points except for U+FFFD and U+FFFF.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe "trailing" section is reserved for future use, such as for non starting Jamos. See
2e5b6d6dSopenharmony_ci<http://www.unicode.org/reports/tr10/#Trailing_Weights>.
2e5b6d6dSopenharmony_ciCLDR 1.9/ICU 4.6 and later map U+FFFF to the very end of the trailing section.
2e5b6d6dSopenharmony_ciUCA 6.3/CLDR 24/ICU 52 and later map U+FFFD to just before U+FFFF.
2e5b6d6dSopenharmony_ciU+FFFD..U+FFFF are not tailorable, and nothing can tailor to them.
2e5b6d6dSopenharmony_ci<http://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights>
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBefore ICU 4.6, U+FFFF mapped to a completely ignorable CE, and `[last trailing]`
2e5b6d6dSopenharmony_ciwas the same as `[first trailing]`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNot all of the indirect-positioning anchors are useful. Most of the 'first'
2e5b6d6dSopenharmony_cielements should be used with the `[before]` directive, in order to make sure
2e5b6d6dSopenharmony_cithat your tailoring will sort before an interesting section.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Complex Tailoring Examples
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following are several fragments of real tailorings, illustrating some of the
2e5b6d6dSopenharmony_ciadvanced syntactical elements:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Expansion Example:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Swedish:**
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&t<<<þ/h
2e5b6d6dSopenharmony_ci&T<<<Þ/H
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe letter 'þ' (THORN) is normally treated by UCA/root collation as a separate
2e5b6d6dSopenharmony_ciletter that has primary-level sorting after 'z'. However, in Swedish and some
2e5b6d6dSopenharmony_ciother Scandinavian languages, 'þ' and 'Þ' should be treated as just a
2e5b6d6dSopenharmony_citertiary-level difference from the letters "th" and "TH" respectively. This is
2e5b6d6dSopenharmony_cian example of an expansion.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUCA | `&t<<<þ/h, &T<<<Þ/H`
2e5b6d6dSopenharmony_ci--- | --------------------
2e5b6d6dSopenharmony_ciaz  | az
2e5b6d6dSopenharmony_ciAz  | Az
2e5b6d6dSopenharmony_citha | tha
2e5b6d6dSopenharmony_ciTha | þa
2e5b6d6dSopenharmony_ciTHa | Tha
2e5b6d6dSopenharmony_cithz | THa
2e5b6d6dSopenharmony_ciza  | Þa
2e5b6d6dSopenharmony_ciZa  | thz
2e5b6d6dSopenharmony_cizz  | þz
2e5b6d6dSopenharmony_ciþa  | za
2e5b6d6dSopenharmony_ciÞa  | Za
2e5b6d6dSopenharmony_ciþz  | zz
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Prefix Example:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciPrefixes are used in Japanese tailorings to reduce the number of contractions. A
2e5b6d6dSopenharmony_cibig number of contractions is a performance burden on the commonly-used base
2e5b6d6dSopenharmony_cicharacters, as their processing is much more complicated than the processing of
2e5b6d6dSopenharmony_ciregular elements.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA prefix rule conditionally changes the CE of the character or string (e.g., ー)
2e5b6d6dSopenharmony_ciafter the | symbol; unlike a contraction, it does not affect the CE of the
2e5b6d6dSopenharmony_cipreceding text (e.g., ァ). (By contrast, a contraction like ァー consumes both
2e5b6d6dSopenharmony_cicharacters and can assign them a CE or expansion unrelated to ァ's CE.) A prefix
2e5b6d6dSopenharmony_cirule is especially useful if the character or string (ー) after the | symbol
2e5b6d6dSopenharmony_cioccurs significantly less often than the first character of the prefix (ァ).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&[before 3]ァ <<< ァ|ー = ｧ|ー = ぁ|ー
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis could have been written as a series of contractions followed by expansion:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&[before 3]ァー <<< ァー = ｧー = ぁー
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciHowever, in that case ァ, ｧ and ぁ would start contractions. Since the prolonged
2e5b6d6dSopenharmony_cisound mark (ー) occurs much less frequently than the other letters of Japanese
2e5b6d6dSopenharmony_ciKatakana and Hiragana, it is much more prudent to put the extra processing on it
2e5b6d6dSopenharmony_ciby using prefixes.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Reset example:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA "reset" always uses only the base character as the insertion point even if
2e5b6d6dSopenharmony_cithere is an expansion. So the following rule,
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci& J <<< K / B & K <<< M
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciis equivalent to
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci& J <<< K / B <<< M
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhich produces the following sort order:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"JA"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"MA"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"KA"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"KC"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"JC"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci"MC"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: Assuming the letters "J", "K" and "M" have equal primary weights, the second
2e5b6d6dSopenharmony_ci> letter contains the differences among these strings. However, the letter "K" is
2e5b6d6dSopenharmony_ci> treated as if it always has a letter "B" following it while the letters "J" and
2e5b6d6dSopenharmony_ci> "M" do not.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following is an example of collation elements for these strings resulting
2e5b6d6dSopenharmony_cifrom the specified rules:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciStrings | Collation Elements | &nbsp;         | &nbsp;
2e5b6d6dSopenharmony_ci------- | ------------------ | -------------- | ------
2e5b6d6dSopenharmony_ci"JA"    | `[005C.00.01]`     | `[0052.00.01]` |
2e5b6d6dSopenharmony_ci"MA"    | `[005C.00.03]`     | `[0052.00.01]` |
2e5b6d6dSopenharmony_ci"KA"    | `[005C.00.02]`     | `[0053.00.01]` | `[0052.00.01]`
2e5b6d6dSopenharmony_ci"KC"    | `[005C.00.02]`     | `[0053.00.01]` | `[0054.00.01]`
2e5b6d6dSopenharmony_ci"JC"    | `[005C.00.01]`     | `[0054.00.01]` |
2e5b6d6dSopenharmony_ci"MC"    | `[005C.00.03]`     | `[0054.00.01]` |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Tailoring Issues
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU uses canonical closure. This means that for each code point in Unicode, if
2e5b6d6dSopenharmony_cithe canonically composed form of a tailored string produces different collation
2e5b6d6dSopenharmony_cielements than the canonically decomposed form, then the canonically composed
2e5b6d6dSopenharmony_ciform is effectively added to the ordering. If 'a' is tailored, for example, all
2e5b6d6dSopenharmony_ciof the accented 'a' characters are also tailored. Canonical closure allows
2e5b6d6dSopenharmony_cicollators to process Unicode strings in the FCD form as well as in NFD. (Note:
2e5b6d6dSopenharmony_ciMost but not all NFC strings are also in FCD. See
2e5b6d6dSopenharmony_ci<http://www.unicode.org/notes/tn5/#FCD>)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciHowever, *compatibility* equivalents are NOT automatically added. If the rule
2e5b6d6dSopenharmony_ci"&b < a" is in tailoring, and the order of **ⓐ (circled a)** is important, it
2e5b6d6dSopenharmony_cineeds to be tailored **explicitly**.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRedundant tailoring rules are removed, with later rules "winning". The strengths
2e5b6d6dSopenharmony_ciaround the removed rules are also fixed.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Example:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following table summarizes effects of different redundant rules.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci&nbsp; | Original                                                  | Equivalent
2e5b6d6dSopenharmony_ci------ | --------------------------------------------------------- | ----------
2e5b6d6dSopenharmony_ci1      | `& a < b < c < d` `& r < c`                               | `& a < b < d` `& r < c`
2e5b6d6dSopenharmony_ci2      | `& a < b < c < d` `& c < m`                               | `& a < b < c < m < d`
2e5b6d6dSopenharmony_ci3      | `& a < b < c < d` `& a < m`                               | `& a < m < b < c < d`
2e5b6d6dSopenharmony_ci4      | `& a <<< b << c < d` `& a < m`                            | `& a <<< b << c < m < d`
2e5b6d6dSopenharmony_ci5      | `& a < b < c < d` `& [before 1] c < m`                    | `& a < b < m < c < d`
2e5b6d6dSopenharmony_ci6      | `& a < b <<< c << d <<< e` `& [before 3] e <<< x`         | `& a < b <<< c << d <<< x <<< e`
2e5b6d6dSopenharmony_ci7      | `& a < b <<< c << d <<< e` `& [before 2] e <<< x`         | `& a < b <<< c <<< x << d <<< e`
2e5b6d6dSopenharmony_ci8      | `& a < b <<< c << d <<< e` `& [before 1] e <<< x`         | `& a <<< x < b <<< c << d <<< e`
2e5b6d6dSopenharmony_ci9      | `& a < b <<< c << d <<< e <<< f < g` `& [before 1] g < x` | `& a < b <<< c << d <<< e <<< f < x < g`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf two different reset lists tailor the same character, then it is removed from the first
2e5b6d6dSopenharmony_cione (see 1 in the table above).
2e5b6d6dSopenharmony_ciIf the second list resets to a character tailored in the first list, then the second
2e5b6d6dSopenharmony_cilist is inserted in the first (see 2).
2e5b6d6dSopenharmony_ciIf both lists reset to the same character, then the same thing
2e5b6d6dSopenharmony_cihappens (see 3). Whenever such an insertion occurs, the second strength
2e5b6d6dSopenharmony_ci"postpones" the position (see 4).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf there is a `[before N]` on the reset, then the reset character is
2e5b6d6dSopenharmony_cieffectively replaced by the item that would be before it, either in a previous
2e5b6d6dSopenharmony_citailoring (if the letter occurs in one - see 5) or in the UCA. The N determines
2e5b6d6dSopenharmony_cithe 'distance' before, based on the strength of the difference (see 6-8).
2e5b6d6dSopenharmony_ciHowever, this is subject to postponement (see 9), so be careful!
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Reset semantics
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe reset semantic in ICU 1.8 and above is different from the previous ICU
2e5b6d6dSopenharmony_cireleases. Prior to version 1.8, the reset relation modifier was applicable only
2e5b6d6dSopenharmony_cito the entry immediately following the reset entry. Also, the relation modifier
2e5b6d6dSopenharmony_ciapplied to all entries that occurred until the next reset or primary relation.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor example,
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&xyz << e <<< f
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciwas equivalent to
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&x << e/yz <<< f
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciprior to ICU version 1.8.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciStarting with ICU version 1.8, the modifier is equivalent to
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&x << e/yz <<< f/yz
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe new semantic produces more intuitive results, especially when the character
2e5b6d6dSopenharmony_ciafter the reset is decomposable. Since all rules are converted to NFD before
2e5b6d6dSopenharmony_cithey are interpreted, this can result in contractions that the rule-writer might
2e5b6d6dSopenharmony_cinot be aware of. Expansion propagates only until the next reset or primary
2e5b6d6dSopenharmony_cirelation occurs.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor example, the following rule:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&ab = c <<< d << e <<< f < g <<< h
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciwas equivalent to the following prior to ICU 1.8 and in Java:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&a = c/b <<< d << e <<< f < g <<< h
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciStarting with 1.8, it is equivalent to
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&a = c / b <<< d / b << e / b <<< f / b < g <<< h
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Known Limitations
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following are known limitations of the ICU collation implementation. These
2e5b6d6dSopenharmony_ciare theoretical limitations, however, since there are no known languages for
2e5b6d6dSopenharmony_ciwhich these limitations are an issue. However, for completeness they should be
2e5b6d6dSopenharmony_cifixed in a future version after 1.8.1. The examples given are designed for
2e5b6d6dSopenharmony_cisimplicity in testing, and do not match any real languages.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Expansion
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe goal of expansion is to sort as if the expansion text were inserted right
2e5b6d6dSopenharmony_ciafter the character. For example, with the rule
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&a <<< c / e
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe text "...**c**..." should sort as if it were right after "...**ae**..." with
2e5b6d6dSopenharmony_cia tertiary difference. There are a few cases where this is not currently true.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Recursive Expansion
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciGiven the rules
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&a <<< c / e
2e5b6d6dSopenharmony_ci&g <<< e / I
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciExpansion should sort the text "...**c**..." as if it were just after
2e5b6d6dSopenharmony_ci"...**ae**...", and that should also sort as if it were just after
2e5b6d6dSopenharmony_ci"...**agi**...". This requires that the compilation of expansions be recursive
2e5b6d6dSopenharmony_ci(and check for loops as well!). ICU currently does not do this.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRules         | Desired Order | Current Order
2e5b6d6dSopenharmony_ci------------- | ------------- | -------------
2e5b6d6dSopenharmony_ci`& a = b / c` | add           | b
2e5b6d6dSopenharmony_ci`& d = c / e` | b             | add
2e5b6d6dSopenharmony_ci&nbsp;        | adf           | adf
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Contractions Spanning Expansions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU currently always pre-compiles the expansion into an internal format (a list
2e5b6d6dSopenharmony_ciof one or more collation elements) when the rule is compiled. If there is a
2e5b6d6dSopenharmony_cicontraction that spans the end of the expanded text and the start of the
2e5b6d6dSopenharmony_cioriginal text, however, that contraction will not match. A text case that
2e5b6d6dSopenharmony_ciillustrates this is:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRules           | Desired Order | Current Order
2e5b6d6dSopenharmony_ci--------------- | ------------- | -------------
2e5b6d6dSopenharmony_ci`& a <<< c / e` | ad            | ad
2e5b6d6dSopenharmony_ci`& g <<< eh`    | c             | c
2e5b6d6dSopenharmony_ci&nbsp;          | af            | ch
2e5b6d6dSopenharmony_ci&nbsp;          | g             | af
2e5b6d6dSopenharmony_ci&nbsp;          | ch            | g
2e5b6d6dSopenharmony_ci&nbsp;          | h             | h
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSince the pre-compiled expansions are a huge performance gain, we will probably
2e5b6d6dSopenharmony_cikeep the implementation the way it is, but in the future allow additional syntax
2e5b6d6dSopenharmony_cito indicate those few expansions that need to behave as if the text were
2e5b6d6dSopenharmony_ciinserted because of the existence of another contraction. Note that such
2e5b6d6dSopenharmony_ciexpansions need to be recursively expanded (as in #1), but rather than at
2e5b6d6dSopenharmony_cipre-compile time, these need to be done at runtime.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhile it is possible to automatically detect these cases, it would be better to
2e5b6d6dSopenharmony_ciallow explicit control in case spanning is not desired. An example of such
2e5b6d6dSopenharmony_cisyntax might be something like:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&a <<< c // e
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Notes:** ICU does handle the case where there is a contraction that is
2e5b6d6dSopenharmony_cicompletely inside the expansion.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSuppose that someone had the rules:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci&a = c / e
2e5b6d6dSopenharmony_ci&x = ae
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThese do not cause **c** to sort as if it were **ae**, nor should they.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Normalization
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe Unicode Collation Algorithm specifies that all text sort as if it were first
2e5b6d6dSopenharmony_cinormalized into NFD. For performance reasons, ICU collation data is
2e5b6d6dSopenharmony_cipre-processed so that there is no need to perform normalization on strings that
2e5b6d6dSopenharmony_ciare in [FCD](http://www.unicode.org/notes/tn5/#FCD) and do not contain any composite
2e5b6d6dSopenharmony_cicombining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81
2e5b6d6dSopenharmony_ci}
2e5b6d6dSopenharmony_ci[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=)
2e5b6d6dSopenharmony_ci(These characters must be decomposed for discontiguous contractions to work
2e5b6d6dSopenharmony_ciproperly. Use of these characters is discouraged by the Unicode Standard.). The
2e5b6d6dSopenharmony_civast majority of strings are in this form.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Nulls in Contractions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNulls should not be used in contractions that could invoke normalization.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRules                | Desired Order | Current Order
2e5b6d6dSopenharmony_ci-------------------- | ------------- | -------------
2e5b6d6dSopenharmony_ci`& a <<< '\u0000'^`  | a             | '\\u0000'^
2e5b6d6dSopenharmony_ci&nbsp;               | '\\u0000'^    | a
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Contractions Spanning Normalization
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following rule specifies that a grave accent followed by a **b** is a
2e5b6d6dSopenharmony_cicontraction, and sorts as if it were an **e**.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci& e <<< ` b
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOn this basis, "...àb..." should sort as if it were just after "...ae...".
2e5b6d6dSopenharmony_ciBecause of the preprocessing, however, the contraction will not match if this
2e5b6d6dSopenharmony_citext is represented with the pre-composed character à, but **will** match if
2e5b6d6dSopenharmony_cigiven the decomposed sequence **a + grave accent**. The same thing happens if
2e5b6d6dSopenharmony_cithe contraction spans the start of a normalized sequence.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRules        | Desired Order | Current Order
2e5b6d6dSopenharmony_ci------------ | ------------- | -------------
2e5b6d6dSopenharmony_ci& e <<< \` b | à             | à
2e5b6d6dSopenharmony_ci&nbsp;       | ad            | àb
2e5b6d6dSopenharmony_ci&nbsp;       | àb            | ad
2e5b6d6dSopenharmony_ci&nbsp;       | af            | af
2e5b6d6dSopenharmony_ci&nbsp;       | &nbsp;        |
2e5b6d6dSopenharmony_ci`& g <<< ca` | f             | cà
2e5b6d6dSopenharmony_ci&nbsp;       | ca            | f
2e5b6d6dSopenharmony_ci&nbsp;       | cà            | ca
2e5b6d6dSopenharmony_ci&nbsp;       | h             | h
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Variable Top
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU lets you set the top of the variable range. This can be done, for example,
2e5b6d6dSopenharmony_cito allow you to ignore just SPACES, and not punctuation.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Variable Top Exclusion
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere is currently a limitation that causes variable top to (perhaps) exclude
2e5b6d6dSopenharmony_cimore characters than it should. This happens if you not only set variable top,
2e5b6d6dSopenharmony_cibut also tailor a number of characters around it with primary differences. The
2e5b6d6dSopenharmony_ciexact number that you can tailor depends on the internal "gaps" between the
2e5b6d6dSopenharmony_cicharacters in the pre-compiled UCA table. Normally there is a gap of one. There
2e5b6d6dSopenharmony_ciare larger gaps between scripts (such as between Latin and Greek), and after
2e5b6d6dSopenharmony_cicertain other special characters. For example, if variable top is set to be at
2e5b6d6dSopenharmony_ciSPACE ('\\u0020'), then it works correctly with up to 70 characters also
2e5b6d6dSopenharmony_citailored after space. However, if variable top is set to be equal to HYPHEN
2e5b6d6dSopenharmony_ci('\\u2010'), only one other value can be accommodated.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn the following, the goal is for x to be ignored and z not to be ignored.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRules              | Desired Order SHIFTED = ON | Current Order
2e5b6d6dSopenharmony_ci------------------ | -------------------------- | -------------
2e5b6d6dSopenharmony_ci`& \u2010`         | -                          | -
2e5b6d6dSopenharmony_ci`< x`              | z                          | z
2e5b6d6dSopenharmony_ci`< [variable top]` | zb                         | zb
2e5b6d6dSopenharmony_ci`< z`              | a                          | xb
2e5b6d6dSopenharmony_ci&nbsp;             | b                          | a
2e5b6d6dSopenharmony_ci&nbsp;             | -b                         | b
2e5b6d6dSopenharmony_ci&nbsp;             | xb                         | -b
2e5b6d6dSopenharmony_ci&nbsp;             | c                          | c
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: With ICU 1.8.1, the
2e5b6d6dSopenharmony_ci> user is advised not to tailor the variable top to customize more than two
2e5b6d6dSopenharmony_ci> primary relations (for example, `"& x < y < [variable top]"`). Starting in ICU
2e5b6d6dSopenharmony_ci> 2.0, setVariableTop() allows the user to set the variable top programmatically
2e5b6d6dSopenharmony_ci> to a legal single character or a valid contracting sequence. In addition, the
2e5b6d6dSopenharmony_ci> string that variable top is set to should not be treated as either inclusive or
2e5b6d6dSopenharmony_ci> exclusive in the rules.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Case Level/First/Second
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn ICU, it is possible to override the tertiary settings programmatically. This
2e5b6d6dSopenharmony_ciis used to change the default case behavior to be all upper first or all lower
2e5b6d6dSopenharmony_cifirst. It can also be used for a separate case level, or to ignore all other
2e5b6d6dSopenharmony_citertiary differences (such as between circled and non-circled letters, or
2e5b6d6dSopenharmony_cibetween half-width and full-width katakana). The case values are derived
2e5b6d6dSopenharmony_cidirectly from the Unicode character properties, and not set by the rules.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Mixed Case Contractions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere is currently a limitation that all contractions of multiple characters can
2e5b6d6dSopenharmony_cionly have three special case values: upper, lower, and mixed. All mixed-case
2e5b6d6dSopenharmony_cicontractions are grouped together, and are not affected by the upper first vs.
2e5b6d6dSopenharmony_cilower first flag.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRules      | Desired Order UPPER_FIRST | Current Order
2e5b6d6dSopenharmony_ci---------- | ------------------------- | -------------
2e5b6d6dSopenharmony_ci`& c < ch` | C                         | c
2e5b6d6dSopenharmony_ci`<<< cH`   | CH                        | CH
2e5b6d6dSopenharmony_ci`<<< Ch`   | Ch                        | cH
2e5b6d6dSopenharmony_ci`<<< CH`   | cH                        | Ch
2e5b6d6dSopenharmony_ci&nbsp;     | ch                        | ch
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Building on Existing Locales
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAll of the collation rules are additive; that is, they override what any
2e5b6d6dSopenharmony_ciprevious rule expressed. That means that you can build on existing rules for
2e5b6d6dSopenharmony_cigiven locales. Here is an example of this, which fetches the rules for a
2e5b6d6dSopenharmony_ciparticular locale (Danish), then overrides some part (sorting '%' after 'm').
2e5b6d6dSopenharmony_ciThe syntax is Java, but C/C++ has similar features.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```java
2e5b6d6dSopenharmony_ciULocale myLocale = new ULocale("da");
2e5b6d6dSopenharmony_citry {
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    RuleBasedCollator col = (RuleBasedCollator) Collator.getInstance(myLocale);
2e5b6d6dSopenharmony_ci    String rules = col.getRules();
2e5b6d6dSopenharmony_ci    String myRules = "& m < '%'";
2e5b6d6dSopenharmony_ci    RuleBasedCollator col2 = new RuleBasedCollator(rules + myRules);
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    // check the values
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    List<String> expected = Arrays.asList("a;m;%;z;aa".split(";"));
2e5b6d6dSopenharmony_ci    TreeSet<String> sorted = new TreeSet<String>(col2);
2e5b6d6dSopenharmony_ci    sorted.addAll(expected);
2e5b6d6dSopenharmony_ci    ArrayList<String> actual = new ArrayList<String>(sorted);
2e5b6d6dSopenharmony_ci    assertEquals("Customized rules with %", expected, actual);
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci} catch (Exception e) {
2e5b6d6dSopenharmony_ci    throw new IllegalArgumentException("Failed to create customized rules", e);
2e5b6d6dSopenharmony_ci}
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe root collator has an empty rules string (`getRules()` returns `""`): Any
2e5b6d6dSopenharmony_cicollator's tailoring rules string defines how a collator *differs* from the root
2e5b6d6dSopenharmony_cicollator, and the tailoring rules string was the input for building the
2e5b6d6dSopenharmony_citailoring collator. By contrast, the root collator itself is built from a file
2e5b6d6dSopenharmony_ciwith explicit mappings (ICU4C source/data/unidata/FractionalUCA.txt)
2e5b6d6dSopenharmony_cifrom characters/contractions to collation elements. This file represents the
2e5b6d6dSopenharmony_ci[DUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table)
2e5b6d6dSopenharmony_cias [modified by
2e5b6d6dSopenharmony_ciCLDR](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere are "extended" versions of `getRules()` which, when called with
2e5b6d6dSopenharmony_ci`delta=UCOL_FULL_RULES` (C/C++) or `fullrules=true` (Java), return "full rules"
2e5b6d6dSopenharmony_ciwhich are a concatenation of the "UCA rules" and the collator's tailoring. The
2e5b6d6dSopenharmony_ci"UCA rules" are published as UCA_Rules.txt in every [UCA
2e5b6d6dSopenharmony_cirelease](http://www.unicode.org/Public/UCA/).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   "UCA rules" is a historical misnomer. The UCA specifies an Algorithm which
2e5b6d6dSopenharmony_ci    applies to all collators, and provides the DUCET as its Default table.
2e5b6d6dSopenharmony_ci*   ICU's root collator implements the CLDR-modified collation element table.
2e5b6d6dSopenharmony_ci    The "UCA rules" returned from ICU functions are equivalently modified rules
2e5b6d6dSopenharmony_ci    compared with those for the DUCET.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe "UCA rules" are an *approximation* of the root collator's sort order, but
2e5b6d6dSopenharmony_cithere are some differences because not all of the details of the root collator
2e5b6d6dSopenharmony_cimappings can be expressed in rule syntax. In particular, a collator built from
2e5b6d6dSopenharmony_ciICU4C source/data/unidata/UCARules.txt
2e5b6d6dSopenharmony_cihas at least the following issues compared with the real root collator:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   inefficient (long) collation element weights
2e5b6d6dSopenharmony_ci*   CODAN (numeric collation) will not work (the 0 digit's primary weight is
2e5b6d6dSopenharmony_ci    hardcoded, or specified in FractionalUCA.txt)
2e5b6d6dSopenharmony_ci*   script reordering will not work
2e5b6d6dSopenharmony_ci*   alternate=shifted will not work
2e5b6d6dSopenharmony_ci*   the sort order has some differences from the regular root collator,
2e5b6d6dSopenharmony_ci    including additional tertiary differences
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe "full rules" are almost never used, or useful, at runtime. They are included
2e5b6d6dSopenharmony_ciin ICU for historical reasons and for UCA consistency tests. They might be
2e5b6d6dSopenharmony_ciusable for emulating the CLDR/ICU sort order with a collation implementation not
2e5b6d6dSopenharmony_cibased on CLDR/ICU.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciCollation rule strings in general are not commonly used but are a significant
2e5b6d6dSopenharmony_ciportion of the data size in ICU collation resource bundles, especially for CJK
2e5b6d6dSopenharmony_cilanguages. The rule strings can be omitted from those resource bundles by adding
2e5b6d6dSopenharmony_cithe `--omitCollationRules` option to the relevant `genrb` invocations
2e5b6d6dSopenharmony_ci(for ICU 53..63, in icu4c/source/data/Makefile.in)
2e5b6d6dSopenharmony_cior, since ICU 64, with a [data filter config file](../../icu_data/buildtool.md).
2e5b6d6dSopenharmony_ci(See for example the relevant
2e5b6d6dSopenharmony_ci[ICU integration test instructions](https://icu.unicode.org/processes/release/tasks/integration#TOC-Verify-that-ICU4C-tests-pass-without-collation-rule-strings).)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf the tailoring rules are needed but the 150kB or so of "UCA rules" are not,
2e5b6d6dSopenharmony_cithen the line
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ciUCARules:process(uca_rules){"../unidata/UCARules.txt"}
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciin
2e5b6d6dSopenharmony_ci[source/data/coll/root.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll/root.txt)
2e5b6d6dSopenharmony_cican be commented out or deleted.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Cautions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following are not known rule limitations, but rather cautions.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Resets
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSince resets always work on the existing state, the user is required to make
2e5b6d6dSopenharmony_cisure that the rule entries are in the proper order.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciRules     | Order | Comment
2e5b6d6dSopenharmony_ci--------- | ----- | -------
2e5b6d6dSopenharmony_ci`& a < b` | a     | The rules mean: put **b** after **a**, then put **c** after **a** (inserting **before** the **b**).
2e5b6d6dSopenharmony_ci`& a < c` | c     |
2e5b6d6dSopenharmony_ci&nbsp;    | b     |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Postpone Insertion
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhen using a reset to insert a value X with a certain strength difference after
2e5b6d6dSopenharmony_cia value Y, it actually is inserted just before the next item of the same
2e5b6d6dSopenharmony_cistrength or higher following Y. Thus, the following are equivalent:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci... m < a = c <<< d << e <<< f < g <<< h & a << x
2e5b6d6dSopenharmony_ci... m < a = c <<< d << x << e <<< f < g <<< h
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: This is different from the Java semantics.
2e5b6d6dSopenharmony_ci> In Java, the value is inserted immediately after the reset character.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Jamo Tailoring
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf Jamo characters are tailored, that causes the code to go through a slow path,
2e5b6d6dSopenharmony_ciwhich will have a significant effect on performance.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Compatibility Decompositions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhen tailoring a letter, the customization affects all of its canonical
2e5b6d6dSopenharmony_ciequivalents. That is, if tailoring rule sorts an **'a'** after**'e '**, for
2e5b6d6dSopenharmony_ciexample, then "**"à", "á", ...** are also sorted after '**e**'.his is not true
2e5b6d6dSopenharmony_cifor compatibility equivalents. If the desired sorting order is for a
2e5b6d6dSopenharmony_ci**superscript-a** ("ª") to be after "**e"**, it is necessary to specify the rule
2e5b6d6dSopenharmony_cifor that.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Case Differences
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSimilarly, when tailoring an "**a" to be sorted** after "**e"**, including
2e5b6d6dSopenharmony_ci"**A"** to be after "**e" **as well, it is required to have a specific rule for
2e5b6d6dSopenharmony_cithat sorting sequence.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Automatic Expansions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU will automatically form expansions whenever a reset is to a multi-character
2e5b6d6dSopenharmony_civalue that is not a contraction. For example, `& ab <<< c` is equivalent to
2e5b6d6dSopenharmony_ci`& a <<< c / b`. The user may be unaware of this happening, since it may not be
2e5b6d6dSopenharmony_ciobvious that the reset is to a multi-character value. For example, `& à<<< d` is
2e5b6d6dSopenharmony_ciequivalent to & a <<< d / \`