12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: Customization
42e5b6d6dSopenharmony_cinav_order: 3
52e5b6d6dSopenharmony_ciparent: Collation
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# Collation Customization
132e5b6d6dSopenharmony_ci{: .no_toc }
142e5b6d6dSopenharmony_ci
152e5b6d6dSopenharmony_ci## Contents
162e5b6d6dSopenharmony_ci{: .no_toc .text-delta }
172e5b6d6dSopenharmony_ci
182e5b6d6dSopenharmony_ci1. TOC
192e5b6d6dSopenharmony_ci{:toc}
202e5b6d6dSopenharmony_ci
212e5b6d6dSopenharmony_ci---
222e5b6d6dSopenharmony_ci
232e5b6d6dSopenharmony_ci## Overview
242e5b6d6dSopenharmony_ci
252e5b6d6dSopenharmony_ciICU uses the [CLDR root collation
262e5b6d6dSopenharmony_ciorder](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation)
272e5b6d6dSopenharmony_cias a default starting point for ordering. (The CLDR root collation is based on
282e5b6d6dSopenharmony_cithe [UCA
292e5b6d6dSopenharmony_ciDUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table).)
302e5b6d6dSopenharmony_ciNot all languages have sorting sequences that correspond with the root collation
312e5b6d6dSopenharmony_ciorder because no single sort order can simultaneously encompass the specifics of
322e5b6d6dSopenharmony_ciall the languages. In particular, languages that share a script may sort the
332e5b6d6dSopenharmony_cisame letters differently.
342e5b6d6dSopenharmony_ci
352e5b6d6dSopenharmony_ciTherefore, ICU provides a data-driven, flexible, and run-time-customizable
362e5b6d6dSopenharmony_cimechanism called "tailoring". Tailoring overrides the default order of code
372e5b6d6dSopenharmony_cipoints and the values of the ICU Collation Service attributes.
382e5b6d6dSopenharmony_ci
392e5b6d6dSopenharmony_ci## Collation Rule
402e5b6d6dSopenharmony_ci
412e5b6d6dSopenharmony_ciA `RuleBasedCollator` is built from a rule string which changes the sort order of
422e5b6d6dSopenharmony_cisome characters and strings relative to the default order. An empty string (or
432e5b6d6dSopenharmony_cione with only white space and comments) results in a collator that behaves like
442e5b6d6dSopenharmony_cithe root collator.
452e5b6d6dSopenharmony_ci
462e5b6d6dSopenharmony_ciA tailoring is specified via a string containing a set of rules. ICU implements
472e5b6d6dSopenharmony_cithe (CLDR) [LDML collation rule
482e5b6d6dSopenharmony_cisyntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules). For more
492e5b6d6dSopenharmony_cidetails see there.
502e5b6d6dSopenharmony_ci
512e5b6d6dSopenharmony_ciEach rule contains a string of ordered characters that starts with an **anchor
522e5b6d6dSopenharmony_cipoint** or a **reset value**. For example, `"&a < g"`, places "g"
532e5b6d6dSopenharmony_ciafter "a" and before "b", and the "a" does not change place. This rule has the
542e5b6d6dSopenharmony_cifollowing sorting consequences:
552e5b6d6dSopenharmony_ci
562e5b6d6dSopenharmony_ciWithout rule | With rule
572e5b6d6dSopenharmony_ci------------ | ---------
582e5b6d6dSopenharmony_ciAbernathy    | Abernathy
592e5b6d6dSopenharmony_ciapple        | apple
602e5b6d6dSopenharmony_cibird         | green
612e5b6d6dSopenharmony_ciBoston       | bird
622e5b6d6dSopenharmony_ciGraham       | Boston
632e5b6d6dSopenharmony_cigreen        | Graham
642e5b6d6dSopenharmony_ci
652e5b6d6dSopenharmony_ciNote that only the word that starts with "g" has changed place. All the words
662e5b6d6dSopenharmony_cisorted after "a" and "A" are sorted after "g".
672e5b6d6dSopenharmony_ciThis includes "Graham"; "G" would have to be tailored separately, such as with
682e5b6d6dSopenharmony_ci`"&a < g <<< G"`.
692e5b6d6dSopenharmony_ci
702e5b6d6dSopenharmony_ciThis is a non-complex example of a tailoring rule. Tailoring rules consist of
712e5b6d6dSopenharmony_cizero or more rules and zero or more options. There must be at least one rule or
722e5b6d6dSopenharmony_ciat least one option. The rule syntax is discussed in more detail in the
732e5b6d6dSopenharmony_cifollowing sections.
742e5b6d6dSopenharmony_ci
752e5b6d6dSopenharmony_ciNote that the tailoring rules override the UCA ordering. In addition, if a
762e5b6d6dSopenharmony_cicharacter is reordered, it automatically reorders any other equivalent
772e5b6d6dSopenharmony_cicharacters. For example, if the rule "&e<a" is used to reorder "a" in the list,
782e5b6d6dSopenharmony_ci"á" is also greater than "é".
792e5b6d6dSopenharmony_ci
802e5b6d6dSopenharmony_ci## Syntax
812e5b6d6dSopenharmony_ci
822e5b6d6dSopenharmony_ciThe following table summarizes the basic syntax necessary for most usages:
832e5b6d6dSopenharmony_ci
842e5b6d6dSopenharmony_ciSymbol | Example&nbsp; | Description
852e5b6d6dSopenharmony_ci------ | ------------- | ----------------------------------
862e5b6d6dSopenharmony_ci`<`    | `a < b`       | Identifies a primary (base letter) difference between "a" and "b"
872e5b6d6dSopenharmony_ci`<<`   | `a << ä`      | Signifies a secondary (accent) difference between "a" and "ä"
882e5b6d6dSopenharmony_ci`<<<`  | `a<<<A`       | Identifies a tertiary difference between "a" and "A"
892e5b6d6dSopenharmony_ci`<<<<` | `か<<<<カ`     | Identifies a quaternary difference between "か" and "カ". (New in ICU 53.)
902e5b6d6dSopenharmony_ci`=`    | `x = y`       | Signifies no difference between "x" and "y".
912e5b6d6dSopenharmony_ci`&`    | `&Z`          | Instructs ICU to reset at this letter. These rules will be relative to this letter from here on, but will not affect the position of Z itself.
922e5b6d6dSopenharmony_ci
932e5b6d6dSopenharmony_ci> :point_right: **Note**: ICU permits up to three quaternary relations in a row
942e5b6d6dSopenharmony_ci> (except for intervening "=" identity relations).
952e5b6d6dSopenharmony_ci
962e5b6d6dSopenharmony_ci> :point_right: **Note**: In releases prior to 1.8,
972e5b6d6dSopenharmony_ci> ICU used the notations `;` to represent secondary relations and `,` to represent tertiary relations.
982e5b6d6dSopenharmony_ci> Starting in release 1.8, use `<<` symbols to represent secondary relations and
992e5b6d6dSopenharmony_ci> `<<<` symbols to represent tertiary relations.
1002e5b6d6dSopenharmony_ci> Rules that use the `;` and `,` notations are still processed by ICU for compatibility;
1012e5b6d6dSopenharmony_ci> also, some of the data used for tailoring to particular locales
1022e5b6d6dSopenharmony_ci> has not yet been updated to the new syntax.
1032e5b6d6dSopenharmony_ci> However, one should consider these symbols deprecated.
1042e5b6d6dSopenharmony_ci
1052e5b6d6dSopenharmony_ci> :point_right: **Note**: See the [LDML collation rule syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules)
1062e5b6d6dSopenharmony_ci> and [Properties and ICU Rule Syntax](../../strings/properties.md) for
1072e5b6d6dSopenharmony_ci> information regarding syntax characters.
1082e5b6d6dSopenharmony_ci
1092e5b6d6dSopenharmony_ciRepeated use of the same relation can be abbreviated, for example
1102e5b6d6dSopenharmony_ci`&a <* bcd-gp-s` for `&a < b < c < d < e < f < g < p < q < r < s`.
1112e5b6d6dSopenharmony_ciFor details see the
1122e5b6d6dSopenharmony_ci[LDML collation spec, section
1132e5b6d6dSopenharmony_ciOrderings](http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings).
1142e5b6d6dSopenharmony_ci
1152e5b6d6dSopenharmony_ci### Escaping Rules
1162e5b6d6dSopenharmony_ci
1172e5b6d6dSopenharmony_ciMost of the characters can be used as parts of rules. However, whitespace
1182e5b6d6dSopenharmony_cicharacters will be skipped over, and all ASCII characters that are not digits or
1192e5b6d6dSopenharmony_ciletters are considered to be part of syntax. In order to use these characters in
1202e5b6d6dSopenharmony_cirules, they need to be escaped. Escaping can be done in several ways:
1212e5b6d6dSopenharmony_ci
1222e5b6d6dSopenharmony_ci*   Single characters can be escaped using backslash **\\** (U+005C).
1232e5b6d6dSopenharmony_ci
1242e5b6d6dSopenharmony_ci*   Strings can be escaped by putting them between single quotes **'like
1252e5b6d6dSopenharmony_ci    this'**.
1262e5b6d6dSopenharmony_ci
1272e5b6d6dSopenharmony_ci*   The single quote (ASCII apostrophe) can be quoted using two single quotes
1282e5b6d6dSopenharmony_ci    **''**, both inside and outside single-quote-escaped strings.
1292e5b6d6dSopenharmony_ci
1302e5b6d6dSopenharmony_ci### Simple Tailoring Examples
1312e5b6d6dSopenharmony_ci
1322e5b6d6dSopenharmony_ciSerbian (Latin) or Croatian: `& C < č <<< Č < ć <<< Ć`
1332e5b6d6dSopenharmony_ci
1342e5b6d6dSopenharmony_ciThis rule is needed because the root collation order usually considers accents
1352e5b6d6dSopenharmony_cito have secondary differences in order to base character. This rule ensures that 'ć'
1362e5b6d6dSopenharmony_ci'č' are treated as base letters.
1372e5b6d6dSopenharmony_ci
1382e5b6d6dSopenharmony_ciUCA             | Tailoring: `& C < č <<< Č < ć <<< Ć`
1392e5b6d6dSopenharmony_ci--------------- | --------------
1402e5b6d6dSopenharmony_ciCUKIĆ RADOJICA  | CUKIĆ RADOJICA
1412e5b6d6dSopenharmony_ciČUKIĆ SLOBODAN  | CUKIĆ SVETOZAR
1422e5b6d6dSopenharmony_ciCUKIĆ SVETOZAR  | CURIĆ MILOŠ
1432e5b6d6dSopenharmony_ciČUKIĆ ZORAN     | CVRKALJ ÐURO
1442e5b6d6dSopenharmony_ciCURIĆ MILOŠ     | ČUKIĆ SLOBODAN
1452e5b6d6dSopenharmony_ciĆURIĆ MILOŠ     | ČUKIĆ ZORAN
1462e5b6d6dSopenharmony_ciCVRKALJ ÐURO    | ĆURIĆ MILOŠ
1472e5b6d6dSopenharmony_ci
1482e5b6d6dSopenharmony_ciSerbian (Latin) or Croatian: `& Ð < dž <<< Dž <<< DŽ`
1492e5b6d6dSopenharmony_ci
1502e5b6d6dSopenharmony_ciThis rule is an example of a contraction. "D" alone is sorted after "C" and "Ž"
1512e5b6d6dSopenharmony_ciis sorted after "Z", but "DŽ", due to the tailoring rule, is treated as a single
1522e5b6d6dSopenharmony_ciletter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter
1532e5b6d6dSopenharmony_ciafter "D" in the UCA). Another thing to note in this example is capitalization
1542e5b6d6dSopenharmony_ciof the letter "DŽ". There are three versions, since all three can legally appear
1552e5b6d6dSopenharmony_ciin text. The fourth version "dŽ" is omitted since it does not occur.
1562e5b6d6dSopenharmony_ci
1572e5b6d6dSopenharmony_ciUCA      | Tailoring: `& Ð < dž <<< Dž <<< DŽ`
1582e5b6d6dSopenharmony_ci-------- | ---------
1592e5b6d6dSopenharmony_cidan      | dan
1602e5b6d6dSopenharmony_cidubok    | dubok
1612e5b6d6dSopenharmony_cidžabe    | đak
1622e5b6d6dSopenharmony_cidžin     | džabe
1632e5b6d6dSopenharmony_ciDžin     | džin
1642e5b6d6dSopenharmony_ciDŽIN     | Džin
1652e5b6d6dSopenharmony_ciđak      | DŽIN
1662e5b6d6dSopenharmony_ciEvropa   | Evropa
1672e5b6d6dSopenharmony_ci
1682e5b6d6dSopenharmony_ciDanish: `&V <<< w <<< W`
1692e5b6d6dSopenharmony_ci
1702e5b6d6dSopenharmony_ciThe letter 'W' is sorted after 'V', but is treated as a tertiary difference
1712e5b6d6dSopenharmony_cisimilar to the difference between 'v' and 'V'.
1722e5b6d6dSopenharmony_ci
1732e5b6d6dSopenharmony_ciUCA | `&V <<< w <<< W`
1742e5b6d6dSopenharmony_ci--- | ----------------
1752e5b6d6dSopenharmony_civa  | va
1762e5b6d6dSopenharmony_ciVa  | Va
1772e5b6d6dSopenharmony_ciVA  | VA
1782e5b6d6dSopenharmony_civb  | wa
1792e5b6d6dSopenharmony_ciVb  | Wa
1802e5b6d6dSopenharmony_ciVB  | WA
1812e5b6d6dSopenharmony_civz  | vb
1822e5b6d6dSopenharmony_ciVz  | Vb
1832e5b6d6dSopenharmony_ciVZ  | VB
1842e5b6d6dSopenharmony_ciwa  | wb
1852e5b6d6dSopenharmony_ciWa  | Wb
1862e5b6d6dSopenharmony_ciWA  | WB
1872e5b6d6dSopenharmony_ciwb  | vz
1882e5b6d6dSopenharmony_ciWb  | Vz
1892e5b6d6dSopenharmony_ciWB  | VZ
1902e5b6d6dSopenharmony_ciwz  | wz
1912e5b6d6dSopenharmony_ciWz  | Wz
1922e5b6d6dSopenharmony_ciWZ  | WZ
1932e5b6d6dSopenharmony_ci
1942e5b6d6dSopenharmony_ci### Default Options
1952e5b6d6dSopenharmony_ci
1962e5b6d6dSopenharmony_ciICU implements the [LDML collation
1972e5b6d6dSopenharmony_cioptions/settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options).
1982e5b6d6dSopenharmony_ciFor more information see there.
1992e5b6d6dSopenharmony_ci
2002e5b6d6dSopenharmony_ciThe tailoring inherits all the attribute values from the root collator unless
2012e5b6d6dSopenharmony_cithey are explicitly redefined in the tailoring. The following summarizes
2022e5b6d6dSopenharmony_cithe option settings. Default options are **in emphasis**.
2032e5b6d6dSopenharmony_ci
2042e5b6d6dSopenharmony_ci#### alternate
2052e5b6d6dSopenharmony_ci- **`[alternate non-ignorable]`**
2062e5b6d6dSopenharmony_ci- `[alternate shifted]`
2072e5b6d6dSopenharmony_ci
2082e5b6d6dSopenharmony_ciSets the default value of the UCOL_ALTERNATE_HANDLING attribute. If
2092e5b6d6dSopenharmony_ciset to shifted, variable code points will be ignored on the primary level.
2102e5b6d6dSopenharmony_ciFor details see the [“Ignore Punctuation” Options](ignorepunct.md) page.
2112e5b6d6dSopenharmony_ci
2122e5b6d6dSopenharmony_ci#### maxVariable
2132e5b6d6dSopenharmony_ci- **`[maxVariable punct]`**
2142e5b6d6dSopenharmony_ci- `[maxVariable space]`
2152e5b6d6dSopenharmony_ci
2162e5b6d6dSopenharmony_ciSets the variable top to the top of the specified
2172e5b6d6dSopenharmony_cireordering group. (New in ICU 53.) All code points with primary weights less
2182e5b6d6dSopenharmony_cithan or equal to the variable top will be considered variable, and thus affected
2192e5b6d6dSopenharmony_ciby the alternate handling.
2202e5b6d6dSopenharmony_ci
2212e5b6d6dSopenharmony_ci#### variable top
2222e5b6d6dSopenharmony_ci(deprecated)
2232e5b6d6dSopenharmony_ci- `& X < [variable top]`
2242e5b6d6dSopenharmony_ci
2252e5b6d6dSopenharmony_ciSets the default value for the variable top. All the code points with primary
2262e5b6d6dSopenharmony_cistrengths less than variable top will be considered variable.
2272e5b6d6dSopenharmony_ci*Changing the variable top via this rule syntax is deprecated since ICU 53.*
2282e5b6d6dSopenharmony_ciIt has been replaced by the maxVariable option.
2292e5b6d6dSopenharmony_ci
2302e5b6d6dSopenharmony_ci#### normalization
2312e5b6d6dSopenharmony_ci- **`[normalization off]`**
2322e5b6d6dSopenharmony_ci- `[normalization on]`
2332e5b6d6dSopenharmony_ci
2342e5b6d6dSopenharmony_ciTurns on or off the UCOL_NORMALIZATION_MODE attribute.
2352e5b6d6dSopenharmony_ciIf set to on, a quick check and necessary normalization will be performed.
2362e5b6d6dSopenharmony_ci
2372e5b6d6dSopenharmony_ci#### strength
2382e5b6d6dSopenharmony_ci- `[strength 1]`
2392e5b6d6dSopenharmony_ci- `[strength 2]`
2402e5b6d6dSopenharmony_ci- **`[strength 3]`**
2412e5b6d6dSopenharmony_ci- `[strength 4]`
2422e5b6d6dSopenharmony_ci- `[strength I]`
2432e5b6d6dSopenharmony_ci
2442e5b6d6dSopenharmony_ciSets the default strength for the collator.
2452e5b6d6dSopenharmony_ci
2462e5b6d6dSopenharmony_ci#### backwards
2472e5b6d6dSopenharmony_ci- `[backwards 2]`
2482e5b6d6dSopenharmony_ci
2492e5b6d6dSopenharmony_ciSets the default value of the UCOL_FRENCH_COLLATION attribute. If set to on,
2502e5b6d6dSopenharmony_ciweights on the secondary level will be reversed.
2512e5b6d6dSopenharmony_ci
2522e5b6d6dSopenharmony_ci#### caseLevel
2532e5b6d6dSopenharmony_ci- **`[caseLevel off]`**
2542e5b6d6dSopenharmony_ci- `[caseLevel on]`
2552e5b6d6dSopenharmony_ci
2562e5b6d6dSopenharmony_ciTurns on or off the UCOL_CASE_LEVEL attribute. If set to on a
2572e5b6d6dSopenharmony_cilevel consisting only of case characteristics will be inserted in front of
2582e5b6d6dSopenharmony_citertiary level. To ignore accents but take cases into account, set strength to
2592e5b6d6dSopenharmony_ciprimary and case level to on.
2602e5b6d6dSopenharmony_ci
2612e5b6d6dSopenharmony_ci#### caseFirst
2622e5b6d6dSopenharmony_ci- **`[caseFirst off]`**
2632e5b6d6dSopenharmony_ci- `[caseFirst upper]`
2642e5b6d6dSopenharmony_ci- `[caseFirst lower]`
2652e5b6d6dSopenharmony_ci
2662e5b6d6dSopenharmony_ciSets the value for the UCOL_CASE_FIRST attribute. If set to
2672e5b6d6dSopenharmony_ciupper, causes upper case to sort before lower case. If set to lower, lower case
2682e5b6d6dSopenharmony_ciwill sort before upper case. Useful for locales that have an already supported
2692e5b6d6dSopenharmony_ciordering but require different order of cases. Affects case and tertiary levels.
2702e5b6d6dSopenharmony_ci
2712e5b6d6dSopenharmony_ci#### numericOrdering
2722e5b6d6dSopenharmony_ci- **`[numericOrdering off]`**
2732e5b6d6dSopenharmony_ci- `[numericOrdering on]`
2742e5b6d6dSopenharmony_ci
2752e5b6d6dSopenharmony_ciTurns on or off the UCOL_NUMERIC_COLLATION attribute. If
2762e5b6d6dSopenharmony_ciset to on, then sequences of decimal digits (gc=Nd) sort by their numeric value.
2772e5b6d6dSopenharmony_ci
2782e5b6d6dSopenharmony_ci#### hiraganaQ
2792e5b6d6dSopenharmony_ci(deprecated)
2802e5b6d6dSopenharmony_ci- **`[hiraganaQ off]`**
2812e5b6d6dSopenharmony_ci- `[hiraganaQ on]`
2822e5b6d6dSopenharmony_ci
2832e5b6d6dSopenharmony_ciControls special treatment of Hiragana code points on
2842e5b6d6dSopenharmony_ciquaternary level. If turned on, Hiragana code points will get lower values than
2852e5b6d6dSopenharmony_ciall the other non-variable code points. Strength must be greater or equal than
2862e5b6d6dSopenharmony_ciquaternary if you want this attribute to take effect.
2872e5b6d6dSopenharmony_ci*hiraganaQ is deprecated since ICU 50.* It was an implementation detail of the
2882e5b6d6dSopenharmony_ciJapanese tailoring. In CLDR 25/ICU 53, the Japanese tailoring expresses the
2892e5b6d6dSopenharmony_cidifferences between Hiragana and Katakana via explicit quaternary (`<<<<`)
2902e5b6d6dSopenharmony_cirelations.
2912e5b6d6dSopenharmony_ci
2922e5b6d6dSopenharmony_ci#### suppressContractions
2932e5b6d6dSopenharmony_ci- `[suppressContractions [Љ-ґ]]`
2942e5b6d6dSopenharmony_ci
2952e5b6d6dSopenharmony_ciRemoves context-sensitive mappings (contractions and prefix/context-before mappings)
2962e5b6d6dSopenharmony_ciassociated with each of the code points in the given UnicodeSet. It works on the
2972e5b6d6dSopenharmony_cicurrent set of rules: It removes mappings from the root collation as well as
2982e5b6d6dSopenharmony_cifrom previous rules.
2992e5b6d6dSopenharmony_ci
3002e5b6d6dSopenharmony_ciThis is the only way to *remove* mappings: The rule syntax otherwise only adds
3012e5b6d6dSopenharmony_ciand overrides mappings. This special command is used in CLDR tailoring data to
3022e5b6d6dSopenharmony_ciremove Cyrillic root collation contractions that are not necessary in several
3032e5b6d6dSopenharmony_cilanguages.
3042e5b6d6dSopenharmony_ci
3052e5b6d6dSopenharmony_ci#### optimize
3062e5b6d6dSopenharmony_ci- `[optimize [Ά-ώ]]`
3072e5b6d6dSopenharmony_ci
3082e5b6d6dSopenharmony_ciPerformance optimization for the code points in the UnicodeSet.
3092e5b6d6dSopenharmony_ciIn ICU, where tailoring data only contains the
3102e5b6d6dSopenharmony_cimappings that are different from the root collation (otherwise the data would be
3112e5b6d6dSopenharmony_citoo large), falling back to root collation mappings for the rest of Unicode is
3122e5b6d6dSopenharmony_cislightly slower. The optimize command copies mappings for additional characters
3132e5b6d6dSopenharmony_ciinto the tailoring data.
3142e5b6d6dSopenharmony_ci
3152e5b6d6dSopenharmony_ci#### reorder
3162e5b6d6dSopenharmony_cifollowed by one or more reorder codes
3172e5b6d6dSopenharmony_ci- `[reorder Grek Hani space]` 
3182e5b6d6dSopenharmony_ci
3192e5b6d6dSopenharmony_ciReorders scripts relative to each other and relative to a special set of
3202e5b6d6dSopenharmony_cinon-script blocks (space, punctuation, symbol, currency, and digit). The default
3212e5b6d6dSopenharmony_ciorder is the same as in the DUCET and in the CLDR root collator.
3222e5b6d6dSopenharmony_ci
3232e5b6d6dSopenharmony_ci----
3242e5b6d6dSopenharmony_ci
3252e5b6d6dSopenharmony_ciA tailoring that consists only of options is also valid and has the same basic
3262e5b6d6dSopenharmony_ciordering as the root collation. For example, the Greek tailoring has option
3272e5b6d6dSopenharmony_cisettings only: `[normalization on][reorder Grek]`
3282e5b6d6dSopenharmony_ci
3292e5b6d6dSopenharmony_ci(The examples in this chapter might refer to older versions of data for
3302e5b6d6dSopenharmony_ciparticular languages. Check CLDR or ICU for actual, current tailorings.)
3312e5b6d6dSopenharmony_ci
3322e5b6d6dSopenharmony_ciThe following tailoring example reorders uppercase and lowercase and uses
3332e5b6d6dSopenharmony_cibackwards-secondary ordering:
3342e5b6d6dSopenharmony_ci
3352e5b6d6dSopenharmony_ci```
3362e5b6d6dSopenharmony_ci[caseFirst upper]
3372e5b6d6dSopenharmony_ci[backwards 2]
3382e5b6d6dSopenharmony_ci& C < č , Č
3392e5b6d6dSopenharmony_ci& G < ģ , Ģ
3402e5b6d6dSopenharmony_ci& I < y, Y
3412e5b6d6dSopenharmony_ci& K < ķ , Ķ
3422e5b6d6dSopenharmony_ci& L < ļ , Ļ
3432e5b6d6dSopenharmony_ci& N < ņ , Ņ
3442e5b6d6dSopenharmony_ci& S < š , Š
3452e5b6d6dSopenharmony_ci& Z < ž , Ž
3462e5b6d6dSopenharmony_ci```
3472e5b6d6dSopenharmony_ci
3482e5b6d6dSopenharmony_ci#### Values for Reorder Codes
3492e5b6d6dSopenharmony_ci
3502e5b6d6dSopenharmony_ciReordering Group                         | Rule Value
3512e5b6d6dSopenharmony_ci---------------------------------------- | ----------
3522e5b6d6dSopenharmony_ciUnicode white space characters           | space
3532e5b6d6dSopenharmony_ciUnicode punctuation                      | punct
3542e5b6d6dSopenharmony_ciUnicode symbols except currency symbols  | symbol
3552e5b6d6dSopenharmony_ciUnicode currency symbols                 | currency
3562e5b6d6dSopenharmony_ciUnicode decimal digits                   | digit
3572e5b6d6dSopenharmony_ciUnicode scripts not mentioned ("others") |Zzzz (= Unknown script)
3582e5b6d6dSopenharmony_ci
3592e5b6d6dSopenharmony_ciIn addition, ISO **4-letter script codes** can be used. Codes for scripts that
3602e5b6d6dSopenharmony_cido not have Unicode characters (according to the Unicode Script property values)
3612e5b6d6dSopenharmony_ciare ignored.
3622e5b6d6dSopenharmony_ci
3632e5b6d6dSopenharmony_ciLimitations of ICU 4.8-52: (Except `Kore` is still not usable because it refers
3642e5b6d6dSopenharmony_cito multiple scripts that do not sort primary-equal.)
3652e5b6d6dSopenharmony_ci
3662e5b6d6dSopenharmony_ci*   For Chinese, use script code `Hani`, *not* `Hans` or `Hant`.
3672e5b6d6dSopenharmony_ci*   For Japanese, use both `Kana` and `Hani` (*not* `Hira`).
3682e5b6d6dSopenharmony_ci*   For Korean, use both `Hang` and `Hani` (*not* `Kore`).
3692e5b6d6dSopenharmony_ci
3702e5b6d6dSopenharmony_ci#### Semantics of a List of Reorder Codes
3712e5b6d6dSopenharmony_ci
3722e5b6d6dSopenharmony_ciThis section is relevant for both the `[reorder ...]` rule syntax and the
3732e5b6d6dSopenharmony_ci`Collator.setReorderCodes()` API.
3742e5b6d6dSopenharmony_ci
3752e5b6d6dSopenharmony_ciFor an introduction and examples see the section “Script Reordering” in the
3762e5b6d6dSopenharmony_ci[Collation Concepts chapter](../concepts.md).
3772e5b6d6dSopenharmony_ci
3782e5b6d6dSopenharmony_ciOn the API, the special groups are represented with `Collator.ReorderCode`s
3792e5b6d6dSopenharmony_ci(`UColReorderCode`) values rather than `UScript` (`UScriptCode`) values.
3802e5b6d6dSopenharmony_ci
3812e5b6d6dSopenharmony_ciIn ICU 4.8-54, not every script could be reordered independently. CLDR and ICU
3822e5b6d6dSopenharmony_cisupported reordering of groups of scripts, each of which started with one of the
3832e5b6d6dSopenharmony_ci[Recommended
3842e5b6d6dSopenharmony_ciScripts](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). A
3852e5b6d6dSopenharmony_ciscript that is not Recommended always moved together with the Recommended Script
3862e5b6d6dSopenharmony_cithat precedes it in DUCET order. (Hiragana sorts together with Katakana, Coptic
3872e5b6d6dSopenharmony_ciwith Greek, etc.) ICU allowed any one script of a (Recommended Script +
3882e5b6d6dSopenharmony_ciDUCET-following) group in the `[reorder]` list, moving the whole set of scripts
3892e5b6d6dSopenharmony_citogether. However, it was strongly recommended that only Recommended Scripts be
3902e5b6d6dSopenharmony_ciused.
3912e5b6d6dSopenharmony_ci
3922e5b6d6dSopenharmony_ciBeginning with ICU 55, scripts only reorder together if they are primary-equal,
3932e5b6d6dSopenharmony_cifor example Hiragana and Katakana.
3942e5b6d6dSopenharmony_ci
3952e5b6d6dSopenharmony_ciZyyy=Common and Zinh=Inherited cannot be reordered.
3962e5b6d6dSopenharmony_ci
3972e5b6d6dSopenharmony_ciThe special code Zzzz (= Unknown script = `UScript.UNKNOWN` =
3982e5b6d6dSopenharmony_ci`Collator.ReorderCodes.OTHERS` = "others") stands for any script that is not
3992e5b6d6dSopenharmony_ciexplicitly mentioned in the list of reordering codes. If Zzzz is mentioned in
4002e5b6d6dSopenharmony_cithe list, then any groups and scripts mentioned later in the list will go at the
4012e5b6d6dSopenharmony_civery end of the reordering, in the order given. If Zzzz is not mentioned, then
4022e5b6d6dSopenharmony_ciall scripts that are not explicitly listed follow at the end in DUCET order.
4032e5b6d6dSopenharmony_ci
4042e5b6d6dSopenharmony_ciThe special reorder code `Collator.ReorderCodes.NONE` (= `UScript.UNKNOWN`), when
4052e5b6d6dSopenharmony_ciused alone (same as `[reorder Zzzz]` or not specifying a `[reorder]` rule in a
4062e5b6d6dSopenharmony_citailoring), will remove any reordering for this collator. The result of setting
4072e5b6d6dSopenharmony_cino reordering will be to use the DUCET/CLDR order.
4082e5b6d6dSopenharmony_ci
4092e5b6d6dSopenharmony_ciOn the API (not applicable to rule syntax), the special reorder code
4102e5b6d6dSopenharmony_ci`Collator.ReorderCodes.DEFAULT` (= `UScript.INHERITED`) will reset the reordering
4112e5b6d6dSopenharmony_cifor the collator to its default order. The default reordering may be the
4122e5b6d6dSopenharmony_ciDUCET/CLDR order or may be a reordering that was specified when this collator
4132e5b6d6dSopenharmony_ciwas created from resource data or from rules. The DEFAULT code must be the sole
4142e5b6d6dSopenharmony_cicode supplied when it used.
4152e5b6d6dSopenharmony_ci
4162e5b6d6dSopenharmony_ciFor details see the [section “Collation Reordering” in the LDML collation
4172e5b6d6dSopenharmony_cispec](http://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering).
4182e5b6d6dSopenharmony_ci
4192e5b6d6dSopenharmony_ci### Advanced Syntactical Elements
4202e5b6d6dSopenharmony_ci
4212e5b6d6dSopenharmony_ciSeveral other syntactical elements are needed in more specific situations.
4222e5b6d6dSopenharmony_ci
4232e5b6d6dSopenharmony_ci#### Order before
4242e5b6d6dSopenharmony_ci
4252e5b6d6dSopenharmony_ci- Syntax: `[before 1|2|3]`
4262e5b6d6dSopenharmony_ci- Example: `&[before 2]a<ā<á<ǎ<à`
4272e5b6d6dSopenharmony_ci
4282e5b6d6dSopenharmony_ciEnables users to order characters **before **a given character. In UCA 3.0, the
4292e5b6d6dSopenharmony_ciexample is equivalent to & ㍡<ā<á<ǎ<à (㍡= \\u3361, ideographic telegraph symbol
4302e5b6d6dSopenharmony_cifor hour nine) and makes accented 'a' letters sort before 'a'. Accents are often
4312e5b6d6dSopenharmony_ciused to indicate the intonations in Pinyin. In this case, the non-accented
4322e5b6d6dSopenharmony_ciletters sort after the accented letters.
4332e5b6d6dSopenharmony_ci
4342e5b6d6dSopenharmony_ci#### Expansion
4352e5b6d6dSopenharmony_ci
4362e5b6d6dSopenharmony_ci- Syntax: `/`
4372e5b6d6dSopenharmony_ci- Example: `æ/e`
4382e5b6d6dSopenharmony_ci
4392e5b6d6dSopenharmony_ciAdds the collation element for 'e' to the collation element for æ.
4402e5b6d6dSopenharmony_ciAfter a reset `&ae << æ` is equivalent to `&a << æ/e`. See the Expansion example
4412e5b6d6dSopenharmony_cibelow.
4422e5b6d6dSopenharmony_ci
4432e5b6d6dSopenharmony_ci#### Prefix processing
4442e5b6d6dSopenharmony_ci
4452e5b6d6dSopenharmony_ci- Syntax: `|`
4462e5b6d6dSopenharmony_ci- Example: `a|b`
4472e5b6d6dSopenharmony_ci
4482e5b6d6dSopenharmony_ciIf 'b' is encountered and it follows 'a',
4492e5b6d6dSopenharmony_cioutput the appropriate collation element. If 'b' follows any other letter,
4502e5b6d6dSopenharmony_cioutput the normal collation element for 'b'.
4512e5b6d6dSopenharmony_ciThe collation element for 'a' is not affected.
4522e5b6d6dSopenharmony_ci
4532e5b6d6dSopenharmony_ciThis element is used to speed up sorting under JIS X 4061. See the
4542e5b6d6dSopenharmony_ciPrefix example below.
4552e5b6d6dSopenharmony_ci
4562e5b6d6dSopenharmony_ci#### Reset to top
4572e5b6d6dSopenharmony_ci
4582e5b6d6dSopenharmony_ci- Syntax: `[top]`
4592e5b6d6dSopenharmony_ci- Example: `&[top] < a < b < c …`
4602e5b6d6dSopenharmony_ci
4612e5b6d6dSopenharmony_ci**Deprecated, use indirect positioning instead**
4622e5b6d6dSopenharmony_ci(`&[last regular]`, see section below)
4632e5b6d6dSopenharmony_ciReorders a set of characters 'above' the UCA. `[top]` is a virtual code point having the
4642e5b6d6dSopenharmony_cibiggest primary weight value that will ever be assigned in the UCA. Above top,
4652e5b6d6dSopenharmony_cithere is a large number of unassigned primary weights that can be used for a
4662e5b6d6dSopenharmony_ci'large' tailoring, such as the reordering of the CJK characters according to a
4672e5b6d6dSopenharmony_ciFar Eastern code page. The first difference after the top is always primary.
4682e5b6d6dSopenharmony_ci
4692e5b6d6dSopenharmony_ci### Indirect Positioning of Collation Elements
4702e5b6d6dSopenharmony_ci
4712e5b6d6dSopenharmony_ciSince ICU version 2.0, ICU allows for indirect positioning of collation elements
4722e5b6d6dSopenharmony_ci(CE). Similar to the reset anchor `top`, these reset anchors allow for positioning of the
4732e5b6d6dSopenharmony_citailoring relative to significant sections of the UCA table. You can use the
4742e5b6d6dSopenharmony_ci`[before]` reset option to position before these sections.
4752e5b6d6dSopenharmony_ci
4762e5b6d6dSopenharmony_ciName                      | Example CE value  | Note
4772e5b6d6dSopenharmony_ci------------------------- | ----------------- | ------------
4782e5b6d6dSopenharmony_cifirst tertiary ignorable  | `[,,]`            | Start of the UCA table. This value will never change unless CEs are extended with higher level values.
4792e5b6d6dSopenharmony_cilast tertiary ignorable   | `[,,]`            | This value will never change unless CEs are extended with higher level values.
4802e5b6d6dSopenharmony_cifirst secondary ignorable | `[,, 05]`         | Currently there are no secondary ignorables in the UCA table.
4812e5b6d6dSopenharmony_cilast secondary ignorable  | `[,, 05]`         | Currently there are no secondary ignorables in the UCA table.
4822e5b6d6dSopenharmony_cifirst primary ignorable   | `[, 87, 05]`      | Mostly for non-spacing combining marks.
4832e5b6d6dSopenharmony_cilast primary ignorable    | `[, E1 B1, 05]`   | Currently this value points to a non-existing code point, used to facilitate sorting of compatibility characters.
4842e5b6d6dSopenharmony_cifirst variable            | `[05 07, 05, 05]` | The lowest CE that is not primary-ignorable. (see below)
4852e5b6d6dSopenharmony_cilast variable             | `[17 9B, 05, 05]` | End of variable section.
4862e5b6d6dSopenharmony_cifirst regular             | `[1A 20, 05, 05]` | This is the first regular CE (not primary ignorable and not variable). The majority of code points have regular CEs.
4872e5b6d6dSopenharmony_cilast regular              | `[78 AA B2, 05, 05]` | Use `&[last regular]` instead of `&[top]`. (see below)
4882e5b6d6dSopenharmony_cifirst implicit            | `[E0 03 03, 05, 05]` | Section of implicitly generated collation elements. (see below)
4892e5b6d6dSopenharmony_cilast implicit             | `[E3 DC 70 C0, 05, 05]` | End of implicit section. This is the CE of the last unassigned code point (U+10FFFD). (see below)
4902e5b6d6dSopenharmony_cifirst trailing            | `[E5, 05, 05]`    | Start of trailing section. (see below)
4912e5b6d6dSopenharmony_cilast trailing             | `[FF FF, 05, 05]` | End of trailing collation elements section. This is the highest possible CE, and is the CE for U+FFFF. Not available for tailoring, see `[first trailing]`.
4922e5b6d6dSopenharmony_ci
4932e5b6d6dSopenharmony_ci"first variable": The current code point is TAB=U+0009. This is the start of the variable section. "Variable" characters will be ignored on primary/secondary/tertiary levels when the "shifted" option is on.
4942e5b6d6dSopenharmony_ci
4952e5b6d6dSopenharmony_ciTailoring after "last regular" will effectively position characters
4962e5b6d6dSopenharmony_cibetween regular code points and "implicit" CEs (the next section).
4972e5b6d6dSopenharmony_ciThis should be used (only) for tailoring Han characters
4982e5b6d6dSopenharmony_ciwhich tends to affect thousands of characters.
4992e5b6d6dSopenharmony_ciThe script reordering implementation assumes that CEs in this section
5002e5b6d6dSopenharmony_ciare for "Hani" script characters.
5012e5b6d6dSopenharmony_ci
5022e5b6d6dSopenharmony_ci"Implicit" means that the UCA default ordering table (DUCET)
5032e5b6d6dSopenharmony_cidoes not explicitly specify CEs for CJK ideographs and unassigned code points;
5042e5b6d6dSopenharmony_ciinstead, their CEs are computed at runtime.
5052e5b6d6dSopenharmony_ci
5062e5b6d6dSopenharmony_ciBeginning with ICU 53, tailoring to any unassigned code point,
5072e5b6d6dSopenharmony_ciincluding "last implicit", is not supported any more.
5082e5b6d6dSopenharmony_ci
5092e5b6d6dSopenharmony_ci"trailing": Tailoring characters after `[first trailing]`
5102e5b6d6dSopenharmony_cimakes them sort after all other non-tailored code points except for U+FFFD and U+FFFF.
5112e5b6d6dSopenharmony_ci
5122e5b6d6dSopenharmony_ciThe "trailing" section is reserved for future use, such as for non starting Jamos. See
5132e5b6d6dSopenharmony_ci<http://www.unicode.org/reports/tr10/#Trailing_Weights>.
5142e5b6d6dSopenharmony_ciCLDR 1.9/ICU 4.6 and later map U+FFFF to the very end of the trailing section.
5152e5b6d6dSopenharmony_ciUCA 6.3/CLDR 24/ICU 52 and later map U+FFFD to just before U+FFFF.
5162e5b6d6dSopenharmony_ciU+FFFD..U+FFFF are not tailorable, and nothing can tailor to them.
5172e5b6d6dSopenharmony_ci<http://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights>
5182e5b6d6dSopenharmony_ci
5192e5b6d6dSopenharmony_ciBefore ICU 4.6, U+FFFF mapped to a completely ignorable CE, and `[last trailing]`
5202e5b6d6dSopenharmony_ciwas the same as `[first trailing]`.
5212e5b6d6dSopenharmony_ci
5222e5b6d6dSopenharmony_ciNot all of the indirect-positioning anchors are useful. Most of the 'first'
5232e5b6d6dSopenharmony_cielements should be used with the `[before]` directive, in order to make sure
5242e5b6d6dSopenharmony_cithat your tailoring will sort before an interesting section.
5252e5b6d6dSopenharmony_ci
5262e5b6d6dSopenharmony_ci### Complex Tailoring Examples
5272e5b6d6dSopenharmony_ci
5282e5b6d6dSopenharmony_ciThe following are several fragments of real tailorings, illustrating some of the
5292e5b6d6dSopenharmony_ciadvanced syntactical elements:
5302e5b6d6dSopenharmony_ci
5312e5b6d6dSopenharmony_ci#### Expansion Example:
5322e5b6d6dSopenharmony_ci
5332e5b6d6dSopenharmony_ci**Swedish:**
5342e5b6d6dSopenharmony_ci```
5352e5b6d6dSopenharmony_ci&t<<<þ/h
5362e5b6d6dSopenharmony_ci&T<<<Þ/H
5372e5b6d6dSopenharmony_ci```
5382e5b6d6dSopenharmony_ci
5392e5b6d6dSopenharmony_ciThe letter 'þ' (THORN) is normally treated by UCA/root collation as a separate
5402e5b6d6dSopenharmony_ciletter that has primary-level sorting after 'z'. However, in Swedish and some
5412e5b6d6dSopenharmony_ciother Scandinavian languages, 'þ' and 'Þ' should be treated as just a
5422e5b6d6dSopenharmony_citertiary-level difference from the letters "th" and "TH" respectively. This is
5432e5b6d6dSopenharmony_cian example of an expansion.
5442e5b6d6dSopenharmony_ci
5452e5b6d6dSopenharmony_ciUCA | `&t<<<þ/h, &T<<<Þ/H`
5462e5b6d6dSopenharmony_ci--- | --------------------
5472e5b6d6dSopenharmony_ciaz  | az
5482e5b6d6dSopenharmony_ciAz  | Az
5492e5b6d6dSopenharmony_citha | tha
5502e5b6d6dSopenharmony_ciTha | þa
5512e5b6d6dSopenharmony_ciTHa | Tha
5522e5b6d6dSopenharmony_cithz | THa
5532e5b6d6dSopenharmony_ciza  | Þa
5542e5b6d6dSopenharmony_ciZa  | thz
5552e5b6d6dSopenharmony_cizz  | þz
5562e5b6d6dSopenharmony_ciþa  | za
5572e5b6d6dSopenharmony_ciÞa  | Za
5582e5b6d6dSopenharmony_ciþz  | zz
5592e5b6d6dSopenharmony_ci
5602e5b6d6dSopenharmony_ci#### Prefix Example:
5612e5b6d6dSopenharmony_ci
5622e5b6d6dSopenharmony_ciPrefixes are used in Japanese tailorings to reduce the number of contractions. A
5632e5b6d6dSopenharmony_cibig number of contractions is a performance burden on the commonly-used base
5642e5b6d6dSopenharmony_cicharacters, as their processing is much more complicated than the processing of
5652e5b6d6dSopenharmony_ciregular elements.
5662e5b6d6dSopenharmony_ci
5672e5b6d6dSopenharmony_ciA prefix rule conditionally changes the CE of the character or string (e.g., ー)
5682e5b6d6dSopenharmony_ciafter the | symbol; unlike a contraction, it does not affect the CE of the
5692e5b6d6dSopenharmony_cipreceding text (e.g., ァ). (By contrast, a contraction like ァー consumes both
5702e5b6d6dSopenharmony_cicharacters and can assign them a CE or expansion unrelated to ァ's CE.) A prefix
5712e5b6d6dSopenharmony_cirule is especially useful if the character or string (ー) after the | symbol
5722e5b6d6dSopenharmony_cioccurs significantly less often than the first character of the prefix (ァ).
5732e5b6d6dSopenharmony_ci
5742e5b6d6dSopenharmony_ci```
5752e5b6d6dSopenharmony_ci&[before 3]ァ <<< ァ|ー = ァ|ー = ぁ|ー
5762e5b6d6dSopenharmony_ci```
5772e5b6d6dSopenharmony_ci
5782e5b6d6dSopenharmony_ciThis could have been written as a series of contractions followed by expansion:
5792e5b6d6dSopenharmony_ci
5802e5b6d6dSopenharmony_ci```
5812e5b6d6dSopenharmony_ci&[before 3]ァー <<< ァー = ァー = ぁー
5822e5b6d6dSopenharmony_ci```
5832e5b6d6dSopenharmony_ci
5842e5b6d6dSopenharmony_ciHowever, in that case ァ, ァ and ぁ would start contractions. Since the prolonged
5852e5b6d6dSopenharmony_cisound mark (ー) occurs much less frequently than the other letters of Japanese
5862e5b6d6dSopenharmony_ciKatakana and Hiragana, it is much more prudent to put the extra processing on it
5872e5b6d6dSopenharmony_ciby using prefixes.
5882e5b6d6dSopenharmony_ci
5892e5b6d6dSopenharmony_ci#### Reset example:
5902e5b6d6dSopenharmony_ci
5912e5b6d6dSopenharmony_ciA "reset" always uses only the base character as the insertion point even if
5922e5b6d6dSopenharmony_cithere is an expansion. So the following rule,
5932e5b6d6dSopenharmony_ci
5942e5b6d6dSopenharmony_ci```
5952e5b6d6dSopenharmony_ci& J <<< K / B & K <<< M
5962e5b6d6dSopenharmony_ci```
5972e5b6d6dSopenharmony_ci
5982e5b6d6dSopenharmony_ciis equivalent to
5992e5b6d6dSopenharmony_ci
6002e5b6d6dSopenharmony_ci```
6012e5b6d6dSopenharmony_ci& J <<< K / B <<< M
6022e5b6d6dSopenharmony_ci```
6032e5b6d6dSopenharmony_ci
6042e5b6d6dSopenharmony_ciWhich produces the following sort order:
6052e5b6d6dSopenharmony_ci
6062e5b6d6dSopenharmony_ci"JA"
6072e5b6d6dSopenharmony_ci
6082e5b6d6dSopenharmony_ci"MA"
6092e5b6d6dSopenharmony_ci
6102e5b6d6dSopenharmony_ci"KA"
6112e5b6d6dSopenharmony_ci
6122e5b6d6dSopenharmony_ci"KC"
6132e5b6d6dSopenharmony_ci
6142e5b6d6dSopenharmony_ci"JC"
6152e5b6d6dSopenharmony_ci
6162e5b6d6dSopenharmony_ci"MC"
6172e5b6d6dSopenharmony_ci
6182e5b6d6dSopenharmony_ci> :point_right: **Note**: Assuming the letters "J", "K" and "M" have equal primary weights, the second
6192e5b6d6dSopenharmony_ci> letter contains the differences among these strings. However, the letter "K" is
6202e5b6d6dSopenharmony_ci> treated as if it always has a letter "B" following it while the letters "J" and
6212e5b6d6dSopenharmony_ci> "M" do not.
6222e5b6d6dSopenharmony_ci
6232e5b6d6dSopenharmony_ciThe following is an example of collation elements for these strings resulting
6242e5b6d6dSopenharmony_cifrom the specified rules:
6252e5b6d6dSopenharmony_ci
6262e5b6d6dSopenharmony_ciStrings | Collation Elements | &nbsp;         | &nbsp;
6272e5b6d6dSopenharmony_ci------- | ------------------ | -------------- | ------
6282e5b6d6dSopenharmony_ci"JA"    | `[005C.00.01]`     | `[0052.00.01]` |
6292e5b6d6dSopenharmony_ci"MA"    | `[005C.00.03]`     | `[0052.00.01]` |
6302e5b6d6dSopenharmony_ci"KA"    | `[005C.00.02]`     | `[0053.00.01]` | `[0052.00.01]`
6312e5b6d6dSopenharmony_ci"KC"    | `[005C.00.02]`     | `[0053.00.01]` | `[0054.00.01]`
6322e5b6d6dSopenharmony_ci"JC"    | `[005C.00.01]`     | `[0054.00.01]` |
6332e5b6d6dSopenharmony_ci"MC"    | `[005C.00.03]`     | `[0054.00.01]` |
6342e5b6d6dSopenharmony_ci
6352e5b6d6dSopenharmony_ci## Tailoring Issues
6362e5b6d6dSopenharmony_ci
6372e5b6d6dSopenharmony_ciICU uses canonical closure. This means that for each code point in Unicode, if
6382e5b6d6dSopenharmony_cithe canonically composed form of a tailored string produces different collation
6392e5b6d6dSopenharmony_cielements than the canonically decomposed form, then the canonically composed
6402e5b6d6dSopenharmony_ciform is effectively added to the ordering. If 'a' is tailored, for example, all
6412e5b6d6dSopenharmony_ciof the accented 'a' characters are also tailored. Canonical closure allows
6422e5b6d6dSopenharmony_cicollators to process Unicode strings in the FCD form as well as in NFD. (Note:
6432e5b6d6dSopenharmony_ciMost but not all NFC strings are also in FCD. See
6442e5b6d6dSopenharmony_ci<http://www.unicode.org/notes/tn5/#FCD>)
6452e5b6d6dSopenharmony_ci
6462e5b6d6dSopenharmony_ciHowever, *compatibility* equivalents are NOT automatically added. If the rule
6472e5b6d6dSopenharmony_ci"&b < a" is in tailoring, and the order of **ⓐ (circled a)** is important, it
6482e5b6d6dSopenharmony_cineeds to be tailored **explicitly**.
6492e5b6d6dSopenharmony_ci
6502e5b6d6dSopenharmony_ciRedundant tailoring rules are removed, with later rules "winning". The strengths
6512e5b6d6dSopenharmony_ciaround the removed rules are also fixed.
6522e5b6d6dSopenharmony_ci
6532e5b6d6dSopenharmony_ci### Example:
6542e5b6d6dSopenharmony_ci
6552e5b6d6dSopenharmony_ciThe following table summarizes effects of different redundant rules.
6562e5b6d6dSopenharmony_ci
6572e5b6d6dSopenharmony_ci&nbsp; | Original                                                  | Equivalent
6582e5b6d6dSopenharmony_ci------ | --------------------------------------------------------- | ----------
6592e5b6d6dSopenharmony_ci1      | `& a < b < c < d` `& r < c`                               | `& a < b < d` `& r < c`
6602e5b6d6dSopenharmony_ci2      | `& a < b < c < d` `& c < m`                               | `& a < b < c < m < d`
6612e5b6d6dSopenharmony_ci3      | `& a < b < c < d` `& a < m`                               | `& a < m < b < c < d`
6622e5b6d6dSopenharmony_ci4      | `& a <<< b << c < d` `& a < m`                            | `& a <<< b << c < m < d`
6632e5b6d6dSopenharmony_ci5      | `& a < b < c < d` `& [before 1] c < m`                    | `& a < b < m < c < d`
6642e5b6d6dSopenharmony_ci6      | `& a < b <<< c << d <<< e` `& [before 3] e <<< x`         | `& a < b <<< c << d <<< x <<< e`
6652e5b6d6dSopenharmony_ci7      | `& a < b <<< c << d <<< e` `& [before 2] e <<< x`         | `& a < b <<< c <<< x << d <<< e`
6662e5b6d6dSopenharmony_ci8      | `& a < b <<< c << d <<< e` `& [before 1] e <<< x`         | `& a <<< x < b <<< c << d <<< e`
6672e5b6d6dSopenharmony_ci9      | `& a < b <<< c << d <<< e <<< f < g` `& [before 1] g < x` | `& a < b <<< c << d <<< e <<< f < x < g`
6682e5b6d6dSopenharmony_ci
6692e5b6d6dSopenharmony_ciIf two different reset lists tailor the same character, then it is removed from the first
6702e5b6d6dSopenharmony_cione (see 1 in the table above).
6712e5b6d6dSopenharmony_ciIf the second list resets to a character tailored in the first list, then the second
6722e5b6d6dSopenharmony_cilist is inserted in the first (see 2).
6732e5b6d6dSopenharmony_ciIf both lists reset to the same character, then the same thing
6742e5b6d6dSopenharmony_cihappens (see 3). Whenever such an insertion occurs, the second strength
6752e5b6d6dSopenharmony_ci"postpones" the position (see 4).
6762e5b6d6dSopenharmony_ci
6772e5b6d6dSopenharmony_ciIf there is a `[before N]` on the reset, then the reset character is
6782e5b6d6dSopenharmony_cieffectively replaced by the item that would be before it, either in a previous
6792e5b6d6dSopenharmony_citailoring (if the letter occurs in one - see 5) or in the UCA. The N determines
6802e5b6d6dSopenharmony_cithe 'distance' before, based on the strength of the difference (see 6-8).
6812e5b6d6dSopenharmony_ciHowever, this is subject to postponement (see 9), so be careful!
6822e5b6d6dSopenharmony_ci
6832e5b6d6dSopenharmony_ci### Reset semantics
6842e5b6d6dSopenharmony_ci
6852e5b6d6dSopenharmony_ciThe reset semantic in ICU 1.8 and above is different from the previous ICU
6862e5b6d6dSopenharmony_cireleases. Prior to version 1.8, the reset relation modifier was applicable only
6872e5b6d6dSopenharmony_cito the entry immediately following the reset entry. Also, the relation modifier
6882e5b6d6dSopenharmony_ciapplied to all entries that occurred until the next reset or primary relation.
6892e5b6d6dSopenharmony_ci
6902e5b6d6dSopenharmony_ciFor example,
6912e5b6d6dSopenharmony_ci
6922e5b6d6dSopenharmony_ci```
6932e5b6d6dSopenharmony_ci&xyz << e <<< f
6942e5b6d6dSopenharmony_ci```
6952e5b6d6dSopenharmony_ci
6962e5b6d6dSopenharmony_ciwas equivalent to
6972e5b6d6dSopenharmony_ci
6982e5b6d6dSopenharmony_ci```
6992e5b6d6dSopenharmony_ci&x << e/yz <<< f
7002e5b6d6dSopenharmony_ci```
7012e5b6d6dSopenharmony_ci
7022e5b6d6dSopenharmony_ciprior to ICU version 1.8.
7032e5b6d6dSopenharmony_ci
7042e5b6d6dSopenharmony_ciStarting with ICU version 1.8, the modifier is equivalent to
7052e5b6d6dSopenharmony_ci
7062e5b6d6dSopenharmony_ci```
7072e5b6d6dSopenharmony_ci&x << e/yz <<< f/yz
7082e5b6d6dSopenharmony_ci```
7092e5b6d6dSopenharmony_ci
7102e5b6d6dSopenharmony_ciThe new semantic produces more intuitive results, especially when the character
7112e5b6d6dSopenharmony_ciafter the reset is decomposable. Since all rules are converted to NFD before
7122e5b6d6dSopenharmony_cithey are interpreted, this can result in contractions that the rule-writer might
7132e5b6d6dSopenharmony_cinot be aware of. Expansion propagates only until the next reset or primary
7142e5b6d6dSopenharmony_cirelation occurs.
7152e5b6d6dSopenharmony_ci
7162e5b6d6dSopenharmony_ciFor example, the following rule:
7172e5b6d6dSopenharmony_ci
7182e5b6d6dSopenharmony_ci```
7192e5b6d6dSopenharmony_ci&ab = c <<< d << e <<< f < g <<< h
7202e5b6d6dSopenharmony_ci```
7212e5b6d6dSopenharmony_ci
7222e5b6d6dSopenharmony_ciwas equivalent to the following prior to ICU 1.8 and in Java:
7232e5b6d6dSopenharmony_ci
7242e5b6d6dSopenharmony_ci```
7252e5b6d6dSopenharmony_ci&a = c/b <<< d << e <<< f < g <<< h
7262e5b6d6dSopenharmony_ci```
7272e5b6d6dSopenharmony_ci
7282e5b6d6dSopenharmony_ciStarting with 1.8, it is equivalent to
7292e5b6d6dSopenharmony_ci
7302e5b6d6dSopenharmony_ci```
7312e5b6d6dSopenharmony_ci&a = c / b <<< d / b << e / b <<< f / b < g <<< h
7322e5b6d6dSopenharmony_ci```
7332e5b6d6dSopenharmony_ci
7342e5b6d6dSopenharmony_ci## Known Limitations
7352e5b6d6dSopenharmony_ci
7362e5b6d6dSopenharmony_ciThe following are known limitations of the ICU collation implementation. These
7372e5b6d6dSopenharmony_ciare theoretical limitations, however, since there are no known languages for
7382e5b6d6dSopenharmony_ciwhich these limitations are an issue. However, for completeness they should be
7392e5b6d6dSopenharmony_cifixed in a future version after 1.8.1. The examples given are designed for
7402e5b6d6dSopenharmony_cisimplicity in testing, and do not match any real languages.
7412e5b6d6dSopenharmony_ci
7422e5b6d6dSopenharmony_ci### Expansion
7432e5b6d6dSopenharmony_ci
7442e5b6d6dSopenharmony_ciThe goal of expansion is to sort as if the expansion text were inserted right
7452e5b6d6dSopenharmony_ciafter the character. For example, with the rule
7462e5b6d6dSopenharmony_ci
7472e5b6d6dSopenharmony_ci```
7482e5b6d6dSopenharmony_ci&a <<< c / e
7492e5b6d6dSopenharmony_ci```
7502e5b6d6dSopenharmony_ci
7512e5b6d6dSopenharmony_ciThe text "...**c**..." should sort as if it were right after "...**ae**..." with
7522e5b6d6dSopenharmony_cia tertiary difference. There are a few cases where this is not currently true.
7532e5b6d6dSopenharmony_ci
7542e5b6d6dSopenharmony_ci#### Recursive Expansion
7552e5b6d6dSopenharmony_ci
7562e5b6d6dSopenharmony_ciGiven the rules
7572e5b6d6dSopenharmony_ci
7582e5b6d6dSopenharmony_ci```
7592e5b6d6dSopenharmony_ci&a <<< c / e
7602e5b6d6dSopenharmony_ci&g <<< e / I
7612e5b6d6dSopenharmony_ci```
7622e5b6d6dSopenharmony_ci
7632e5b6d6dSopenharmony_ciExpansion should sort the text "...**c**..." as if it were just after
7642e5b6d6dSopenharmony_ci"...**ae**...", and that should also sort as if it were just after
7652e5b6d6dSopenharmony_ci"...**agi**...". This requires that the compilation of expansions be recursive
7662e5b6d6dSopenharmony_ci(and check for loops as well!). ICU currently does not do this.
7672e5b6d6dSopenharmony_ci
7682e5b6d6dSopenharmony_ciRules         | Desired Order | Current Order
7692e5b6d6dSopenharmony_ci------------- | ------------- | -------------
7702e5b6d6dSopenharmony_ci`& a = b / c` | add           | b
7712e5b6d6dSopenharmony_ci`& d = c / e` | b             | add
7722e5b6d6dSopenharmony_ci&nbsp;        | adf           | adf
7732e5b6d6dSopenharmony_ci
7742e5b6d6dSopenharmony_ci#### Contractions Spanning Expansions
7752e5b6d6dSopenharmony_ci
7762e5b6d6dSopenharmony_ciICU currently always pre-compiles the expansion into an internal format (a list
7772e5b6d6dSopenharmony_ciof one or more collation elements) when the rule is compiled. If there is a
7782e5b6d6dSopenharmony_cicontraction that spans the end of the expanded text and the start of the
7792e5b6d6dSopenharmony_cioriginal text, however, that contraction will not match. A text case that
7802e5b6d6dSopenharmony_ciillustrates this is:
7812e5b6d6dSopenharmony_ci
7822e5b6d6dSopenharmony_ciRules           | Desired Order | Current Order
7832e5b6d6dSopenharmony_ci--------------- | ------------- | -------------
7842e5b6d6dSopenharmony_ci`& a <<< c / e` | ad            | ad
7852e5b6d6dSopenharmony_ci`& g <<< eh`    | c             | c
7862e5b6d6dSopenharmony_ci&nbsp;          | af            | ch
7872e5b6d6dSopenharmony_ci&nbsp;          | g             | af
7882e5b6d6dSopenharmony_ci&nbsp;          | ch            | g
7892e5b6d6dSopenharmony_ci&nbsp;          | h             | h
7902e5b6d6dSopenharmony_ci
7912e5b6d6dSopenharmony_ciSince the pre-compiled expansions are a huge performance gain, we will probably
7922e5b6d6dSopenharmony_cikeep the implementation the way it is, but in the future allow additional syntax
7932e5b6d6dSopenharmony_cito indicate those few expansions that need to behave as if the text were
7942e5b6d6dSopenharmony_ciinserted because of the existence of another contraction. Note that such
7952e5b6d6dSopenharmony_ciexpansions need to be recursively expanded (as in #1), but rather than at
7962e5b6d6dSopenharmony_cipre-compile time, these need to be done at runtime.
7972e5b6d6dSopenharmony_ci
7982e5b6d6dSopenharmony_ciWhile it is possible to automatically detect these cases, it would be better to
7992e5b6d6dSopenharmony_ciallow explicit control in case spanning is not desired. An example of such
8002e5b6d6dSopenharmony_cisyntax might be something like:
8012e5b6d6dSopenharmony_ci
8022e5b6d6dSopenharmony_ci```
8032e5b6d6dSopenharmony_ci&a <<< c // e
8042e5b6d6dSopenharmony_ci```
8052e5b6d6dSopenharmony_ci
8062e5b6d6dSopenharmony_ci**Notes:** ICU does handle the case where there is a contraction that is
8072e5b6d6dSopenharmony_cicompletely inside the expansion.
8082e5b6d6dSopenharmony_ci
8092e5b6d6dSopenharmony_ciSuppose that someone had the rules:
8102e5b6d6dSopenharmony_ci
8112e5b6d6dSopenharmony_ci```
8122e5b6d6dSopenharmony_ci&a = c / e
8132e5b6d6dSopenharmony_ci&x = ae
8142e5b6d6dSopenharmony_ci```
8152e5b6d6dSopenharmony_ci
8162e5b6d6dSopenharmony_ciThese do not cause **c** to sort as if it were **ae**, nor should they.
8172e5b6d6dSopenharmony_ci
8182e5b6d6dSopenharmony_ci### Normalization
8192e5b6d6dSopenharmony_ci
8202e5b6d6dSopenharmony_ciThe Unicode Collation Algorithm specifies that all text sort as if it were first
8212e5b6d6dSopenharmony_cinormalized into NFD. For performance reasons, ICU collation data is
8222e5b6d6dSopenharmony_cipre-processed so that there is no need to perform normalization on strings that
8232e5b6d6dSopenharmony_ciare in [FCD](http://www.unicode.org/notes/tn5/#FCD) and do not contain any composite
8242e5b6d6dSopenharmony_cicombining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81
8252e5b6d6dSopenharmony_ci}
8262e5b6d6dSopenharmony_ci[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=)
8272e5b6d6dSopenharmony_ci(These characters must be decomposed for discontiguous contractions to work
8282e5b6d6dSopenharmony_ciproperly. Use of these characters is discouraged by the Unicode Standard.). The
8292e5b6d6dSopenharmony_civast majority of strings are in this form.
8302e5b6d6dSopenharmony_ci
8312e5b6d6dSopenharmony_ci#### Nulls in Contractions
8322e5b6d6dSopenharmony_ci
8332e5b6d6dSopenharmony_ciNulls should not be used in contractions that could invoke normalization.
8342e5b6d6dSopenharmony_ci
8352e5b6d6dSopenharmony_ciRules                | Desired Order | Current Order
8362e5b6d6dSopenharmony_ci-------------------- | ------------- | -------------
8372e5b6d6dSopenharmony_ci`& a <<< '\u0000'^`  | a             | '\\u0000'^
8382e5b6d6dSopenharmony_ci&nbsp;               | '\\u0000'^    | a
8392e5b6d6dSopenharmony_ci
8402e5b6d6dSopenharmony_ci#### Contractions Spanning Normalization
8412e5b6d6dSopenharmony_ci
8422e5b6d6dSopenharmony_ciThe following rule specifies that a grave accent followed by a **b** is a
8432e5b6d6dSopenharmony_cicontraction, and sorts as if it were an **e**.
8442e5b6d6dSopenharmony_ci
8452e5b6d6dSopenharmony_ci```
8462e5b6d6dSopenharmony_ci& e <<< ` b
8472e5b6d6dSopenharmony_ci```
8482e5b6d6dSopenharmony_ci
8492e5b6d6dSopenharmony_ciOn this basis, "...àb..." should sort as if it were just after "...ae...".
8502e5b6d6dSopenharmony_ciBecause of the preprocessing, however, the contraction will not match if this
8512e5b6d6dSopenharmony_citext is represented with the pre-composed character à, but **will** match if
8522e5b6d6dSopenharmony_cigiven the decomposed sequence **a + grave accent**. The same thing happens if
8532e5b6d6dSopenharmony_cithe contraction spans the start of a normalized sequence.
8542e5b6d6dSopenharmony_ci
8552e5b6d6dSopenharmony_ciRules        | Desired Order | Current Order
8562e5b6d6dSopenharmony_ci------------ | ------------- | -------------
8572e5b6d6dSopenharmony_ci& e <<< \` b | à             | à
8582e5b6d6dSopenharmony_ci&nbsp;       | ad            | àb
8592e5b6d6dSopenharmony_ci&nbsp;       | àb            | ad
8602e5b6d6dSopenharmony_ci&nbsp;       | af            | af
8612e5b6d6dSopenharmony_ci&nbsp;       | &nbsp;        |
8622e5b6d6dSopenharmony_ci`& g <<< ca` | f             | cà
8632e5b6d6dSopenharmony_ci&nbsp;       | ca            | f
8642e5b6d6dSopenharmony_ci&nbsp;       | cà            | ca
8652e5b6d6dSopenharmony_ci&nbsp;       | h             | h
8662e5b6d6dSopenharmony_ci
8672e5b6d6dSopenharmony_ci### Variable Top
8682e5b6d6dSopenharmony_ci
8692e5b6d6dSopenharmony_ciICU lets you set the top of the variable range. This can be done, for example,
8702e5b6d6dSopenharmony_cito allow you to ignore just SPACES, and not punctuation.
8712e5b6d6dSopenharmony_ci
8722e5b6d6dSopenharmony_ci#### Variable Top Exclusion
8732e5b6d6dSopenharmony_ci
8742e5b6d6dSopenharmony_ciThere is currently a limitation that causes variable top to (perhaps) exclude
8752e5b6d6dSopenharmony_cimore characters than it should. This happens if you not only set variable top,
8762e5b6d6dSopenharmony_cibut also tailor a number of characters around it with primary differences. The
8772e5b6d6dSopenharmony_ciexact number that you can tailor depends on the internal "gaps" between the
8782e5b6d6dSopenharmony_cicharacters in the pre-compiled UCA table. Normally there is a gap of one. There
8792e5b6d6dSopenharmony_ciare larger gaps between scripts (such as between Latin and Greek), and after
8802e5b6d6dSopenharmony_cicertain other special characters. For example, if variable top is set to be at
8812e5b6d6dSopenharmony_ciSPACE ('\\u0020'), then it works correctly with up to 70 characters also
8822e5b6d6dSopenharmony_citailored after space. However, if variable top is set to be equal to HYPHEN
8832e5b6d6dSopenharmony_ci('\\u2010'), only one other value can be accommodated.
8842e5b6d6dSopenharmony_ci
8852e5b6d6dSopenharmony_ciIn the following, the goal is for x to be ignored and z not to be ignored.
8862e5b6d6dSopenharmony_ci
8872e5b6d6dSopenharmony_ciRules              | Desired Order SHIFTED = ON | Current Order
8882e5b6d6dSopenharmony_ci------------------ | -------------------------- | -------------
8892e5b6d6dSopenharmony_ci`& \u2010`         | -                          | -
8902e5b6d6dSopenharmony_ci`< x`              | z                          | z
8912e5b6d6dSopenharmony_ci`< [variable top]` | zb                         | zb
8922e5b6d6dSopenharmony_ci`< z`              | a                          | xb
8932e5b6d6dSopenharmony_ci&nbsp;             | b                          | a
8942e5b6d6dSopenharmony_ci&nbsp;             | -b                         | b
8952e5b6d6dSopenharmony_ci&nbsp;             | xb                         | -b
8962e5b6d6dSopenharmony_ci&nbsp;             | c                          | c
8972e5b6d6dSopenharmony_ci
8982e5b6d6dSopenharmony_ci> :point_right: **Note**: With ICU 1.8.1, the
8992e5b6d6dSopenharmony_ci> user is advised not to tailor the variable top to customize more than two
9002e5b6d6dSopenharmony_ci> primary relations (for example, `"& x < y < [variable top]"`). Starting in ICU
9012e5b6d6dSopenharmony_ci> 2.0, setVariableTop() allows the user to set the variable top programmatically
9022e5b6d6dSopenharmony_ci> to a legal single character or a valid contracting sequence. In addition, the
9032e5b6d6dSopenharmony_ci> string that variable top is set to should not be treated as either inclusive or
9042e5b6d6dSopenharmony_ci> exclusive in the rules.
9052e5b6d6dSopenharmony_ci
9062e5b6d6dSopenharmony_ci### Case Level/First/Second
9072e5b6d6dSopenharmony_ci
9082e5b6d6dSopenharmony_ciIn ICU, it is possible to override the tertiary settings programmatically. This
9092e5b6d6dSopenharmony_ciis used to change the default case behavior to be all upper first or all lower
9102e5b6d6dSopenharmony_cifirst. It can also be used for a separate case level, or to ignore all other
9112e5b6d6dSopenharmony_citertiary differences (such as between circled and non-circled letters, or
9122e5b6d6dSopenharmony_cibetween half-width and full-width katakana). The case values are derived
9132e5b6d6dSopenharmony_cidirectly from the Unicode character properties, and not set by the rules.
9142e5b6d6dSopenharmony_ci
9152e5b6d6dSopenharmony_ci#### Mixed Case Contractions
9162e5b6d6dSopenharmony_ci
9172e5b6d6dSopenharmony_ciThere is currently a limitation that all contractions of multiple characters can
9182e5b6d6dSopenharmony_cionly have three special case values: upper, lower, and mixed. All mixed-case
9192e5b6d6dSopenharmony_cicontractions are grouped together, and are not affected by the upper first vs.
9202e5b6d6dSopenharmony_cilower first flag.
9212e5b6d6dSopenharmony_ci
9222e5b6d6dSopenharmony_ciRules      | Desired Order UPPER_FIRST | Current Order
9232e5b6d6dSopenharmony_ci---------- | ------------------------- | -------------
9242e5b6d6dSopenharmony_ci`& c < ch` | C                         | c
9252e5b6d6dSopenharmony_ci`<<< cH`   | CH                        | CH
9262e5b6d6dSopenharmony_ci`<<< Ch`   | Ch                        | cH
9272e5b6d6dSopenharmony_ci`<<< CH`   | cH                        | Ch
9282e5b6d6dSopenharmony_ci&nbsp;     | ch                        | ch
9292e5b6d6dSopenharmony_ci
9302e5b6d6dSopenharmony_ci## Building on Existing Locales
9312e5b6d6dSopenharmony_ci
9322e5b6d6dSopenharmony_ciAll of the collation rules are additive; that is, they override what any
9332e5b6d6dSopenharmony_ciprevious rule expressed. That means that you can build on existing rules for
9342e5b6d6dSopenharmony_cigiven locales. Here is an example of this, which fetches the rules for a
9352e5b6d6dSopenharmony_ciparticular locale (Danish), then overrides some part (sorting '%' after 'm').
9362e5b6d6dSopenharmony_ciThe syntax is Java, but C/C++ has similar features.
9372e5b6d6dSopenharmony_ci
9382e5b6d6dSopenharmony_ci```java
9392e5b6d6dSopenharmony_ciULocale myLocale = new ULocale("da");
9402e5b6d6dSopenharmony_citry {
9412e5b6d6dSopenharmony_ci
9422e5b6d6dSopenharmony_ci    RuleBasedCollator col = (RuleBasedCollator) Collator.getInstance(myLocale);
9432e5b6d6dSopenharmony_ci    String rules = col.getRules();
9442e5b6d6dSopenharmony_ci    String myRules = "& m < '%'";
9452e5b6d6dSopenharmony_ci    RuleBasedCollator col2 = new RuleBasedCollator(rules + myRules);
9462e5b6d6dSopenharmony_ci
9472e5b6d6dSopenharmony_ci    // check the values
9482e5b6d6dSopenharmony_ci
9492e5b6d6dSopenharmony_ci    List<String> expected = Arrays.asList("a;m;%;z;aa".split(";"));
9502e5b6d6dSopenharmony_ci    TreeSet<String> sorted = new TreeSet<String>(col2);
9512e5b6d6dSopenharmony_ci    sorted.addAll(expected);
9522e5b6d6dSopenharmony_ci    ArrayList<String> actual = new ArrayList<String>(sorted);
9532e5b6d6dSopenharmony_ci    assertEquals("Customized rules with %", expected, actual);
9542e5b6d6dSopenharmony_ci
9552e5b6d6dSopenharmony_ci} catch (Exception e) {
9562e5b6d6dSopenharmony_ci    throw new IllegalArgumentException("Failed to create customized rules", e);
9572e5b6d6dSopenharmony_ci}
9582e5b6d6dSopenharmony_ci```
9592e5b6d6dSopenharmony_ci
9602e5b6d6dSopenharmony_ciThe root collator has an empty rules string (`getRules()` returns `""`): Any
9612e5b6d6dSopenharmony_cicollator's tailoring rules string defines how a collator *differs* from the root
9622e5b6d6dSopenharmony_cicollator, and the tailoring rules string was the input for building the
9632e5b6d6dSopenharmony_citailoring collator. By contrast, the root collator itself is built from a file
9642e5b6d6dSopenharmony_ciwith explicit mappings (ICU4C source/data/unidata/FractionalUCA.txt)
9652e5b6d6dSopenharmony_cifrom characters/contractions to collation elements. This file represents the
9662e5b6d6dSopenharmony_ci[DUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table)
9672e5b6d6dSopenharmony_cias [modified by
9682e5b6d6dSopenharmony_ciCLDR](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation).
9692e5b6d6dSopenharmony_ci
9702e5b6d6dSopenharmony_ciThere are "extended" versions of `getRules()` which, when called with
9712e5b6d6dSopenharmony_ci`delta=UCOL_FULL_RULES` (C/C++) or `fullrules=true` (Java), return "full rules"
9722e5b6d6dSopenharmony_ciwhich are a concatenation of the "UCA rules" and the collator's tailoring. The
9732e5b6d6dSopenharmony_ci"UCA rules" are published as UCA_Rules.txt in every [UCA
9742e5b6d6dSopenharmony_cirelease](http://www.unicode.org/Public/UCA/).
9752e5b6d6dSopenharmony_ci
9762e5b6d6dSopenharmony_ci*   "UCA rules" is a historical misnomer. The UCA specifies an Algorithm which
9772e5b6d6dSopenharmony_ci    applies to all collators, and provides the DUCET as its Default table.
9782e5b6d6dSopenharmony_ci*   ICU's root collator implements the CLDR-modified collation element table.
9792e5b6d6dSopenharmony_ci    The "UCA rules" returned from ICU functions are equivalently modified rules
9802e5b6d6dSopenharmony_ci    compared with those for the DUCET.
9812e5b6d6dSopenharmony_ci
9822e5b6d6dSopenharmony_ciThe "UCA rules" are an *approximation* of the root collator's sort order, but
9832e5b6d6dSopenharmony_cithere are some differences because not all of the details of the root collator
9842e5b6d6dSopenharmony_cimappings can be expressed in rule syntax. In particular, a collator built from
9852e5b6d6dSopenharmony_ciICU4C source/data/unidata/UCARules.txt
9862e5b6d6dSopenharmony_cihas at least the following issues compared with the real root collator:
9872e5b6d6dSopenharmony_ci
9882e5b6d6dSopenharmony_ci*   inefficient (long) collation element weights
9892e5b6d6dSopenharmony_ci*   CODAN (numeric collation) will not work (the 0 digit's primary weight is
9902e5b6d6dSopenharmony_ci    hardcoded, or specified in FractionalUCA.txt)
9912e5b6d6dSopenharmony_ci*   script reordering will not work
9922e5b6d6dSopenharmony_ci*   alternate=shifted will not work
9932e5b6d6dSopenharmony_ci*   the sort order has some differences from the regular root collator,
9942e5b6d6dSopenharmony_ci    including additional tertiary differences
9952e5b6d6dSopenharmony_ci
9962e5b6d6dSopenharmony_ciThe "full rules" are almost never used, or useful, at runtime. They are included
9972e5b6d6dSopenharmony_ciin ICU for historical reasons and for UCA consistency tests. They might be
9982e5b6d6dSopenharmony_ciusable for emulating the CLDR/ICU sort order with a collation implementation not
9992e5b6d6dSopenharmony_cibased on CLDR/ICU.
10002e5b6d6dSopenharmony_ci
10012e5b6d6dSopenharmony_ciCollation rule strings in general are not commonly used but are a significant
10022e5b6d6dSopenharmony_ciportion of the data size in ICU collation resource bundles, especially for CJK
10032e5b6d6dSopenharmony_cilanguages. The rule strings can be omitted from those resource bundles by adding
10042e5b6d6dSopenharmony_cithe `--omitCollationRules` option to the relevant `genrb` invocations
10052e5b6d6dSopenharmony_ci(for ICU 53..63, in icu4c/source/data/Makefile.in)
10062e5b6d6dSopenharmony_cior, since ICU 64, with a [data filter config file](../../icu_data/buildtool.md).
10072e5b6d6dSopenharmony_ci(See for example the relevant
10082e5b6d6dSopenharmony_ci[ICU integration test instructions](https://icu.unicode.org/processes/release/tasks/integration#TOC-Verify-that-ICU4C-tests-pass-without-collation-rule-strings).)
10092e5b6d6dSopenharmony_ci
10102e5b6d6dSopenharmony_ciIf the tailoring rules are needed but the 150kB or so of "UCA rules" are not,
10112e5b6d6dSopenharmony_cithen the line
10122e5b6d6dSopenharmony_ci
10132e5b6d6dSopenharmony_ci```
10142e5b6d6dSopenharmony_ciUCARules:process(uca_rules){"../unidata/UCARules.txt"}
10152e5b6d6dSopenharmony_ci```
10162e5b6d6dSopenharmony_ci
10172e5b6d6dSopenharmony_ciin
10182e5b6d6dSopenharmony_ci[source/data/coll/root.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll/root.txt)
10192e5b6d6dSopenharmony_cican be commented out or deleted.
10202e5b6d6dSopenharmony_ci
10212e5b6d6dSopenharmony_ci## Cautions
10222e5b6d6dSopenharmony_ci
10232e5b6d6dSopenharmony_ciThe following are not known rule limitations, but rather cautions.
10242e5b6d6dSopenharmony_ci
10252e5b6d6dSopenharmony_ci### Resets
10262e5b6d6dSopenharmony_ci
10272e5b6d6dSopenharmony_ciSince resets always work on the existing state, the user is required to make
10282e5b6d6dSopenharmony_cisure that the rule entries are in the proper order.
10292e5b6d6dSopenharmony_ci
10302e5b6d6dSopenharmony_ciRules     | Order | Comment
10312e5b6d6dSopenharmony_ci--------- | ----- | -------
10322e5b6d6dSopenharmony_ci`& a < b` | a     | The rules mean: put **b** after **a**, then put **c** after **a** (inserting **before** the **b**).
10332e5b6d6dSopenharmony_ci`& a < c` | c     |
10342e5b6d6dSopenharmony_ci&nbsp;    | b     |
10352e5b6d6dSopenharmony_ci
10362e5b6d6dSopenharmony_ci### Postpone Insertion
10372e5b6d6dSopenharmony_ci
10382e5b6d6dSopenharmony_ciWhen using a reset to insert a value X with a certain strength difference after
10392e5b6d6dSopenharmony_cia value Y, it actually is inserted just before the next item of the same
10402e5b6d6dSopenharmony_cistrength or higher following Y. Thus, the following are equivalent:
10412e5b6d6dSopenharmony_ci
10422e5b6d6dSopenharmony_ci```
10432e5b6d6dSopenharmony_ci... m < a = c <<< d << e <<< f < g <<< h & a << x
10442e5b6d6dSopenharmony_ci... m < a = c <<< d << x << e <<< f < g <<< h
10452e5b6d6dSopenharmony_ci```
10462e5b6d6dSopenharmony_ci
10472e5b6d6dSopenharmony_ci> :point_right: **Note**: This is different from the Java semantics.
10482e5b6d6dSopenharmony_ci> In Java, the value is inserted immediately after the reset character.
10492e5b6d6dSopenharmony_ci
10502e5b6d6dSopenharmony_ci### Jamo Tailoring
10512e5b6d6dSopenharmony_ci
10522e5b6d6dSopenharmony_ciIf Jamo characters are tailored, that causes the code to go through a slow path,
10532e5b6d6dSopenharmony_ciwhich will have a significant effect on performance.
10542e5b6d6dSopenharmony_ci
10552e5b6d6dSopenharmony_ci### Compatibility Decompositions
10562e5b6d6dSopenharmony_ci
10572e5b6d6dSopenharmony_ciWhen tailoring a letter, the customization affects all of its canonical
10582e5b6d6dSopenharmony_ciequivalents. That is, if tailoring rule sorts an **'a'** after**'e '**, for
10592e5b6d6dSopenharmony_ciexample, then "**"à", "á", ...** are also sorted after '**e**'.his is not true
10602e5b6d6dSopenharmony_cifor compatibility equivalents. If the desired sorting order is for a
10612e5b6d6dSopenharmony_ci**superscript-a** ("ª") to be after "**e"**, it is necessary to specify the rule
10622e5b6d6dSopenharmony_cifor that.
10632e5b6d6dSopenharmony_ci
10642e5b6d6dSopenharmony_ci### Case Differences
10652e5b6d6dSopenharmony_ci
10662e5b6d6dSopenharmony_ciSimilarly, when tailoring an "**a" to be sorted** after "**e"**, including
10672e5b6d6dSopenharmony_ci"**A"** to be after "**e" **as well, it is required to have a specific rule for
10682e5b6d6dSopenharmony_cithat sorting sequence.
10692e5b6d6dSopenharmony_ci
10702e5b6d6dSopenharmony_ci### Automatic Expansions
10712e5b6d6dSopenharmony_ci
10722e5b6d6dSopenharmony_ciICU will automatically form expansions whenever a reset is to a multi-character
10732e5b6d6dSopenharmony_civalue that is not a contraction. For example, `& ab <<< c` is equivalent to
10742e5b6d6dSopenharmony_ci`& a <<< c / b`. The user may be unaware of this happening, since it may not be
10752e5b6d6dSopenharmony_ciobvious that the reset is to a multi-character value. For example, `& à<<< d` is
10762e5b6d6dSopenharmony_ciequivalent to & a <<< d / \`
1077