12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: Customization 42e5b6d6dSopenharmony_cinav_order: 3 52e5b6d6dSopenharmony_ciparent: Collation 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# Collation Customization 132e5b6d6dSopenharmony_ci{: .no_toc } 142e5b6d6dSopenharmony_ci 152e5b6d6dSopenharmony_ci## Contents 162e5b6d6dSopenharmony_ci{: .no_toc .text-delta } 172e5b6d6dSopenharmony_ci 182e5b6d6dSopenharmony_ci1. TOC 192e5b6d6dSopenharmony_ci{:toc} 202e5b6d6dSopenharmony_ci 212e5b6d6dSopenharmony_ci--- 222e5b6d6dSopenharmony_ci 232e5b6d6dSopenharmony_ci## Overview 242e5b6d6dSopenharmony_ci 252e5b6d6dSopenharmony_ciICU uses the [CLDR root collation 262e5b6d6dSopenharmony_ciorder](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) 272e5b6d6dSopenharmony_cias a default starting point for ordering. (The CLDR root collation is based on 282e5b6d6dSopenharmony_cithe [UCA 292e5b6d6dSopenharmony_ciDUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table).) 302e5b6d6dSopenharmony_ciNot all languages have sorting sequences that correspond with the root collation 312e5b6d6dSopenharmony_ciorder because no single sort order can simultaneously encompass the specifics of 322e5b6d6dSopenharmony_ciall the languages. In particular, languages that share a script may sort the 332e5b6d6dSopenharmony_cisame letters differently. 342e5b6d6dSopenharmony_ci 352e5b6d6dSopenharmony_ciTherefore, ICU provides a data-driven, flexible, and run-time-customizable 362e5b6d6dSopenharmony_cimechanism called "tailoring". Tailoring overrides the default order of code 372e5b6d6dSopenharmony_cipoints and the values of the ICU Collation Service attributes. 382e5b6d6dSopenharmony_ci 392e5b6d6dSopenharmony_ci## Collation Rule 402e5b6d6dSopenharmony_ci 412e5b6d6dSopenharmony_ciA `RuleBasedCollator` is built from a rule string which changes the sort order of 422e5b6d6dSopenharmony_cisome characters and strings relative to the default order. An empty string (or 432e5b6d6dSopenharmony_cione with only white space and comments) results in a collator that behaves like 442e5b6d6dSopenharmony_cithe root collator. 452e5b6d6dSopenharmony_ci 462e5b6d6dSopenharmony_ciA tailoring is specified via a string containing a set of rules. ICU implements 472e5b6d6dSopenharmony_cithe (CLDR) [LDML collation rule 482e5b6d6dSopenharmony_cisyntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules). For more 492e5b6d6dSopenharmony_cidetails see there. 502e5b6d6dSopenharmony_ci 512e5b6d6dSopenharmony_ciEach rule contains a string of ordered characters that starts with an **anchor 522e5b6d6dSopenharmony_cipoint** or a **reset value**. For example, `"&a < g"`, places "g" 532e5b6d6dSopenharmony_ciafter "a" and before "b", and the "a" does not change place. This rule has the 542e5b6d6dSopenharmony_cifollowing sorting consequences: 552e5b6d6dSopenharmony_ci 562e5b6d6dSopenharmony_ciWithout rule | With rule 572e5b6d6dSopenharmony_ci------------ | --------- 582e5b6d6dSopenharmony_ciAbernathy | Abernathy 592e5b6d6dSopenharmony_ciapple | apple 602e5b6d6dSopenharmony_cibird | green 612e5b6d6dSopenharmony_ciBoston | bird 622e5b6d6dSopenharmony_ciGraham | Boston 632e5b6d6dSopenharmony_cigreen | Graham 642e5b6d6dSopenharmony_ci 652e5b6d6dSopenharmony_ciNote that only the word that starts with "g" has changed place. All the words 662e5b6d6dSopenharmony_cisorted after "a" and "A" are sorted after "g". 672e5b6d6dSopenharmony_ciThis includes "Graham"; "G" would have to be tailored separately, such as with 682e5b6d6dSopenharmony_ci`"&a < g <<< G"`. 692e5b6d6dSopenharmony_ci 702e5b6d6dSopenharmony_ciThis is a non-complex example of a tailoring rule. Tailoring rules consist of 712e5b6d6dSopenharmony_cizero or more rules and zero or more options. There must be at least one rule or 722e5b6d6dSopenharmony_ciat least one option. The rule syntax is discussed in more detail in the 732e5b6d6dSopenharmony_cifollowing sections. 742e5b6d6dSopenharmony_ci 752e5b6d6dSopenharmony_ciNote that the tailoring rules override the UCA ordering. In addition, if a 762e5b6d6dSopenharmony_cicharacter is reordered, it automatically reorders any other equivalent 772e5b6d6dSopenharmony_cicharacters. For example, if the rule "&e<a" is used to reorder "a" in the list, 782e5b6d6dSopenharmony_ci"á" is also greater than "é". 792e5b6d6dSopenharmony_ci 802e5b6d6dSopenharmony_ci## Syntax 812e5b6d6dSopenharmony_ci 822e5b6d6dSopenharmony_ciThe following table summarizes the basic syntax necessary for most usages: 832e5b6d6dSopenharmony_ci 842e5b6d6dSopenharmony_ciSymbol | Example | Description 852e5b6d6dSopenharmony_ci------ | ------------- | ---------------------------------- 862e5b6d6dSopenharmony_ci`<` | `a < b` | Identifies a primary (base letter) difference between "a" and "b" 872e5b6d6dSopenharmony_ci`<<` | `a << ä` | Signifies a secondary (accent) difference between "a" and "ä" 882e5b6d6dSopenharmony_ci`<<<` | `a<<<A` | Identifies a tertiary difference between "a" and "A" 892e5b6d6dSopenharmony_ci`<<<<` | `か<<<<カ` | Identifies a quaternary difference between "か" and "カ". (New in ICU 53.) 902e5b6d6dSopenharmony_ci`=` | `x = y` | Signifies no difference between "x" and "y". 912e5b6d6dSopenharmony_ci`&` | `&Z` | Instructs ICU to reset at this letter. These rules will be relative to this letter from here on, but will not affect the position of Z itself. 922e5b6d6dSopenharmony_ci 932e5b6d6dSopenharmony_ci> :point_right: **Note**: ICU permits up to three quaternary relations in a row 942e5b6d6dSopenharmony_ci> (except for intervening "=" identity relations). 952e5b6d6dSopenharmony_ci 962e5b6d6dSopenharmony_ci> :point_right: **Note**: In releases prior to 1.8, 972e5b6d6dSopenharmony_ci> ICU used the notations `;` to represent secondary relations and `,` to represent tertiary relations. 982e5b6d6dSopenharmony_ci> Starting in release 1.8, use `<<` symbols to represent secondary relations and 992e5b6d6dSopenharmony_ci> `<<<` symbols to represent tertiary relations. 1002e5b6d6dSopenharmony_ci> Rules that use the `;` and `,` notations are still processed by ICU for compatibility; 1012e5b6d6dSopenharmony_ci> also, some of the data used for tailoring to particular locales 1022e5b6d6dSopenharmony_ci> has not yet been updated to the new syntax. 1032e5b6d6dSopenharmony_ci> However, one should consider these symbols deprecated. 1042e5b6d6dSopenharmony_ci 1052e5b6d6dSopenharmony_ci> :point_right: **Note**: See the [LDML collation rule syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules) 1062e5b6d6dSopenharmony_ci> and [Properties and ICU Rule Syntax](../../strings/properties.md) for 1072e5b6d6dSopenharmony_ci> information regarding syntax characters. 1082e5b6d6dSopenharmony_ci 1092e5b6d6dSopenharmony_ciRepeated use of the same relation can be abbreviated, for example 1102e5b6d6dSopenharmony_ci`&a <* bcd-gp-s` for `&a < b < c < d < e < f < g < p < q < r < s`. 1112e5b6d6dSopenharmony_ciFor details see the 1122e5b6d6dSopenharmony_ci[LDML collation spec, section 1132e5b6d6dSopenharmony_ciOrderings](http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings). 1142e5b6d6dSopenharmony_ci 1152e5b6d6dSopenharmony_ci### Escaping Rules 1162e5b6d6dSopenharmony_ci 1172e5b6d6dSopenharmony_ciMost of the characters can be used as parts of rules. However, whitespace 1182e5b6d6dSopenharmony_cicharacters will be skipped over, and all ASCII characters that are not digits or 1192e5b6d6dSopenharmony_ciletters are considered to be part of syntax. In order to use these characters in 1202e5b6d6dSopenharmony_cirules, they need to be escaped. Escaping can be done in several ways: 1212e5b6d6dSopenharmony_ci 1222e5b6d6dSopenharmony_ci* Single characters can be escaped using backslash **\\** (U+005C). 1232e5b6d6dSopenharmony_ci 1242e5b6d6dSopenharmony_ci* Strings can be escaped by putting them between single quotes **'like 1252e5b6d6dSopenharmony_ci this'**. 1262e5b6d6dSopenharmony_ci 1272e5b6d6dSopenharmony_ci* The single quote (ASCII apostrophe) can be quoted using two single quotes 1282e5b6d6dSopenharmony_ci **''**, both inside and outside single-quote-escaped strings. 1292e5b6d6dSopenharmony_ci 1302e5b6d6dSopenharmony_ci### Simple Tailoring Examples 1312e5b6d6dSopenharmony_ci 1322e5b6d6dSopenharmony_ciSerbian (Latin) or Croatian: `& C < č <<< Č < ć <<< Ć` 1332e5b6d6dSopenharmony_ci 1342e5b6d6dSopenharmony_ciThis rule is needed because the root collation order usually considers accents 1352e5b6d6dSopenharmony_cito have secondary differences in order to base character. This rule ensures that 'ć' 1362e5b6d6dSopenharmony_ci'č' are treated as base letters. 1372e5b6d6dSopenharmony_ci 1382e5b6d6dSopenharmony_ciUCA | Tailoring: `& C < č <<< Č < ć <<< Ć` 1392e5b6d6dSopenharmony_ci--------------- | -------------- 1402e5b6d6dSopenharmony_ciCUKIĆ RADOJICA | CUKIĆ RADOJICA 1412e5b6d6dSopenharmony_ciČUKIĆ SLOBODAN | CUKIĆ SVETOZAR 1422e5b6d6dSopenharmony_ciCUKIĆ SVETOZAR | CURIĆ MILOŠ 1432e5b6d6dSopenharmony_ciČUKIĆ ZORAN | CVRKALJ ÐURO 1442e5b6d6dSopenharmony_ciCURIĆ MILOŠ | ČUKIĆ SLOBODAN 1452e5b6d6dSopenharmony_ciĆURIĆ MILOŠ | ČUKIĆ ZORAN 1462e5b6d6dSopenharmony_ciCVRKALJ ÐURO | ĆURIĆ MILOŠ 1472e5b6d6dSopenharmony_ci 1482e5b6d6dSopenharmony_ciSerbian (Latin) or Croatian: `& Ð < dž <<< Dž <<< DŽ` 1492e5b6d6dSopenharmony_ci 1502e5b6d6dSopenharmony_ciThis rule is an example of a contraction. "D" alone is sorted after "C" and "Ž" 1512e5b6d6dSopenharmony_ciis sorted after "Z", but "DŽ", due to the tailoring rule, is treated as a single 1522e5b6d6dSopenharmony_ciletter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter 1532e5b6d6dSopenharmony_ciafter "D" in the UCA). Another thing to note in this example is capitalization 1542e5b6d6dSopenharmony_ciof the letter "DŽ". There are three versions, since all three can legally appear 1552e5b6d6dSopenharmony_ciin text. The fourth version "dŽ" is omitted since it does not occur. 1562e5b6d6dSopenharmony_ci 1572e5b6d6dSopenharmony_ciUCA | Tailoring: `& Ð < dž <<< Dž <<< DŽ` 1582e5b6d6dSopenharmony_ci-------- | --------- 1592e5b6d6dSopenharmony_cidan | dan 1602e5b6d6dSopenharmony_cidubok | dubok 1612e5b6d6dSopenharmony_cidžabe | đak 1622e5b6d6dSopenharmony_cidžin | džabe 1632e5b6d6dSopenharmony_ciDžin | džin 1642e5b6d6dSopenharmony_ciDŽIN | Džin 1652e5b6d6dSopenharmony_ciđak | DŽIN 1662e5b6d6dSopenharmony_ciEvropa | Evropa 1672e5b6d6dSopenharmony_ci 1682e5b6d6dSopenharmony_ciDanish: `&V <<< w <<< W` 1692e5b6d6dSopenharmony_ci 1702e5b6d6dSopenharmony_ciThe letter 'W' is sorted after 'V', but is treated as a tertiary difference 1712e5b6d6dSopenharmony_cisimilar to the difference between 'v' and 'V'. 1722e5b6d6dSopenharmony_ci 1732e5b6d6dSopenharmony_ciUCA | `&V <<< w <<< W` 1742e5b6d6dSopenharmony_ci--- | ---------------- 1752e5b6d6dSopenharmony_civa | va 1762e5b6d6dSopenharmony_ciVa | Va 1772e5b6d6dSopenharmony_ciVA | VA 1782e5b6d6dSopenharmony_civb | wa 1792e5b6d6dSopenharmony_ciVb | Wa 1802e5b6d6dSopenharmony_ciVB | WA 1812e5b6d6dSopenharmony_civz | vb 1822e5b6d6dSopenharmony_ciVz | Vb 1832e5b6d6dSopenharmony_ciVZ | VB 1842e5b6d6dSopenharmony_ciwa | wb 1852e5b6d6dSopenharmony_ciWa | Wb 1862e5b6d6dSopenharmony_ciWA | WB 1872e5b6d6dSopenharmony_ciwb | vz 1882e5b6d6dSopenharmony_ciWb | Vz 1892e5b6d6dSopenharmony_ciWB | VZ 1902e5b6d6dSopenharmony_ciwz | wz 1912e5b6d6dSopenharmony_ciWz | Wz 1922e5b6d6dSopenharmony_ciWZ | WZ 1932e5b6d6dSopenharmony_ci 1942e5b6d6dSopenharmony_ci### Default Options 1952e5b6d6dSopenharmony_ci 1962e5b6d6dSopenharmony_ciICU implements the [LDML collation 1972e5b6d6dSopenharmony_cioptions/settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options). 1982e5b6d6dSopenharmony_ciFor more information see there. 1992e5b6d6dSopenharmony_ci 2002e5b6d6dSopenharmony_ciThe tailoring inherits all the attribute values from the root collator unless 2012e5b6d6dSopenharmony_cithey are explicitly redefined in the tailoring. The following summarizes 2022e5b6d6dSopenharmony_cithe option settings. Default options are **in emphasis**. 2032e5b6d6dSopenharmony_ci 2042e5b6d6dSopenharmony_ci#### alternate 2052e5b6d6dSopenharmony_ci- **`[alternate non-ignorable]`** 2062e5b6d6dSopenharmony_ci- `[alternate shifted]` 2072e5b6d6dSopenharmony_ci 2082e5b6d6dSopenharmony_ciSets the default value of the UCOL_ALTERNATE_HANDLING attribute. If 2092e5b6d6dSopenharmony_ciset to shifted, variable code points will be ignored on the primary level. 2102e5b6d6dSopenharmony_ciFor details see the [“Ignore Punctuation” Options](ignorepunct.md) page. 2112e5b6d6dSopenharmony_ci 2122e5b6d6dSopenharmony_ci#### maxVariable 2132e5b6d6dSopenharmony_ci- **`[maxVariable punct]`** 2142e5b6d6dSopenharmony_ci- `[maxVariable space]` 2152e5b6d6dSopenharmony_ci 2162e5b6d6dSopenharmony_ciSets the variable top to the top of the specified 2172e5b6d6dSopenharmony_cireordering group. (New in ICU 53.) All code points with primary weights less 2182e5b6d6dSopenharmony_cithan or equal to the variable top will be considered variable, and thus affected 2192e5b6d6dSopenharmony_ciby the alternate handling. 2202e5b6d6dSopenharmony_ci 2212e5b6d6dSopenharmony_ci#### variable top 2222e5b6d6dSopenharmony_ci(deprecated) 2232e5b6d6dSopenharmony_ci- `& X < [variable top]` 2242e5b6d6dSopenharmony_ci 2252e5b6d6dSopenharmony_ciSets the default value for the variable top. All the code points with primary 2262e5b6d6dSopenharmony_cistrengths less than variable top will be considered variable. 2272e5b6d6dSopenharmony_ci*Changing the variable top via this rule syntax is deprecated since ICU 53.* 2282e5b6d6dSopenharmony_ciIt has been replaced by the maxVariable option. 2292e5b6d6dSopenharmony_ci 2302e5b6d6dSopenharmony_ci#### normalization 2312e5b6d6dSopenharmony_ci- **`[normalization off]`** 2322e5b6d6dSopenharmony_ci- `[normalization on]` 2332e5b6d6dSopenharmony_ci 2342e5b6d6dSopenharmony_ciTurns on or off the UCOL_NORMALIZATION_MODE attribute. 2352e5b6d6dSopenharmony_ciIf set to on, a quick check and necessary normalization will be performed. 2362e5b6d6dSopenharmony_ci 2372e5b6d6dSopenharmony_ci#### strength 2382e5b6d6dSopenharmony_ci- `[strength 1]` 2392e5b6d6dSopenharmony_ci- `[strength 2]` 2402e5b6d6dSopenharmony_ci- **`[strength 3]`** 2412e5b6d6dSopenharmony_ci- `[strength 4]` 2422e5b6d6dSopenharmony_ci- `[strength I]` 2432e5b6d6dSopenharmony_ci 2442e5b6d6dSopenharmony_ciSets the default strength for the collator. 2452e5b6d6dSopenharmony_ci 2462e5b6d6dSopenharmony_ci#### backwards 2472e5b6d6dSopenharmony_ci- `[backwards 2]` 2482e5b6d6dSopenharmony_ci 2492e5b6d6dSopenharmony_ciSets the default value of the UCOL_FRENCH_COLLATION attribute. If set to on, 2502e5b6d6dSopenharmony_ciweights on the secondary level will be reversed. 2512e5b6d6dSopenharmony_ci 2522e5b6d6dSopenharmony_ci#### caseLevel 2532e5b6d6dSopenharmony_ci- **`[caseLevel off]`** 2542e5b6d6dSopenharmony_ci- `[caseLevel on]` 2552e5b6d6dSopenharmony_ci 2562e5b6d6dSopenharmony_ciTurns on or off the UCOL_CASE_LEVEL attribute. If set to on a 2572e5b6d6dSopenharmony_cilevel consisting only of case characteristics will be inserted in front of 2582e5b6d6dSopenharmony_citertiary level. To ignore accents but take cases into account, set strength to 2592e5b6d6dSopenharmony_ciprimary and case level to on. 2602e5b6d6dSopenharmony_ci 2612e5b6d6dSopenharmony_ci#### caseFirst 2622e5b6d6dSopenharmony_ci- **`[caseFirst off]`** 2632e5b6d6dSopenharmony_ci- `[caseFirst upper]` 2642e5b6d6dSopenharmony_ci- `[caseFirst lower]` 2652e5b6d6dSopenharmony_ci 2662e5b6d6dSopenharmony_ciSets the value for the UCOL_CASE_FIRST attribute. If set to 2672e5b6d6dSopenharmony_ciupper, causes upper case to sort before lower case. If set to lower, lower case 2682e5b6d6dSopenharmony_ciwill sort before upper case. Useful for locales that have an already supported 2692e5b6d6dSopenharmony_ciordering but require different order of cases. Affects case and tertiary levels. 2702e5b6d6dSopenharmony_ci 2712e5b6d6dSopenharmony_ci#### numericOrdering 2722e5b6d6dSopenharmony_ci- **`[numericOrdering off]`** 2732e5b6d6dSopenharmony_ci- `[numericOrdering on]` 2742e5b6d6dSopenharmony_ci 2752e5b6d6dSopenharmony_ciTurns on or off the UCOL_NUMERIC_COLLATION attribute. If 2762e5b6d6dSopenharmony_ciset to on, then sequences of decimal digits (gc=Nd) sort by their numeric value. 2772e5b6d6dSopenharmony_ci 2782e5b6d6dSopenharmony_ci#### hiraganaQ 2792e5b6d6dSopenharmony_ci(deprecated) 2802e5b6d6dSopenharmony_ci- **`[hiraganaQ off]`** 2812e5b6d6dSopenharmony_ci- `[hiraganaQ on]` 2822e5b6d6dSopenharmony_ci 2832e5b6d6dSopenharmony_ciControls special treatment of Hiragana code points on 2842e5b6d6dSopenharmony_ciquaternary level. If turned on, Hiragana code points will get lower values than 2852e5b6d6dSopenharmony_ciall the other non-variable code points. Strength must be greater or equal than 2862e5b6d6dSopenharmony_ciquaternary if you want this attribute to take effect. 2872e5b6d6dSopenharmony_ci*hiraganaQ is deprecated since ICU 50.* It was an implementation detail of the 2882e5b6d6dSopenharmony_ciJapanese tailoring. In CLDR 25/ICU 53, the Japanese tailoring expresses the 2892e5b6d6dSopenharmony_cidifferences between Hiragana and Katakana via explicit quaternary (`<<<<`) 2902e5b6d6dSopenharmony_cirelations. 2912e5b6d6dSopenharmony_ci 2922e5b6d6dSopenharmony_ci#### suppressContractions 2932e5b6d6dSopenharmony_ci- `[suppressContractions [Љ-ґ]]` 2942e5b6d6dSopenharmony_ci 2952e5b6d6dSopenharmony_ciRemoves context-sensitive mappings (contractions and prefix/context-before mappings) 2962e5b6d6dSopenharmony_ciassociated with each of the code points in the given UnicodeSet. It works on the 2972e5b6d6dSopenharmony_cicurrent set of rules: It removes mappings from the root collation as well as 2982e5b6d6dSopenharmony_cifrom previous rules. 2992e5b6d6dSopenharmony_ci 3002e5b6d6dSopenharmony_ciThis is the only way to *remove* mappings: The rule syntax otherwise only adds 3012e5b6d6dSopenharmony_ciand overrides mappings. This special command is used in CLDR tailoring data to 3022e5b6d6dSopenharmony_ciremove Cyrillic root collation contractions that are not necessary in several 3032e5b6d6dSopenharmony_cilanguages. 3042e5b6d6dSopenharmony_ci 3052e5b6d6dSopenharmony_ci#### optimize 3062e5b6d6dSopenharmony_ci- `[optimize [Ά-ώ]]` 3072e5b6d6dSopenharmony_ci 3082e5b6d6dSopenharmony_ciPerformance optimization for the code points in the UnicodeSet. 3092e5b6d6dSopenharmony_ciIn ICU, where tailoring data only contains the 3102e5b6d6dSopenharmony_cimappings that are different from the root collation (otherwise the data would be 3112e5b6d6dSopenharmony_citoo large), falling back to root collation mappings for the rest of Unicode is 3122e5b6d6dSopenharmony_cislightly slower. The optimize command copies mappings for additional characters 3132e5b6d6dSopenharmony_ciinto the tailoring data. 3142e5b6d6dSopenharmony_ci 3152e5b6d6dSopenharmony_ci#### reorder 3162e5b6d6dSopenharmony_cifollowed by one or more reorder codes 3172e5b6d6dSopenharmony_ci- `[reorder Grek Hani space]` 3182e5b6d6dSopenharmony_ci 3192e5b6d6dSopenharmony_ciReorders scripts relative to each other and relative to a special set of 3202e5b6d6dSopenharmony_cinon-script blocks (space, punctuation, symbol, currency, and digit). The default 3212e5b6d6dSopenharmony_ciorder is the same as in the DUCET and in the CLDR root collator. 3222e5b6d6dSopenharmony_ci 3232e5b6d6dSopenharmony_ci---- 3242e5b6d6dSopenharmony_ci 3252e5b6d6dSopenharmony_ciA tailoring that consists only of options is also valid and has the same basic 3262e5b6d6dSopenharmony_ciordering as the root collation. For example, the Greek tailoring has option 3272e5b6d6dSopenharmony_cisettings only: `[normalization on][reorder Grek]` 3282e5b6d6dSopenharmony_ci 3292e5b6d6dSopenharmony_ci(The examples in this chapter might refer to older versions of data for 3302e5b6d6dSopenharmony_ciparticular languages. Check CLDR or ICU for actual, current tailorings.) 3312e5b6d6dSopenharmony_ci 3322e5b6d6dSopenharmony_ciThe following tailoring example reorders uppercase and lowercase and uses 3332e5b6d6dSopenharmony_cibackwards-secondary ordering: 3342e5b6d6dSopenharmony_ci 3352e5b6d6dSopenharmony_ci``` 3362e5b6d6dSopenharmony_ci[caseFirst upper] 3372e5b6d6dSopenharmony_ci[backwards 2] 3382e5b6d6dSopenharmony_ci& C < č , Č 3392e5b6d6dSopenharmony_ci& G < ģ , Ģ 3402e5b6d6dSopenharmony_ci& I < y, Y 3412e5b6d6dSopenharmony_ci& K < ķ , Ķ 3422e5b6d6dSopenharmony_ci& L < ļ , Ļ 3432e5b6d6dSopenharmony_ci& N < ņ , Ņ 3442e5b6d6dSopenharmony_ci& S < š , Š 3452e5b6d6dSopenharmony_ci& Z < ž , Ž 3462e5b6d6dSopenharmony_ci``` 3472e5b6d6dSopenharmony_ci 3482e5b6d6dSopenharmony_ci#### Values for Reorder Codes 3492e5b6d6dSopenharmony_ci 3502e5b6d6dSopenharmony_ciReordering Group | Rule Value 3512e5b6d6dSopenharmony_ci---------------------------------------- | ---------- 3522e5b6d6dSopenharmony_ciUnicode white space characters | space 3532e5b6d6dSopenharmony_ciUnicode punctuation | punct 3542e5b6d6dSopenharmony_ciUnicode symbols except currency symbols | symbol 3552e5b6d6dSopenharmony_ciUnicode currency symbols | currency 3562e5b6d6dSopenharmony_ciUnicode decimal digits | digit 3572e5b6d6dSopenharmony_ciUnicode scripts not mentioned ("others") |Zzzz (= Unknown script) 3582e5b6d6dSopenharmony_ci 3592e5b6d6dSopenharmony_ciIn addition, ISO **4-letter script codes** can be used. Codes for scripts that 3602e5b6d6dSopenharmony_cido not have Unicode characters (according to the Unicode Script property values) 3612e5b6d6dSopenharmony_ciare ignored. 3622e5b6d6dSopenharmony_ci 3632e5b6d6dSopenharmony_ciLimitations of ICU 4.8-52: (Except `Kore` is still not usable because it refers 3642e5b6d6dSopenharmony_cito multiple scripts that do not sort primary-equal.) 3652e5b6d6dSopenharmony_ci 3662e5b6d6dSopenharmony_ci* For Chinese, use script code `Hani`, *not* `Hans` or `Hant`. 3672e5b6d6dSopenharmony_ci* For Japanese, use both `Kana` and `Hani` (*not* `Hira`). 3682e5b6d6dSopenharmony_ci* For Korean, use both `Hang` and `Hani` (*not* `Kore`). 3692e5b6d6dSopenharmony_ci 3702e5b6d6dSopenharmony_ci#### Semantics of a List of Reorder Codes 3712e5b6d6dSopenharmony_ci 3722e5b6d6dSopenharmony_ciThis section is relevant for both the `[reorder ...]` rule syntax and the 3732e5b6d6dSopenharmony_ci`Collator.setReorderCodes()` API. 3742e5b6d6dSopenharmony_ci 3752e5b6d6dSopenharmony_ciFor an introduction and examples see the section “Script Reordering” in the 3762e5b6d6dSopenharmony_ci[Collation Concepts chapter](../concepts.md). 3772e5b6d6dSopenharmony_ci 3782e5b6d6dSopenharmony_ciOn the API, the special groups are represented with `Collator.ReorderCode`s 3792e5b6d6dSopenharmony_ci(`UColReorderCode`) values rather than `UScript` (`UScriptCode`) values. 3802e5b6d6dSopenharmony_ci 3812e5b6d6dSopenharmony_ciIn ICU 4.8-54, not every script could be reordered independently. CLDR and ICU 3822e5b6d6dSopenharmony_cisupported reordering of groups of scripts, each of which started with one of the 3832e5b6d6dSopenharmony_ci[Recommended 3842e5b6d6dSopenharmony_ciScripts](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). A 3852e5b6d6dSopenharmony_ciscript that is not Recommended always moved together with the Recommended Script 3862e5b6d6dSopenharmony_cithat precedes it in DUCET order. (Hiragana sorts together with Katakana, Coptic 3872e5b6d6dSopenharmony_ciwith Greek, etc.) ICU allowed any one script of a (Recommended Script + 3882e5b6d6dSopenharmony_ciDUCET-following) group in the `[reorder]` list, moving the whole set of scripts 3892e5b6d6dSopenharmony_citogether. However, it was strongly recommended that only Recommended Scripts be 3902e5b6d6dSopenharmony_ciused. 3912e5b6d6dSopenharmony_ci 3922e5b6d6dSopenharmony_ciBeginning with ICU 55, scripts only reorder together if they are primary-equal, 3932e5b6d6dSopenharmony_cifor example Hiragana and Katakana. 3942e5b6d6dSopenharmony_ci 3952e5b6d6dSopenharmony_ciZyyy=Common and Zinh=Inherited cannot be reordered. 3962e5b6d6dSopenharmony_ci 3972e5b6d6dSopenharmony_ciThe special code Zzzz (= Unknown script = `UScript.UNKNOWN` = 3982e5b6d6dSopenharmony_ci`Collator.ReorderCodes.OTHERS` = "others") stands for any script that is not 3992e5b6d6dSopenharmony_ciexplicitly mentioned in the list of reordering codes. If Zzzz is mentioned in 4002e5b6d6dSopenharmony_cithe list, then any groups and scripts mentioned later in the list will go at the 4012e5b6d6dSopenharmony_civery end of the reordering, in the order given. If Zzzz is not mentioned, then 4022e5b6d6dSopenharmony_ciall scripts that are not explicitly listed follow at the end in DUCET order. 4032e5b6d6dSopenharmony_ci 4042e5b6d6dSopenharmony_ciThe special reorder code `Collator.ReorderCodes.NONE` (= `UScript.UNKNOWN`), when 4052e5b6d6dSopenharmony_ciused alone (same as `[reorder Zzzz]` or not specifying a `[reorder]` rule in a 4062e5b6d6dSopenharmony_citailoring), will remove any reordering for this collator. The result of setting 4072e5b6d6dSopenharmony_cino reordering will be to use the DUCET/CLDR order. 4082e5b6d6dSopenharmony_ci 4092e5b6d6dSopenharmony_ciOn the API (not applicable to rule syntax), the special reorder code 4102e5b6d6dSopenharmony_ci`Collator.ReorderCodes.DEFAULT` (= `UScript.INHERITED`) will reset the reordering 4112e5b6d6dSopenharmony_cifor the collator to its default order. The default reordering may be the 4122e5b6d6dSopenharmony_ciDUCET/CLDR order or may be a reordering that was specified when this collator 4132e5b6d6dSopenharmony_ciwas created from resource data or from rules. The DEFAULT code must be the sole 4142e5b6d6dSopenharmony_cicode supplied when it used. 4152e5b6d6dSopenharmony_ci 4162e5b6d6dSopenharmony_ciFor details see the [section “Collation Reordering” in the LDML collation 4172e5b6d6dSopenharmony_cispec](http://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering). 4182e5b6d6dSopenharmony_ci 4192e5b6d6dSopenharmony_ci### Advanced Syntactical Elements 4202e5b6d6dSopenharmony_ci 4212e5b6d6dSopenharmony_ciSeveral other syntactical elements are needed in more specific situations. 4222e5b6d6dSopenharmony_ci 4232e5b6d6dSopenharmony_ci#### Order before 4242e5b6d6dSopenharmony_ci 4252e5b6d6dSopenharmony_ci- Syntax: `[before 1|2|3]` 4262e5b6d6dSopenharmony_ci- Example: `&[before 2]a<ā<á<ǎ<à` 4272e5b6d6dSopenharmony_ci 4282e5b6d6dSopenharmony_ciEnables users to order characters **before **a given character. In UCA 3.0, the 4292e5b6d6dSopenharmony_ciexample is equivalent to & ㍡<ā<á<ǎ<à (㍡= \\u3361, ideographic telegraph symbol 4302e5b6d6dSopenharmony_cifor hour nine) and makes accented 'a' letters sort before 'a'. Accents are often 4312e5b6d6dSopenharmony_ciused to indicate the intonations in Pinyin. In this case, the non-accented 4322e5b6d6dSopenharmony_ciletters sort after the accented letters. 4332e5b6d6dSopenharmony_ci 4342e5b6d6dSopenharmony_ci#### Expansion 4352e5b6d6dSopenharmony_ci 4362e5b6d6dSopenharmony_ci- Syntax: `/` 4372e5b6d6dSopenharmony_ci- Example: `æ/e` 4382e5b6d6dSopenharmony_ci 4392e5b6d6dSopenharmony_ciAdds the collation element for 'e' to the collation element for æ. 4402e5b6d6dSopenharmony_ciAfter a reset `&ae << æ` is equivalent to `&a << æ/e`. See the Expansion example 4412e5b6d6dSopenharmony_cibelow. 4422e5b6d6dSopenharmony_ci 4432e5b6d6dSopenharmony_ci#### Prefix processing 4442e5b6d6dSopenharmony_ci 4452e5b6d6dSopenharmony_ci- Syntax: `|` 4462e5b6d6dSopenharmony_ci- Example: `a|b` 4472e5b6d6dSopenharmony_ci 4482e5b6d6dSopenharmony_ciIf 'b' is encountered and it follows 'a', 4492e5b6d6dSopenharmony_cioutput the appropriate collation element. If 'b' follows any other letter, 4502e5b6d6dSopenharmony_cioutput the normal collation element for 'b'. 4512e5b6d6dSopenharmony_ciThe collation element for 'a' is not affected. 4522e5b6d6dSopenharmony_ci 4532e5b6d6dSopenharmony_ciThis element is used to speed up sorting under JIS X 4061. See the 4542e5b6d6dSopenharmony_ciPrefix example below. 4552e5b6d6dSopenharmony_ci 4562e5b6d6dSopenharmony_ci#### Reset to top 4572e5b6d6dSopenharmony_ci 4582e5b6d6dSopenharmony_ci- Syntax: `[top]` 4592e5b6d6dSopenharmony_ci- Example: `&[top] < a < b < c …` 4602e5b6d6dSopenharmony_ci 4612e5b6d6dSopenharmony_ci**Deprecated, use indirect positioning instead** 4622e5b6d6dSopenharmony_ci(`&[last regular]`, see section below) 4632e5b6d6dSopenharmony_ciReorders a set of characters 'above' the UCA. `[top]` is a virtual code point having the 4642e5b6d6dSopenharmony_cibiggest primary weight value that will ever be assigned in the UCA. Above top, 4652e5b6d6dSopenharmony_cithere is a large number of unassigned primary weights that can be used for a 4662e5b6d6dSopenharmony_ci'large' tailoring, such as the reordering of the CJK characters according to a 4672e5b6d6dSopenharmony_ciFar Eastern code page. The first difference after the top is always primary. 4682e5b6d6dSopenharmony_ci 4692e5b6d6dSopenharmony_ci### Indirect Positioning of Collation Elements 4702e5b6d6dSopenharmony_ci 4712e5b6d6dSopenharmony_ciSince ICU version 2.0, ICU allows for indirect positioning of collation elements 4722e5b6d6dSopenharmony_ci(CE). Similar to the reset anchor `top`, these reset anchors allow for positioning of the 4732e5b6d6dSopenharmony_citailoring relative to significant sections of the UCA table. You can use the 4742e5b6d6dSopenharmony_ci`[before]` reset option to position before these sections. 4752e5b6d6dSopenharmony_ci 4762e5b6d6dSopenharmony_ciName | Example CE value | Note 4772e5b6d6dSopenharmony_ci------------------------- | ----------------- | ------------ 4782e5b6d6dSopenharmony_cifirst tertiary ignorable | `[,,]` | Start of the UCA table. This value will never change unless CEs are extended with higher level values. 4792e5b6d6dSopenharmony_cilast tertiary ignorable | `[,,]` | This value will never change unless CEs are extended with higher level values. 4802e5b6d6dSopenharmony_cifirst secondary ignorable | `[,, 05]` | Currently there are no secondary ignorables in the UCA table. 4812e5b6d6dSopenharmony_cilast secondary ignorable | `[,, 05]` | Currently there are no secondary ignorables in the UCA table. 4822e5b6d6dSopenharmony_cifirst primary ignorable | `[, 87, 05]` | Mostly for non-spacing combining marks. 4832e5b6d6dSopenharmony_cilast primary ignorable | `[, E1 B1, 05]` | Currently this value points to a non-existing code point, used to facilitate sorting of compatibility characters. 4842e5b6d6dSopenharmony_cifirst variable | `[05 07, 05, 05]` | The lowest CE that is not primary-ignorable. (see below) 4852e5b6d6dSopenharmony_cilast variable | `[17 9B, 05, 05]` | End of variable section. 4862e5b6d6dSopenharmony_cifirst regular | `[1A 20, 05, 05]` | This is the first regular CE (not primary ignorable and not variable). The majority of code points have regular CEs. 4872e5b6d6dSopenharmony_cilast regular | `[78 AA B2, 05, 05]` | Use `&[last regular]` instead of `&[top]`. (see below) 4882e5b6d6dSopenharmony_cifirst implicit | `[E0 03 03, 05, 05]` | Section of implicitly generated collation elements. (see below) 4892e5b6d6dSopenharmony_cilast implicit | `[E3 DC 70 C0, 05, 05]` | End of implicit section. This is the CE of the last unassigned code point (U+10FFFD). (see below) 4902e5b6d6dSopenharmony_cifirst trailing | `[E5, 05, 05]` | Start of trailing section. (see below) 4912e5b6d6dSopenharmony_cilast trailing | `[FF FF, 05, 05]` | End of trailing collation elements section. This is the highest possible CE, and is the CE for U+FFFF. Not available for tailoring, see `[first trailing]`. 4922e5b6d6dSopenharmony_ci 4932e5b6d6dSopenharmony_ci"first variable": The current code point is TAB=U+0009. This is the start of the variable section. "Variable" characters will be ignored on primary/secondary/tertiary levels when the "shifted" option is on. 4942e5b6d6dSopenharmony_ci 4952e5b6d6dSopenharmony_ciTailoring after "last regular" will effectively position characters 4962e5b6d6dSopenharmony_cibetween regular code points and "implicit" CEs (the next section). 4972e5b6d6dSopenharmony_ciThis should be used (only) for tailoring Han characters 4982e5b6d6dSopenharmony_ciwhich tends to affect thousands of characters. 4992e5b6d6dSopenharmony_ciThe script reordering implementation assumes that CEs in this section 5002e5b6d6dSopenharmony_ciare for "Hani" script characters. 5012e5b6d6dSopenharmony_ci 5022e5b6d6dSopenharmony_ci"Implicit" means that the UCA default ordering table (DUCET) 5032e5b6d6dSopenharmony_cidoes not explicitly specify CEs for CJK ideographs and unassigned code points; 5042e5b6d6dSopenharmony_ciinstead, their CEs are computed at runtime. 5052e5b6d6dSopenharmony_ci 5062e5b6d6dSopenharmony_ciBeginning with ICU 53, tailoring to any unassigned code point, 5072e5b6d6dSopenharmony_ciincluding "last implicit", is not supported any more. 5082e5b6d6dSopenharmony_ci 5092e5b6d6dSopenharmony_ci"trailing": Tailoring characters after `[first trailing]` 5102e5b6d6dSopenharmony_cimakes them sort after all other non-tailored code points except for U+FFFD and U+FFFF. 5112e5b6d6dSopenharmony_ci 5122e5b6d6dSopenharmony_ciThe "trailing" section is reserved for future use, such as for non starting Jamos. See 5132e5b6d6dSopenharmony_ci<http://www.unicode.org/reports/tr10/#Trailing_Weights>. 5142e5b6d6dSopenharmony_ciCLDR 1.9/ICU 4.6 and later map U+FFFF to the very end of the trailing section. 5152e5b6d6dSopenharmony_ciUCA 6.3/CLDR 24/ICU 52 and later map U+FFFD to just before U+FFFF. 5162e5b6d6dSopenharmony_ciU+FFFD..U+FFFF are not tailorable, and nothing can tailor to them. 5172e5b6d6dSopenharmony_ci<http://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights> 5182e5b6d6dSopenharmony_ci 5192e5b6d6dSopenharmony_ciBefore ICU 4.6, U+FFFF mapped to a completely ignorable CE, and `[last trailing]` 5202e5b6d6dSopenharmony_ciwas the same as `[first trailing]`. 5212e5b6d6dSopenharmony_ci 5222e5b6d6dSopenharmony_ciNot all of the indirect-positioning anchors are useful. Most of the 'first' 5232e5b6d6dSopenharmony_cielements should be used with the `[before]` directive, in order to make sure 5242e5b6d6dSopenharmony_cithat your tailoring will sort before an interesting section. 5252e5b6d6dSopenharmony_ci 5262e5b6d6dSopenharmony_ci### Complex Tailoring Examples 5272e5b6d6dSopenharmony_ci 5282e5b6d6dSopenharmony_ciThe following are several fragments of real tailorings, illustrating some of the 5292e5b6d6dSopenharmony_ciadvanced syntactical elements: 5302e5b6d6dSopenharmony_ci 5312e5b6d6dSopenharmony_ci#### Expansion Example: 5322e5b6d6dSopenharmony_ci 5332e5b6d6dSopenharmony_ci**Swedish:** 5342e5b6d6dSopenharmony_ci``` 5352e5b6d6dSopenharmony_ci&t<<<þ/h 5362e5b6d6dSopenharmony_ci&T<<<Þ/H 5372e5b6d6dSopenharmony_ci``` 5382e5b6d6dSopenharmony_ci 5392e5b6d6dSopenharmony_ciThe letter 'þ' (THORN) is normally treated by UCA/root collation as a separate 5402e5b6d6dSopenharmony_ciletter that has primary-level sorting after 'z'. However, in Swedish and some 5412e5b6d6dSopenharmony_ciother Scandinavian languages, 'þ' and 'Þ' should be treated as just a 5422e5b6d6dSopenharmony_citertiary-level difference from the letters "th" and "TH" respectively. This is 5432e5b6d6dSopenharmony_cian example of an expansion. 5442e5b6d6dSopenharmony_ci 5452e5b6d6dSopenharmony_ciUCA | `&t<<<þ/h, &T<<<Þ/H` 5462e5b6d6dSopenharmony_ci--- | -------------------- 5472e5b6d6dSopenharmony_ciaz | az 5482e5b6d6dSopenharmony_ciAz | Az 5492e5b6d6dSopenharmony_citha | tha 5502e5b6d6dSopenharmony_ciTha | þa 5512e5b6d6dSopenharmony_ciTHa | Tha 5522e5b6d6dSopenharmony_cithz | THa 5532e5b6d6dSopenharmony_ciza | Þa 5542e5b6d6dSopenharmony_ciZa | thz 5552e5b6d6dSopenharmony_cizz | þz 5562e5b6d6dSopenharmony_ciþa | za 5572e5b6d6dSopenharmony_ciÞa | Za 5582e5b6d6dSopenharmony_ciþz | zz 5592e5b6d6dSopenharmony_ci 5602e5b6d6dSopenharmony_ci#### Prefix Example: 5612e5b6d6dSopenharmony_ci 5622e5b6d6dSopenharmony_ciPrefixes are used in Japanese tailorings to reduce the number of contractions. A 5632e5b6d6dSopenharmony_cibig number of contractions is a performance burden on the commonly-used base 5642e5b6d6dSopenharmony_cicharacters, as their processing is much more complicated than the processing of 5652e5b6d6dSopenharmony_ciregular elements. 5662e5b6d6dSopenharmony_ci 5672e5b6d6dSopenharmony_ciA prefix rule conditionally changes the CE of the character or string (e.g., ー) 5682e5b6d6dSopenharmony_ciafter the | symbol; unlike a contraction, it does not affect the CE of the 5692e5b6d6dSopenharmony_cipreceding text (e.g., ァ). (By contrast, a contraction like ァー consumes both 5702e5b6d6dSopenharmony_cicharacters and can assign them a CE or expansion unrelated to ァ's CE.) A prefix 5712e5b6d6dSopenharmony_cirule is especially useful if the character or string (ー) after the | symbol 5722e5b6d6dSopenharmony_cioccurs significantly less often than the first character of the prefix (ァ). 5732e5b6d6dSopenharmony_ci 5742e5b6d6dSopenharmony_ci``` 5752e5b6d6dSopenharmony_ci&[before 3]ァ <<< ァ|ー = ァ|ー = ぁ|ー 5762e5b6d6dSopenharmony_ci``` 5772e5b6d6dSopenharmony_ci 5782e5b6d6dSopenharmony_ciThis could have been written as a series of contractions followed by expansion: 5792e5b6d6dSopenharmony_ci 5802e5b6d6dSopenharmony_ci``` 5812e5b6d6dSopenharmony_ci&[before 3]ァー <<< ァー = ァー = ぁー 5822e5b6d6dSopenharmony_ci``` 5832e5b6d6dSopenharmony_ci 5842e5b6d6dSopenharmony_ciHowever, in that case ァ, ァ and ぁ would start contractions. Since the prolonged 5852e5b6d6dSopenharmony_cisound mark (ー) occurs much less frequently than the other letters of Japanese 5862e5b6d6dSopenharmony_ciKatakana and Hiragana, it is much more prudent to put the extra processing on it 5872e5b6d6dSopenharmony_ciby using prefixes. 5882e5b6d6dSopenharmony_ci 5892e5b6d6dSopenharmony_ci#### Reset example: 5902e5b6d6dSopenharmony_ci 5912e5b6d6dSopenharmony_ciA "reset" always uses only the base character as the insertion point even if 5922e5b6d6dSopenharmony_cithere is an expansion. So the following rule, 5932e5b6d6dSopenharmony_ci 5942e5b6d6dSopenharmony_ci``` 5952e5b6d6dSopenharmony_ci& J <<< K / B & K <<< M 5962e5b6d6dSopenharmony_ci``` 5972e5b6d6dSopenharmony_ci 5982e5b6d6dSopenharmony_ciis equivalent to 5992e5b6d6dSopenharmony_ci 6002e5b6d6dSopenharmony_ci``` 6012e5b6d6dSopenharmony_ci& J <<< K / B <<< M 6022e5b6d6dSopenharmony_ci``` 6032e5b6d6dSopenharmony_ci 6042e5b6d6dSopenharmony_ciWhich produces the following sort order: 6052e5b6d6dSopenharmony_ci 6062e5b6d6dSopenharmony_ci"JA" 6072e5b6d6dSopenharmony_ci 6082e5b6d6dSopenharmony_ci"MA" 6092e5b6d6dSopenharmony_ci 6102e5b6d6dSopenharmony_ci"KA" 6112e5b6d6dSopenharmony_ci 6122e5b6d6dSopenharmony_ci"KC" 6132e5b6d6dSopenharmony_ci 6142e5b6d6dSopenharmony_ci"JC" 6152e5b6d6dSopenharmony_ci 6162e5b6d6dSopenharmony_ci"MC" 6172e5b6d6dSopenharmony_ci 6182e5b6d6dSopenharmony_ci> :point_right: **Note**: Assuming the letters "J", "K" and "M" have equal primary weights, the second 6192e5b6d6dSopenharmony_ci> letter contains the differences among these strings. However, the letter "K" is 6202e5b6d6dSopenharmony_ci> treated as if it always has a letter "B" following it while the letters "J" and 6212e5b6d6dSopenharmony_ci> "M" do not. 6222e5b6d6dSopenharmony_ci 6232e5b6d6dSopenharmony_ciThe following is an example of collation elements for these strings resulting 6242e5b6d6dSopenharmony_cifrom the specified rules: 6252e5b6d6dSopenharmony_ci 6262e5b6d6dSopenharmony_ciStrings | Collation Elements | | 6272e5b6d6dSopenharmony_ci------- | ------------------ | -------------- | ------ 6282e5b6d6dSopenharmony_ci"JA" | `[005C.00.01]` | `[0052.00.01]` | 6292e5b6d6dSopenharmony_ci"MA" | `[005C.00.03]` | `[0052.00.01]` | 6302e5b6d6dSopenharmony_ci"KA" | `[005C.00.02]` | `[0053.00.01]` | `[0052.00.01]` 6312e5b6d6dSopenharmony_ci"KC" | `[005C.00.02]` | `[0053.00.01]` | `[0054.00.01]` 6322e5b6d6dSopenharmony_ci"JC" | `[005C.00.01]` | `[0054.00.01]` | 6332e5b6d6dSopenharmony_ci"MC" | `[005C.00.03]` | `[0054.00.01]` | 6342e5b6d6dSopenharmony_ci 6352e5b6d6dSopenharmony_ci## Tailoring Issues 6362e5b6d6dSopenharmony_ci 6372e5b6d6dSopenharmony_ciICU uses canonical closure. This means that for each code point in Unicode, if 6382e5b6d6dSopenharmony_cithe canonically composed form of a tailored string produces different collation 6392e5b6d6dSopenharmony_cielements than the canonically decomposed form, then the canonically composed 6402e5b6d6dSopenharmony_ciform is effectively added to the ordering. If 'a' is tailored, for example, all 6412e5b6d6dSopenharmony_ciof the accented 'a' characters are also tailored. Canonical closure allows 6422e5b6d6dSopenharmony_cicollators to process Unicode strings in the FCD form as well as in NFD. (Note: 6432e5b6d6dSopenharmony_ciMost but not all NFC strings are also in FCD. See 6442e5b6d6dSopenharmony_ci<http://www.unicode.org/notes/tn5/#FCD>) 6452e5b6d6dSopenharmony_ci 6462e5b6d6dSopenharmony_ciHowever, *compatibility* equivalents are NOT automatically added. If the rule 6472e5b6d6dSopenharmony_ci"&b < a" is in tailoring, and the order of **ⓐ (circled a)** is important, it 6482e5b6d6dSopenharmony_cineeds to be tailored **explicitly**. 6492e5b6d6dSopenharmony_ci 6502e5b6d6dSopenharmony_ciRedundant tailoring rules are removed, with later rules "winning". The strengths 6512e5b6d6dSopenharmony_ciaround the removed rules are also fixed. 6522e5b6d6dSopenharmony_ci 6532e5b6d6dSopenharmony_ci### Example: 6542e5b6d6dSopenharmony_ci 6552e5b6d6dSopenharmony_ciThe following table summarizes effects of different redundant rules. 6562e5b6d6dSopenharmony_ci 6572e5b6d6dSopenharmony_ci | Original | Equivalent 6582e5b6d6dSopenharmony_ci------ | --------------------------------------------------------- | ---------- 6592e5b6d6dSopenharmony_ci1 | `& a < b < c < d` `& r < c` | `& a < b < d` `& r < c` 6602e5b6d6dSopenharmony_ci2 | `& a < b < c < d` `& c < m` | `& a < b < c < m < d` 6612e5b6d6dSopenharmony_ci3 | `& a < b < c < d` `& a < m` | `& a < m < b < c < d` 6622e5b6d6dSopenharmony_ci4 | `& a <<< b << c < d` `& a < m` | `& a <<< b << c < m < d` 6632e5b6d6dSopenharmony_ci5 | `& a < b < c < d` `& [before 1] c < m` | `& a < b < m < c < d` 6642e5b6d6dSopenharmony_ci6 | `& a < b <<< c << d <<< e` `& [before 3] e <<< x` | `& a < b <<< c << d <<< x <<< e` 6652e5b6d6dSopenharmony_ci7 | `& a < b <<< c << d <<< e` `& [before 2] e <<< x` | `& a < b <<< c <<< x << d <<< e` 6662e5b6d6dSopenharmony_ci8 | `& a < b <<< c << d <<< e` `& [before 1] e <<< x` | `& a <<< x < b <<< c << d <<< e` 6672e5b6d6dSopenharmony_ci9 | `& a < b <<< c << d <<< e <<< f < g` `& [before 1] g < x` | `& a < b <<< c << d <<< e <<< f < x < g` 6682e5b6d6dSopenharmony_ci 6692e5b6d6dSopenharmony_ciIf two different reset lists tailor the same character, then it is removed from the first 6702e5b6d6dSopenharmony_cione (see 1 in the table above). 6712e5b6d6dSopenharmony_ciIf the second list resets to a character tailored in the first list, then the second 6722e5b6d6dSopenharmony_cilist is inserted in the first (see 2). 6732e5b6d6dSopenharmony_ciIf both lists reset to the same character, then the same thing 6742e5b6d6dSopenharmony_cihappens (see 3). Whenever such an insertion occurs, the second strength 6752e5b6d6dSopenharmony_ci"postpones" the position (see 4). 6762e5b6d6dSopenharmony_ci 6772e5b6d6dSopenharmony_ciIf there is a `[before N]` on the reset, then the reset character is 6782e5b6d6dSopenharmony_cieffectively replaced by the item that would be before it, either in a previous 6792e5b6d6dSopenharmony_citailoring (if the letter occurs in one - see 5) or in the UCA. The N determines 6802e5b6d6dSopenharmony_cithe 'distance' before, based on the strength of the difference (see 6-8). 6812e5b6d6dSopenharmony_ciHowever, this is subject to postponement (see 9), so be careful! 6822e5b6d6dSopenharmony_ci 6832e5b6d6dSopenharmony_ci### Reset semantics 6842e5b6d6dSopenharmony_ci 6852e5b6d6dSopenharmony_ciThe reset semantic in ICU 1.8 and above is different from the previous ICU 6862e5b6d6dSopenharmony_cireleases. Prior to version 1.8, the reset relation modifier was applicable only 6872e5b6d6dSopenharmony_cito the entry immediately following the reset entry. Also, the relation modifier 6882e5b6d6dSopenharmony_ciapplied to all entries that occurred until the next reset or primary relation. 6892e5b6d6dSopenharmony_ci 6902e5b6d6dSopenharmony_ciFor example, 6912e5b6d6dSopenharmony_ci 6922e5b6d6dSopenharmony_ci``` 6932e5b6d6dSopenharmony_ci&xyz << e <<< f 6942e5b6d6dSopenharmony_ci``` 6952e5b6d6dSopenharmony_ci 6962e5b6d6dSopenharmony_ciwas equivalent to 6972e5b6d6dSopenharmony_ci 6982e5b6d6dSopenharmony_ci``` 6992e5b6d6dSopenharmony_ci&x << e/yz <<< f 7002e5b6d6dSopenharmony_ci``` 7012e5b6d6dSopenharmony_ci 7022e5b6d6dSopenharmony_ciprior to ICU version 1.8. 7032e5b6d6dSopenharmony_ci 7042e5b6d6dSopenharmony_ciStarting with ICU version 1.8, the modifier is equivalent to 7052e5b6d6dSopenharmony_ci 7062e5b6d6dSopenharmony_ci``` 7072e5b6d6dSopenharmony_ci&x << e/yz <<< f/yz 7082e5b6d6dSopenharmony_ci``` 7092e5b6d6dSopenharmony_ci 7102e5b6d6dSopenharmony_ciThe new semantic produces more intuitive results, especially when the character 7112e5b6d6dSopenharmony_ciafter the reset is decomposable. Since all rules are converted to NFD before 7122e5b6d6dSopenharmony_cithey are interpreted, this can result in contractions that the rule-writer might 7132e5b6d6dSopenharmony_cinot be aware of. Expansion propagates only until the next reset or primary 7142e5b6d6dSopenharmony_cirelation occurs. 7152e5b6d6dSopenharmony_ci 7162e5b6d6dSopenharmony_ciFor example, the following rule: 7172e5b6d6dSopenharmony_ci 7182e5b6d6dSopenharmony_ci``` 7192e5b6d6dSopenharmony_ci&ab = c <<< d << e <<< f < g <<< h 7202e5b6d6dSopenharmony_ci``` 7212e5b6d6dSopenharmony_ci 7222e5b6d6dSopenharmony_ciwas equivalent to the following prior to ICU 1.8 and in Java: 7232e5b6d6dSopenharmony_ci 7242e5b6d6dSopenharmony_ci``` 7252e5b6d6dSopenharmony_ci&a = c/b <<< d << e <<< f < g <<< h 7262e5b6d6dSopenharmony_ci``` 7272e5b6d6dSopenharmony_ci 7282e5b6d6dSopenharmony_ciStarting with 1.8, it is equivalent to 7292e5b6d6dSopenharmony_ci 7302e5b6d6dSopenharmony_ci``` 7312e5b6d6dSopenharmony_ci&a = c / b <<< d / b << e / b <<< f / b < g <<< h 7322e5b6d6dSopenharmony_ci``` 7332e5b6d6dSopenharmony_ci 7342e5b6d6dSopenharmony_ci## Known Limitations 7352e5b6d6dSopenharmony_ci 7362e5b6d6dSopenharmony_ciThe following are known limitations of the ICU collation implementation. These 7372e5b6d6dSopenharmony_ciare theoretical limitations, however, since there are no known languages for 7382e5b6d6dSopenharmony_ciwhich these limitations are an issue. However, for completeness they should be 7392e5b6d6dSopenharmony_cifixed in a future version after 1.8.1. The examples given are designed for 7402e5b6d6dSopenharmony_cisimplicity in testing, and do not match any real languages. 7412e5b6d6dSopenharmony_ci 7422e5b6d6dSopenharmony_ci### Expansion 7432e5b6d6dSopenharmony_ci 7442e5b6d6dSopenharmony_ciThe goal of expansion is to sort as if the expansion text were inserted right 7452e5b6d6dSopenharmony_ciafter the character. For example, with the rule 7462e5b6d6dSopenharmony_ci 7472e5b6d6dSopenharmony_ci``` 7482e5b6d6dSopenharmony_ci&a <<< c / e 7492e5b6d6dSopenharmony_ci``` 7502e5b6d6dSopenharmony_ci 7512e5b6d6dSopenharmony_ciThe text "...**c**..." should sort as if it were right after "...**ae**..." with 7522e5b6d6dSopenharmony_cia tertiary difference. There are a few cases where this is not currently true. 7532e5b6d6dSopenharmony_ci 7542e5b6d6dSopenharmony_ci#### Recursive Expansion 7552e5b6d6dSopenharmony_ci 7562e5b6d6dSopenharmony_ciGiven the rules 7572e5b6d6dSopenharmony_ci 7582e5b6d6dSopenharmony_ci``` 7592e5b6d6dSopenharmony_ci&a <<< c / e 7602e5b6d6dSopenharmony_ci&g <<< e / I 7612e5b6d6dSopenharmony_ci``` 7622e5b6d6dSopenharmony_ci 7632e5b6d6dSopenharmony_ciExpansion should sort the text "...**c**..." as if it were just after 7642e5b6d6dSopenharmony_ci"...**ae**...", and that should also sort as if it were just after 7652e5b6d6dSopenharmony_ci"...**agi**...". This requires that the compilation of expansions be recursive 7662e5b6d6dSopenharmony_ci(and check for loops as well!). ICU currently does not do this. 7672e5b6d6dSopenharmony_ci 7682e5b6d6dSopenharmony_ciRules | Desired Order | Current Order 7692e5b6d6dSopenharmony_ci------------- | ------------- | ------------- 7702e5b6d6dSopenharmony_ci`& a = b / c` | add | b 7712e5b6d6dSopenharmony_ci`& d = c / e` | b | add 7722e5b6d6dSopenharmony_ci | adf | adf 7732e5b6d6dSopenharmony_ci 7742e5b6d6dSopenharmony_ci#### Contractions Spanning Expansions 7752e5b6d6dSopenharmony_ci 7762e5b6d6dSopenharmony_ciICU currently always pre-compiles the expansion into an internal format (a list 7772e5b6d6dSopenharmony_ciof one or more collation elements) when the rule is compiled. If there is a 7782e5b6d6dSopenharmony_cicontraction that spans the end of the expanded text and the start of the 7792e5b6d6dSopenharmony_cioriginal text, however, that contraction will not match. A text case that 7802e5b6d6dSopenharmony_ciillustrates this is: 7812e5b6d6dSopenharmony_ci 7822e5b6d6dSopenharmony_ciRules | Desired Order | Current Order 7832e5b6d6dSopenharmony_ci--------------- | ------------- | ------------- 7842e5b6d6dSopenharmony_ci`& a <<< c / e` | ad | ad 7852e5b6d6dSopenharmony_ci`& g <<< eh` | c | c 7862e5b6d6dSopenharmony_ci | af | ch 7872e5b6d6dSopenharmony_ci | g | af 7882e5b6d6dSopenharmony_ci | ch | g 7892e5b6d6dSopenharmony_ci | h | h 7902e5b6d6dSopenharmony_ci 7912e5b6d6dSopenharmony_ciSince the pre-compiled expansions are a huge performance gain, we will probably 7922e5b6d6dSopenharmony_cikeep the implementation the way it is, but in the future allow additional syntax 7932e5b6d6dSopenharmony_cito indicate those few expansions that need to behave as if the text were 7942e5b6d6dSopenharmony_ciinserted because of the existence of another contraction. Note that such 7952e5b6d6dSopenharmony_ciexpansions need to be recursively expanded (as in #1), but rather than at 7962e5b6d6dSopenharmony_cipre-compile time, these need to be done at runtime. 7972e5b6d6dSopenharmony_ci 7982e5b6d6dSopenharmony_ciWhile it is possible to automatically detect these cases, it would be better to 7992e5b6d6dSopenharmony_ciallow explicit control in case spanning is not desired. An example of such 8002e5b6d6dSopenharmony_cisyntax might be something like: 8012e5b6d6dSopenharmony_ci 8022e5b6d6dSopenharmony_ci``` 8032e5b6d6dSopenharmony_ci&a <<< c // e 8042e5b6d6dSopenharmony_ci``` 8052e5b6d6dSopenharmony_ci 8062e5b6d6dSopenharmony_ci**Notes:** ICU does handle the case where there is a contraction that is 8072e5b6d6dSopenharmony_cicompletely inside the expansion. 8082e5b6d6dSopenharmony_ci 8092e5b6d6dSopenharmony_ciSuppose that someone had the rules: 8102e5b6d6dSopenharmony_ci 8112e5b6d6dSopenharmony_ci``` 8122e5b6d6dSopenharmony_ci&a = c / e 8132e5b6d6dSopenharmony_ci&x = ae 8142e5b6d6dSopenharmony_ci``` 8152e5b6d6dSopenharmony_ci 8162e5b6d6dSopenharmony_ciThese do not cause **c** to sort as if it were **ae**, nor should they. 8172e5b6d6dSopenharmony_ci 8182e5b6d6dSopenharmony_ci### Normalization 8192e5b6d6dSopenharmony_ci 8202e5b6d6dSopenharmony_ciThe Unicode Collation Algorithm specifies that all text sort as if it were first 8212e5b6d6dSopenharmony_cinormalized into NFD. For performance reasons, ICU collation data is 8222e5b6d6dSopenharmony_cipre-processed so that there is no need to perform normalization on strings that 8232e5b6d6dSopenharmony_ciare in [FCD](http://www.unicode.org/notes/tn5/#FCD) and do not contain any composite 8242e5b6d6dSopenharmony_cicombining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81 8252e5b6d6dSopenharmony_ci} 8262e5b6d6dSopenharmony_ci[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=) 8272e5b6d6dSopenharmony_ci(These characters must be decomposed for discontiguous contractions to work 8282e5b6d6dSopenharmony_ciproperly. Use of these characters is discouraged by the Unicode Standard.). The 8292e5b6d6dSopenharmony_civast majority of strings are in this form. 8302e5b6d6dSopenharmony_ci 8312e5b6d6dSopenharmony_ci#### Nulls in Contractions 8322e5b6d6dSopenharmony_ci 8332e5b6d6dSopenharmony_ciNulls should not be used in contractions that could invoke normalization. 8342e5b6d6dSopenharmony_ci 8352e5b6d6dSopenharmony_ciRules | Desired Order | Current Order 8362e5b6d6dSopenharmony_ci-------------------- | ------------- | ------------- 8372e5b6d6dSopenharmony_ci`& a <<< '\u0000'^` | a | '\\u0000'^ 8382e5b6d6dSopenharmony_ci | '\\u0000'^ | a 8392e5b6d6dSopenharmony_ci 8402e5b6d6dSopenharmony_ci#### Contractions Spanning Normalization 8412e5b6d6dSopenharmony_ci 8422e5b6d6dSopenharmony_ciThe following rule specifies that a grave accent followed by a **b** is a 8432e5b6d6dSopenharmony_cicontraction, and sorts as if it were an **e**. 8442e5b6d6dSopenharmony_ci 8452e5b6d6dSopenharmony_ci``` 8462e5b6d6dSopenharmony_ci& e <<< ` b 8472e5b6d6dSopenharmony_ci``` 8482e5b6d6dSopenharmony_ci 8492e5b6d6dSopenharmony_ciOn this basis, "...àb..." should sort as if it were just after "...ae...". 8502e5b6d6dSopenharmony_ciBecause of the preprocessing, however, the contraction will not match if this 8512e5b6d6dSopenharmony_citext is represented with the pre-composed character à, but **will** match if 8522e5b6d6dSopenharmony_cigiven the decomposed sequence **a + grave accent**. The same thing happens if 8532e5b6d6dSopenharmony_cithe contraction spans the start of a normalized sequence. 8542e5b6d6dSopenharmony_ci 8552e5b6d6dSopenharmony_ciRules | Desired Order | Current Order 8562e5b6d6dSopenharmony_ci------------ | ------------- | ------------- 8572e5b6d6dSopenharmony_ci& e <<< \` b | à | à 8582e5b6d6dSopenharmony_ci | ad | àb 8592e5b6d6dSopenharmony_ci | àb | ad 8602e5b6d6dSopenharmony_ci | af | af 8612e5b6d6dSopenharmony_ci | | 8622e5b6d6dSopenharmony_ci`& g <<< ca` | f | cà 8632e5b6d6dSopenharmony_ci | ca | f 8642e5b6d6dSopenharmony_ci | cà | ca 8652e5b6d6dSopenharmony_ci | h | h 8662e5b6d6dSopenharmony_ci 8672e5b6d6dSopenharmony_ci### Variable Top 8682e5b6d6dSopenharmony_ci 8692e5b6d6dSopenharmony_ciICU lets you set the top of the variable range. This can be done, for example, 8702e5b6d6dSopenharmony_cito allow you to ignore just SPACES, and not punctuation. 8712e5b6d6dSopenharmony_ci 8722e5b6d6dSopenharmony_ci#### Variable Top Exclusion 8732e5b6d6dSopenharmony_ci 8742e5b6d6dSopenharmony_ciThere is currently a limitation that causes variable top to (perhaps) exclude 8752e5b6d6dSopenharmony_cimore characters than it should. This happens if you not only set variable top, 8762e5b6d6dSopenharmony_cibut also tailor a number of characters around it with primary differences. The 8772e5b6d6dSopenharmony_ciexact number that you can tailor depends on the internal "gaps" between the 8782e5b6d6dSopenharmony_cicharacters in the pre-compiled UCA table. Normally there is a gap of one. There 8792e5b6d6dSopenharmony_ciare larger gaps between scripts (such as between Latin and Greek), and after 8802e5b6d6dSopenharmony_cicertain other special characters. For example, if variable top is set to be at 8812e5b6d6dSopenharmony_ciSPACE ('\\u0020'), then it works correctly with up to 70 characters also 8822e5b6d6dSopenharmony_citailored after space. However, if variable top is set to be equal to HYPHEN 8832e5b6d6dSopenharmony_ci('\\u2010'), only one other value can be accommodated. 8842e5b6d6dSopenharmony_ci 8852e5b6d6dSopenharmony_ciIn the following, the goal is for x to be ignored and z not to be ignored. 8862e5b6d6dSopenharmony_ci 8872e5b6d6dSopenharmony_ciRules | Desired Order SHIFTED = ON | Current Order 8882e5b6d6dSopenharmony_ci------------------ | -------------------------- | ------------- 8892e5b6d6dSopenharmony_ci`& \u2010` | - | - 8902e5b6d6dSopenharmony_ci`< x` | z | z 8912e5b6d6dSopenharmony_ci`< [variable top]` | zb | zb 8922e5b6d6dSopenharmony_ci`< z` | a | xb 8932e5b6d6dSopenharmony_ci | b | a 8942e5b6d6dSopenharmony_ci | -b | b 8952e5b6d6dSopenharmony_ci | xb | -b 8962e5b6d6dSopenharmony_ci | c | c 8972e5b6d6dSopenharmony_ci 8982e5b6d6dSopenharmony_ci> :point_right: **Note**: With ICU 1.8.1, the 8992e5b6d6dSopenharmony_ci> user is advised not to tailor the variable top to customize more than two 9002e5b6d6dSopenharmony_ci> primary relations (for example, `"& x < y < [variable top]"`). Starting in ICU 9012e5b6d6dSopenharmony_ci> 2.0, setVariableTop() allows the user to set the variable top programmatically 9022e5b6d6dSopenharmony_ci> to a legal single character or a valid contracting sequence. In addition, the 9032e5b6d6dSopenharmony_ci> string that variable top is set to should not be treated as either inclusive or 9042e5b6d6dSopenharmony_ci> exclusive in the rules. 9052e5b6d6dSopenharmony_ci 9062e5b6d6dSopenharmony_ci### Case Level/First/Second 9072e5b6d6dSopenharmony_ci 9082e5b6d6dSopenharmony_ciIn ICU, it is possible to override the tertiary settings programmatically. This 9092e5b6d6dSopenharmony_ciis used to change the default case behavior to be all upper first or all lower 9102e5b6d6dSopenharmony_cifirst. It can also be used for a separate case level, or to ignore all other 9112e5b6d6dSopenharmony_citertiary differences (such as between circled and non-circled letters, or 9122e5b6d6dSopenharmony_cibetween half-width and full-width katakana). The case values are derived 9132e5b6d6dSopenharmony_cidirectly from the Unicode character properties, and not set by the rules. 9142e5b6d6dSopenharmony_ci 9152e5b6d6dSopenharmony_ci#### Mixed Case Contractions 9162e5b6d6dSopenharmony_ci 9172e5b6d6dSopenharmony_ciThere is currently a limitation that all contractions of multiple characters can 9182e5b6d6dSopenharmony_cionly have three special case values: upper, lower, and mixed. All mixed-case 9192e5b6d6dSopenharmony_cicontractions are grouped together, and are not affected by the upper first vs. 9202e5b6d6dSopenharmony_cilower first flag. 9212e5b6d6dSopenharmony_ci 9222e5b6d6dSopenharmony_ciRules | Desired Order UPPER_FIRST | Current Order 9232e5b6d6dSopenharmony_ci---------- | ------------------------- | ------------- 9242e5b6d6dSopenharmony_ci`& c < ch` | C | c 9252e5b6d6dSopenharmony_ci`<<< cH` | CH | CH 9262e5b6d6dSopenharmony_ci`<<< Ch` | Ch | cH 9272e5b6d6dSopenharmony_ci`<<< CH` | cH | Ch 9282e5b6d6dSopenharmony_ci | ch | ch 9292e5b6d6dSopenharmony_ci 9302e5b6d6dSopenharmony_ci## Building on Existing Locales 9312e5b6d6dSopenharmony_ci 9322e5b6d6dSopenharmony_ciAll of the collation rules are additive; that is, they override what any 9332e5b6d6dSopenharmony_ciprevious rule expressed. That means that you can build on existing rules for 9342e5b6d6dSopenharmony_cigiven locales. Here is an example of this, which fetches the rules for a 9352e5b6d6dSopenharmony_ciparticular locale (Danish), then overrides some part (sorting '%' after 'm'). 9362e5b6d6dSopenharmony_ciThe syntax is Java, but C/C++ has similar features. 9372e5b6d6dSopenharmony_ci 9382e5b6d6dSopenharmony_ci```java 9392e5b6d6dSopenharmony_ciULocale myLocale = new ULocale("da"); 9402e5b6d6dSopenharmony_citry { 9412e5b6d6dSopenharmony_ci 9422e5b6d6dSopenharmony_ci RuleBasedCollator col = (RuleBasedCollator) Collator.getInstance(myLocale); 9432e5b6d6dSopenharmony_ci String rules = col.getRules(); 9442e5b6d6dSopenharmony_ci String myRules = "& m < '%'"; 9452e5b6d6dSopenharmony_ci RuleBasedCollator col2 = new RuleBasedCollator(rules + myRules); 9462e5b6d6dSopenharmony_ci 9472e5b6d6dSopenharmony_ci // check the values 9482e5b6d6dSopenharmony_ci 9492e5b6d6dSopenharmony_ci List<String> expected = Arrays.asList("a;m;%;z;aa".split(";")); 9502e5b6d6dSopenharmony_ci TreeSet<String> sorted = new TreeSet<String>(col2); 9512e5b6d6dSopenharmony_ci sorted.addAll(expected); 9522e5b6d6dSopenharmony_ci ArrayList<String> actual = new ArrayList<String>(sorted); 9532e5b6d6dSopenharmony_ci assertEquals("Customized rules with %", expected, actual); 9542e5b6d6dSopenharmony_ci 9552e5b6d6dSopenharmony_ci} catch (Exception e) { 9562e5b6d6dSopenharmony_ci throw new IllegalArgumentException("Failed to create customized rules", e); 9572e5b6d6dSopenharmony_ci} 9582e5b6d6dSopenharmony_ci``` 9592e5b6d6dSopenharmony_ci 9602e5b6d6dSopenharmony_ciThe root collator has an empty rules string (`getRules()` returns `""`): Any 9612e5b6d6dSopenharmony_cicollator's tailoring rules string defines how a collator *differs* from the root 9622e5b6d6dSopenharmony_cicollator, and the tailoring rules string was the input for building the 9632e5b6d6dSopenharmony_citailoring collator. By contrast, the root collator itself is built from a file 9642e5b6d6dSopenharmony_ciwith explicit mappings (ICU4C source/data/unidata/FractionalUCA.txt) 9652e5b6d6dSopenharmony_cifrom characters/contractions to collation elements. This file represents the 9662e5b6d6dSopenharmony_ci[DUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) 9672e5b6d6dSopenharmony_cias [modified by 9682e5b6d6dSopenharmony_ciCLDR](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation). 9692e5b6d6dSopenharmony_ci 9702e5b6d6dSopenharmony_ciThere are "extended" versions of `getRules()` which, when called with 9712e5b6d6dSopenharmony_ci`delta=UCOL_FULL_RULES` (C/C++) or `fullrules=true` (Java), return "full rules" 9722e5b6d6dSopenharmony_ciwhich are a concatenation of the "UCA rules" and the collator's tailoring. The 9732e5b6d6dSopenharmony_ci"UCA rules" are published as UCA_Rules.txt in every [UCA 9742e5b6d6dSopenharmony_cirelease](http://www.unicode.org/Public/UCA/). 9752e5b6d6dSopenharmony_ci 9762e5b6d6dSopenharmony_ci* "UCA rules" is a historical misnomer. The UCA specifies an Algorithm which 9772e5b6d6dSopenharmony_ci applies to all collators, and provides the DUCET as its Default table. 9782e5b6d6dSopenharmony_ci* ICU's root collator implements the CLDR-modified collation element table. 9792e5b6d6dSopenharmony_ci The "UCA rules" returned from ICU functions are equivalently modified rules 9802e5b6d6dSopenharmony_ci compared with those for the DUCET. 9812e5b6d6dSopenharmony_ci 9822e5b6d6dSopenharmony_ciThe "UCA rules" are an *approximation* of the root collator's sort order, but 9832e5b6d6dSopenharmony_cithere are some differences because not all of the details of the root collator 9842e5b6d6dSopenharmony_cimappings can be expressed in rule syntax. In particular, a collator built from 9852e5b6d6dSopenharmony_ciICU4C source/data/unidata/UCARules.txt 9862e5b6d6dSopenharmony_cihas at least the following issues compared with the real root collator: 9872e5b6d6dSopenharmony_ci 9882e5b6d6dSopenharmony_ci* inefficient (long) collation element weights 9892e5b6d6dSopenharmony_ci* CODAN (numeric collation) will not work (the 0 digit's primary weight is 9902e5b6d6dSopenharmony_ci hardcoded, or specified in FractionalUCA.txt) 9912e5b6d6dSopenharmony_ci* script reordering will not work 9922e5b6d6dSopenharmony_ci* alternate=shifted will not work 9932e5b6d6dSopenharmony_ci* the sort order has some differences from the regular root collator, 9942e5b6d6dSopenharmony_ci including additional tertiary differences 9952e5b6d6dSopenharmony_ci 9962e5b6d6dSopenharmony_ciThe "full rules" are almost never used, or useful, at runtime. They are included 9972e5b6d6dSopenharmony_ciin ICU for historical reasons and for UCA consistency tests. They might be 9982e5b6d6dSopenharmony_ciusable for emulating the CLDR/ICU sort order with a collation implementation not 9992e5b6d6dSopenharmony_cibased on CLDR/ICU. 10002e5b6d6dSopenharmony_ci 10012e5b6d6dSopenharmony_ciCollation rule strings in general are not commonly used but are a significant 10022e5b6d6dSopenharmony_ciportion of the data size in ICU collation resource bundles, especially for CJK 10032e5b6d6dSopenharmony_cilanguages. The rule strings can be omitted from those resource bundles by adding 10042e5b6d6dSopenharmony_cithe `--omitCollationRules` option to the relevant `genrb` invocations 10052e5b6d6dSopenharmony_ci(for ICU 53..63, in icu4c/source/data/Makefile.in) 10062e5b6d6dSopenharmony_cior, since ICU 64, with a [data filter config file](../../icu_data/buildtool.md). 10072e5b6d6dSopenharmony_ci(See for example the relevant 10082e5b6d6dSopenharmony_ci[ICU integration test instructions](https://icu.unicode.org/processes/release/tasks/integration#TOC-Verify-that-ICU4C-tests-pass-without-collation-rule-strings).) 10092e5b6d6dSopenharmony_ci 10102e5b6d6dSopenharmony_ciIf the tailoring rules are needed but the 150kB or so of "UCA rules" are not, 10112e5b6d6dSopenharmony_cithen the line 10122e5b6d6dSopenharmony_ci 10132e5b6d6dSopenharmony_ci``` 10142e5b6d6dSopenharmony_ciUCARules:process(uca_rules){"../unidata/UCARules.txt"} 10152e5b6d6dSopenharmony_ci``` 10162e5b6d6dSopenharmony_ci 10172e5b6d6dSopenharmony_ciin 10182e5b6d6dSopenharmony_ci[source/data/coll/root.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll/root.txt) 10192e5b6d6dSopenharmony_cican be commented out or deleted. 10202e5b6d6dSopenharmony_ci 10212e5b6d6dSopenharmony_ci## Cautions 10222e5b6d6dSopenharmony_ci 10232e5b6d6dSopenharmony_ciThe following are not known rule limitations, but rather cautions. 10242e5b6d6dSopenharmony_ci 10252e5b6d6dSopenharmony_ci### Resets 10262e5b6d6dSopenharmony_ci 10272e5b6d6dSopenharmony_ciSince resets always work on the existing state, the user is required to make 10282e5b6d6dSopenharmony_cisure that the rule entries are in the proper order. 10292e5b6d6dSopenharmony_ci 10302e5b6d6dSopenharmony_ciRules | Order | Comment 10312e5b6d6dSopenharmony_ci--------- | ----- | ------- 10322e5b6d6dSopenharmony_ci`& a < b` | a | The rules mean: put **b** after **a**, then put **c** after **a** (inserting **before** the **b**). 10332e5b6d6dSopenharmony_ci`& a < c` | c | 10342e5b6d6dSopenharmony_ci | b | 10352e5b6d6dSopenharmony_ci 10362e5b6d6dSopenharmony_ci### Postpone Insertion 10372e5b6d6dSopenharmony_ci 10382e5b6d6dSopenharmony_ciWhen using a reset to insert a value X with a certain strength difference after 10392e5b6d6dSopenharmony_cia value Y, it actually is inserted just before the next item of the same 10402e5b6d6dSopenharmony_cistrength or higher following Y. Thus, the following are equivalent: 10412e5b6d6dSopenharmony_ci 10422e5b6d6dSopenharmony_ci``` 10432e5b6d6dSopenharmony_ci... m < a = c <<< d << e <<< f < g <<< h & a << x 10442e5b6d6dSopenharmony_ci... m < a = c <<< d << x << e <<< f < g <<< h 10452e5b6d6dSopenharmony_ci``` 10462e5b6d6dSopenharmony_ci 10472e5b6d6dSopenharmony_ci> :point_right: **Note**: This is different from the Java semantics. 10482e5b6d6dSopenharmony_ci> In Java, the value is inserted immediately after the reset character. 10492e5b6d6dSopenharmony_ci 10502e5b6d6dSopenharmony_ci### Jamo Tailoring 10512e5b6d6dSopenharmony_ci 10522e5b6d6dSopenharmony_ciIf Jamo characters are tailored, that causes the code to go through a slow path, 10532e5b6d6dSopenharmony_ciwhich will have a significant effect on performance. 10542e5b6d6dSopenharmony_ci 10552e5b6d6dSopenharmony_ci### Compatibility Decompositions 10562e5b6d6dSopenharmony_ci 10572e5b6d6dSopenharmony_ciWhen tailoring a letter, the customization affects all of its canonical 10582e5b6d6dSopenharmony_ciequivalents. That is, if tailoring rule sorts an **'a'** after**'e '**, for 10592e5b6d6dSopenharmony_ciexample, then "**"à", "á", ...** are also sorted after '**e**'.his is not true 10602e5b6d6dSopenharmony_cifor compatibility equivalents. If the desired sorting order is for a 10612e5b6d6dSopenharmony_ci**superscript-a** ("ª") to be after "**e"**, it is necessary to specify the rule 10622e5b6d6dSopenharmony_cifor that. 10632e5b6d6dSopenharmony_ci 10642e5b6d6dSopenharmony_ci### Case Differences 10652e5b6d6dSopenharmony_ci 10662e5b6d6dSopenharmony_ciSimilarly, when tailoring an "**a" to be sorted** after "**e"**, including 10672e5b6d6dSopenharmony_ci"**A"** to be after "**e" **as well, it is required to have a specific rule for 10682e5b6d6dSopenharmony_cithat sorting sequence. 10692e5b6d6dSopenharmony_ci 10702e5b6d6dSopenharmony_ci### Automatic Expansions 10712e5b6d6dSopenharmony_ci 10722e5b6d6dSopenharmony_ciICU will automatically form expansions whenever a reset is to a multi-character 10732e5b6d6dSopenharmony_civalue that is not a contraction. For example, `& ab <<< c` is equivalent to 10742e5b6d6dSopenharmony_ci`& a <<< c / b`. The user may be unaware of this happening, since it may not be 10752e5b6d6dSopenharmony_ciobvious that the reset is to a multi-character value. For example, `& à<<< d` is 10762e5b6d6dSopenharmony_ciequivalent to & a <<< d / \` 1077