12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: Concepts 42e5b6d6dSopenharmony_cinav_order: 1 52e5b6d6dSopenharmony_ciparent: Collation 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# Collation Concepts 132e5b6d6dSopenharmony_ci{: .no_toc } 142e5b6d6dSopenharmony_ci 152e5b6d6dSopenharmony_ci## Contents 162e5b6d6dSopenharmony_ci{: .no_toc .text-delta } 172e5b6d6dSopenharmony_ci 182e5b6d6dSopenharmony_ci1. TOC 192e5b6d6dSopenharmony_ci{:toc} 202e5b6d6dSopenharmony_ci 212e5b6d6dSopenharmony_ci--- 222e5b6d6dSopenharmony_ci 232e5b6d6dSopenharmony_ci## Overview 242e5b6d6dSopenharmony_ci 252e5b6d6dSopenharmony_ciThe previous section demonstrated many of the requirements imposed on string 262e5b6d6dSopenharmony_cicomparison routines that try to correctly collate strings according to 272e5b6d6dSopenharmony_ciconventions of more than a hundred different languages, written in many 282e5b6d6dSopenharmony_cidifferent scripts. This section describes the principles and architecture behind 292e5b6d6dSopenharmony_cithe ICU Collation Service. 302e5b6d6dSopenharmony_ci 312e5b6d6dSopenharmony_ci## Sortkeys vs Comparison 322e5b6d6dSopenharmony_ci 332e5b6d6dSopenharmony_ciSort keys are most useful in databases, where the overhead of calling a function 342e5b6d6dSopenharmony_cifor each comparison is very large. 352e5b6d6dSopenharmony_ci 362e5b6d6dSopenharmony_ciGenerating a sort key from a Collator is many times more expensive than doing a 372e5b6d6dSopenharmony_cicompare with the Collator (for common use cases). That's if the two functions 382e5b6d6dSopenharmony_ciare called from Java or C. So for those languages, unless there is a very large 392e5b6d6dSopenharmony_cinumber of comparisons, it is better to call the compare function. 402e5b6d6dSopenharmony_ci 412e5b6d6dSopenharmony_ciHere is an example, with a little back-of-the-envelope calculation. Let's 422e5b6d6dSopenharmony_cisuppose that with a given language on a given platform, the compare performance 432e5b6d6dSopenharmony_ci(CP) is 100 faster than sortKey performance (SP), and that you are doing a 442e5b6d6dSopenharmony_cibinary search of a list with 1,000 elements. The binary comparison performance 452e5b6d6dSopenharmony_ciis BP. We'd do about 10 comparisons, getting: 462e5b6d6dSopenharmony_ci 472e5b6d6dSopenharmony_cicompare: 10 \* CP 482e5b6d6dSopenharmony_ci 492e5b6d6dSopenharmony_cisortkey: 1 \* SP + 10 \* BP 502e5b6d6dSopenharmony_ci 512e5b6d6dSopenharmony_ciEven if BP is free, compare would be better. One has to get up to where log2(n) 522e5b6d6dSopenharmony_ci= 100 before they break even. 532e5b6d6dSopenharmony_ci 542e5b6d6dSopenharmony_ciBut even this calculation is only a rough guide. First, the binary comparison is 552e5b6d6dSopenharmony_cinot completely free. Secondly, the performance of compare function varies 562e5b6d6dSopenharmony_ciradically with the source data. We optimized for maximizing performance of 572e5b6d6dSopenharmony_cicollation in sorting and binary search, so comparing strings that are "close" is 582e5b6d6dSopenharmony_cioptimized to be much faster than comparing strings that are "far away". That 592e5b6d6dSopenharmony_cioptimization is important because normal sort/lookup operations compare close 602e5b6d6dSopenharmony_cistrings far more often -- think of binary search, where the last few comparisons 612e5b6d6dSopenharmony_ciare always with the closest strings. So even the above calculation is not very 622e5b6d6dSopenharmony_ciaccurate. 632e5b6d6dSopenharmony_ci 642e5b6d6dSopenharmony_ci## Comparison Levels 652e5b6d6dSopenharmony_ci 662e5b6d6dSopenharmony_ciIn general, when comparing and sorting objects, some properties can take 672e5b6d6dSopenharmony_ciprecedence over others. For example, in geometry, you might consider first the 682e5b6d6dSopenharmony_cinumber of sides a shape has, followed by the number of sides of equal length. 692e5b6d6dSopenharmony_ciThis causes triangles to be sorted together, then rectangles, then pentagons, 702e5b6d6dSopenharmony_cietc. Within each category, the shapes would be ordered according to whether they 712e5b6d6dSopenharmony_cihad 0, 2, 3 or more sides of the same length. However, this is not the only way 722e5b6d6dSopenharmony_cithe shapes can be sorted. For example, it might be preferable to sort shapes by 732e5b6d6dSopenharmony_cicolor first, so that all red shapes are grouped together, then blue, etc. 742e5b6d6dSopenharmony_ciAnother approach would be to sort the shapes by the amount of area they enclose. 752e5b6d6dSopenharmony_ci 762e5b6d6dSopenharmony_ciSimilarly, character strings have properties, some of which can take precedence 772e5b6d6dSopenharmony_ciover others. There is more than one way to prioritize the properties. 782e5b6d6dSopenharmony_ci 792e5b6d6dSopenharmony_ciFor example, a common approach is to distinguish characters first by their 802e5b6d6dSopenharmony_ciunadorned base letter (for example, without accents, vowels or tone marks), then 812e5b6d6dSopenharmony_ciby accents, and then by the case of the letter (upper vs. lower). Ideographic 822e5b6d6dSopenharmony_cicharacters might be sorted by their component radicals and then by the number of 832e5b6d6dSopenharmony_cistrokes it takes to draw the character. 842e5b6d6dSopenharmony_ciAn alternative ordering would be to sort these characters by strokes first and 852e5b6d6dSopenharmony_cithen by their radicals. 862e5b6d6dSopenharmony_ci 872e5b6d6dSopenharmony_ciThe ICU Collation Service supports many levels of comparison (named "Levels", 882e5b6d6dSopenharmony_cibut also known as "Strengths"). Having these categories enables ICU to sort 892e5b6d6dSopenharmony_cistrings precisely according to local conventions. However, by allowing the 902e5b6d6dSopenharmony_cilevels to be selectively employed, searching for a string in text can be 912e5b6d6dSopenharmony_ciperformed with various matching conditions. 922e5b6d6dSopenharmony_ci 932e5b6d6dSopenharmony_ciPerformance optimizations have been made for ICU collation with the default 942e5b6d6dSopenharmony_cilevel settings. Performance specific impacts are discussed in the Performance 952e5b6d6dSopenharmony_cisection below. 962e5b6d6dSopenharmony_ci 972e5b6d6dSopenharmony_ciFollowing is a list of the names for each level and an example usage: 982e5b6d6dSopenharmony_ci 992e5b6d6dSopenharmony_ci1. Primary Level: Typically, this is used to denote differences between base 1002e5b6d6dSopenharmony_ci characters (for example, "a" < "b"). It is the strongest difference. For 1012e5b6d6dSopenharmony_ci example, dictionaries are divided into different sections by base character. 1022e5b6d6dSopenharmony_ci This is also called the level-1 strength. 1032e5b6d6dSopenharmony_ci 1042e5b6d6dSopenharmony_ci2. Secondary Level: Accents in the characters are considered secondary 1052e5b6d6dSopenharmony_ci differences (for example, "as" < "às" < "at"). Other differences between 1062e5b6d6dSopenharmony_ci letters can also be considered secondary differences, depending on the 1072e5b6d6dSopenharmony_ci language. A secondary difference is ignored when there is a primary 1082e5b6d6dSopenharmony_ci difference anywhere in the strings. This is also called the level-2 1092e5b6d6dSopenharmony_ci strength. 1102e5b6d6dSopenharmony_ci Note: In some languages (such as Danish), certain accented letters are 1112e5b6d6dSopenharmony_ci considered to be separate base characters. In most languages, however, an 1122e5b6d6dSopenharmony_ci accented letter only has a secondary difference from the unaccented version 1132e5b6d6dSopenharmony_ci of that letter. 1142e5b6d6dSopenharmony_ci 1152e5b6d6dSopenharmony_ci3. Tertiary Level: Upper and lower case differences in characters are 1162e5b6d6dSopenharmony_ci distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In 1172e5b6d6dSopenharmony_ci addition, a variant of a letter differs from the base form on the tertiary 1182e5b6d6dSopenharmony_ci level (such as "A" and "Ⓐ"). Another example is the difference between large 1192e5b6d6dSopenharmony_ci and small Kana. A tertiary difference is ignored when there is a primary or 1202e5b6d6dSopenharmony_ci secondary difference anywhere in the strings. This is also called the 1212e5b6d6dSopenharmony_ci level-3 strength. 1222e5b6d6dSopenharmony_ci 1232e5b6d6dSopenharmony_ci4. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations 1242e5b6d6dSopenharmony_ci (§)) at level 1-3, an additional level can be used to distinguish words with 1252e5b6d6dSopenharmony_ci and without punctuation (for example, "ab" < "a-b" < "aB"). This difference 1262e5b6d6dSopenharmony_ci is ignored when there is a primary, secondary or tertiary difference. This 1272e5b6d6dSopenharmony_ci is also known as the level-4 strength. The quaternary level should only be 1282e5b6d6dSopenharmony_ci used if ignoring punctuation is required or when processing Japanese text 1292e5b6d6dSopenharmony_ci (see Hiragana processing (§)). 1302e5b6d6dSopenharmony_ci 1312e5b6d6dSopenharmony_ci5. Identical Level: When all other levels are equal, the identical level is 1322e5b6d6dSopenharmony_ci used as a tiebreaker. The Unicode code point values of the NFD form of each 1332e5b6d6dSopenharmony_ci string are compared at this level, just in case there is no difference at 1342e5b6d6dSopenharmony_ci levels 1-4. For example, Hebrew cantillation marks are only distinguished 1352e5b6d6dSopenharmony_ci at this level. This level should be used sparingly, as only code point 1362e5b6d6dSopenharmony_ci value differences between two strings is an extremely rare occurrence. 1372e5b6d6dSopenharmony_ci Using this level substantially decreases the performance for 1382e5b6d6dSopenharmony_ci both incremental comparison and sort key generation (as well as increasing 1392e5b6d6dSopenharmony_ci the sort key length). It is also known as level 5 strength. 1402e5b6d6dSopenharmony_ci 1412e5b6d6dSopenharmony_ci## Backward Secondary Sorting 1422e5b6d6dSopenharmony_ci 1432e5b6d6dSopenharmony_ciSome languages require words to be ordered on the secondary level according to 1442e5b6d6dSopenharmony_cithe *last* accent difference, as opposed to the *first* accent difference. This 1452e5b6d6dSopenharmony_ciwas previously the default for all French locales, based on some French 1462e5b6d6dSopenharmony_cidictionary ordering traditions, but is currently only applicable to Canadian 1472e5b6d6dSopenharmony_ciFrench (locale **fr_CA**), for conformance with the [Canadian sorting 1482e5b6d6dSopenharmony_cistandard](http://www.unicode.org/reports/tr10/#CanStd). The difference in 1492e5b6d6dSopenharmony_ciordering is only noticeable for a small number of pairs of real words. For more 1502e5b6d6dSopenharmony_ciinformation see [UCA: Contextual 1512e5b6d6dSopenharmony_ciSensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity). 1522e5b6d6dSopenharmony_ci 1532e5b6d6dSopenharmony_ciExample: 1542e5b6d6dSopenharmony_ci 1552e5b6d6dSopenharmony_ciForward secondary | Backward secondary 1562e5b6d6dSopenharmony_ci----------------- | ------------------ 1572e5b6d6dSopenharmony_cicote | cote 1582e5b6d6dSopenharmony_cicoté | côte 1592e5b6d6dSopenharmony_cicôte | coté 1602e5b6d6dSopenharmony_cicôté | côté 1612e5b6d6dSopenharmony_ci 1622e5b6d6dSopenharmony_ci## Contractions 1632e5b6d6dSopenharmony_ci 1642e5b6d6dSopenharmony_ciA contraction is a sequence consisting of two or more letters. It is considered 1652e5b6d6dSopenharmony_cia single letter in sorting. 1662e5b6d6dSopenharmony_ci 1672e5b6d6dSopenharmony_ciFor example, in the traditional Spanish sorting order, "ch" is considered a 1682e5b6d6dSopenharmony_cisingle letter. All words that begin with "ch" sort after all other words 1692e5b6d6dSopenharmony_cibeginning with "c", but before words starting with "d". 1702e5b6d6dSopenharmony_ci 1712e5b6d6dSopenharmony_ciOther examples of contractions are "ch" in Czech, which sorts after "h", and 1722e5b6d6dSopenharmony_ci"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" 1732e5b6d6dSopenharmony_cirespectively. 1742e5b6d6dSopenharmony_ci 1752e5b6d6dSopenharmony_ciExample: 1762e5b6d6dSopenharmony_ci 1772e5b6d6dSopenharmony_ciOrder without contraction | Order with contraction "lj" sorting after letter "l" 1782e5b6d6dSopenharmony_ci------------------------- | ---------------------------------------------------- 1792e5b6d6dSopenharmony_cila | la 1802e5b6d6dSopenharmony_cili | li 1812e5b6d6dSopenharmony_cilj | lk 1822e5b6d6dSopenharmony_cilja | lz 1832e5b6d6dSopenharmony_ciljz | lj 1842e5b6d6dSopenharmony_cilk | lja 1852e5b6d6dSopenharmony_cilz | ljz 1862e5b6d6dSopenharmony_cima | ma 1872e5b6d6dSopenharmony_ci 1882e5b6d6dSopenharmony_ciContracting sequences such as the above are not very common in most languages. 1892e5b6d6dSopenharmony_ci 1902e5b6d6dSopenharmony_ci> :point_right: **Note** Since ICU 2.2, and as required by the UCA, 1912e5b6d6dSopenharmony_ci> if a completely ignorable code point 1922e5b6d6dSopenharmony_ci> appears in text in the middle of contraction, it will not break the contraction. 1932e5b6d6dSopenharmony_ci> For example, in Czech sorting, cU+0000h will sort as it were ch. 1942e5b6d6dSopenharmony_ci 1952e5b6d6dSopenharmony_ci## Expansions 1962e5b6d6dSopenharmony_ci 1972e5b6d6dSopenharmony_ciIf a letter sorts as if it were a sequence of more than one letter, it is called 1982e5b6d6dSopenharmony_cian expansion. 1992e5b6d6dSopenharmony_ci 2002e5b6d6dSopenharmony_ciFor example, in German phonebook sorting (de@collation=phonebook or BCP 47 2012e5b6d6dSopenharmony_cide-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae." 2022e5b6d6dSopenharmony_ciAll words starting with "ä" will sort between words starting with "ad" and words 2032e5b6d6dSopenharmony_cistarting with "af". 2042e5b6d6dSopenharmony_ci 2052e5b6d6dSopenharmony_ciIn the case of Unicode encoding, characters can often be represented either as 2062e5b6d6dSopenharmony_cipre-composed characters or in decomposed form. For example, the letter "à" can 2072e5b6d6dSopenharmony_cibe represented in its decomposed (a+\`) and pre-composed (à) form. Most 2082e5b6d6dSopenharmony_ciapplications do not want to distinguish text by the way it is encoded. A search 2092e5b6d6dSopenharmony_cifor "à" should find all instances of the letter, regardless of whether the 2102e5b6d6dSopenharmony_ciinstance is in pre-composed or decomposed form. Therefore, either form of the 2112e5b6d6dSopenharmony_ciletter must result in the same sort ordering. The architecture of the ICU 2122e5b6d6dSopenharmony_ciCollation Service supports this. 2132e5b6d6dSopenharmony_ci 2142e5b6d6dSopenharmony_ci## Contractions Producing Expansions 2152e5b6d6dSopenharmony_ci 2162e5b6d6dSopenharmony_ciIt is possible to have contractions that produce expansions. 2172e5b6d6dSopenharmony_ci 2182e5b6d6dSopenharmony_ciOne example occurs in Japanese, where the vowel with a prolonged sound mark is 2192e5b6d6dSopenharmony_citreated to be equivalent to the long vowel version: 2202e5b6d6dSopenharmony_ci 2212e5b6d6dSopenharmony_ciカアー<<< カイー and\ 2222e5b6d6dSopenharmony_ciキイー<<< キイー 2232e5b6d6dSopenharmony_ci 2242e5b6d6dSopenharmony_ci> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses 2252e5b6d6dSopenharmony_ci> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings) 2262e5b6d6dSopenharmony_ci> instead of contraction producing expansions. 2272e5b6d6dSopenharmony_ci 2282e5b6d6dSopenharmony_ci## Normalization 2292e5b6d6dSopenharmony_ci 2302e5b6d6dSopenharmony_ciIn the section on expansions, we discussed that text in Unicode can often be 2312e5b6d6dSopenharmony_cirepresented in either pre-composed or decomposed forms. There are other types of 2322e5b6d6dSopenharmony_ciequivalences possible with Unicode, including Canonical and Compatibility. The 2332e5b6d6dSopenharmony_ciprocess of 2342e5b6d6dSopenharmony_ciNormalization ensures that text is written in a predictable way so that searches 2352e5b6d6dSopenharmony_ciare not made unnecessarily complicated by having to match on equivalences. Not 2362e5b6d6dSopenharmony_ciall text is normalized, however, so it is useful to have a collation service 2372e5b6d6dSopenharmony_cithat can address text that is not normalized, but do so with efficiency. 2382e5b6d6dSopenharmony_ci 2392e5b6d6dSopenharmony_ciThe ICU Collation Service handles un-normalized text properly, producing the 2402e5b6d6dSopenharmony_cisame results as if the text were normalized. 2412e5b6d6dSopenharmony_ci 2422e5b6d6dSopenharmony_ciIn practice, most data that is encountered is in normalized or semi-normalized 2432e5b6d6dSopenharmony_ciform already. The ICU Collation Service is designed so that it can process a 2442e5b6d6dSopenharmony_ciwide range of normalized or un-normalized text without a need for normalization 2452e5b6d6dSopenharmony_ciprocessing. When a case is encountered that requires normalization, the ICU 2462e5b6d6dSopenharmony_ciCollation Service drops into code specific to this purpose. This maximizes 2472e5b6d6dSopenharmony_ciperformance for the majority of text that does not require normalization. 2482e5b6d6dSopenharmony_ci 2492e5b6d6dSopenharmony_ciIn addition, if the text is known with certainty not to contain un-normalized 2502e5b6d6dSopenharmony_citext, then even the overhead of checking for normalization can be eliminated. 2512e5b6d6dSopenharmony_ciThe ICU Collation Service has the ability to turn Normalization Checking either 2522e5b6d6dSopenharmony_cion or off. If Normalization Checking is turned off, it is the user's 2532e5b6d6dSopenharmony_ciresponsibility to insure that all text is already in the appropriate form. This 2542e5b6d6dSopenharmony_ciis true in a great majority of the world languages, so normalization checking is 2552e5b6d6dSopenharmony_citurned off by default for most locales. 2562e5b6d6dSopenharmony_ci 2572e5b6d6dSopenharmony_ciIf the text requires normalization processing, Normalization Checking should be 2582e5b6d6dSopenharmony_cion. Any language that uses multiple combining characters such as Arabic, ancient 2592e5b6d6dSopenharmony_ciGreek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking 2602e5b6d6dSopenharmony_cito be on, or the text to go through a normalization process before collation. 2612e5b6d6dSopenharmony_ci 2622e5b6d6dSopenharmony_ciFor more information about Normalization related reordering please see 2632e5b6d6dSopenharmony_ci[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and 2642e5b6d6dSopenharmony_ci[UAX #15.](http://www.unicode.org/reports/tr15/) 2652e5b6d6dSopenharmony_ci 2662e5b6d6dSopenharmony_ci> :point_right: **Note** ICU supports two modes of normalization: on and off. 2672e5b6d6dSopenharmony_ci> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU. 2682e5b6d6dSopenharmony_ci 2692e5b6d6dSopenharmony_ci## Ignoring Punctuation 2702e5b6d6dSopenharmony_ci 2712e5b6d6dSopenharmony_ciIn some cases, punctuation can be ignored while searching or sorting data. For 2722e5b6d6dSopenharmony_ciexample, this enables a search for "biweekly" to also return instances of 2732e5b6d6dSopenharmony_ci"bi-weekly". In other cases, it is desirable for punctuated text to be 2742e5b6d6dSopenharmony_cidistinguished from text without punctuation, but to have the text sort close 2752e5b6d6dSopenharmony_citogether. 2762e5b6d6dSopenharmony_ci 2772e5b6d6dSopenharmony_ciThese two behaviors can be accomplished if there is a way for a character to be 2782e5b6d6dSopenharmony_ciignored on all levels except for the quaternary level. If this is the case, then 2792e5b6d6dSopenharmony_citwo strings which compare as identical on the first three levels (base letter, 2802e5b6d6dSopenharmony_ciaccents, and case) are then distinguished at the fourth level based on their 2812e5b6d6dSopenharmony_cipunctuation (if any). If the comparison function ignores differences at the 2822e5b6d6dSopenharmony_cifourth level, then strings that differ by punctuation only are compared as 2832e5b6d6dSopenharmony_ciequal. 2842e5b6d6dSopenharmony_ci 2852e5b6d6dSopenharmony_ciThe following table shows the results of sorting a list of terms in 3 different 2862e5b6d6dSopenharmony_ciways. In the first column, punctuation characters (space " ", and hyphen "-") 2872e5b6d6dSopenharmony_ciare not ignored (" " < "-" < "b"). In the second column, punctuation characters 2882e5b6d6dSopenharmony_ciare ignored in the first 3 levels and compared only in the fourth level. In the 2892e5b6d6dSopenharmony_cithird column, punctuation characters are ignored in the first 3 levels and the 2902e5b6d6dSopenharmony_cifourth level is not considered. In the last column, punctuated terms are 2912e5b6d6dSopenharmony_ciequivalent to the identical terms without punctuation. 2922e5b6d6dSopenharmony_ci 2932e5b6d6dSopenharmony_ciFor more options and details see the [“Ignore Punctuation” 2942e5b6d6dSopenharmony_ciOptions](customization/ignorepunct.md) page. 2952e5b6d6dSopenharmony_ci 2962e5b6d6dSopenharmony_ciNon-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength 2972e5b6d6dSopenharmony_ci------------- | --------------------------------- | ------------------------------- 2982e5b6d6dSopenharmony_ciblack bird | black bird | **black bird** 2992e5b6d6dSopenharmony_ciblack Bird | black-bird | **black-bird** 3002e5b6d6dSopenharmony_ciblack birds | blackbird | **blackbird** 3012e5b6d6dSopenharmony_ciblack-bird | black Bird | black Bird 3022e5b6d6dSopenharmony_ciblack-Bird | black-Bird | black-Bird 3032e5b6d6dSopenharmony_ciblack-birds | blackBird | blackBird 3042e5b6d6dSopenharmony_ciblackbird | black birds | black birds 3052e5b6d6dSopenharmony_ciblackBird | black-birds | black-birds 3062e5b6d6dSopenharmony_ciblackbirds | blackbirds | blackbirds 3072e5b6d6dSopenharmony_ci 3082e5b6d6dSopenharmony_ci> :point_right: **Note** The strings with the same font format in the last column are 3092e5b6d6dSopenharmony_cicompared as equal by ICU Collator.\ 3102e5b6d6dSopenharmony_ci> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that 3112e5b6d6dSopenharmony_ci> follow shifted code points will be completely ignored. This means that an accent 3122e5b6d6dSopenharmony_ci> following a space will compare as if it was a space alone. 3132e5b6d6dSopenharmony_ci 3142e5b6d6dSopenharmony_ci## Case Ordering 3152e5b6d6dSopenharmony_ci 3162e5b6d6dSopenharmony_ciThe tertiary level is used to distinguish text by case, by small versus large 3172e5b6d6dSopenharmony_ciKana, and other letter variants as noted above. 3182e5b6d6dSopenharmony_ci 3192e5b6d6dSopenharmony_ciSome applications prefer to emphasize case differences so that words starting 3202e5b6d6dSopenharmony_ciwith the same case sort together. Some Japanese applications require the 3212e5b6d6dSopenharmony_cidifference between small and large Kana be emphasized over other tertiary 3222e5b6d6dSopenharmony_cidifferences. 3232e5b6d6dSopenharmony_ci 3242e5b6d6dSopenharmony_ciThe UCA does not provide means to separate out either case or Kana differences 3252e5b6d6dSopenharmony_cifrom the remaining tertiary differences. However, the ICU Collation Service has 3262e5b6d6dSopenharmony_citwo options that help in customize case and/or Kana differences. Both options 3272e5b6d6dSopenharmony_ciare turned off by default. 3282e5b6d6dSopenharmony_ci 3292e5b6d6dSopenharmony_ci### CaseFirst 3302e5b6d6dSopenharmony_ci 3312e5b6d6dSopenharmony_ciThe Case-first option makes case the most significant part of the tertiary 3322e5b6d6dSopenharmony_cilevel. Primary and secondary levels are unaffected. With this option, words 3332e5b6d6dSopenharmony_cistarting with the same case sort together. The Case-first option can be set to 3342e5b6d6dSopenharmony_cimake either lowercase sort before 3352e5b6d6dSopenharmony_ciuppercase or uppercase sort before lowercase. 3362e5b6d6dSopenharmony_ci 3372e5b6d6dSopenharmony_ciNote: The case-first option does not constitute a separate level; it is simply a 3382e5b6d6dSopenharmony_cireordering of the tertiary level. 3392e5b6d6dSopenharmony_ci 3402e5b6d6dSopenharmony_ciICU makes use of the following three case categories for sorting 3412e5b6d6dSopenharmony_ci 3422e5b6d6dSopenharmony_ci1. uppercase: "ABC" 3432e5b6d6dSopenharmony_ci 3442e5b6d6dSopenharmony_ci2. mixed case: "Abc", "aBc" 3452e5b6d6dSopenharmony_ci 3462e5b6d6dSopenharmony_ci3. normal (lowercase or no case): "abc", "123" 3472e5b6d6dSopenharmony_ci 3482e5b6d6dSopenharmony_ciMixed case is always sorted between uppercase and normal case when the 3492e5b6d6dSopenharmony_ci"case-first" option is set. 3502e5b6d6dSopenharmony_ci 3512e5b6d6dSopenharmony_ci### CaseLevel 3522e5b6d6dSopenharmony_ci 3532e5b6d6dSopenharmony_ciThe Case Level option makes a separate level for case differences. This is an 3542e5b6d6dSopenharmony_ciextra level positioned between secondary and tertiary. The case level is used in 3552e5b6d6dSopenharmony_ciJapanese to make the difference between small and large Kana more important than 3562e5b6d6dSopenharmony_cithe other tertiary differences. It also can be used to ignore other tertiary 3572e5b6d6dSopenharmony_cidifferences, or even secondary differences. This is especially useful in 3582e5b6d6dSopenharmony_cimatching. For example, if the strength is set to primary only (level-1) and the 3592e5b6d6dSopenharmony_cicase level is turned on, the comparison ignores accents and tertiary differences 3602e5b6d6dSopenharmony_ciexcept for case. The contents of the case level are affected by the case-first 3612e5b6d6dSopenharmony_cioption. 3622e5b6d6dSopenharmony_ci 3632e5b6d6dSopenharmony_ciThe case level is independent from the strength of comparison. It is possible to 3642e5b6d6dSopenharmony_cihave a collator set to primary strength with the case level turned on. This 3652e5b6d6dSopenharmony_ciprovides for comparison that takes into account the case differences, while at 3662e5b6d6dSopenharmony_cithe same time ignoring accents and tertiary differences other than case. This 3672e5b6d6dSopenharmony_cimay be used in searching. 3682e5b6d6dSopenharmony_ci 3692e5b6d6dSopenharmony_ciExample: 3702e5b6d6dSopenharmony_ci 3712e5b6d6dSopenharmony_ci**Case-first off, Case level off** 3722e5b6d6dSopenharmony_ci 3732e5b6d6dSopenharmony_ciapple\ 3742e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\ 3752e5b6d6dSopenharmony_ciAbernathy\ 3762e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 3772e5b6d6dSopenharmony_ciähnlich\ 3782e5b6d6dSopenharmony_ciÄhnlichkeit 3792e5b6d6dSopenharmony_ci 3802e5b6d6dSopenharmony_ci**Lowercase-first, Case level off** 3812e5b6d6dSopenharmony_ci 3822e5b6d6dSopenharmony_ciapple\ 3832e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\ 3842e5b6d6dSopenharmony_ciähnlich\ 3852e5b6d6dSopenharmony_ciAbernathy\ 3862e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 3872e5b6d6dSopenharmony_ciÄhnlichkeit 3882e5b6d6dSopenharmony_ci 3892e5b6d6dSopenharmony_ci**Uppercase-first, Case level off** 3902e5b6d6dSopenharmony_ci 3912e5b6d6dSopenharmony_ciAbernathy\ 3922e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 3932e5b6d6dSopenharmony_ciÄhnlichkeit\ 3942e5b6d6dSopenharmony_ciapple\ 3952e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\ 3962e5b6d6dSopenharmony_ciähnlich 3972e5b6d6dSopenharmony_ci 3982e5b6d6dSopenharmony_ci**Lowercase-first, Case level on** 3992e5b6d6dSopenharmony_ci 4002e5b6d6dSopenharmony_ciapple\ 4012e5b6d6dSopenharmony_ciAbernathy\ 4022e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\ 4032e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 4042e5b6d6dSopenharmony_ciähnlich\ 4052e5b6d6dSopenharmony_ciÄhnlichkeit 4062e5b6d6dSopenharmony_ci 4072e5b6d6dSopenharmony_ci**Uppercase-first, Case level on** 4082e5b6d6dSopenharmony_ci 4092e5b6d6dSopenharmony_ciAbernathy\ 4102e5b6d6dSopenharmony_ciapple\ 4112e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 4122e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\ 4132e5b6d6dSopenharmony_ciÄhnlichkeit\ 4142e5b6d6dSopenharmony_ciähnlich 4152e5b6d6dSopenharmony_ci 4162e5b6d6dSopenharmony_ci## Script Reordering 4172e5b6d6dSopenharmony_ci 4182e5b6d6dSopenharmony_ciScript reordering allows scripts and some other groups of characters to be moved 4192e5b6d6dSopenharmony_cirelative to each other. This reordering is done on top of the DUCET/CLDR 4202e5b6d6dSopenharmony_cistandard collation order. Reordering can specify groups to be placed at the 4212e5b6d6dSopenharmony_cistart and/or the end of the collation order. 4222e5b6d6dSopenharmony_ci 4232e5b6d6dSopenharmony_ciBy default, reordering codes specified for the start of the order are placed in 4242e5b6d6dSopenharmony_cithe order given after several special non-script blocks. These special groups of 4252e5b6d6dSopenharmony_cicharacters are space, punctuation, symbol, currency, and digit. Script groups 4262e5b6d6dSopenharmony_cican be intermingled with these special non-script groups if those special groups 4272e5b6d6dSopenharmony_ciare explicitly specified in the reordering. 4282e5b6d6dSopenharmony_ci 4292e5b6d6dSopenharmony_ciThe special code `others` stands for any script that is not explicitly mentioned 4302e5b6d6dSopenharmony_ciin the list. Anything that is after others will go at the very end of the list 4312e5b6d6dSopenharmony_ciin the order given. For example, `[Grek, others, Latn]` will result in an 4322e5b6d6dSopenharmony_ciordering that puts all scripts other than Greek and Latin between them. 4332e5b6d6dSopenharmony_ci 4342e5b6d6dSopenharmony_ci### Examples: 4352e5b6d6dSopenharmony_ci 4362e5b6d6dSopenharmony_ciNote: All examples below use the string equivalents for the scripts and reorder 4372e5b6d6dSopenharmony_cicodes that would be used in collator rules. The script and reorder code 4382e5b6d6dSopenharmony_ciconstants that would be used in API calls will be different. 4392e5b6d6dSopenharmony_ci 4402e5b6d6dSopenharmony_ci**Example 1:**\ 4412e5b6d6dSopenharmony_ciset reorder code - `[Grek]`\ 4422e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others]` 4432e5b6d6dSopenharmony_ci 4442e5b6d6dSopenharmony_ci**Example 2:**\ 4452e5b6d6dSopenharmony_ciset reorder code - `[Grek]`\ 4462e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others]` 4472e5b6d6dSopenharmony_ci 4482e5b6d6dSopenharmony_cifollowed by: set reorder code - `[Hani]`\ 4492e5b6d6dSopenharmony_ciresult -` [space, punctuation, symbol, currency, digit, Hani, others]` 4502e5b6d6dSopenharmony_ci 4512e5b6d6dSopenharmony_ciThat is, setting a reordering always modifies 4522e5b6d6dSopenharmony_cithe DUCET/CLDR order, replacing whatever was previously set, rather than adding 4532e5b6d6dSopenharmony_cion to it. In order to cumulatively modify an ordering, you have to retrieve the 4542e5b6d6dSopenharmony_ciexisting ordering, modify it, and then set it. 4552e5b6d6dSopenharmony_ci 4562e5b6d6dSopenharmony_ci**Example 3:**\ 4572e5b6d6dSopenharmony_ciset reorder code - `[others, digit]`\ 4582e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, others, digit]` 4592e5b6d6dSopenharmony_ci 4602e5b6d6dSopenharmony_ci**Example 4:**\ 4612e5b6d6dSopenharmony_ciset reorder code - `[space, Grek, punctuation]`\ 4622e5b6d6dSopenharmony_ciresult - `[symbol, currency, digit, space, Grek, punctuation, others]` 4632e5b6d6dSopenharmony_ci 4642e5b6d6dSopenharmony_ci**Example 5:**\ 4652e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani]`\ 4662e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]` 4672e5b6d6dSopenharmony_ci 4682e5b6d6dSopenharmony_ci**Example 6:**\ 4692e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\ 4702e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` 4712e5b6d6dSopenharmony_ci 4722e5b6d6dSopenharmony_cifollowed by:\ 4732e5b6d6dSopenharmony_ciset reorder code - `[NONE]`\ 4742e5b6d6dSopenharmony_ciresult - DUCET/CLDR 4752e5b6d6dSopenharmony_ci 4762e5b6d6dSopenharmony_ci**Example 7:**\ 4772e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\ 4782e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` 4792e5b6d6dSopenharmony_ci 4802e5b6d6dSopenharmony_cifollowed by:\ 4812e5b6d6dSopenharmony_ciset reorder code - `[DEFAULT]`\ 4822e5b6d6dSopenharmony_ciresult - original reordering for the locale which may or may not be DUCET/CLDR 4832e5b6d6dSopenharmony_ci 4842e5b6d6dSopenharmony_ci**Example 8:**\ 4852e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\ 4862e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` 4872e5b6d6dSopenharmony_ci 4882e5b6d6dSopenharmony_cifollowed by:\ 4892e5b6d6dSopenharmony_ciset reorder code - `[]`\ 4902e5b6d6dSopenharmony_ciresult - original reordering for the locale which may or may not be DUCET/CLDR 4912e5b6d6dSopenharmony_ci 4922e5b6d6dSopenharmony_ci**Example 9:**\ 4932e5b6d6dSopenharmony_ciset reorder code - `[Hebr, Phnx]`\ 4942e5b6d6dSopenharmony_ciresult - error 4952e5b6d6dSopenharmony_ci 4962e5b6d6dSopenharmony_ciBeginning with ICU 55, scripts only reorder together if they are primary-equal, 4972e5b6d6dSopenharmony_cifor example Hiragana and Katakana. 4982e5b6d6dSopenharmony_ci 4992e5b6d6dSopenharmony_ciICU 4.8-54: 5002e5b6d6dSopenharmony_ci 5012e5b6d6dSopenharmony_ci* Scripts were reordered in groups, each normally starting with a [Recommended 5022e5b6d6dSopenharmony_ci Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). 5032e5b6d6dSopenharmony_ci* Reorder codes moved as a group (were “equivalent”) if their scripts shared a 5042e5b6d6dSopenharmony_ci primary-weight lead byte. 5052e5b6d6dSopenharmony_ci* For example, Hebr and Phnx were “equivalent” reordering codes and were 5062e5b6d6dSopenharmony_ci reordered together. Their order relative to each other could not be changed. 5072e5b6d6dSopenharmony_ci* Only any one code out of any group could be reordered, not multiple of the 5082e5b6d6dSopenharmony_ci same group. 5092e5b6d6dSopenharmony_ci 5102e5b6d6dSopenharmony_ci## Sorting of Japanese Text (JIS X 4061) 5112e5b6d6dSopenharmony_ci 5122e5b6d6dSopenharmony_ciJapanese standard JIS X 4061 requires two changes to the collation procedures: 5132e5b6d6dSopenharmony_cispecial processing of Hiragana characters and (for performance reasons) prefix 5142e5b6d6dSopenharmony_cianalysis of text. 5152e5b6d6dSopenharmony_ci 5162e5b6d6dSopenharmony_ci### Hiragana Processing 5172e5b6d6dSopenharmony_ci 5182e5b6d6dSopenharmony_ciJIS X 4061 standard requires more levels than provided by the UCA. To offer 5192e5b6d6dSopenharmony_ciconformant sorting order, ICU uses the quaternary level to distinguish between 5202e5b6d6dSopenharmony_ciHiragana and Katakana. Hiragana symbols are given smaller values than Katakana 5212e5b6d6dSopenharmony_cisymbols on quaternary level, thus causing Hiragana sequences to sort before 5222e5b6d6dSopenharmony_cicorresponding Katakana sequences. 5232e5b6d6dSopenharmony_ci 5242e5b6d6dSopenharmony_ci### Prefix Analysis 5252e5b6d6dSopenharmony_ci 5262e5b6d6dSopenharmony_ciAnother characteristics of sorting according to the JIS X 4061 is a large number 5272e5b6d6dSopenharmony_ciof contractions followed by expansions (see 5282e5b6d6dSopenharmony_ci[Contractions Producing Expansions](#contractions-producing-expansions)). 5292e5b6d6dSopenharmony_ciThis causes all the Hiragana and Katakana codepoints to be treated as 5302e5b6d6dSopenharmony_cicontractions, which reduces performance. The solution we adopted introduces the 5312e5b6d6dSopenharmony_ciprefix concept which allows us to improve the performance of Japanese sorting. 5322e5b6d6dSopenharmony_ciMore about this can be found in the [customization 5332e5b6d6dSopenharmony_cichapter](customization/index.md) . 5342e5b6d6dSopenharmony_ci 5352e5b6d6dSopenharmony_ci## Thai/Lao reordering 5362e5b6d6dSopenharmony_ci 5372e5b6d6dSopenharmony_ciUCA requires that certain Thai and Lao prevowels be reordered with a code point 5382e5b6d6dSopenharmony_cifollowing them. This option is always on in the ICU implementation, as 5392e5b6d6dSopenharmony_ciprescribed by the UCA. 5402e5b6d6dSopenharmony_ci 5412e5b6d6dSopenharmony_ciThis rule takes effect when: 5422e5b6d6dSopenharmony_ci 5432e5b6d6dSopenharmony_ci1. A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the 5442e5b6d6dSopenharmony_ci range \\U0E01-\\U0E2E 5452e5b6d6dSopenharmony_ci or 5462e5b6d6dSopenharmony_ci 5472e5b6d6dSopenharmony_ci2. A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the 5482e5b6d6dSopenharmony_ci range \\U0E81-\\U0EAE. In these cases the vowel is placed after the 5492e5b6d6dSopenharmony_ci consonant for collation purposes. 5502e5b6d6dSopenharmony_ci 5512e5b6d6dSopenharmony_ci> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai 5522e5b6d6dSopenharmony_ci> reordering. Java.text.\* classes allow tailorings to turn off reordering by 5532e5b6d6dSopenharmony_ci> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai 5542e5b6d6dSopenharmony_ci> prevowels. 5552e5b6d6dSopenharmony_ci 5562e5b6d6dSopenharmony_ci## Space Padding 5572e5b6d6dSopenharmony_ci 5582e5b6d6dSopenharmony_ciIn many database products, fields are padded with null. To get correct results, 5592e5b6d6dSopenharmony_cithe input to a Collator should omit any superfluous trailing padding spaces. The 5602e5b6d6dSopenharmony_ciproblem arises with contractions, expansions, or normalization. Suppose that 5612e5b6d6dSopenharmony_cithere are two fields, one containing "aed" and the other with "äd". German 5622e5b6d6dSopenharmony_ciphonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will 5632e5b6d6dSopenharmony_cicompare "ä" as if it were "ae" (on a primary level), so the order will be "äd" < 5642e5b6d6dSopenharmony_ci"aed". But if both fields are padded with spaces to a length of 3, then this 5652e5b6d6dSopenharmony_ciwill reverse the order, since the first will compare as if it were one character 5662e5b6d6dSopenharmony_cilonger. In other words, when you start with strings 1 and 2 5672e5b6d6dSopenharmony_ci 5682e5b6d6dSopenharmony_ci1 | a | e | d | \<space\> 5692e5b6d6dSopenharmony_ci-- | -- | -- | --------- | --------- 5702e5b6d6dSopenharmony_ci2 | ä | d | \<space\> | \<space\> 5712e5b6d6dSopenharmony_ci 5722e5b6d6dSopenharmony_cithey end up being compared on a primary level as if they were 1' and 2' 5732e5b6d6dSopenharmony_ci 5742e5b6d6dSopenharmony_ci1' | a | e | d | \<space\> | 5752e5b6d6dSopenharmony_ci-- | -- | -- | -- | --------- | --------- 5762e5b6d6dSopenharmony_ci2' | a | e | d | \<space\> | \<space\> 5772e5b6d6dSopenharmony_ci 5782e5b6d6dSopenharmony_ciSince 2' has an extra character (the extra space), it counts as having a primary 5792e5b6d6dSopenharmony_cidifference when it shouldn't. The correct result occurs when the trailing 5802e5b6d6dSopenharmony_cipadding spaces are removed, as in 1" and 2" 5812e5b6d6dSopenharmony_ci 5822e5b6d6dSopenharmony_ci1" | a | e | d 5832e5b6d6dSopenharmony_ci-- | -- | -- | -- 5842e5b6d6dSopenharmony_ci2" | a | e | d 5852e5b6d6dSopenharmony_ci 5862e5b6d6dSopenharmony_ci## Collator naming scheme 5872e5b6d6dSopenharmony_ci 5882e5b6d6dSopenharmony_ci***Starting with ICU 54, the following naming scheme and its API functions are deprecated.*** 5892e5b6d6dSopenharmony_ciUse `ucol_open()` with language tag collation keywords instead 5902e5b6d6dSopenharmony_ci(see [Collation API Details](api.md)). For example, 5912e5b6d6dSopenharmony_ci`ucol_open("de-u-co-phonebk-ka-shifted", &errorCode)` for German Phonebook order 5922e5b6d6dSopenharmony_ciwith "ignore punctuation" mode. 5932e5b6d6dSopenharmony_ci 5942e5b6d6dSopenharmony_ciWhen collating or matching text, a number of attributes can be used to affect 5952e5b6d6dSopenharmony_cithe desired result. The following describes the attributes, their values, their 5962e5b6d6dSopenharmony_cieffects, their normal usage, and the string comparison performance and sort key 5972e5b6d6dSopenharmony_cilength implications. It also includes single-letter abbreviations for both the 5982e5b6d6dSopenharmony_ciattributes and their values. These abbreviations allow a 'short-form' 5992e5b6d6dSopenharmony_cispecification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which 6002e5b6d6dSopenharmony_cican be used to specific that the desired options are: UCA version 4.0.0; ignore 6012e5b6d6dSopenharmony_cispaces, punctuation and symbols; use Swedish linguistic conventions; compare 6022e5b6d6dSopenharmony_cicase-insensitively. 6032e5b6d6dSopenharmony_ci 6042e5b6d6dSopenharmony_ciA number of attribute values are common across different attributes; these 6052e5b6d6dSopenharmony_ciinclude **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless 6062e5b6d6dSopenharmony_ciotherwise stated, the examples use the UCA alone with default settings. 6072e5b6d6dSopenharmony_ci 6082e5b6d6dSopenharmony_ci> :point_right: **Note** In order to achieve uniqueness, a collator name always 6092e5b6d6dSopenharmony_ci> has the attribute abbreviations sorted. 6102e5b6d6dSopenharmony_ci 6112e5b6d6dSopenharmony_ci### Main References 6122e5b6d6dSopenharmony_ci 6132e5b6d6dSopenharmony_ci1. For a full list of supported locales in ICU, see [Locale 6142e5b6d6dSopenharmony_ci Explorer](https://icu4c-demos.unicode.org/icu-bin/locexp) , which also contains 6152e5b6d6dSopenharmony_ci an on-line demo showing sorting for each locale. The demo allows you to try 6162e5b6d6dSopenharmony_ci different attribute values, to see how they affect sorting. 6172e5b6d6dSopenharmony_ci 6182e5b6d6dSopenharmony_ci2. To see tabular results for the UCA table itself, see the [Unicode Collation 6192e5b6d6dSopenharmony_ci Charts](http://www.unicode.org/charts/collation/) . 6202e5b6d6dSopenharmony_ci 6212e5b6d6dSopenharmony_ci3. For the UCA specification, see [UTS #10: Unicode Collation 6222e5b6d6dSopenharmony_ci Algorithm](http://www.unicode.org/reports/tr10/) . 6232e5b6d6dSopenharmony_ci 6242e5b6d6dSopenharmony_ci4. For more detail on the precise effects of these options, see [Collation 6252e5b6d6dSopenharmony_ci Customization](customization/index.md) . 6262e5b6d6dSopenharmony_ci 6272e5b6d6dSopenharmony_ci#### Collator Naming Attributes 6282e5b6d6dSopenharmony_ci 6292e5b6d6dSopenharmony_ciAttribute | Abbreviation | Possible Values 6302e5b6d6dSopenharmony_ci---------------------- | ------------ | --------------- 6312e5b6d6dSopenharmony_ciLocale | L | \<language\> 6322e5b6d6dSopenharmony_ciScript | Z | \<script\> 6332e5b6d6dSopenharmony_ciRegion | R | \<region\> 6342e5b6d6dSopenharmony_ciVariant | V | \<variant\> 6352e5b6d6dSopenharmony_ciKeyword | K | \<keyword\> 6362e5b6d6dSopenharmony_ci | | 6372e5b6d6dSopenharmony_ciStrength | S | 1, 2, 3, 4, I, D 6382e5b6d6dSopenharmony_ciCase_Level | E | X, O, D 6392e5b6d6dSopenharmony_ciCase_First | C | X, L, U, D 6402e5b6d6dSopenharmony_ciAlternate | A | N, S, D 6412e5b6d6dSopenharmony_ciVariable_Top | T | \<hex digits\> 6422e5b6d6dSopenharmony_ciNormalization Checking | N | X, O, D 6432e5b6d6dSopenharmony_ciFrench | F | X, O, D 6442e5b6d6dSopenharmony_ciHiragana | H | X, O, D 6452e5b6d6dSopenharmony_ci 6462e5b6d6dSopenharmony_ci#### Collator Naming Attribute Descriptions 6472e5b6d6dSopenharmony_ci 6482e5b6d6dSopenharmony_ciThe **Locale** attribute is typically the most 6492e5b6d6dSopenharmony_ciimportant attribute for correct sorting and matching, according to the user 6502e5b6d6dSopenharmony_ciexpectations in different countries and regions. The default UCA ordering will 6512e5b6d6dSopenharmony_cionly sort a few languages such as Dutch and Portuguese correctly ("correctly" 6522e5b6d6dSopenharmony_cimeaning according to the normal expectations for users of the languages). 6532e5b6d6dSopenharmony_ciOtherwise, you need to supply the locale to UCA in order to properly collate 6542e5b6d6dSopenharmony_citext for a given language. Thus a locale needs to be supplied so as to choose a 6552e5b6d6dSopenharmony_cicollator that is correctly **tailored** for that locale. The choice of a locale 6562e5b6d6dSopenharmony_ciwill automatically preset the values for all of the attributes to something that 6572e5b6d6dSopenharmony_ciis reasonable for that locale. Thus most of the time the other attributes do not 6582e5b6d6dSopenharmony_cineed to be explicitly set. In some cases, the choice of locale will make a 6592e5b6d6dSopenharmony_cidifference in string comparison performance and/or sort key length. 6602e5b6d6dSopenharmony_ci 6612e5b6d6dSopenharmony_ciIn short attribute names, 6622e5b6d6dSopenharmony_ci`<language>_<script>_<region>_<variant>@collation=<keyword>` is 6632e5b6d6dSopenharmony_cirepresented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not 6642e5b6d6dSopenharmony_ciall the elements are required. Valid values for locale elements are general 6652e5b6d6dSopenharmony_civalid values for RFC 3066 locale naming. 6662e5b6d6dSopenharmony_ci 6672e5b6d6dSopenharmony_ci**Example:**\ 6682e5b6d6dSopenharmony_ci**Locale="sv" (Swedish)** "Kypper" < "Köpfe"\ 6692e5b6d6dSopenharmony_ci**Locale="de" (German)** "Köpfe" < "Kypper" 6702e5b6d6dSopenharmony_ci 6712e5b6d6dSopenharmony_ciThe **Strength** attribute determines whether accents or 6722e5b6d6dSopenharmony_cicase are taken into account when collating or matching text. ( (In writing 6732e5b6d6dSopenharmony_cisystems without case or accents, it controls similarly important features). The 6742e5b6d6dSopenharmony_cidefault strength setting usually does not need to be changed for collating 6752e5b6d6dSopenharmony_ci(sorting), but often needs to be changed when **matching** (e.g. SELECT). The 6762e5b6d6dSopenharmony_cipossible values include Default (D), Primary (1), Secondary (2), Tertiary (3), 6772e5b6d6dSopenharmony_ciQuaternary (4), and Identical (I). 6782e5b6d6dSopenharmony_ci 6792e5b6d6dSopenharmony_ciFor example, people may choose to ignore accents or ignore accents and case when 6802e5b6d6dSopenharmony_cisearching for text. 6812e5b6d6dSopenharmony_ci 6822e5b6d6dSopenharmony_ciAlmost all characters are distinguished by the first three levels, and in most 6832e5b6d6dSopenharmony_cilocales the default value is thus Tertiary. However, if Alternate is set to be 6842e5b6d6dSopenharmony_ciShifted, then the Quaternary strength (4) can be used to break ties among 6852e5b6d6dSopenharmony_ciwhitespace, punctuation, and symbols that would otherwise be ignored. If very 6862e5b6d6dSopenharmony_cifine distinctions among characters are required, then the Identical strength (I) 6872e5b6d6dSopenharmony_cican be used (for example, Identical Strength distinguishes between the 6882e5b6d6dSopenharmony_ci**Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more 6892e5b6d6dSopenharmony_ciexamples, look at the cells with white backgrounds in the collation charts). 6902e5b6d6dSopenharmony_ciHowever, using levels higher than Tertiary - the Identical strength - result in 6912e5b6d6dSopenharmony_cisignificantly longer sort keys, and slower string comparison performance for 6922e5b6d6dSopenharmony_ciequal strings. 6932e5b6d6dSopenharmony_ci 6942e5b6d6dSopenharmony_ci**Example:**\ 6952e5b6d6dSopenharmony_ci**S=1** role = Role = rôle\ 6962e5b6d6dSopenharmony_ci**S=2** role = Role < rôle\ 6972e5b6d6dSopenharmony_ci**S=3** role < Role < rôle 6982e5b6d6dSopenharmony_ci 6992e5b6d6dSopenharmony_ciThe **Case_Level** attribute is used when ignoring accents 7002e5b6d6dSopenharmony_ci**but not** case. In such a situation, set Strength to be Primary, and 7012e5b6d6dSopenharmony_ciCase_Level to be On. In most locales, this setting is Off by default. There is a 7022e5b6d6dSopenharmony_cismall string comparison performance and sort key impact if this attribute is set 7032e5b6d6dSopenharmony_cito be On. 7042e5b6d6dSopenharmony_ci 7052e5b6d6dSopenharmony_ci**Example:**\ 7062e5b6d6dSopenharmony_ci**S=1, E=X** role = Role = rôle\ 7072e5b6d6dSopenharmony_ci**S=1, E=O** role = rôle < Role 7082e5b6d6dSopenharmony_ci 7092e5b6d6dSopenharmony_ciThe **Case_First** attribute is used to control whether 7102e5b6d6dSopenharmony_ciuppercase letters come before lowercase letters or vice versa, in the absence of 7112e5b6d6dSopenharmony_ciother differences in the strings. The possible values are Uppercase_First (U) 7122e5b6d6dSopenharmony_ciand Lowercase_First (L), plus the standard Default and Off. There is almost no 7132e5b6d6dSopenharmony_cidifference between the Off and Lowercase_First options in terms of results, so 7142e5b6d6dSopenharmony_citypically users will not use Lowercase_First: only Off or Uppercase_First. 7152e5b6d6dSopenharmony_ci(People interested in the detailed differences between X and L should consult 7162e5b6d6dSopenharmony_cithe [Collation Customization](customization/index.md) ). 7172e5b6d6dSopenharmony_ciSpecifying either L or U won't affect string comparison performance, but will 7182e5b6d6dSopenharmony_ciaffect the sort key length. 7192e5b6d6dSopenharmony_ci 7202e5b6d6dSopenharmony_ci**Example:**\ 7212e5b6d6dSopenharmony_ci**C=X or C=L** "china" < "China" < "denmark" < "Denmark"\ 7222e5b6d6dSopenharmony_ci**C=U** "China" < "china" < "Denmark" < "denmark" 7232e5b6d6dSopenharmony_ci 7242e5b6d6dSopenharmony_ciThe **Alternate** attribute is used to control the handling of 7252e5b6d6dSopenharmony_cithe so-called **variable **characters in the UCA: whitespace, punctuation and 7262e5b6d6dSopenharmony_cisymbols. If Alternate is set to Non-Ignorable (N), then differences among these 7272e5b6d6dSopenharmony_cicharacters are of the same importance as differences among letters. If Alternate 7282e5b6d6dSopenharmony_ciis set to Shifted (S), then these characters are of only minor importance. The 7292e5b6d6dSopenharmony_ciShifted value is often used in combination with Strength set to Quaternary. In 7302e5b6d6dSopenharmony_cisuch a case, white-space, punctuation, and symbols are considered when comparing 7312e5b6d6dSopenharmony_cistrings, but only if all other aspects of the strings (base letters, accents, 7322e5b6d6dSopenharmony_ciand case) are identical. If Alternate is not set to Shifted, then there is no 7332e5b6d6dSopenharmony_cidifference between a Strength of 3 and a Strength of 4. 7342e5b6d6dSopenharmony_ci 7352e5b6d6dSopenharmony_ciFor more information and examples, see 7362e5b6d6dSopenharmony_ci[Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in 7372e5b6d6dSopenharmony_cithe UCA. 7382e5b6d6dSopenharmony_ci 7392e5b6d6dSopenharmony_ciThe reason the Alternate values are not simply On and Off is that 7402e5b6d6dSopenharmony_ciadditional Alternate values may be added in the future. 7412e5b6d6dSopenharmony_ci 7422e5b6d6dSopenharmony_ciThe UCA option 7432e5b6d6dSopenharmony_ci**Blanked** is expressed with Strength set to 3, and Alternate set to Shifted. 7442e5b6d6dSopenharmony_ci 7452e5b6d6dSopenharmony_ciThe default for most locales is Non-Ignorable. If Shifted is selected, it may be 7462e5b6d6dSopenharmony_cislower if there are many strings that are the same except for punctuation; sort 7472e5b6d6dSopenharmony_cikey length will not be affected unless the strength level is also increased. 7482e5b6d6dSopenharmony_ci 7492e5b6d6dSopenharmony_ci**Example:**\ 7502e5b6d6dSopenharmony_ci**S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\ 7512e5b6d6dSopenharmony_ci**S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\ 7522e5b6d6dSopenharmony_ci**S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA 7532e5b6d6dSopenharmony_ci 7542e5b6d6dSopenharmony_ciThe **Variable_Top** attribute is only meaningful if the 7552e5b6d6dSopenharmony_ciAlternate attribute is not set to Non-Ignorable. In such a case, it controls 7562e5b6d6dSopenharmony_ciwhich characters count as ignorable. The \<hex\> value specifies the "highest" 7572e5b6d6dSopenharmony_cicharacter sequence (in UCA order) weight that is to be considered ignorable. 7582e5b6d6dSopenharmony_ci 7592e5b6d6dSopenharmony_ciThus, for example, if a user wanted white-space to be ignorable, but not any 7602e5b6d6dSopenharmony_civisible characters, then s/he would use the value Variable_Top=0020 (space). The 7612e5b6d6dSopenharmony_cidigits should only be a single character. All characters of the same primary 7622e5b6d6dSopenharmony_ciweight are equivalent, so Variable_Top=3000 (ideographic space) has the same 7632e5b6d6dSopenharmony_cieffect as Variable_Top=0020. 7642e5b6d6dSopenharmony_ci 7652e5b6d6dSopenharmony_ciThis setting (alone) has little impact on string comparison performance; setting 7662e5b6d6dSopenharmony_ciit lower or higher will make sort keys slightly shorter or longer respectively. 7672e5b6d6dSopenharmony_ci 7682e5b6d6dSopenharmony_ci**Example:**\ 7692e5b6d6dSopenharmony_ci**S=3, A=S** di Silva = diSilva < U.S.A. = USA\ 7702e5b6d6dSopenharmony_ci**S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA 7712e5b6d6dSopenharmony_ci 7722e5b6d6dSopenharmony_ciThe **Normalization** setting determines whether 7732e5b6d6dSopenharmony_citext is thoroughly normalized or not in comparison. Even if the setting is off 7742e5b6d6dSopenharmony_ci(which is the default for many locales), text as represented in common usage 7752e5b6d6dSopenharmony_ciwill compare correctly (for details, see [UTN 7762e5b6d6dSopenharmony_ci#5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in 7772e5b6d6dSopenharmony_cinon-canonical order will there be a problem. If the setting is On, then the best 7782e5b6d6dSopenharmony_ciresults are guaranteed for all possible text input.There is a medium string 7792e5b6d6dSopenharmony_cicomparison performance cost if this attribute is On, depending on the frequency 7802e5b6d6dSopenharmony_ciof sequences that require normalization. There is no significant effect on sort 7812e5b6d6dSopenharmony_cikey length.If the input text is known to be in NFD or NFKD normalization forms, 7822e5b6d6dSopenharmony_cithere is no need to enable this Normalization option. 7832e5b6d6dSopenharmony_ci 7842e5b6d6dSopenharmony_ci**Example:**\ 7852e5b6d6dSopenharmony_ci**N=X** ä = a + ◌̈ < ä + ◌̣ < ạ + ◌̈\ 7862e5b6d6dSopenharmony_ci**N=O** ä = a + ◌̈ < ä + ◌̣ = ạ + ◌̈ 7872e5b6d6dSopenharmony_ci 7882e5b6d6dSopenharmony_ciSome **French** dictionary ordering traditions sort strings with 7892e5b6d6dSopenharmony_cidifferent accents from the back of the string. This attribute is automatically 7902e5b6d6dSopenharmony_ciset to On for the Canadian French locale (fr_CA). Users normally would not need 7912e5b6d6dSopenharmony_cito explicitly set this attribute. There is a string comparison performance cost 7922e5b6d6dSopenharmony_ciwhen it is set On, but sort key length is unaffected. 7932e5b6d6dSopenharmony_ci 7942e5b6d6dSopenharmony_ci**Example:**\ 7952e5b6d6dSopenharmony_ci**F=X** cote < coté < côte < côté\ 7962e5b6d6dSopenharmony_ci**F=O** cote < côte < coté < côté 7972e5b6d6dSopenharmony_ci 7982e5b6d6dSopenharmony_ciCompatibility with JIS x 4061 requires the introduction of an 7992e5b6d6dSopenharmony_ciadditional level to distinguish **Hiragana** and Katakana characters. If 8002e5b6d6dSopenharmony_cicompatibility with that standard is required, then this attribute is set On, and 8012e5b6d6dSopenharmony_cithe strength should be set to at least Quaternary. 8022e5b6d6dSopenharmony_ci 8032e5b6d6dSopenharmony_ciThis attribute is an implementation detail of the CLDR Japanese tailoring. The 8042e5b6d6dSopenharmony_ciimplementation might change to use a different mechanism to achieve the same 8052e5b6d6dSopenharmony_ciJapanese sort order. Since ICU 50, this attribute is not settable any more. 8062e5b6d6dSopenharmony_ci 8072e5b6d6dSopenharmony_ci**Example:**\ 8082e5b6d6dSopenharmony_ci**H=X, S=4** きゅう = キュウ < きゆう = キユウ\ 8092e5b6d6dSopenharmony_ci**H=O, S=4** きゅう < キュウ < きゆう < キユウ 8102e5b6d6dSopenharmony_ci 8112e5b6d6dSopenharmony_ci> :point_right: **Note** If attributes in collator name are not overridden, 8122e5b6d6dSopenharmony_ci> it is assumed that they are the same as for the given locale. 8132e5b6d6dSopenharmony_ci> For example, a collator opened with an empty 8142e5b6d6dSopenharmony_ci> string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.* 8152e5b6d6dSopenharmony_ci 8162e5b6d6dSopenharmony_ci### Summary of Value Abbreviations 8172e5b6d6dSopenharmony_ci 8182e5b6d6dSopenharmony_ciValue | Abbreviation 8192e5b6d6dSopenharmony_ci------------- | ------------ 8202e5b6d6dSopenharmony_ciDefault | D 8212e5b6d6dSopenharmony_ciOn | O 8222e5b6d6dSopenharmony_ciOff | X 8232e5b6d6dSopenharmony_ciPrimary | 1 8242e5b6d6dSopenharmony_ciSecondary | 2 8252e5b6d6dSopenharmony_ciTertiary | 3 8262e5b6d6dSopenharmony_ciQuaternary | 4 8272e5b6d6dSopenharmony_ciIdentical | I 8282e5b6d6dSopenharmony_ciShifted | S 8292e5b6d6dSopenharmony_ciNon-Ignorable | N 8302e5b6d6dSopenharmony_ciLower-First | L 8312e5b6d6dSopenharmony_ciUpper-First | U 832