12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: Concepts
42e5b6d6dSopenharmony_cinav_order: 1
52e5b6d6dSopenharmony_ciparent: Collation
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# Collation Concepts
132e5b6d6dSopenharmony_ci{: .no_toc }
142e5b6d6dSopenharmony_ci
152e5b6d6dSopenharmony_ci## Contents
162e5b6d6dSopenharmony_ci{: .no_toc .text-delta }
172e5b6d6dSopenharmony_ci
182e5b6d6dSopenharmony_ci1. TOC
192e5b6d6dSopenharmony_ci{:toc}
202e5b6d6dSopenharmony_ci
212e5b6d6dSopenharmony_ci---
222e5b6d6dSopenharmony_ci
232e5b6d6dSopenharmony_ci## Overview
242e5b6d6dSopenharmony_ci
252e5b6d6dSopenharmony_ciThe previous section demonstrated many of the requirements imposed on string
262e5b6d6dSopenharmony_cicomparison routines that try to correctly collate strings according to
272e5b6d6dSopenharmony_ciconventions of more than a hundred different languages, written in many
282e5b6d6dSopenharmony_cidifferent scripts. This section describes the principles and architecture behind
292e5b6d6dSopenharmony_cithe ICU Collation Service.
302e5b6d6dSopenharmony_ci
312e5b6d6dSopenharmony_ci## Sortkeys vs Comparison
322e5b6d6dSopenharmony_ci
332e5b6d6dSopenharmony_ciSort keys are most useful in databases, where the overhead of calling a function
342e5b6d6dSopenharmony_cifor each comparison is very large.
352e5b6d6dSopenharmony_ci
362e5b6d6dSopenharmony_ciGenerating a sort key from a Collator is many times more expensive than doing a
372e5b6d6dSopenharmony_cicompare with the Collator (for common use cases). That's if the two functions
382e5b6d6dSopenharmony_ciare called from Java or C. So for those languages, unless there is a very large
392e5b6d6dSopenharmony_cinumber of comparisons, it is better to call the compare function.
402e5b6d6dSopenharmony_ci
412e5b6d6dSopenharmony_ciHere is an example, with a little back-of-the-envelope calculation. Let's
422e5b6d6dSopenharmony_cisuppose that with a given language on a given platform, the compare performance
432e5b6d6dSopenharmony_ci(CP) is 100 faster than sortKey performance (SP), and that you are doing a
442e5b6d6dSopenharmony_cibinary search of a list with 1,000 elements. The binary comparison performance
452e5b6d6dSopenharmony_ciis BP. We'd do about 10 comparisons, getting:
462e5b6d6dSopenharmony_ci
472e5b6d6dSopenharmony_cicompare: 10 \* CP
482e5b6d6dSopenharmony_ci
492e5b6d6dSopenharmony_cisortkey: 1 \* SP + 10 \* BP
502e5b6d6dSopenharmony_ci
512e5b6d6dSopenharmony_ciEven if BP is free, compare would be better. One has to get up to where log2(n)
522e5b6d6dSopenharmony_ci= 100 before they break even.
532e5b6d6dSopenharmony_ci
542e5b6d6dSopenharmony_ciBut even this calculation is only a rough guide. First, the binary comparison is
552e5b6d6dSopenharmony_cinot completely free. Secondly, the performance of compare function varies
562e5b6d6dSopenharmony_ciradically with the source data. We optimized for maximizing performance of
572e5b6d6dSopenharmony_cicollation in sorting and binary search, so comparing strings that are "close" is
582e5b6d6dSopenharmony_cioptimized to be much faster than comparing strings that are "far away". That
592e5b6d6dSopenharmony_cioptimization is important because normal sort/lookup operations compare close
602e5b6d6dSopenharmony_cistrings far more often -- think of binary search, where the last few comparisons
612e5b6d6dSopenharmony_ciare always with the closest strings. So even the above calculation is not very
622e5b6d6dSopenharmony_ciaccurate.
632e5b6d6dSopenharmony_ci
642e5b6d6dSopenharmony_ci## Comparison Levels
652e5b6d6dSopenharmony_ci
662e5b6d6dSopenharmony_ciIn general, when comparing and sorting objects, some properties can take
672e5b6d6dSopenharmony_ciprecedence over others. For example, in geometry, you might consider first the
682e5b6d6dSopenharmony_cinumber of sides a shape has, followed by the number of sides of equal length.
692e5b6d6dSopenharmony_ciThis causes triangles to be sorted together, then rectangles, then pentagons,
702e5b6d6dSopenharmony_cietc. Within each category, the shapes would be ordered according to whether they
712e5b6d6dSopenharmony_cihad 0, 2, 3 or more sides of the same length. However, this is not the only way
722e5b6d6dSopenharmony_cithe shapes can be sorted. For example, it might be preferable to sort shapes by
732e5b6d6dSopenharmony_cicolor first, so that all red shapes are grouped together, then blue, etc.
742e5b6d6dSopenharmony_ciAnother approach would be to sort the shapes by the amount of area they enclose.
752e5b6d6dSopenharmony_ci
762e5b6d6dSopenharmony_ciSimilarly, character strings have properties, some of which can take precedence
772e5b6d6dSopenharmony_ciover others. There is more than one way to prioritize the properties.
782e5b6d6dSopenharmony_ci
792e5b6d6dSopenharmony_ciFor example, a common approach is to distinguish characters first by their
802e5b6d6dSopenharmony_ciunadorned base letter (for example, without accents, vowels or tone marks), then
812e5b6d6dSopenharmony_ciby accents, and then by the case of the letter (upper vs. lower). Ideographic
822e5b6d6dSopenharmony_cicharacters might be sorted by their component radicals and then by the number of
832e5b6d6dSopenharmony_cistrokes it takes to draw the character.
842e5b6d6dSopenharmony_ciAn alternative ordering would be to sort these characters by strokes first and
852e5b6d6dSopenharmony_cithen by their radicals.
862e5b6d6dSopenharmony_ci
872e5b6d6dSopenharmony_ciThe ICU Collation Service supports many levels of comparison (named "Levels",
882e5b6d6dSopenharmony_cibut also known as "Strengths"). Having these categories enables ICU to sort
892e5b6d6dSopenharmony_cistrings precisely according to local conventions. However, by allowing the
902e5b6d6dSopenharmony_cilevels to be selectively employed, searching for a string in text can be
912e5b6d6dSopenharmony_ciperformed with various matching conditions.
922e5b6d6dSopenharmony_ci
932e5b6d6dSopenharmony_ciPerformance optimizations have been made for ICU collation with the default
942e5b6d6dSopenharmony_cilevel settings. Performance specific impacts are discussed in the Performance
952e5b6d6dSopenharmony_cisection below.
962e5b6d6dSopenharmony_ci
972e5b6d6dSopenharmony_ciFollowing is a list of the names for each level and an example usage:
982e5b6d6dSopenharmony_ci
992e5b6d6dSopenharmony_ci1.  Primary Level: Typically, this is used to denote differences between base
1002e5b6d6dSopenharmony_ci    characters (for example, "a" < "b"). It is the strongest difference. For
1012e5b6d6dSopenharmony_ci    example, dictionaries are divided into different sections by base character.
1022e5b6d6dSopenharmony_ci    This is also called the level-1 strength.
1032e5b6d6dSopenharmony_ci
1042e5b6d6dSopenharmony_ci2.  Secondary Level: Accents in the characters are considered secondary
1052e5b6d6dSopenharmony_ci    differences (for example, "as" < "às" < "at"). Other differences between
1062e5b6d6dSopenharmony_ci    letters can also be considered secondary differences, depending on the
1072e5b6d6dSopenharmony_ci    language. A secondary difference is ignored when there is a primary
1082e5b6d6dSopenharmony_ci    difference anywhere in the strings. This is also called the level-2
1092e5b6d6dSopenharmony_ci    strength.
1102e5b6d6dSopenharmony_ci    Note: In some languages (such as Danish), certain accented letters are
1112e5b6d6dSopenharmony_ci    considered to be separate base characters. In most languages, however, an
1122e5b6d6dSopenharmony_ci    accented letter only has a secondary difference from the unaccented version
1132e5b6d6dSopenharmony_ci    of that letter.
1142e5b6d6dSopenharmony_ci
1152e5b6d6dSopenharmony_ci3.  Tertiary Level: Upper and lower case differences in characters are
1162e5b6d6dSopenharmony_ci    distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In
1172e5b6d6dSopenharmony_ci    addition, a variant of a letter differs from the base form on the tertiary
1182e5b6d6dSopenharmony_ci    level (such as "A" and "Ⓐ"). Another example is the difference between large
1192e5b6d6dSopenharmony_ci    and small Kana. A tertiary difference is ignored when there is a primary or
1202e5b6d6dSopenharmony_ci    secondary difference anywhere in the strings. This is also called the
1212e5b6d6dSopenharmony_ci    level-3 strength.
1222e5b6d6dSopenharmony_ci
1232e5b6d6dSopenharmony_ci4.  Quaternary Level: When punctuation is ignored (see Ignoring Punctuations
1242e5b6d6dSopenharmony_ci    (§)) at level 1-3, an additional level can be used to distinguish words with
1252e5b6d6dSopenharmony_ci    and without punctuation (for example, "ab" < "a-b" < "aB"). This difference
1262e5b6d6dSopenharmony_ci    is ignored when there is a primary, secondary or tertiary difference. This
1272e5b6d6dSopenharmony_ci    is also known as the level-4 strength. The quaternary level should only be
1282e5b6d6dSopenharmony_ci    used if ignoring punctuation is required or when processing Japanese text
1292e5b6d6dSopenharmony_ci    (see Hiragana processing (§)).
1302e5b6d6dSopenharmony_ci
1312e5b6d6dSopenharmony_ci5.  Identical Level: When all other levels are equal, the identical level is
1322e5b6d6dSopenharmony_ci    used as a tiebreaker. The Unicode code point values of the NFD form of each
1332e5b6d6dSopenharmony_ci    string are compared at this level, just in case there is no difference at
1342e5b6d6dSopenharmony_ci    levels 1-4. For example, Hebrew cantillation marks are only distinguished
1352e5b6d6dSopenharmony_ci    at this level. This level should be used sparingly, as only code point
1362e5b6d6dSopenharmony_ci    value differences between two strings is an extremely rare occurrence.
1372e5b6d6dSopenharmony_ci    Using this level substantially decreases the performance for
1382e5b6d6dSopenharmony_ci    both incremental comparison and sort key generation (as well as increasing
1392e5b6d6dSopenharmony_ci    the sort key length). It is also known as level 5 strength.
1402e5b6d6dSopenharmony_ci
1412e5b6d6dSopenharmony_ci## Backward Secondary Sorting
1422e5b6d6dSopenharmony_ci
1432e5b6d6dSopenharmony_ciSome languages require words to be ordered on the secondary level according to
1442e5b6d6dSopenharmony_cithe *last* accent difference, as opposed to the *first* accent difference. This
1452e5b6d6dSopenharmony_ciwas previously the default for all French locales, based on some French
1462e5b6d6dSopenharmony_cidictionary ordering traditions, but is currently only applicable to Canadian
1472e5b6d6dSopenharmony_ciFrench (locale **fr_CA**), for conformance with the [Canadian sorting
1482e5b6d6dSopenharmony_cistandard](http://www.unicode.org/reports/tr10/#CanStd). The difference in
1492e5b6d6dSopenharmony_ciordering is only noticeable for a small number of pairs of real words. For more
1502e5b6d6dSopenharmony_ciinformation see [UCA: Contextual
1512e5b6d6dSopenharmony_ciSensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity).
1522e5b6d6dSopenharmony_ci
1532e5b6d6dSopenharmony_ciExample:
1542e5b6d6dSopenharmony_ci
1552e5b6d6dSopenharmony_ciForward secondary | Backward secondary
1562e5b6d6dSopenharmony_ci----------------- | ------------------
1572e5b6d6dSopenharmony_cicote              | cote
1582e5b6d6dSopenharmony_cicoté              | côte
1592e5b6d6dSopenharmony_cicôte              | coté
1602e5b6d6dSopenharmony_cicôté              | côté
1612e5b6d6dSopenharmony_ci
1622e5b6d6dSopenharmony_ci## Contractions
1632e5b6d6dSopenharmony_ci
1642e5b6d6dSopenharmony_ciA contraction is a sequence consisting of two or more letters. It is considered
1652e5b6d6dSopenharmony_cia single letter in sorting.
1662e5b6d6dSopenharmony_ci
1672e5b6d6dSopenharmony_ciFor example, in the traditional Spanish sorting order, "ch" is considered a
1682e5b6d6dSopenharmony_cisingle letter. All words that begin with "ch" sort after all other words
1692e5b6d6dSopenharmony_cibeginning with "c", but before words starting with "d".
1702e5b6d6dSopenharmony_ci
1712e5b6d6dSopenharmony_ciOther examples of contractions are "ch" in Czech, which sorts after "h", and
1722e5b6d6dSopenharmony_ci"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n"
1732e5b6d6dSopenharmony_cirespectively.
1742e5b6d6dSopenharmony_ci
1752e5b6d6dSopenharmony_ciExample:
1762e5b6d6dSopenharmony_ci
1772e5b6d6dSopenharmony_ciOrder without contraction | Order with contraction "lj" sorting after letter "l"
1782e5b6d6dSopenharmony_ci------------------------- | ----------------------------------------------------
1792e5b6d6dSopenharmony_cila                        | la
1802e5b6d6dSopenharmony_cili                        | li
1812e5b6d6dSopenharmony_cilj                        | lk
1822e5b6d6dSopenharmony_cilja                       | lz
1832e5b6d6dSopenharmony_ciljz                       | lj
1842e5b6d6dSopenharmony_cilk                        | lja
1852e5b6d6dSopenharmony_cilz                        | ljz
1862e5b6d6dSopenharmony_cima                        | ma
1872e5b6d6dSopenharmony_ci
1882e5b6d6dSopenharmony_ciContracting sequences such as the above are not very common in most languages.
1892e5b6d6dSopenharmony_ci
1902e5b6d6dSopenharmony_ci> :point_right: **Note** Since ICU 2.2, and as required by the UCA,
1912e5b6d6dSopenharmony_ci> if a completely ignorable code point
1922e5b6d6dSopenharmony_ci> appears in text in the middle of contraction, it will not break the contraction.
1932e5b6d6dSopenharmony_ci> For example, in Czech sorting, cU+0000h will sort as it were ch.
1942e5b6d6dSopenharmony_ci
1952e5b6d6dSopenharmony_ci## Expansions
1962e5b6d6dSopenharmony_ci
1972e5b6d6dSopenharmony_ciIf a letter sorts as if it were a sequence of more than one letter, it is called
1982e5b6d6dSopenharmony_cian expansion.
1992e5b6d6dSopenharmony_ci
2002e5b6d6dSopenharmony_ciFor example, in German phonebook sorting (de@collation=phonebook or BCP 47
2012e5b6d6dSopenharmony_cide-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae."
2022e5b6d6dSopenharmony_ciAll words starting with "ä" will sort between words starting with "ad" and words
2032e5b6d6dSopenharmony_cistarting with "af".
2042e5b6d6dSopenharmony_ci
2052e5b6d6dSopenharmony_ciIn the case of Unicode encoding, characters can often be represented either as
2062e5b6d6dSopenharmony_cipre-composed characters or in decomposed form. For example, the letter "à" can
2072e5b6d6dSopenharmony_cibe represented in its decomposed (a+\`) and pre-composed (à) form. Most
2082e5b6d6dSopenharmony_ciapplications do not want to distinguish text by the way it is encoded. A search
2092e5b6d6dSopenharmony_cifor "à" should find all instances of the letter, regardless of whether the
2102e5b6d6dSopenharmony_ciinstance is in pre-composed or decomposed form. Therefore, either form of the
2112e5b6d6dSopenharmony_ciletter must result in the same sort ordering. The architecture of the ICU
2122e5b6d6dSopenharmony_ciCollation Service supports this.
2132e5b6d6dSopenharmony_ci
2142e5b6d6dSopenharmony_ci## Contractions Producing Expansions
2152e5b6d6dSopenharmony_ci
2162e5b6d6dSopenharmony_ciIt is possible to have contractions that produce expansions.
2172e5b6d6dSopenharmony_ci
2182e5b6d6dSopenharmony_ciOne example occurs in Japanese, where the vowel with a prolonged sound mark is
2192e5b6d6dSopenharmony_citreated to be equivalent to the long vowel version:
2202e5b6d6dSopenharmony_ci
2212e5b6d6dSopenharmony_ciカアー<<< カイー and\
2222e5b6d6dSopenharmony_ciキイー<<< キイー
2232e5b6d6dSopenharmony_ci
2242e5b6d6dSopenharmony_ci> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses
2252e5b6d6dSopenharmony_ci> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings)
2262e5b6d6dSopenharmony_ci> instead of contraction producing expansions.
2272e5b6d6dSopenharmony_ci
2282e5b6d6dSopenharmony_ci## Normalization
2292e5b6d6dSopenharmony_ci
2302e5b6d6dSopenharmony_ciIn the section on expansions, we discussed that text in Unicode can often be
2312e5b6d6dSopenharmony_cirepresented in either pre-composed or decomposed forms. There are other types of
2322e5b6d6dSopenharmony_ciequivalences possible with Unicode, including Canonical and Compatibility. The
2332e5b6d6dSopenharmony_ciprocess of
2342e5b6d6dSopenharmony_ciNormalization ensures that text is written in a predictable way so that searches
2352e5b6d6dSopenharmony_ciare not made unnecessarily complicated by having to match on equivalences. Not
2362e5b6d6dSopenharmony_ciall text is normalized, however, so it is useful to have a collation service
2372e5b6d6dSopenharmony_cithat can address text that is not normalized, but do so with efficiency.
2382e5b6d6dSopenharmony_ci
2392e5b6d6dSopenharmony_ciThe ICU Collation Service handles un-normalized text properly, producing the
2402e5b6d6dSopenharmony_cisame results as if the text were normalized.
2412e5b6d6dSopenharmony_ci
2422e5b6d6dSopenharmony_ciIn practice, most data that is encountered is in normalized or semi-normalized
2432e5b6d6dSopenharmony_ciform already. The ICU Collation Service is designed so that it can process a
2442e5b6d6dSopenharmony_ciwide range of normalized or un-normalized text without a need for normalization
2452e5b6d6dSopenharmony_ciprocessing. When a case is encountered that requires normalization, the ICU
2462e5b6d6dSopenharmony_ciCollation Service drops into code specific to this purpose. This maximizes
2472e5b6d6dSopenharmony_ciperformance for the majority of text that does not require normalization.
2482e5b6d6dSopenharmony_ci
2492e5b6d6dSopenharmony_ciIn addition, if the text is known with certainty not to contain un-normalized
2502e5b6d6dSopenharmony_citext, then even the overhead of checking for normalization can be eliminated.
2512e5b6d6dSopenharmony_ciThe ICU Collation Service has the ability to turn Normalization Checking either
2522e5b6d6dSopenharmony_cion or off. If Normalization Checking is turned off, it is the user's
2532e5b6d6dSopenharmony_ciresponsibility to insure that all text is already in the appropriate form. This
2542e5b6d6dSopenharmony_ciis true in a great majority of the world languages, so normalization checking is
2552e5b6d6dSopenharmony_citurned off by default for most locales.
2562e5b6d6dSopenharmony_ci
2572e5b6d6dSopenharmony_ciIf the text requires normalization processing, Normalization Checking should be
2582e5b6d6dSopenharmony_cion. Any language that uses multiple combining characters such as Arabic, ancient
2592e5b6d6dSopenharmony_ciGreek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking
2602e5b6d6dSopenharmony_cito be on, or the text to go through a normalization process before collation.
2612e5b6d6dSopenharmony_ci
2622e5b6d6dSopenharmony_ciFor more information about Normalization related reordering please see
2632e5b6d6dSopenharmony_ci[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and
2642e5b6d6dSopenharmony_ci[UAX #15.](http://www.unicode.org/reports/tr15/)
2652e5b6d6dSopenharmony_ci
2662e5b6d6dSopenharmony_ci> :point_right: **Note** ICU supports two modes of normalization: on and off.
2672e5b6d6dSopenharmony_ci> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU.
2682e5b6d6dSopenharmony_ci
2692e5b6d6dSopenharmony_ci## Ignoring Punctuation
2702e5b6d6dSopenharmony_ci
2712e5b6d6dSopenharmony_ciIn some cases, punctuation can be ignored while searching or sorting data. For
2722e5b6d6dSopenharmony_ciexample, this enables a search for "biweekly" to also return instances of
2732e5b6d6dSopenharmony_ci"bi-weekly". In other cases, it is desirable for punctuated text to be
2742e5b6d6dSopenharmony_cidistinguished from text without punctuation, but to have the text sort close
2752e5b6d6dSopenharmony_citogether.
2762e5b6d6dSopenharmony_ci
2772e5b6d6dSopenharmony_ciThese two behaviors can be accomplished if there is a way for a character to be
2782e5b6d6dSopenharmony_ciignored on all levels except for the quaternary level. If this is the case, then
2792e5b6d6dSopenharmony_citwo strings which compare as identical on the first three levels (base letter,
2802e5b6d6dSopenharmony_ciaccents, and case) are then distinguished at the fourth level based on their
2812e5b6d6dSopenharmony_cipunctuation (if any). If the comparison function ignores differences at the
2822e5b6d6dSopenharmony_cifourth level, then strings that differ by punctuation only are compared as
2832e5b6d6dSopenharmony_ciequal.
2842e5b6d6dSopenharmony_ci
2852e5b6d6dSopenharmony_ciThe following table shows the results of sorting a list of terms in 3 different
2862e5b6d6dSopenharmony_ciways. In the first column, punctuation characters (space " ", and hyphen "-")
2872e5b6d6dSopenharmony_ciare not ignored (" " < "-" < "b"). In the second column, punctuation characters
2882e5b6d6dSopenharmony_ciare ignored in the first 3 levels and compared only in the fourth level. In the
2892e5b6d6dSopenharmony_cithird column, punctuation characters are ignored in the first 3 levels and the
2902e5b6d6dSopenharmony_cifourth level is not considered. In the last column, punctuated terms are
2912e5b6d6dSopenharmony_ciequivalent to the identical terms without punctuation.
2922e5b6d6dSopenharmony_ci
2932e5b6d6dSopenharmony_ciFor more options and details see the [“Ignore Punctuation”
2942e5b6d6dSopenharmony_ciOptions](customization/ignorepunct.md) page.
2952e5b6d6dSopenharmony_ci
2962e5b6d6dSopenharmony_ciNon-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength
2972e5b6d6dSopenharmony_ci------------- | --------------------------------- | -------------------------------
2982e5b6d6dSopenharmony_ciblack bird    | black bird                        | **black bird**
2992e5b6d6dSopenharmony_ciblack Bird    | black-bird                        | **black-bird**
3002e5b6d6dSopenharmony_ciblack birds   | blackbird                         | **blackbird**
3012e5b6d6dSopenharmony_ciblack-bird    | black Bird                        | black Bird
3022e5b6d6dSopenharmony_ciblack-Bird    | black-Bird                        | black-Bird
3032e5b6d6dSopenharmony_ciblack-birds   | blackBird                         | blackBird
3042e5b6d6dSopenharmony_ciblackbird     | black birds                       | black birds
3052e5b6d6dSopenharmony_ciblackBird     | black-birds                       | black-birds
3062e5b6d6dSopenharmony_ciblackbirds    | blackbirds                        | blackbirds
3072e5b6d6dSopenharmony_ci
3082e5b6d6dSopenharmony_ci> :point_right: **Note** The strings with the same font format in the last column are
3092e5b6d6dSopenharmony_cicompared as equal by ICU Collator.\
3102e5b6d6dSopenharmony_ci> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that
3112e5b6d6dSopenharmony_ci> follow shifted code points will be completely ignored. This means that an accent
3122e5b6d6dSopenharmony_ci> following a space will compare as if it was a space alone.
3132e5b6d6dSopenharmony_ci
3142e5b6d6dSopenharmony_ci## Case Ordering
3152e5b6d6dSopenharmony_ci
3162e5b6d6dSopenharmony_ciThe tertiary level is used to distinguish text by case, by small versus large
3172e5b6d6dSopenharmony_ciKana, and other letter variants as noted above.
3182e5b6d6dSopenharmony_ci
3192e5b6d6dSopenharmony_ciSome applications prefer to emphasize case differences so that words starting
3202e5b6d6dSopenharmony_ciwith the same case sort together. Some Japanese applications require the
3212e5b6d6dSopenharmony_cidifference between small and large Kana be emphasized over other tertiary
3222e5b6d6dSopenharmony_cidifferences.
3232e5b6d6dSopenharmony_ci
3242e5b6d6dSopenharmony_ciThe UCA does not provide means to separate out either case or Kana differences
3252e5b6d6dSopenharmony_cifrom the remaining tertiary differences. However, the ICU Collation Service has
3262e5b6d6dSopenharmony_citwo options that help in customize case and/or Kana differences. Both options
3272e5b6d6dSopenharmony_ciare turned off by default.
3282e5b6d6dSopenharmony_ci
3292e5b6d6dSopenharmony_ci### CaseFirst
3302e5b6d6dSopenharmony_ci
3312e5b6d6dSopenharmony_ciThe Case-first option makes case the most significant part of the tertiary
3322e5b6d6dSopenharmony_cilevel. Primary and secondary levels are unaffected. With this option, words
3332e5b6d6dSopenharmony_cistarting with the same case sort together. The Case-first option can be set to
3342e5b6d6dSopenharmony_cimake either lowercase sort before
3352e5b6d6dSopenharmony_ciuppercase or uppercase sort before lowercase.
3362e5b6d6dSopenharmony_ci
3372e5b6d6dSopenharmony_ciNote: The case-first option does not constitute a separate level; it is simply a
3382e5b6d6dSopenharmony_cireordering of the tertiary level.
3392e5b6d6dSopenharmony_ci
3402e5b6d6dSopenharmony_ciICU makes use of the following three case categories for sorting
3412e5b6d6dSopenharmony_ci
3422e5b6d6dSopenharmony_ci1.  uppercase: "ABC"
3432e5b6d6dSopenharmony_ci
3442e5b6d6dSopenharmony_ci2.  mixed case: "Abc", "aBc"
3452e5b6d6dSopenharmony_ci
3462e5b6d6dSopenharmony_ci3.  normal (lowercase or no case): "abc", "123"
3472e5b6d6dSopenharmony_ci
3482e5b6d6dSopenharmony_ciMixed case is always sorted between uppercase and normal case when the
3492e5b6d6dSopenharmony_ci"case-first" option is set.
3502e5b6d6dSopenharmony_ci
3512e5b6d6dSopenharmony_ci### CaseLevel
3522e5b6d6dSopenharmony_ci
3532e5b6d6dSopenharmony_ciThe Case Level option makes a separate level for case differences. This is an
3542e5b6d6dSopenharmony_ciextra level positioned between secondary and tertiary. The case level is used in
3552e5b6d6dSopenharmony_ciJapanese to make the difference between small and large Kana more important than
3562e5b6d6dSopenharmony_cithe other tertiary differences. It also can be used to ignore other tertiary
3572e5b6d6dSopenharmony_cidifferences, or even secondary differences. This is especially useful in
3582e5b6d6dSopenharmony_cimatching. For example, if the strength is set to primary only (level-1) and the
3592e5b6d6dSopenharmony_cicase level is turned on, the comparison ignores accents and tertiary differences
3602e5b6d6dSopenharmony_ciexcept for case. The contents of the case level are affected by the case-first
3612e5b6d6dSopenharmony_cioption.
3622e5b6d6dSopenharmony_ci
3632e5b6d6dSopenharmony_ciThe case level is independent from the strength of comparison. It is possible to
3642e5b6d6dSopenharmony_cihave a collator set to primary strength with the case level turned on. This
3652e5b6d6dSopenharmony_ciprovides for comparison that takes into account the case differences, while at
3662e5b6d6dSopenharmony_cithe same time ignoring accents and tertiary differences other than case. This
3672e5b6d6dSopenharmony_cimay be used in searching.
3682e5b6d6dSopenharmony_ci
3692e5b6d6dSopenharmony_ciExample:
3702e5b6d6dSopenharmony_ci
3712e5b6d6dSopenharmony_ci**Case-first off, Case level off**
3722e5b6d6dSopenharmony_ci
3732e5b6d6dSopenharmony_ciapple\
3742e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
3752e5b6d6dSopenharmony_ciAbernathy\
3762e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
3772e5b6d6dSopenharmony_ciähnlich\
3782e5b6d6dSopenharmony_ciÄhnlichkeit
3792e5b6d6dSopenharmony_ci
3802e5b6d6dSopenharmony_ci**Lowercase-first, Case level off**
3812e5b6d6dSopenharmony_ci
3822e5b6d6dSopenharmony_ciapple\
3832e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
3842e5b6d6dSopenharmony_ciähnlich\
3852e5b6d6dSopenharmony_ciAbernathy\
3862e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
3872e5b6d6dSopenharmony_ciÄhnlichkeit
3882e5b6d6dSopenharmony_ci
3892e5b6d6dSopenharmony_ci**Uppercase-first, Case level off**
3902e5b6d6dSopenharmony_ci
3912e5b6d6dSopenharmony_ciAbernathy\
3922e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
3932e5b6d6dSopenharmony_ciÄhnlichkeit\
3942e5b6d6dSopenharmony_ciapple\
3952e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
3962e5b6d6dSopenharmony_ciähnlich
3972e5b6d6dSopenharmony_ci
3982e5b6d6dSopenharmony_ci**Lowercase-first, Case level on**
3992e5b6d6dSopenharmony_ci
4002e5b6d6dSopenharmony_ciapple\
4012e5b6d6dSopenharmony_ciAbernathy\
4022e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
4032e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
4042e5b6d6dSopenharmony_ciähnlich\
4052e5b6d6dSopenharmony_ciÄhnlichkeit
4062e5b6d6dSopenharmony_ci
4072e5b6d6dSopenharmony_ci**Uppercase-first, Case level on**
4082e5b6d6dSopenharmony_ci
4092e5b6d6dSopenharmony_ciAbernathy\
4102e5b6d6dSopenharmony_ciapple\
4112e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
4122e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
4132e5b6d6dSopenharmony_ciÄhnlichkeit\
4142e5b6d6dSopenharmony_ciähnlich
4152e5b6d6dSopenharmony_ci
4162e5b6d6dSopenharmony_ci## Script Reordering
4172e5b6d6dSopenharmony_ci
4182e5b6d6dSopenharmony_ciScript reordering allows scripts and some other groups of characters to be moved
4192e5b6d6dSopenharmony_cirelative to each other. This reordering is done on top of the DUCET/CLDR
4202e5b6d6dSopenharmony_cistandard collation order. Reordering can specify groups to be placed at the
4212e5b6d6dSopenharmony_cistart and/or the end of the collation order.
4222e5b6d6dSopenharmony_ci
4232e5b6d6dSopenharmony_ciBy default, reordering codes specified for the start of the order are placed in
4242e5b6d6dSopenharmony_cithe order given after several special non-script blocks. These special groups of
4252e5b6d6dSopenharmony_cicharacters are space, punctuation, symbol, currency, and digit. Script groups
4262e5b6d6dSopenharmony_cican be intermingled with these special non-script groups if those special groups
4272e5b6d6dSopenharmony_ciare explicitly specified in the reordering.
4282e5b6d6dSopenharmony_ci
4292e5b6d6dSopenharmony_ciThe special code `others` stands for any script that is not explicitly mentioned
4302e5b6d6dSopenharmony_ciin the list. Anything that is after others will go at the very end of the list
4312e5b6d6dSopenharmony_ciin the order given. For example, `[Grek, others, Latn]` will result in an
4322e5b6d6dSopenharmony_ciordering that puts all scripts other than Greek and Latin between them.
4332e5b6d6dSopenharmony_ci
4342e5b6d6dSopenharmony_ci### Examples:
4352e5b6d6dSopenharmony_ci
4362e5b6d6dSopenharmony_ciNote: All examples below use the string equivalents for the scripts and reorder
4372e5b6d6dSopenharmony_cicodes that would be used in collator rules. The script and reorder code
4382e5b6d6dSopenharmony_ciconstants that would be used in API calls will be different.
4392e5b6d6dSopenharmony_ci
4402e5b6d6dSopenharmony_ci**Example 1:**\
4412e5b6d6dSopenharmony_ciset reorder code - `[Grek]`\
4422e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others]`
4432e5b6d6dSopenharmony_ci
4442e5b6d6dSopenharmony_ci**Example 2:**\
4452e5b6d6dSopenharmony_ciset reorder code - `[Grek]`\
4462e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others]`
4472e5b6d6dSopenharmony_ci
4482e5b6d6dSopenharmony_cifollowed by: set reorder code - `[Hani]`\
4492e5b6d6dSopenharmony_ciresult -` [space, punctuation, symbol, currency, digit, Hani, others]`
4502e5b6d6dSopenharmony_ci
4512e5b6d6dSopenharmony_ciThat is, setting a reordering always modifies
4522e5b6d6dSopenharmony_cithe DUCET/CLDR order, replacing whatever was previously set, rather than adding
4532e5b6d6dSopenharmony_cion to it. In order to cumulatively modify an ordering, you have to retrieve the
4542e5b6d6dSopenharmony_ciexisting ordering, modify it, and then set it.
4552e5b6d6dSopenharmony_ci
4562e5b6d6dSopenharmony_ci**Example 3:**\
4572e5b6d6dSopenharmony_ciset reorder code - `[others, digit]`\
4582e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, others, digit]`
4592e5b6d6dSopenharmony_ci
4602e5b6d6dSopenharmony_ci**Example 4:**\
4612e5b6d6dSopenharmony_ciset reorder code - `[space, Grek, punctuation]`\
4622e5b6d6dSopenharmony_ciresult - `[symbol, currency, digit, space, Grek, punctuation, others]`
4632e5b6d6dSopenharmony_ci
4642e5b6d6dSopenharmony_ci**Example 5:**\
4652e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani]`\
4662e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]`
4672e5b6d6dSopenharmony_ci
4682e5b6d6dSopenharmony_ci**Example 6:**\
4692e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\
4702e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
4712e5b6d6dSopenharmony_ci
4722e5b6d6dSopenharmony_cifollowed by:\
4732e5b6d6dSopenharmony_ciset reorder code - `[NONE]`\
4742e5b6d6dSopenharmony_ciresult - DUCET/CLDR
4752e5b6d6dSopenharmony_ci
4762e5b6d6dSopenharmony_ci**Example 7:**\
4772e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\
4782e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
4792e5b6d6dSopenharmony_ci
4802e5b6d6dSopenharmony_cifollowed by:\
4812e5b6d6dSopenharmony_ciset reorder code - `[DEFAULT]`\
4822e5b6d6dSopenharmony_ciresult - original reordering for the locale which may or may not be DUCET/CLDR
4832e5b6d6dSopenharmony_ci
4842e5b6d6dSopenharmony_ci**Example 8:**\
4852e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\
4862e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
4872e5b6d6dSopenharmony_ci
4882e5b6d6dSopenharmony_cifollowed by:\
4892e5b6d6dSopenharmony_ciset reorder code - `[]`\
4902e5b6d6dSopenharmony_ciresult - original reordering for the locale which may or may not be DUCET/CLDR
4912e5b6d6dSopenharmony_ci
4922e5b6d6dSopenharmony_ci**Example 9:**\
4932e5b6d6dSopenharmony_ciset reorder code - `[Hebr, Phnx]`\
4942e5b6d6dSopenharmony_ciresult - error
4952e5b6d6dSopenharmony_ci
4962e5b6d6dSopenharmony_ciBeginning with ICU 55, scripts only reorder together if they are primary-equal,
4972e5b6d6dSopenharmony_cifor example Hiragana and Katakana.
4982e5b6d6dSopenharmony_ci
4992e5b6d6dSopenharmony_ciICU 4.8-54:
5002e5b6d6dSopenharmony_ci
5012e5b6d6dSopenharmony_ci*   Scripts were reordered in groups, each normally starting with a [Recommended
5022e5b6d6dSopenharmony_ci    Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts).
5032e5b6d6dSopenharmony_ci*   Reorder codes moved as a group (were “equivalent”) if their scripts shared a
5042e5b6d6dSopenharmony_ci    primary-weight lead byte.
5052e5b6d6dSopenharmony_ci*   For example, Hebr and Phnx were “equivalent” reordering codes and were
5062e5b6d6dSopenharmony_ci    reordered together. Their order relative to each other could not be changed.
5072e5b6d6dSopenharmony_ci*   Only any one code out of any group could be reordered, not multiple of the
5082e5b6d6dSopenharmony_ci    same group.
5092e5b6d6dSopenharmony_ci
5102e5b6d6dSopenharmony_ci## Sorting of Japanese Text (JIS X 4061)
5112e5b6d6dSopenharmony_ci
5122e5b6d6dSopenharmony_ciJapanese standard JIS X 4061 requires two changes to the collation procedures:
5132e5b6d6dSopenharmony_cispecial processing of Hiragana characters and (for performance reasons) prefix
5142e5b6d6dSopenharmony_cianalysis of text.
5152e5b6d6dSopenharmony_ci
5162e5b6d6dSopenharmony_ci### Hiragana Processing
5172e5b6d6dSopenharmony_ci
5182e5b6d6dSopenharmony_ciJIS X 4061 standard requires more levels than provided by the UCA. To offer
5192e5b6d6dSopenharmony_ciconformant sorting order, ICU uses the quaternary level to distinguish between
5202e5b6d6dSopenharmony_ciHiragana and Katakana. Hiragana symbols are given smaller values than Katakana
5212e5b6d6dSopenharmony_cisymbols on quaternary level, thus causing Hiragana sequences to sort before
5222e5b6d6dSopenharmony_cicorresponding Katakana sequences.
5232e5b6d6dSopenharmony_ci
5242e5b6d6dSopenharmony_ci### Prefix Analysis
5252e5b6d6dSopenharmony_ci
5262e5b6d6dSopenharmony_ciAnother characteristics of sorting according to the JIS X 4061 is a large number
5272e5b6d6dSopenharmony_ciof contractions followed by expansions (see
5282e5b6d6dSopenharmony_ci[Contractions Producing Expansions](#contractions-producing-expansions)).
5292e5b6d6dSopenharmony_ciThis causes all the Hiragana and Katakana codepoints to be treated as
5302e5b6d6dSopenharmony_cicontractions, which reduces performance. The solution we adopted introduces the
5312e5b6d6dSopenharmony_ciprefix concept which allows us to improve the performance of Japanese sorting.
5322e5b6d6dSopenharmony_ciMore about this can be found in the [customization
5332e5b6d6dSopenharmony_cichapter](customization/index.md) .
5342e5b6d6dSopenharmony_ci
5352e5b6d6dSopenharmony_ci## Thai/Lao reordering
5362e5b6d6dSopenharmony_ci
5372e5b6d6dSopenharmony_ciUCA requires that certain Thai and Lao prevowels be reordered with a code point
5382e5b6d6dSopenharmony_cifollowing them. This option is always on in the ICU implementation, as
5392e5b6d6dSopenharmony_ciprescribed by the UCA.
5402e5b6d6dSopenharmony_ci
5412e5b6d6dSopenharmony_ciThis rule takes effect when:
5422e5b6d6dSopenharmony_ci
5432e5b6d6dSopenharmony_ci1.  A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the
5442e5b6d6dSopenharmony_ci    range \\U0E01-\\U0E2E
5452e5b6d6dSopenharmony_ci    or
5462e5b6d6dSopenharmony_ci
5472e5b6d6dSopenharmony_ci2.  A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the
5482e5b6d6dSopenharmony_ci    range \\U0E81-\\U0EAE. In these cases the vowel is placed after the
5492e5b6d6dSopenharmony_ci    consonant for collation purposes.
5502e5b6d6dSopenharmony_ci
5512e5b6d6dSopenharmony_ci> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai
5522e5b6d6dSopenharmony_ci> reordering. Java.text.\* classes allow tailorings to turn off reordering by
5532e5b6d6dSopenharmony_ci> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai
5542e5b6d6dSopenharmony_ci> prevowels.
5552e5b6d6dSopenharmony_ci
5562e5b6d6dSopenharmony_ci## Space Padding
5572e5b6d6dSopenharmony_ci
5582e5b6d6dSopenharmony_ciIn many database products, fields are padded with null. To get correct results,
5592e5b6d6dSopenharmony_cithe input to a Collator should omit any superfluous trailing padding spaces. The
5602e5b6d6dSopenharmony_ciproblem arises with contractions, expansions, or normalization. Suppose that
5612e5b6d6dSopenharmony_cithere are two fields, one containing "aed" and the other with "äd". German
5622e5b6d6dSopenharmony_ciphonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will
5632e5b6d6dSopenharmony_cicompare "ä" as if it were "ae" (on a primary level), so the order will be "äd" <
5642e5b6d6dSopenharmony_ci"aed". But if both fields are padded with spaces to a length of 3, then this
5652e5b6d6dSopenharmony_ciwill reverse the order, since the first will compare as if it were one character
5662e5b6d6dSopenharmony_cilonger. In other words, when you start with strings 1 and 2
5672e5b6d6dSopenharmony_ci
5682e5b6d6dSopenharmony_ci1  | a  | e  | d         | \<space\>
5692e5b6d6dSopenharmony_ci-- | -- | -- | --------- | ---------
5702e5b6d6dSopenharmony_ci2  | ä  | d  | \<space\> | \<space\>
5712e5b6d6dSopenharmony_ci
5722e5b6d6dSopenharmony_cithey end up being compared on a primary level as if they were 1' and 2'
5732e5b6d6dSopenharmony_ci
5742e5b6d6dSopenharmony_ci1' | a  | e  | d  | \<space\> | &nbsp;
5752e5b6d6dSopenharmony_ci-- | -- | -- | -- | --------- | ---------
5762e5b6d6dSopenharmony_ci2' | a  | e  | d  | \<space\> | \<space\>
5772e5b6d6dSopenharmony_ci
5782e5b6d6dSopenharmony_ciSince 2' has an extra character (the extra space), it counts as having a primary
5792e5b6d6dSopenharmony_cidifference when it shouldn't. The correct result occurs when the trailing
5802e5b6d6dSopenharmony_cipadding spaces are removed, as in 1" and 2"
5812e5b6d6dSopenharmony_ci
5822e5b6d6dSopenharmony_ci1" | a  | e  | d
5832e5b6d6dSopenharmony_ci-- | -- | -- | --
5842e5b6d6dSopenharmony_ci2" | a  | e  | d
5852e5b6d6dSopenharmony_ci
5862e5b6d6dSopenharmony_ci## Collator naming scheme
5872e5b6d6dSopenharmony_ci
5882e5b6d6dSopenharmony_ci***Starting with ICU 54, the following naming scheme and its API functions are deprecated.***
5892e5b6d6dSopenharmony_ciUse `ucol_open()` with language tag collation keywords instead
5902e5b6d6dSopenharmony_ci(see [Collation API Details](api.md)). For example,
5912e5b6d6dSopenharmony_ci`ucol_open("de-u-co-phonebk-ka-shifted", &errorCode)` for German Phonebook order
5922e5b6d6dSopenharmony_ciwith "ignore punctuation" mode.
5932e5b6d6dSopenharmony_ci
5942e5b6d6dSopenharmony_ciWhen collating or matching text, a number of attributes can be used to affect
5952e5b6d6dSopenharmony_cithe desired result. The following describes the attributes, their values, their
5962e5b6d6dSopenharmony_cieffects, their normal usage, and the string comparison performance and sort key
5972e5b6d6dSopenharmony_cilength implications. It also includes single-letter abbreviations for both the
5982e5b6d6dSopenharmony_ciattributes and their values. These abbreviations allow a 'short-form'
5992e5b6d6dSopenharmony_cispecification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which
6002e5b6d6dSopenharmony_cican be used to specific that the desired options are: UCA version 4.0.0; ignore
6012e5b6d6dSopenharmony_cispaces, punctuation and symbols; use Swedish linguistic conventions; compare
6022e5b6d6dSopenharmony_cicase-insensitively.
6032e5b6d6dSopenharmony_ci
6042e5b6d6dSopenharmony_ciA number of attribute values are common across different attributes; these
6052e5b6d6dSopenharmony_ciinclude **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless
6062e5b6d6dSopenharmony_ciotherwise stated, the examples use the UCA alone with default settings.
6072e5b6d6dSopenharmony_ci
6082e5b6d6dSopenharmony_ci> :point_right: **Note** In order to achieve uniqueness, a collator name always
6092e5b6d6dSopenharmony_ci> has the attribute abbreviations sorted.
6102e5b6d6dSopenharmony_ci
6112e5b6d6dSopenharmony_ci### Main References
6122e5b6d6dSopenharmony_ci
6132e5b6d6dSopenharmony_ci1.  For a full list of supported locales in ICU, see [Locale
6142e5b6d6dSopenharmony_ci    Explorer](https://icu4c-demos.unicode.org/icu-bin/locexp) , which also contains
6152e5b6d6dSopenharmony_ci    an on-line demo showing sorting for each locale. The demo allows you to try
6162e5b6d6dSopenharmony_ci    different attribute values, to see how they affect sorting.
6172e5b6d6dSopenharmony_ci
6182e5b6d6dSopenharmony_ci2.  To see tabular results for the UCA table itself, see the [Unicode Collation
6192e5b6d6dSopenharmony_ci    Charts](http://www.unicode.org/charts/collation/) .
6202e5b6d6dSopenharmony_ci
6212e5b6d6dSopenharmony_ci3.  For the UCA specification, see [UTS #10: Unicode Collation
6222e5b6d6dSopenharmony_ci    Algorithm](http://www.unicode.org/reports/tr10/) .
6232e5b6d6dSopenharmony_ci
6242e5b6d6dSopenharmony_ci4.  For more detail on the precise effects of these options, see [Collation
6252e5b6d6dSopenharmony_ci    Customization](customization/index.md) .
6262e5b6d6dSopenharmony_ci
6272e5b6d6dSopenharmony_ci#### Collator Naming Attributes
6282e5b6d6dSopenharmony_ci
6292e5b6d6dSopenharmony_ciAttribute              | Abbreviation | Possible Values
6302e5b6d6dSopenharmony_ci---------------------- | ------------ | ---------------
6312e5b6d6dSopenharmony_ciLocale                 | L            | \<language\>
6322e5b6d6dSopenharmony_ciScript                 | Z            | \<script\>
6332e5b6d6dSopenharmony_ciRegion                 | R            | \<region\>
6342e5b6d6dSopenharmony_ciVariant                | V            | \<variant\>
6352e5b6d6dSopenharmony_ciKeyword                | K            | \<keyword\>
6362e5b6d6dSopenharmony_ci&nbsp;                 | &nbsp;       | &nbsp;
6372e5b6d6dSopenharmony_ciStrength               | S            | 1, 2, 3, 4, I, D
6382e5b6d6dSopenharmony_ciCase_Level             | E            | X, O, D
6392e5b6d6dSopenharmony_ciCase_First             | C            | X, L, U, D
6402e5b6d6dSopenharmony_ciAlternate              | A            | N, S, D
6412e5b6d6dSopenharmony_ciVariable_Top           | T            | \<hex digits\>
6422e5b6d6dSopenharmony_ciNormalization Checking | N            | X, O, D
6432e5b6d6dSopenharmony_ciFrench                 | F            | X, O, D
6442e5b6d6dSopenharmony_ciHiragana               | H            | X, O, D
6452e5b6d6dSopenharmony_ci
6462e5b6d6dSopenharmony_ci#### Collator Naming Attribute Descriptions
6472e5b6d6dSopenharmony_ci
6482e5b6d6dSopenharmony_ciThe **Locale** attribute is typically the most
6492e5b6d6dSopenharmony_ciimportant attribute for correct sorting and matching, according to the user
6502e5b6d6dSopenharmony_ciexpectations in different countries and regions. The default UCA ordering will
6512e5b6d6dSopenharmony_cionly sort a few languages such as Dutch and Portuguese correctly ("correctly"
6522e5b6d6dSopenharmony_cimeaning according to the normal expectations for users of the languages).
6532e5b6d6dSopenharmony_ciOtherwise, you need to supply the locale to UCA in order to properly collate
6542e5b6d6dSopenharmony_citext for a given language. Thus a locale needs to be supplied so as to choose a
6552e5b6d6dSopenharmony_cicollator that is correctly **tailored** for that locale. The choice of a locale
6562e5b6d6dSopenharmony_ciwill automatically preset the values for all of the attributes to something that
6572e5b6d6dSopenharmony_ciis reasonable for that locale. Thus most of the time the other attributes do not
6582e5b6d6dSopenharmony_cineed to be explicitly set. In some cases, the choice of locale will make a
6592e5b6d6dSopenharmony_cidifference in string comparison performance and/or sort key length.
6602e5b6d6dSopenharmony_ci
6612e5b6d6dSopenharmony_ciIn short attribute names,
6622e5b6d6dSopenharmony_ci`<language>_<script>_<region>_<variant>@collation=<keyword>` is
6632e5b6d6dSopenharmony_cirepresented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not
6642e5b6d6dSopenharmony_ciall the elements are required. Valid values for locale elements are general
6652e5b6d6dSopenharmony_civalid values for RFC 3066 locale naming.
6662e5b6d6dSopenharmony_ci
6672e5b6d6dSopenharmony_ci**Example:**\
6682e5b6d6dSopenharmony_ci**Locale="sv" (Swedish)** "Kypper" < "Köpfe"\
6692e5b6d6dSopenharmony_ci**Locale="de" (German)** "Köpfe" < "Kypper"
6702e5b6d6dSopenharmony_ci
6712e5b6d6dSopenharmony_ciThe **Strength** attribute determines whether accents or
6722e5b6d6dSopenharmony_cicase are taken into account when collating or matching text. ( (In writing
6732e5b6d6dSopenharmony_cisystems without case or accents, it controls similarly important features). The
6742e5b6d6dSopenharmony_cidefault strength setting usually does not need to be changed for collating
6752e5b6d6dSopenharmony_ci(sorting), but often needs to be changed when **matching** (e.g. SELECT). The
6762e5b6d6dSopenharmony_cipossible values include Default (D), Primary (1), Secondary (2), Tertiary (3),
6772e5b6d6dSopenharmony_ciQuaternary (4), and Identical (I).
6782e5b6d6dSopenharmony_ci
6792e5b6d6dSopenharmony_ciFor example, people may choose to ignore accents or ignore accents and case when
6802e5b6d6dSopenharmony_cisearching for text.
6812e5b6d6dSopenharmony_ci
6822e5b6d6dSopenharmony_ciAlmost all characters are distinguished by the first three levels, and in most
6832e5b6d6dSopenharmony_cilocales the default value is thus Tertiary. However, if Alternate is set to be
6842e5b6d6dSopenharmony_ciShifted, then the Quaternary strength (4) can be used to break ties among
6852e5b6d6dSopenharmony_ciwhitespace, punctuation, and symbols that would otherwise be ignored. If very
6862e5b6d6dSopenharmony_cifine distinctions among characters are required, then the Identical strength (I)
6872e5b6d6dSopenharmony_cican be used (for example, Identical Strength distinguishes between the
6882e5b6d6dSopenharmony_ci**Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more
6892e5b6d6dSopenharmony_ciexamples, look at the cells with white backgrounds in the collation charts).
6902e5b6d6dSopenharmony_ciHowever, using levels higher than Tertiary - the Identical strength - result in
6912e5b6d6dSopenharmony_cisignificantly longer sort keys, and slower string comparison performance for
6922e5b6d6dSopenharmony_ciequal strings.
6932e5b6d6dSopenharmony_ci
6942e5b6d6dSopenharmony_ci**Example:**\
6952e5b6d6dSopenharmony_ci**S=1** role = Role = rôle\
6962e5b6d6dSopenharmony_ci**S=2** role = Role < rôle\
6972e5b6d6dSopenharmony_ci**S=3** role < Role < rôle
6982e5b6d6dSopenharmony_ci
6992e5b6d6dSopenharmony_ciThe **Case_Level** attribute is used when ignoring accents
7002e5b6d6dSopenharmony_ci**but not** case. In such a situation, set Strength to be Primary, and
7012e5b6d6dSopenharmony_ciCase_Level to be On. In most locales, this setting is Off by default. There is a
7022e5b6d6dSopenharmony_cismall string comparison performance and sort key impact if this attribute is set
7032e5b6d6dSopenharmony_cito be On.
7042e5b6d6dSopenharmony_ci
7052e5b6d6dSopenharmony_ci**Example:**\
7062e5b6d6dSopenharmony_ci**S=1, E=X** role = Role = rôle\
7072e5b6d6dSopenharmony_ci**S=1, E=O** role = rôle < Role
7082e5b6d6dSopenharmony_ci
7092e5b6d6dSopenharmony_ciThe **Case_First** attribute is used to control whether
7102e5b6d6dSopenharmony_ciuppercase letters come before lowercase letters or vice versa, in the absence of
7112e5b6d6dSopenharmony_ciother differences in the strings. The possible values are Uppercase_First (U)
7122e5b6d6dSopenharmony_ciand Lowercase_First (L), plus the standard Default and Off. There is almost no
7132e5b6d6dSopenharmony_cidifference between the Off and Lowercase_First options in terms of results, so
7142e5b6d6dSopenharmony_citypically users will not use Lowercase_First: only Off or Uppercase_First.
7152e5b6d6dSopenharmony_ci(People interested in the detailed differences between X and L should consult
7162e5b6d6dSopenharmony_cithe [Collation Customization](customization/index.md) ).
7172e5b6d6dSopenharmony_ciSpecifying either L or U won't affect string comparison performance, but will
7182e5b6d6dSopenharmony_ciaffect the sort key length.
7192e5b6d6dSopenharmony_ci
7202e5b6d6dSopenharmony_ci**Example:**\
7212e5b6d6dSopenharmony_ci**C=X or C=L** "china" < "China" < "denmark" < "Denmark"\
7222e5b6d6dSopenharmony_ci**C=U** "China" < "china" < "Denmark" < "denmark"
7232e5b6d6dSopenharmony_ci
7242e5b6d6dSopenharmony_ciThe **Alternate** attribute is used to control the handling of
7252e5b6d6dSopenharmony_cithe so-called **variable **characters in the UCA: whitespace, punctuation and
7262e5b6d6dSopenharmony_cisymbols. If Alternate is set to Non-Ignorable (N), then differences among these
7272e5b6d6dSopenharmony_cicharacters are of the same importance as differences among letters. If Alternate
7282e5b6d6dSopenharmony_ciis set to Shifted (S), then these characters are of only minor importance. The
7292e5b6d6dSopenharmony_ciShifted value is often used in combination with Strength set to Quaternary. In
7302e5b6d6dSopenharmony_cisuch a case, white-space, punctuation, and symbols are considered when comparing
7312e5b6d6dSopenharmony_cistrings, but only if all other aspects of the strings (base letters, accents,
7322e5b6d6dSopenharmony_ciand case) are identical. If Alternate is not set to Shifted, then there is no
7332e5b6d6dSopenharmony_cidifference between a Strength of 3 and a Strength of 4.
7342e5b6d6dSopenharmony_ci
7352e5b6d6dSopenharmony_ciFor more information and examples, see
7362e5b6d6dSopenharmony_ci[Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in
7372e5b6d6dSopenharmony_cithe UCA.
7382e5b6d6dSopenharmony_ci
7392e5b6d6dSopenharmony_ciThe reason the Alternate values are not simply On and Off is that
7402e5b6d6dSopenharmony_ciadditional Alternate values may be added in the future.
7412e5b6d6dSopenharmony_ci
7422e5b6d6dSopenharmony_ciThe UCA option
7432e5b6d6dSopenharmony_ci**Blanked** is expressed with Strength set to 3, and Alternate set to Shifted.
7442e5b6d6dSopenharmony_ci
7452e5b6d6dSopenharmony_ciThe default for most locales is Non-Ignorable. If Shifted is selected, it may be
7462e5b6d6dSopenharmony_cislower if there are many strings that are the same except for punctuation; sort
7472e5b6d6dSopenharmony_cikey length will not be affected unless the strength level is also increased.
7482e5b6d6dSopenharmony_ci
7492e5b6d6dSopenharmony_ci**Example:**\
7502e5b6d6dSopenharmony_ci**S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\
7512e5b6d6dSopenharmony_ci**S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\
7522e5b6d6dSopenharmony_ci**S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA
7532e5b6d6dSopenharmony_ci
7542e5b6d6dSopenharmony_ciThe **Variable_Top** attribute is only meaningful if the
7552e5b6d6dSopenharmony_ciAlternate attribute is not set to Non-Ignorable. In such a case, it controls
7562e5b6d6dSopenharmony_ciwhich characters count as ignorable. The \<hex\> value specifies the "highest"
7572e5b6d6dSopenharmony_cicharacter sequence (in UCA order) weight that is to be considered ignorable.
7582e5b6d6dSopenharmony_ci
7592e5b6d6dSopenharmony_ciThus, for example, if a user wanted white-space to be ignorable, but not any
7602e5b6d6dSopenharmony_civisible characters, then s/he would use the value Variable_Top=0020 (space). The
7612e5b6d6dSopenharmony_cidigits should only be a single character. All characters of the same primary
7622e5b6d6dSopenharmony_ciweight are equivalent, so Variable_Top=3000 (ideographic space) has the same
7632e5b6d6dSopenharmony_cieffect as Variable_Top=0020.
7642e5b6d6dSopenharmony_ci
7652e5b6d6dSopenharmony_ciThis setting (alone) has little impact on string comparison performance; setting
7662e5b6d6dSopenharmony_ciit lower or higher will make sort keys slightly shorter or longer respectively.
7672e5b6d6dSopenharmony_ci
7682e5b6d6dSopenharmony_ci**Example:**\
7692e5b6d6dSopenharmony_ci**S=3, A=S** di Silva = diSilva < U.S.A. = USA\
7702e5b6d6dSopenharmony_ci**S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA
7712e5b6d6dSopenharmony_ci
7722e5b6d6dSopenharmony_ciThe **Normalization** setting determines whether
7732e5b6d6dSopenharmony_citext is thoroughly normalized or not in comparison. Even if the setting is off
7742e5b6d6dSopenharmony_ci(which is the default for many locales), text as represented in common usage
7752e5b6d6dSopenharmony_ciwill compare correctly (for details, see [UTN
7762e5b6d6dSopenharmony_ci#5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in
7772e5b6d6dSopenharmony_cinon-canonical order will there be a problem. If the setting is On, then the best
7782e5b6d6dSopenharmony_ciresults are guaranteed for all possible text input.There is a medium string
7792e5b6d6dSopenharmony_cicomparison performance cost if this attribute is On, depending on the frequency
7802e5b6d6dSopenharmony_ciof sequences that require normalization. There is no significant effect on sort
7812e5b6d6dSopenharmony_cikey length.If the input text is known to be in NFD or NFKD normalization forms,
7822e5b6d6dSopenharmony_cithere is no need to enable this Normalization option.
7832e5b6d6dSopenharmony_ci
7842e5b6d6dSopenharmony_ci**Example:**\
7852e5b6d6dSopenharmony_ci**N=X** ä = a + ◌̈ < ä + ◌̣ < ạ + ◌̈\
7862e5b6d6dSopenharmony_ci**N=O** ä = a + ◌̈ < ä + ◌̣ = ạ + ◌̈
7872e5b6d6dSopenharmony_ci
7882e5b6d6dSopenharmony_ciSome **French** dictionary ordering traditions sort strings with
7892e5b6d6dSopenharmony_cidifferent accents from the back of the string. This attribute is automatically
7902e5b6d6dSopenharmony_ciset to On for the Canadian French locale (fr_CA). Users normally would not need
7912e5b6d6dSopenharmony_cito explicitly set this attribute. There is a string comparison performance cost
7922e5b6d6dSopenharmony_ciwhen it is set On, but sort key length is unaffected.
7932e5b6d6dSopenharmony_ci
7942e5b6d6dSopenharmony_ci**Example:**\
7952e5b6d6dSopenharmony_ci**F=X** cote < coté < côte < côté\
7962e5b6d6dSopenharmony_ci**F=O** cote < côte < coté < côté
7972e5b6d6dSopenharmony_ci
7982e5b6d6dSopenharmony_ciCompatibility with JIS x 4061 requires the introduction of an
7992e5b6d6dSopenharmony_ciadditional level to distinguish **Hiragana** and Katakana characters. If
8002e5b6d6dSopenharmony_cicompatibility with that standard is required, then this attribute is set On, and
8012e5b6d6dSopenharmony_cithe strength should be set to at least Quaternary.
8022e5b6d6dSopenharmony_ci
8032e5b6d6dSopenharmony_ciThis attribute is an implementation detail of the CLDR Japanese tailoring. The
8042e5b6d6dSopenharmony_ciimplementation might change to use a different mechanism to achieve the same
8052e5b6d6dSopenharmony_ciJapanese sort order. Since ICU 50, this attribute is not settable any more.
8062e5b6d6dSopenharmony_ci
8072e5b6d6dSopenharmony_ci**Example:**\
8082e5b6d6dSopenharmony_ci**H=X, S=4** きゅう = キュウ < きゆう = キユウ\
8092e5b6d6dSopenharmony_ci**H=O, S=4** きゅう < キュウ < きゆう < キユウ
8102e5b6d6dSopenharmony_ci
8112e5b6d6dSopenharmony_ci> :point_right: **Note** If attributes in collator name are not overridden,
8122e5b6d6dSopenharmony_ci> it is assumed that they are the same as for the given locale.
8132e5b6d6dSopenharmony_ci> For example, a collator opened with an empty
8142e5b6d6dSopenharmony_ci> string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.*
8152e5b6d6dSopenharmony_ci
8162e5b6d6dSopenharmony_ci### Summary of Value Abbreviations
8172e5b6d6dSopenharmony_ci
8182e5b6d6dSopenharmony_ciValue         | Abbreviation
8192e5b6d6dSopenharmony_ci------------- | ------------
8202e5b6d6dSopenharmony_ciDefault       | D
8212e5b6d6dSopenharmony_ciOn            | O
8222e5b6d6dSopenharmony_ciOff           | X
8232e5b6d6dSopenharmony_ciPrimary       | 1
8242e5b6d6dSopenharmony_ciSecondary     | 2
8252e5b6d6dSopenharmony_ciTertiary      | 3
8262e5b6d6dSopenharmony_ciQuaternary    | 4
8272e5b6d6dSopenharmony_ciIdentical     | I
8282e5b6d6dSopenharmony_ciShifted       | S
8292e5b6d6dSopenharmony_ciNon-Ignorable | N
8302e5b6d6dSopenharmony_ciLower-First   | L
8312e5b6d6dSopenharmony_ciUpper-First   | U
832