userguide/collation/concepts.md

2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_cilayout: default
2e5b6d6dSopenharmony_cititle: Concepts
2e5b6d6dSopenharmony_cinav_order: 1
2e5b6d6dSopenharmony_ciparent: Collation
2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_ci<!--
2e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
2e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
2e5b6d6dSopenharmony_ci-->
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci# Collation Concepts
2e5b6d6dSopenharmony_ci{: .no_toc }
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Contents
2e5b6d6dSopenharmony_ci{: .no_toc .text-delta }
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1. TOC
2e5b6d6dSopenharmony_ci{:toc}
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Overview
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe previous section demonstrated many of the requirements imposed on string
2e5b6d6dSopenharmony_cicomparison routines that try to correctly collate strings according to
2e5b6d6dSopenharmony_ciconventions of more than a hundred different languages, written in many
2e5b6d6dSopenharmony_cidifferent scripts. This section describes the principles and architecture behind
2e5b6d6dSopenharmony_cithe ICU Collation Service.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Sortkeys vs Comparison
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSort keys are most useful in databases, where the overhead of calling a function
2e5b6d6dSopenharmony_cifor each comparison is very large.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciGenerating a sort key from a Collator is many times more expensive than doing a
2e5b6d6dSopenharmony_cicompare with the Collator (for common use cases). That's if the two functions
2e5b6d6dSopenharmony_ciare called from Java or C. So for those languages, unless there is a very large
2e5b6d6dSopenharmony_cinumber of comparisons, it is better to call the compare function.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciHere is an example, with a little back-of-the-envelope calculation. Let's
2e5b6d6dSopenharmony_cisuppose that with a given language on a given platform, the compare performance
2e5b6d6dSopenharmony_ci(CP) is 100 faster than sortKey performance (SP), and that you are doing a
2e5b6d6dSopenharmony_cibinary search of a list with 1,000 elements. The binary comparison performance
2e5b6d6dSopenharmony_ciis BP. We'd do about 10 comparisons, getting:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_cicompare: 10 \* CP
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_cisortkey: 1 \* SP + 10 \* BP
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciEven if BP is free, compare would be better. One has to get up to where log2(n)
2e5b6d6dSopenharmony_ci= 100 before they break even.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBut even this calculation is only a rough guide. First, the binary comparison is
2e5b6d6dSopenharmony_cinot completely free. Secondly, the performance of compare function varies
2e5b6d6dSopenharmony_ciradically with the source data. We optimized for maximizing performance of
2e5b6d6dSopenharmony_cicollation in sorting and binary search, so comparing strings that are "close" is
2e5b6d6dSopenharmony_cioptimized to be much faster than comparing strings that are "far away". That
2e5b6d6dSopenharmony_cioptimization is important because normal sort/lookup operations compare close
2e5b6d6dSopenharmony_cistrings far more often -- think of binary search, where the last few comparisons
2e5b6d6dSopenharmony_ciare always with the closest strings. So even the above calculation is not very
2e5b6d6dSopenharmony_ciaccurate.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Comparison Levels
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn general, when comparing and sorting objects, some properties can take
2e5b6d6dSopenharmony_ciprecedence over others. For example, in geometry, you might consider first the
2e5b6d6dSopenharmony_cinumber of sides a shape has, followed by the number of sides of equal length.
2e5b6d6dSopenharmony_ciThis causes triangles to be sorted together, then rectangles, then pentagons,
2e5b6d6dSopenharmony_cietc. Within each category, the shapes would be ordered according to whether they
2e5b6d6dSopenharmony_cihad 0, 2, 3 or more sides of the same length. However, this is not the only way
2e5b6d6dSopenharmony_cithe shapes can be sorted. For example, it might be preferable to sort shapes by
2e5b6d6dSopenharmony_cicolor first, so that all red shapes are grouped together, then blue, etc.
2e5b6d6dSopenharmony_ciAnother approach would be to sort the shapes by the amount of area they enclose.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSimilarly, character strings have properties, some of which can take precedence
2e5b6d6dSopenharmony_ciover others. There is more than one way to prioritize the properties.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor example, a common approach is to distinguish characters first by their
2e5b6d6dSopenharmony_ciunadorned base letter (for example, without accents, vowels or tone marks), then
2e5b6d6dSopenharmony_ciby accents, and then by the case of the letter (upper vs. lower). Ideographic
2e5b6d6dSopenharmony_cicharacters might be sorted by their component radicals and then by the number of
2e5b6d6dSopenharmony_cistrokes it takes to draw the character.
2e5b6d6dSopenharmony_ciAn alternative ordering would be to sort these characters by strokes first and
2e5b6d6dSopenharmony_cithen by their radicals.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe ICU Collation Service supports many levels of comparison (named "Levels",
2e5b6d6dSopenharmony_cibut also known as "Strengths"). Having these categories enables ICU to sort
2e5b6d6dSopenharmony_cistrings precisely according to local conventions. However, by allowing the
2e5b6d6dSopenharmony_cilevels to be selectively employed, searching for a string in text can be
2e5b6d6dSopenharmony_ciperformed with various matching conditions.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciPerformance optimizations have been made for ICU collation with the default
2e5b6d6dSopenharmony_cilevel settings. Performance specific impacts are discussed in the Performance
2e5b6d6dSopenharmony_cisection below.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFollowing is a list of the names for each level and an example usage:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  Primary Level: Typically, this is used to denote differences between base
2e5b6d6dSopenharmony_ci    characters (for example, "a" < "b"). It is the strongest difference. For
2e5b6d6dSopenharmony_ci    example, dictionaries are divided into different sections by base character.
2e5b6d6dSopenharmony_ci    This is also called the level-1 strength.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  Secondary Level: Accents in the characters are considered secondary
2e5b6d6dSopenharmony_ci    differences (for example, "as" < "às" < "at"). Other differences between
2e5b6d6dSopenharmony_ci    letters can also be considered secondary differences, depending on the
2e5b6d6dSopenharmony_ci    language. A secondary difference is ignored when there is a primary
2e5b6d6dSopenharmony_ci    difference anywhere in the strings. This is also called the level-2
2e5b6d6dSopenharmony_ci    strength.
2e5b6d6dSopenharmony_ci    Note: In some languages (such as Danish), certain accented letters are
2e5b6d6dSopenharmony_ci    considered to be separate base characters. In most languages, however, an
2e5b6d6dSopenharmony_ci    accented letter only has a secondary difference from the unaccented version
2e5b6d6dSopenharmony_ci    of that letter.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci3.  Tertiary Level: Upper and lower case differences in characters are
2e5b6d6dSopenharmony_ci    distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In
2e5b6d6dSopenharmony_ci    addition, a variant of a letter differs from the base form on the tertiary
2e5b6d6dSopenharmony_ci    level (such as "A" and "Ⓐ"). Another example is the difference between large
2e5b6d6dSopenharmony_ci    and small Kana. A tertiary difference is ignored when there is a primary or
2e5b6d6dSopenharmony_ci    secondary difference anywhere in the strings. This is also called the
2e5b6d6dSopenharmony_ci    level-3 strength.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci4.  Quaternary Level: When punctuation is ignored (see Ignoring Punctuations
2e5b6d6dSopenharmony_ci    (§)) at level 1-3, an additional level can be used to distinguish words with
2e5b6d6dSopenharmony_ci    and without punctuation (for example, "ab" < "a-b" < "aB"). This difference
2e5b6d6dSopenharmony_ci    is ignored when there is a primary, secondary or tertiary difference. This
2e5b6d6dSopenharmony_ci    is also known as the level-4 strength. The quaternary level should only be
2e5b6d6dSopenharmony_ci    used if ignoring punctuation is required or when processing Japanese text
2e5b6d6dSopenharmony_ci    (see Hiragana processing (§)).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci5.  Identical Level: When all other levels are equal, the identical level is
2e5b6d6dSopenharmony_ci    used as a tiebreaker. The Unicode code point values of the NFD form of each
2e5b6d6dSopenharmony_ci    string are compared at this level, just in case there is no difference at
2e5b6d6dSopenharmony_ci    levels 1-4. For example, Hebrew cantillation marks are only distinguished
2e5b6d6dSopenharmony_ci    at this level. This level should be used sparingly, as only code point
2e5b6d6dSopenharmony_ci    value differences between two strings is an extremely rare occurrence.
2e5b6d6dSopenharmony_ci    Using this level substantially decreases the performance for
2e5b6d6dSopenharmony_ci    both incremental comparison and sort key generation (as well as increasing
2e5b6d6dSopenharmony_ci    the sort key length). It is also known as level 5 strength.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Backward Secondary Sorting
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSome languages require words to be ordered on the secondary level according to
2e5b6d6dSopenharmony_cithe *last* accent difference, as opposed to the *first* accent difference. This
2e5b6d6dSopenharmony_ciwas previously the default for all French locales, based on some French
2e5b6d6dSopenharmony_cidictionary ordering traditions, but is currently only applicable to Canadian
2e5b6d6dSopenharmony_ciFrench (locale **fr_CA**), for conformance with the [Canadian sorting
2e5b6d6dSopenharmony_cistandard](http://www.unicode.org/reports/tr10/#CanStd). The difference in
2e5b6d6dSopenharmony_ciordering is only noticeable for a small number of pairs of real words. For more
2e5b6d6dSopenharmony_ciinformation see [UCA: Contextual
2e5b6d6dSopenharmony_ciSensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciExample:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciForward secondary | Backward secondary
2e5b6d6dSopenharmony_ci----------------- | ------------------
2e5b6d6dSopenharmony_cicote              | cote
2e5b6d6dSopenharmony_cicoté              | côte
2e5b6d6dSopenharmony_cicôte              | coté
2e5b6d6dSopenharmony_cicôté              | côté
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Contractions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA contraction is a sequence consisting of two or more letters. It is considered
2e5b6d6dSopenharmony_cia single letter in sorting.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor example, in the traditional Spanish sorting order, "ch" is considered a
2e5b6d6dSopenharmony_cisingle letter. All words that begin with "ch" sort after all other words
2e5b6d6dSopenharmony_cibeginning with "c", but before words starting with "d".
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOther examples of contractions are "ch" in Czech, which sorts after "h", and
2e5b6d6dSopenharmony_ci"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n"
2e5b6d6dSopenharmony_cirespectively.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciExample:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOrder without contraction | Order with contraction "lj" sorting after letter "l"
2e5b6d6dSopenharmony_ci------------------------- | ----------------------------------------------------
2e5b6d6dSopenharmony_cila                        | la
2e5b6d6dSopenharmony_cili                        | li
2e5b6d6dSopenharmony_cilj                        | lk
2e5b6d6dSopenharmony_cilja                       | lz
2e5b6d6dSopenharmony_ciljz                       | lj
2e5b6d6dSopenharmony_cilk                        | lja
2e5b6d6dSopenharmony_cilz                        | ljz
2e5b6d6dSopenharmony_cima                        | ma
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciContracting sequences such as the above are not very common in most languages.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note** Since ICU 2.2, and as required by the UCA,
2e5b6d6dSopenharmony_ci> if a completely ignorable code point
2e5b6d6dSopenharmony_ci> appears in text in the middle of contraction, it will not break the contraction.
2e5b6d6dSopenharmony_ci> For example, in Czech sorting, cU+0000h will sort as it were ch.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Expansions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf a letter sorts as if it were a sequence of more than one letter, it is called
2e5b6d6dSopenharmony_cian expansion.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor example, in German phonebook sorting (de@collation=phonebook or BCP 47
2e5b6d6dSopenharmony_cide-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae."
2e5b6d6dSopenharmony_ciAll words starting with "ä" will sort between words starting with "ad" and words
2e5b6d6dSopenharmony_cistarting with "af".
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn the case of Unicode encoding, characters can often be represented either as
2e5b6d6dSopenharmony_cipre-composed characters or in decomposed form. For example, the letter "à" can
2e5b6d6dSopenharmony_cibe represented in its decomposed (a+\`) and pre-composed (à) form. Most
2e5b6d6dSopenharmony_ciapplications do not want to distinguish text by the way it is encoded. A search
2e5b6d6dSopenharmony_cifor "à" should find all instances of the letter, regardless of whether the
2e5b6d6dSopenharmony_ciinstance is in pre-composed or decomposed form. Therefore, either form of the
2e5b6d6dSopenharmony_ciletter must result in the same sort ordering. The architecture of the ICU
2e5b6d6dSopenharmony_ciCollation Service supports this.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Contractions Producing Expansions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIt is possible to have contractions that produce expansions.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOne example occurs in Japanese, where the vowel with a prolonged sound mark is
2e5b6d6dSopenharmony_citreated to be equivalent to the long vowel version:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciカアー<<< カイー and\
2e5b6d6dSopenharmony_ciキイー<<< キイー
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses
2e5b6d6dSopenharmony_ci> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings)
2e5b6d6dSopenharmony_ci> instead of contraction producing expansions.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Normalization
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn the section on expansions, we discussed that text in Unicode can often be
2e5b6d6dSopenharmony_cirepresented in either pre-composed or decomposed forms. There are other types of
2e5b6d6dSopenharmony_ciequivalences possible with Unicode, including Canonical and Compatibility. The
2e5b6d6dSopenharmony_ciprocess of
2e5b6d6dSopenharmony_ciNormalization ensures that text is written in a predictable way so that searches
2e5b6d6dSopenharmony_ciare not made unnecessarily complicated by having to match on equivalences. Not
2e5b6d6dSopenharmony_ciall text is normalized, however, so it is useful to have a collation service
2e5b6d6dSopenharmony_cithat can address text that is not normalized, but do so with efficiency.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe ICU Collation Service handles un-normalized text properly, producing the
2e5b6d6dSopenharmony_cisame results as if the text were normalized.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn practice, most data that is encountered is in normalized or semi-normalized
2e5b6d6dSopenharmony_ciform already. The ICU Collation Service is designed so that it can process a
2e5b6d6dSopenharmony_ciwide range of normalized or un-normalized text without a need for normalization
2e5b6d6dSopenharmony_ciprocessing. When a case is encountered that requires normalization, the ICU
2e5b6d6dSopenharmony_ciCollation Service drops into code specific to this purpose. This maximizes
2e5b6d6dSopenharmony_ciperformance for the majority of text that does not require normalization.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn addition, if the text is known with certainty not to contain un-normalized
2e5b6d6dSopenharmony_citext, then even the overhead of checking for normalization can be eliminated.
2e5b6d6dSopenharmony_ciThe ICU Collation Service has the ability to turn Normalization Checking either
2e5b6d6dSopenharmony_cion or off. If Normalization Checking is turned off, it is the user's
2e5b6d6dSopenharmony_ciresponsibility to insure that all text is already in the appropriate form. This
2e5b6d6dSopenharmony_ciis true in a great majority of the world languages, so normalization checking is
2e5b6d6dSopenharmony_citurned off by default for most locales.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf the text requires normalization processing, Normalization Checking should be
2e5b6d6dSopenharmony_cion. Any language that uses multiple combining characters such as Arabic, ancient
2e5b6d6dSopenharmony_ciGreek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking
2e5b6d6dSopenharmony_cito be on, or the text to go through a normalization process before collation.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor more information about Normalization related reordering please see
2e5b6d6dSopenharmony_ci[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and
2e5b6d6dSopenharmony_ci[UAX #15.](http://www.unicode.org/reports/tr15/)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note** ICU supports two modes of normalization: on and off.
2e5b6d6dSopenharmony_ci> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Ignoring Punctuation
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn some cases, punctuation can be ignored while searching or sorting data. For
2e5b6d6dSopenharmony_ciexample, this enables a search for "biweekly" to also return instances of
2e5b6d6dSopenharmony_ci"bi-weekly". In other cases, it is desirable for punctuated text to be
2e5b6d6dSopenharmony_cidistinguished from text without punctuation, but to have the text sort close
2e5b6d6dSopenharmony_citogether.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThese two behaviors can be accomplished if there is a way for a character to be
2e5b6d6dSopenharmony_ciignored on all levels except for the quaternary level. If this is the case, then
2e5b6d6dSopenharmony_citwo strings which compare as identical on the first three levels (base letter,
2e5b6d6dSopenharmony_ciaccents, and case) are then distinguished at the fourth level based on their
2e5b6d6dSopenharmony_cipunctuation (if any). If the comparison function ignores differences at the
2e5b6d6dSopenharmony_cifourth level, then strings that differ by punctuation only are compared as
2e5b6d6dSopenharmony_ciequal.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following table shows the results of sorting a list of terms in 3 different
2e5b6d6dSopenharmony_ciways. In the first column, punctuation characters (space " ", and hyphen "-")
2e5b6d6dSopenharmony_ciare not ignored (" " < "-" < "b"). In the second column, punctuation characters
2e5b6d6dSopenharmony_ciare ignored in the first 3 levels and compared only in the fourth level. In the
2e5b6d6dSopenharmony_cithird column, punctuation characters are ignored in the first 3 levels and the
2e5b6d6dSopenharmony_cifourth level is not considered. In the last column, punctuated terms are
2e5b6d6dSopenharmony_ciequivalent to the identical terms without punctuation.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor more options and details see the [“Ignore Punctuation”
2e5b6d6dSopenharmony_ciOptions](customization/ignorepunct.md) page.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNon-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength
2e5b6d6dSopenharmony_ci------------- | --------------------------------- | -------------------------------
2e5b6d6dSopenharmony_ciblack bird    | black bird                        | **black bird**
2e5b6d6dSopenharmony_ciblack Bird    | black-bird                        | **black-bird**
2e5b6d6dSopenharmony_ciblack birds   | blackbird                         | **blackbird**
2e5b6d6dSopenharmony_ciblack-bird    | black Bird                        | black Bird
2e5b6d6dSopenharmony_ciblack-Bird    | black-Bird                        | black-Bird
2e5b6d6dSopenharmony_ciblack-birds   | blackBird                         | blackBird
2e5b6d6dSopenharmony_ciblackbird     | black birds                       | black birds
2e5b6d6dSopenharmony_ciblackBird     | black-birds                       | black-birds
2e5b6d6dSopenharmony_ciblackbirds    | blackbirds                        | blackbirds
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note** The strings with the same font format in the last column are
2e5b6d6dSopenharmony_cicompared as equal by ICU Collator.\
2e5b6d6dSopenharmony_ci> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that
2e5b6d6dSopenharmony_ci> follow shifted code points will be completely ignored. This means that an accent
2e5b6d6dSopenharmony_ci> following a space will compare as if it was a space alone.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Case Ordering
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe tertiary level is used to distinguish text by case, by small versus large
2e5b6d6dSopenharmony_ciKana, and other letter variants as noted above.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSome applications prefer to emphasize case differences so that words starting
2e5b6d6dSopenharmony_ciwith the same case sort together. Some Japanese applications require the
2e5b6d6dSopenharmony_cidifference between small and large Kana be emphasized over other tertiary
2e5b6d6dSopenharmony_cidifferences.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe UCA does not provide means to separate out either case or Kana differences
2e5b6d6dSopenharmony_cifrom the remaining tertiary differences. However, the ICU Collation Service has
2e5b6d6dSopenharmony_citwo options that help in customize case and/or Kana differences. Both options
2e5b6d6dSopenharmony_ciare turned off by default.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### CaseFirst
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe Case-first option makes case the most significant part of the tertiary
2e5b6d6dSopenharmony_cilevel. Primary and secondary levels are unaffected. With this option, words
2e5b6d6dSopenharmony_cistarting with the same case sort together. The Case-first option can be set to
2e5b6d6dSopenharmony_cimake either lowercase sort before
2e5b6d6dSopenharmony_ciuppercase or uppercase sort before lowercase.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNote: The case-first option does not constitute a separate level; it is simply a
2e5b6d6dSopenharmony_cireordering of the tertiary level.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU makes use of the following three case categories for sorting
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  uppercase: "ABC"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  mixed case: "Abc", "aBc"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci3.  normal (lowercase or no case): "abc", "123"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciMixed case is always sorted between uppercase and normal case when the
2e5b6d6dSopenharmony_ci"case-first" option is set.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### CaseLevel
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe Case Level option makes a separate level for case differences. This is an
2e5b6d6dSopenharmony_ciextra level positioned between secondary and tertiary. The case level is used in
2e5b6d6dSopenharmony_ciJapanese to make the difference between small and large Kana more important than
2e5b6d6dSopenharmony_cithe other tertiary differences. It also can be used to ignore other tertiary
2e5b6d6dSopenharmony_cidifferences, or even secondary differences. This is especially useful in
2e5b6d6dSopenharmony_cimatching. For example, if the strength is set to primary only (level-1) and the
2e5b6d6dSopenharmony_cicase level is turned on, the comparison ignores accents and tertiary differences
2e5b6d6dSopenharmony_ciexcept for case. The contents of the case level are affected by the case-first
2e5b6d6dSopenharmony_cioption.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe case level is independent from the strength of comparison. It is possible to
2e5b6d6dSopenharmony_cihave a collator set to primary strength with the case level turned on. This
2e5b6d6dSopenharmony_ciprovides for comparison that takes into account the case differences, while at
2e5b6d6dSopenharmony_cithe same time ignoring accents and tertiary differences other than case. This
2e5b6d6dSopenharmony_cimay be used in searching.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciExample:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Case-first off, Case level off**
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciapple\
2e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
2e5b6d6dSopenharmony_ciAbernathy\
2e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
2e5b6d6dSopenharmony_ciähnlich\
2e5b6d6dSopenharmony_ciÄhnlichkeit
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Lowercase-first, Case level off**
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciapple\
2e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
2e5b6d6dSopenharmony_ciähnlich\
2e5b6d6dSopenharmony_ciAbernathy\
2e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
2e5b6d6dSopenharmony_ciÄhnlichkeit
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Uppercase-first, Case level off**
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAbernathy\
2e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
2e5b6d6dSopenharmony_ciÄhnlichkeit\
2e5b6d6dSopenharmony_ciapple\
2e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
2e5b6d6dSopenharmony_ciähnlich
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Lowercase-first, Case level on**
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciapple\
2e5b6d6dSopenharmony_ciAbernathy\
2e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
2e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
2e5b6d6dSopenharmony_ciähnlich\
2e5b6d6dSopenharmony_ciÄhnlichkeit
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Uppercase-first, Case level on**
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAbernathy\
2e5b6d6dSopenharmony_ciapple\
2e5b6d6dSopenharmony_ciⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
2e5b6d6dSopenharmony_ciⓐⓟⓟⓛⓔ\
2e5b6d6dSopenharmony_ciÄhnlichkeit\
2e5b6d6dSopenharmony_ciähnlich
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Script Reordering
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciScript reordering allows scripts and some other groups of characters to be moved
2e5b6d6dSopenharmony_cirelative to each other. This reordering is done on top of the DUCET/CLDR
2e5b6d6dSopenharmony_cistandard collation order. Reordering can specify groups to be placed at the
2e5b6d6dSopenharmony_cistart and/or the end of the collation order.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBy default, reordering codes specified for the start of the order are placed in
2e5b6d6dSopenharmony_cithe order given after several special non-script blocks. These special groups of
2e5b6d6dSopenharmony_cicharacters are space, punctuation, symbol, currency, and digit. Script groups
2e5b6d6dSopenharmony_cican be intermingled with these special non-script groups if those special groups
2e5b6d6dSopenharmony_ciare explicitly specified in the reordering.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe special code `others` stands for any script that is not explicitly mentioned
2e5b6d6dSopenharmony_ciin the list. Anything that is after others will go at the very end of the list
2e5b6d6dSopenharmony_ciin the order given. For example, `[Grek, others, Latn]` will result in an
2e5b6d6dSopenharmony_ciordering that puts all scripts other than Greek and Latin between them.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Examples:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNote: All examples below use the string equivalents for the scripts and reorder
2e5b6d6dSopenharmony_cicodes that would be used in collator rules. The script and reorder code
2e5b6d6dSopenharmony_ciconstants that would be used in API calls will be different.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 1:**\
2e5b6d6dSopenharmony_ciset reorder code - `[Grek]`\
2e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 2:**\
2e5b6d6dSopenharmony_ciset reorder code - `[Grek]`\
2e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_cifollowed by: set reorder code - `[Hani]`\
2e5b6d6dSopenharmony_ciresult -` [space, punctuation, symbol, currency, digit, Hani, others]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThat is, setting a reordering always modifies
2e5b6d6dSopenharmony_cithe DUCET/CLDR order, replacing whatever was previously set, rather than adding
2e5b6d6dSopenharmony_cion to it. In order to cumulatively modify an ordering, you have to retrieve the
2e5b6d6dSopenharmony_ciexisting ordering, modify it, and then set it.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 3:**\
2e5b6d6dSopenharmony_ciset reorder code - `[others, digit]`\
2e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, others, digit]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 4:**\
2e5b6d6dSopenharmony_ciset reorder code - `[space, Grek, punctuation]`\
2e5b6d6dSopenharmony_ciresult - `[symbol, currency, digit, space, Grek, punctuation, others]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 5:**\
2e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani]`\
2e5b6d6dSopenharmony_ciresult - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 6:**\
2e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\
2e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_cifollowed by:\
2e5b6d6dSopenharmony_ciset reorder code - `[NONE]`\
2e5b6d6dSopenharmony_ciresult - DUCET/CLDR
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 7:**\
2e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\
2e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_cifollowed by:\
2e5b6d6dSopenharmony_ciset reorder code - `[DEFAULT]`\
2e5b6d6dSopenharmony_ciresult - original reordering for the locale which may or may not be DUCET/CLDR
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 8:**\
2e5b6d6dSopenharmony_ciset reorder code - `[Grek, others, Hani, symbol, Tglg]`\
2e5b6d6dSopenharmony_ciresult - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_cifollowed by:\
2e5b6d6dSopenharmony_ciset reorder code - `[]`\
2e5b6d6dSopenharmony_ciresult - original reordering for the locale which may or may not be DUCET/CLDR
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example 9:**\
2e5b6d6dSopenharmony_ciset reorder code - `[Hebr, Phnx]`\
2e5b6d6dSopenharmony_ciresult - error
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBeginning with ICU 55, scripts only reorder together if they are primary-equal,
2e5b6d6dSopenharmony_cifor example Hiragana and Katakana.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU 4.8-54:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   Scripts were reordered in groups, each normally starting with a [Recommended
2e5b6d6dSopenharmony_ci    Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts).
2e5b6d6dSopenharmony_ci*   Reorder codes moved as a group (were “equivalent”) if their scripts shared a
2e5b6d6dSopenharmony_ci    primary-weight lead byte.
2e5b6d6dSopenharmony_ci*   For example, Hebr and Phnx were “equivalent” reordering codes and were
2e5b6d6dSopenharmony_ci    reordered together. Their order relative to each other could not be changed.
2e5b6d6dSopenharmony_ci*   Only any one code out of any group could be reordered, not multiple of the
2e5b6d6dSopenharmony_ci    same group.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Sorting of Japanese Text (JIS X 4061)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciJapanese standard JIS X 4061 requires two changes to the collation procedures:
2e5b6d6dSopenharmony_cispecial processing of Hiragana characters and (for performance reasons) prefix
2e5b6d6dSopenharmony_cianalysis of text.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Hiragana Processing
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciJIS X 4061 standard requires more levels than provided by the UCA. To offer
2e5b6d6dSopenharmony_ciconformant sorting order, ICU uses the quaternary level to distinguish between
2e5b6d6dSopenharmony_ciHiragana and Katakana. Hiragana symbols are given smaller values than Katakana
2e5b6d6dSopenharmony_cisymbols on quaternary level, thus causing Hiragana sequences to sort before
2e5b6d6dSopenharmony_cicorresponding Katakana sequences.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Prefix Analysis
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAnother characteristics of sorting according to the JIS X 4061 is a large number
2e5b6d6dSopenharmony_ciof contractions followed by expansions (see
2e5b6d6dSopenharmony_ci[Contractions Producing Expansions](#contractions-producing-expansions)).
2e5b6d6dSopenharmony_ciThis causes all the Hiragana and Katakana codepoints to be treated as
2e5b6d6dSopenharmony_cicontractions, which reduces performance. The solution we adopted introduces the
2e5b6d6dSopenharmony_ciprefix concept which allows us to improve the performance of Japanese sorting.
2e5b6d6dSopenharmony_ciMore about this can be found in the [customization
2e5b6d6dSopenharmony_cichapter](customization/index.md) .
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Thai/Lao reordering
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUCA requires that certain Thai and Lao prevowels be reordered with a code point
2e5b6d6dSopenharmony_cifollowing them. This option is always on in the ICU implementation, as
2e5b6d6dSopenharmony_ciprescribed by the UCA.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis rule takes effect when:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the
2e5b6d6dSopenharmony_ci    range \\U0E01-\\U0E2E
2e5b6d6dSopenharmony_ci    or
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the
2e5b6d6dSopenharmony_ci    range \\U0E81-\\U0EAE. In these cases the vowel is placed after the
2e5b6d6dSopenharmony_ci    consonant for collation purposes.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai
2e5b6d6dSopenharmony_ci> reordering. Java.text.\* classes allow tailorings to turn off reordering by
2e5b6d6dSopenharmony_ci> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai
2e5b6d6dSopenharmony_ci> prevowels.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Space Padding
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn many database products, fields are padded with null. To get correct results,
2e5b6d6dSopenharmony_cithe input to a Collator should omit any superfluous trailing padding spaces. The
2e5b6d6dSopenharmony_ciproblem arises with contractions, expansions, or normalization. Suppose that
2e5b6d6dSopenharmony_cithere are two fields, one containing "aed" and the other with "äd". German
2e5b6d6dSopenharmony_ciphonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will
2e5b6d6dSopenharmony_cicompare "ä" as if it were "ae" (on a primary level), so the order will be "äd" <
2e5b6d6dSopenharmony_ci"aed". But if both fields are padded with spaces to a length of 3, then this
2e5b6d6dSopenharmony_ciwill reverse the order, since the first will compare as if it were one character
2e5b6d6dSopenharmony_cilonger. In other words, when you start with strings 1 and 2
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1  | a  | e  | d         | \<space\>
2e5b6d6dSopenharmony_ci-- | -- | -- | --------- | ---------
2e5b6d6dSopenharmony_ci2  | ä  | d  | \<space\> | \<space\>
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_cithey end up being compared on a primary level as if they were 1' and 2'
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1' | a  | e  | d  | \<space\> | &nbsp;
2e5b6d6dSopenharmony_ci-- | -- | -- | -- | --------- | ---------
2e5b6d6dSopenharmony_ci2' | a  | e  | d  | \<space\> | \<space\>
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSince 2' has an extra character (the extra space), it counts as having a primary
2e5b6d6dSopenharmony_cidifference when it shouldn't. The correct result occurs when the trailing
2e5b6d6dSopenharmony_cipadding spaces are removed, as in 1" and 2"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1" | a  | e  | d
2e5b6d6dSopenharmony_ci-- | -- | -- | --
2e5b6d6dSopenharmony_ci2" | a  | e  | d
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Collator naming scheme
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci***Starting with ICU 54, the following naming scheme and its API functions are deprecated.***
2e5b6d6dSopenharmony_ciUse `ucol_open()` with language tag collation keywords instead
2e5b6d6dSopenharmony_ci(see [Collation API Details](api.md)). For example,
2e5b6d6dSopenharmony_ci`ucol_open("de-u-co-phonebk-ka-shifted", &errorCode)` for German Phonebook order
2e5b6d6dSopenharmony_ciwith "ignore punctuation" mode.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhen collating or matching text, a number of attributes can be used to affect
2e5b6d6dSopenharmony_cithe desired result. The following describes the attributes, their values, their
2e5b6d6dSopenharmony_cieffects, their normal usage, and the string comparison performance and sort key
2e5b6d6dSopenharmony_cilength implications. It also includes single-letter abbreviations for both the
2e5b6d6dSopenharmony_ciattributes and their values. These abbreviations allow a 'short-form'
2e5b6d6dSopenharmony_cispecification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which
2e5b6d6dSopenharmony_cican be used to specific that the desired options are: UCA version 4.0.0; ignore
2e5b6d6dSopenharmony_cispaces, punctuation and symbols; use Swedish linguistic conventions; compare
2e5b6d6dSopenharmony_cicase-insensitively.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA number of attribute values are common across different attributes; these
2e5b6d6dSopenharmony_ciinclude **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless
2e5b6d6dSopenharmony_ciotherwise stated, the examples use the UCA alone with default settings.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note** In order to achieve uniqueness, a collator name always
2e5b6d6dSopenharmony_ci> has the attribute abbreviations sorted.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Main References
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  For a full list of supported locales in ICU, see [Locale
2e5b6d6dSopenharmony_ci    Explorer](https://icu4c-demos.unicode.org/icu-bin/locexp) , which also contains
2e5b6d6dSopenharmony_ci    an on-line demo showing sorting for each locale. The demo allows you to try
2e5b6d6dSopenharmony_ci    different attribute values, to see how they affect sorting.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  To see tabular results for the UCA table itself, see the [Unicode Collation
2e5b6d6dSopenharmony_ci    Charts](http://www.unicode.org/charts/collation/) .
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci3.  For the UCA specification, see [UTS #10: Unicode Collation
2e5b6d6dSopenharmony_ci    Algorithm](http://www.unicode.org/reports/tr10/) .
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci4.  For more detail on the precise effects of these options, see [Collation
2e5b6d6dSopenharmony_ci    Customization](customization/index.md) .
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Collator Naming Attributes
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAttribute              | Abbreviation | Possible Values
2e5b6d6dSopenharmony_ci---------------------- | ------------ | ---------------
2e5b6d6dSopenharmony_ciLocale                 | L            | \<language\>
2e5b6d6dSopenharmony_ciScript                 | Z            | \<script\>
2e5b6d6dSopenharmony_ciRegion                 | R            | \<region\>
2e5b6d6dSopenharmony_ciVariant                | V            | \<variant\>
2e5b6d6dSopenharmony_ciKeyword                | K            | \<keyword\>
2e5b6d6dSopenharmony_ci&nbsp;                 | &nbsp;       | &nbsp;
2e5b6d6dSopenharmony_ciStrength               | S            | 1, 2, 3, 4, I, D
2e5b6d6dSopenharmony_ciCase_Level             | E            | X, O, D
2e5b6d6dSopenharmony_ciCase_First             | C            | X, L, U, D
2e5b6d6dSopenharmony_ciAlternate              | A            | N, S, D
2e5b6d6dSopenharmony_ciVariable_Top           | T            | \<hex digits\>
2e5b6d6dSopenharmony_ciNormalization Checking | N            | X, O, D
2e5b6d6dSopenharmony_ciFrench                 | F            | X, O, D
2e5b6d6dSopenharmony_ciHiragana               | H            | X, O, D
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Collator Naming Attribute Descriptions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe **Locale** attribute is typically the most
2e5b6d6dSopenharmony_ciimportant attribute for correct sorting and matching, according to the user
2e5b6d6dSopenharmony_ciexpectations in different countries and regions. The default UCA ordering will
2e5b6d6dSopenharmony_cionly sort a few languages such as Dutch and Portuguese correctly ("correctly"
2e5b6d6dSopenharmony_cimeaning according to the normal expectations for users of the languages).
2e5b6d6dSopenharmony_ciOtherwise, you need to supply the locale to UCA in order to properly collate
2e5b6d6dSopenharmony_citext for a given language. Thus a locale needs to be supplied so as to choose a
2e5b6d6dSopenharmony_cicollator that is correctly **tailored** for that locale. The choice of a locale
2e5b6d6dSopenharmony_ciwill automatically preset the values for all of the attributes to something that
2e5b6d6dSopenharmony_ciis reasonable for that locale. Thus most of the time the other attributes do not
2e5b6d6dSopenharmony_cineed to be explicitly set. In some cases, the choice of locale will make a
2e5b6d6dSopenharmony_cidifference in string comparison performance and/or sort key length.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn short attribute names,
2e5b6d6dSopenharmony_ci`<language>_<script>_<region>_<variant>@collation=<keyword>` is
2e5b6d6dSopenharmony_cirepresented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not
2e5b6d6dSopenharmony_ciall the elements are required. Valid values for locale elements are general
2e5b6d6dSopenharmony_civalid values for RFC 3066 locale naming.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**Locale="sv" (Swedish)** "Kypper" < "Köpfe"\
2e5b6d6dSopenharmony_ci**Locale="de" (German)** "Köpfe" < "Kypper"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe **Strength** attribute determines whether accents or
2e5b6d6dSopenharmony_cicase are taken into account when collating or matching text. ( (In writing
2e5b6d6dSopenharmony_cisystems without case or accents, it controls similarly important features). The
2e5b6d6dSopenharmony_cidefault strength setting usually does not need to be changed for collating
2e5b6d6dSopenharmony_ci(sorting), but often needs to be changed when **matching** (e.g. SELECT). The
2e5b6d6dSopenharmony_cipossible values include Default (D), Primary (1), Secondary (2), Tertiary (3),
2e5b6d6dSopenharmony_ciQuaternary (4), and Identical (I).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor example, people may choose to ignore accents or ignore accents and case when
2e5b6d6dSopenharmony_cisearching for text.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAlmost all characters are distinguished by the first three levels, and in most
2e5b6d6dSopenharmony_cilocales the default value is thus Tertiary. However, if Alternate is set to be
2e5b6d6dSopenharmony_ciShifted, then the Quaternary strength (4) can be used to break ties among
2e5b6d6dSopenharmony_ciwhitespace, punctuation, and symbols that would otherwise be ignored. If very
2e5b6d6dSopenharmony_cifine distinctions among characters are required, then the Identical strength (I)
2e5b6d6dSopenharmony_cican be used (for example, Identical Strength distinguishes between the
2e5b6d6dSopenharmony_ci**Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more
2e5b6d6dSopenharmony_ciexamples, look at the cells with white backgrounds in the collation charts).
2e5b6d6dSopenharmony_ciHowever, using levels higher than Tertiary - the Identical strength - result in
2e5b6d6dSopenharmony_cisignificantly longer sort keys, and slower string comparison performance for
2e5b6d6dSopenharmony_ciequal strings.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**S=1** role = Role = rôle\
2e5b6d6dSopenharmony_ci**S=2** role = Role < rôle\
2e5b6d6dSopenharmony_ci**S=3** role < Role < rôle
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe **Case_Level** attribute is used when ignoring accents
2e5b6d6dSopenharmony_ci**but not** case. In such a situation, set Strength to be Primary, and
2e5b6d6dSopenharmony_ciCase_Level to be On. In most locales, this setting is Off by default. There is a
2e5b6d6dSopenharmony_cismall string comparison performance and sort key impact if this attribute is set
2e5b6d6dSopenharmony_cito be On.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**S=1, E=X** role = Role = rôle\
2e5b6d6dSopenharmony_ci**S=1, E=O** role = rôle < Role
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe **Case_First** attribute is used to control whether
2e5b6d6dSopenharmony_ciuppercase letters come before lowercase letters or vice versa, in the absence of
2e5b6d6dSopenharmony_ciother differences in the strings. The possible values are Uppercase_First (U)
2e5b6d6dSopenharmony_ciand Lowercase_First (L), plus the standard Default and Off. There is almost no
2e5b6d6dSopenharmony_cidifference between the Off and Lowercase_First options in terms of results, so
2e5b6d6dSopenharmony_citypically users will not use Lowercase_First: only Off or Uppercase_First.
2e5b6d6dSopenharmony_ci(People interested in the detailed differences between X and L should consult
2e5b6d6dSopenharmony_cithe [Collation Customization](customization/index.md) ).
2e5b6d6dSopenharmony_ciSpecifying either L or U won't affect string comparison performance, but will
2e5b6d6dSopenharmony_ciaffect the sort key length.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**C=X or C=L** "china" < "China" < "denmark" < "Denmark"\
2e5b6d6dSopenharmony_ci**C=U** "China" < "china" < "Denmark" < "denmark"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe **Alternate** attribute is used to control the handling of
2e5b6d6dSopenharmony_cithe so-called **variable **characters in the UCA: whitespace, punctuation and
2e5b6d6dSopenharmony_cisymbols. If Alternate is set to Non-Ignorable (N), then differences among these
2e5b6d6dSopenharmony_cicharacters are of the same importance as differences among letters. If Alternate
2e5b6d6dSopenharmony_ciis set to Shifted (S), then these characters are of only minor importance. The
2e5b6d6dSopenharmony_ciShifted value is often used in combination with Strength set to Quaternary. In
2e5b6d6dSopenharmony_cisuch a case, white-space, punctuation, and symbols are considered when comparing
2e5b6d6dSopenharmony_cistrings, but only if all other aspects of the strings (base letters, accents,
2e5b6d6dSopenharmony_ciand case) are identical. If Alternate is not set to Shifted, then there is no
2e5b6d6dSopenharmony_cidifference between a Strength of 3 and a Strength of 4.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor more information and examples, see
2e5b6d6dSopenharmony_ci[Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in
2e5b6d6dSopenharmony_cithe UCA.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe reason the Alternate values are not simply On and Off is that
2e5b6d6dSopenharmony_ciadditional Alternate values may be added in the future.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe UCA option
2e5b6d6dSopenharmony_ci**Blanked** is expressed with Strength set to 3, and Alternate set to Shifted.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe default for most locales is Non-Ignorable. If Shifted is selected, it may be
2e5b6d6dSopenharmony_cislower if there are many strings that are the same except for punctuation; sort
2e5b6d6dSopenharmony_cikey length will not be affected unless the strength level is also increased.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\
2e5b6d6dSopenharmony_ci**S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\
2e5b6d6dSopenharmony_ci**S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe **Variable_Top** attribute is only meaningful if the
2e5b6d6dSopenharmony_ciAlternate attribute is not set to Non-Ignorable. In such a case, it controls
2e5b6d6dSopenharmony_ciwhich characters count as ignorable. The \<hex\> value specifies the "highest"
2e5b6d6dSopenharmony_cicharacter sequence (in UCA order) weight that is to be considered ignorable.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThus, for example, if a user wanted white-space to be ignorable, but not any
2e5b6d6dSopenharmony_civisible characters, then s/he would use the value Variable_Top=0020 (space). The
2e5b6d6dSopenharmony_cidigits should only be a single character. All characters of the same primary
2e5b6d6dSopenharmony_ciweight are equivalent, so Variable_Top=3000 (ideographic space) has the same
2e5b6d6dSopenharmony_cieffect as Variable_Top=0020.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis setting (alone) has little impact on string comparison performance; setting
2e5b6d6dSopenharmony_ciit lower or higher will make sort keys slightly shorter or longer respectively.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**S=3, A=S** di Silva = diSilva < U.S.A. = USA\
2e5b6d6dSopenharmony_ci**S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe **Normalization** setting determines whether
2e5b6d6dSopenharmony_citext is thoroughly normalized or not in comparison. Even if the setting is off
2e5b6d6dSopenharmony_ci(which is the default for many locales), text as represented in common usage
2e5b6d6dSopenharmony_ciwill compare correctly (for details, see [UTN
2e5b6d6dSopenharmony_ci#5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in
2e5b6d6dSopenharmony_cinon-canonical order will there be a problem. If the setting is On, then the best
2e5b6d6dSopenharmony_ciresults are guaranteed for all possible text input.There is a medium string
2e5b6d6dSopenharmony_cicomparison performance cost if this attribute is On, depending on the frequency
2e5b6d6dSopenharmony_ciof sequences that require normalization. There is no significant effect on sort
2e5b6d6dSopenharmony_cikey length.If the input text is known to be in NFD or NFKD normalization forms,
2e5b6d6dSopenharmony_cithere is no need to enable this Normalization option.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**N=X** ä = a + ◌̈ < ä + ◌̣ < ạ + ◌̈\
2e5b6d6dSopenharmony_ci**N=O** ä = a + ◌̈ < ä + ◌̣ = ạ + ◌̈
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSome **French** dictionary ordering traditions sort strings with
2e5b6d6dSopenharmony_cidifferent accents from the back of the string. This attribute is automatically
2e5b6d6dSopenharmony_ciset to On for the Canadian French locale (fr_CA). Users normally would not need
2e5b6d6dSopenharmony_cito explicitly set this attribute. There is a string comparison performance cost
2e5b6d6dSopenharmony_ciwhen it is set On, but sort key length is unaffected.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**F=X** cote < coté < côte < côté\
2e5b6d6dSopenharmony_ci**F=O** cote < côte < coté < côté
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciCompatibility with JIS x 4061 requires the introduction of an
2e5b6d6dSopenharmony_ciadditional level to distinguish **Hiragana** and Katakana characters. If
2e5b6d6dSopenharmony_cicompatibility with that standard is required, then this attribute is set On, and
2e5b6d6dSopenharmony_cithe strength should be set to at least Quaternary.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis attribute is an implementation detail of the CLDR Japanese tailoring. The
2e5b6d6dSopenharmony_ciimplementation might change to use a different mechanism to achieve the same
2e5b6d6dSopenharmony_ciJapanese sort order. Since ICU 50, this attribute is not settable any more.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Example:**\
2e5b6d6dSopenharmony_ci**H=X, S=4** きゅう = キュウ < きゆう = キユウ\
2e5b6d6dSopenharmony_ci**H=O, S=4** きゅう < キュウ < きゆう < キユウ
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note** If attributes in collator name are not overridden,
2e5b6d6dSopenharmony_ci> it is assumed that they are the same as for the given locale.
2e5b6d6dSopenharmony_ci> For example, a collator opened with an empty
2e5b6d6dSopenharmony_ci> string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Summary of Value Abbreviations
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciValue         | Abbreviation
2e5b6d6dSopenharmony_ci------------- | ------------
2e5b6d6dSopenharmony_ciDefault       | D
2e5b6d6dSopenharmony_ciOn            | O
2e5b6d6dSopenharmony_ciOff           | X
2e5b6d6dSopenharmony_ciPrimary       | 1
2e5b6d6dSopenharmony_ciSecondary     | 2
2e5b6d6dSopenharmony_ciTertiary      | 3
2e5b6d6dSopenharmony_ciQuaternary    | 4
2e5b6d6dSopenharmony_ciIdentical     | I
2e5b6d6dSopenharmony_ciShifted       | S
2e5b6d6dSopenharmony_ciNon-Ignorable | N
2e5b6d6dSopenharmony_ciLower-First   | L
2e5b6d6dSopenharmony_ciUpper-First   | U