12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: Architecture
42e5b6d6dSopenharmony_cinav_order: 2
52e5b6d6dSopenharmony_ciparent: Collation
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# Collation Service Architecture
132e5b6d6dSopenharmony_ci{: .no_toc }
142e5b6d6dSopenharmony_ci
152e5b6d6dSopenharmony_ci## Contents
162e5b6d6dSopenharmony_ci{: .no_toc .text-delta }
172e5b6d6dSopenharmony_ci
182e5b6d6dSopenharmony_ci1. TOC
192e5b6d6dSopenharmony_ci{:toc}
202e5b6d6dSopenharmony_ci
212e5b6d6dSopenharmony_ci---
222e5b6d6dSopenharmony_ci
232e5b6d6dSopenharmony_ci## Overview
242e5b6d6dSopenharmony_ci
252e5b6d6dSopenharmony_ciThis section describes the design principles, architecture and coding
262e5b6d6dSopenharmony_ciconventions of the ICU Collation Service.
272e5b6d6dSopenharmony_ci
282e5b6d6dSopenharmony_ci## Collator
292e5b6d6dSopenharmony_ci
302e5b6d6dSopenharmony_ciTo use the Collation Service, a Collator must first be instantiated. An
312e5b6d6dSopenharmony_ciCollator is a data structure or object that maintains all of the property
322e5b6d6dSopenharmony_ciand state information necessary to define and support the specific collation
332e5b6d6dSopenharmony_cibehavior provided. Examples of properties described in the Collator are the
342e5b6d6dSopenharmony_cilocale, whether normalization is to be performed, and how many levels of
352e5b6d6dSopenharmony_cicollation are to be evaluated. Examples of the state information described in
362e5b6d6dSopenharmony_cithe Collator include the direction of a Collation Element Iterator (forward
372e5b6d6dSopenharmony_cior backward) and the status of the last API executed.
382e5b6d6dSopenharmony_ci
392e5b6d6dSopenharmony_ciThe Collator is instantiated either by referencing a locale or by defining a
402e5b6d6dSopenharmony_cicustom set of rules (a tailoring).
412e5b6d6dSopenharmony_ci
422e5b6d6dSopenharmony_ciThe Collation Service uses the paradigm:
432e5b6d6dSopenharmony_ci
442e5b6d6dSopenharmony_ci1.  Open a Collator,
452e5b6d6dSopenharmony_ci
462e5b6d6dSopenharmony_ci2.  Use while necessary,
472e5b6d6dSopenharmony_ci
482e5b6d6dSopenharmony_ci3.  Close the Collator.
492e5b6d6dSopenharmony_ci
502e5b6d6dSopenharmony_ciCollator instances cannot be shared among threads. You should open them
512e5b6d6dSopenharmony_ciinstead, and use a different collator for each separate thread. The safe clone
522e5b6d6dSopenharmony_cifunction is supported for cloning collators in a thread-safe fashion.
532e5b6d6dSopenharmony_ci
542e5b6d6dSopenharmony_ciThe Collation Service follows the ICU conventions for locale designation
552e5b6d6dSopenharmony_ciwhen opening collators:
562e5b6d6dSopenharmony_ci
572e5b6d6dSopenharmony_ci1.  NULL means the default locale.
582e5b6d6dSopenharmony_ci
592e5b6d6dSopenharmony_ci2.  The empty locale name ("") means the root locale.
602e5b6d6dSopenharmony_ci    The Collation Service adheres to the ICU conventions described in the
612e5b6d6dSopenharmony_ci    "[ICU Architectural Design](../design.md) " section of the users guide.
622e5b6d6dSopenharmony_ci    In particular:
632e5b6d6dSopenharmony_ci
642e5b6d6dSopenharmony_ci3.  The standard error code convention is usually followed. (Functions that do
652e5b6d6dSopenharmony_ci    not take an error code parameter do so for backward compatibility.)
662e5b6d6dSopenharmony_ci
672e5b6d6dSopenharmony_ci4.  The string length convention is followed: when passing a `UChar *`, the
682e5b6d6dSopenharmony_ci    length is required in a separate argument. If -1 is passed for the length,
692e5b6d6dSopenharmony_ci    it is assumed that the string is zero terminated.
702e5b6d6dSopenharmony_ci
712e5b6d6dSopenharmony_ci### Collation locale and keyword handling
722e5b6d6dSopenharmony_ci
732e5b6d6dSopenharmony_ciWhen a collator is created from a locale, the collation service (like all ICU
742e5b6d6dSopenharmony_ciservices) must map the requested locale to the localized collation data
752e5b6d6dSopenharmony_ciavailable to ICU at the time. It does so using the standard ICU locale fallback
762e5b6d6dSopenharmony_cimechanism. See the fallback section of the [locale
772e5b6d6dSopenharmony_cichapter](../locale/index.md) for more details.
782e5b6d6dSopenharmony_ci
792e5b6d6dSopenharmony_ciIf you pass a regular locale in, like "en_US", the collation service first
802e5b6d6dSopenharmony_cisearches with fallback for "collations/default" key. The first such key it finds
812e5b6d6dSopenharmony_ciwill have an associated string value; this is the keyword name for the collation
822e5b6d6dSopenharmony_cithat is default for this locale. If the search falls all the way back to the
832e5b6d6dSopenharmony_ciroot locale, the collation service will us the "collations/default" key there,
842e5b6d6dSopenharmony_ciwhich has the value "standard".
852e5b6d6dSopenharmony_ci
862e5b6d6dSopenharmony_ciIf there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the
872e5b6d6dSopenharmony_cicollation service searches with fallback for "collations/phonebook". If the
882e5b6d6dSopenharmony_cisearch is successful, the collation service uses the string value it finds to
892e5b6d6dSopenharmony_ciinstantiate a Collator. If the search fails because no such key is present in
902e5b6d6dSopenharmony_ciany of ICU's locale data (e.g., "de@collation=funky"), the service returns a
912e5b6d6dSopenharmony_cicollator implementing the default tailoring of the locale.
922e5b6d6dSopenharmony_ciIf the fallback is all the way to the root locale, then
932e5b6d6dSopenharmony_cithe return `UErrorCode` is `U_USING_DEFAULT_WARNING`.
942e5b6d6dSopenharmony_ci
952e5b6d6dSopenharmony_ci## Input values for collation
962e5b6d6dSopenharmony_ci
972e5b6d6dSopenharmony_ciCollation deals with processing strings. ICU generally requires that all the
982e5b6d6dSopenharmony_cistrings should be in UTF-16 format, and that all the required conversion should
992e5b6d6dSopenharmony_cidone before ICU functions are used. In the case of collation, there are APIs
1002e5b6d6dSopenharmony_cithat can also take instances of character iterators (`UCharIterator`)
1012e5b6d6dSopenharmony_cior UTF-8 directly.
1022e5b6d6dSopenharmony_ci
1032e5b6d6dSopenharmony_ciTheoretically, character iterators can iterate strings
1042e5b6d6dSopenharmony_ciin any encoding. ICU currently provides character iterator implementations for
1052e5b6d6dSopenharmony_ciUTF-8 and UTF-16BE (useful when processing data from a big endian platform on an
1062e5b6d6dSopenharmony_cilittle endian machine). It should be noted, however, that using iterators for
1072e5b6d6dSopenharmony_cicollation APIs has a performance impact. It should be used in situations when it
1082e5b6d6dSopenharmony_ciis not desirable to convert whole strings before the operation - such as when
1092e5b6d6dSopenharmony_ciusing a string compare function.
1102e5b6d6dSopenharmony_ci
1112e5b6d6dSopenharmony_ci## Collation Elements
1122e5b6d6dSopenharmony_ci
1132e5b6d6dSopenharmony_ciAs discussed in the introduction, there are many possible orderings for sorted
1142e5b6d6dSopenharmony_citext, depending on language and other factors. Ideally, there is a way to
1152e5b6d6dSopenharmony_cidescribe each ordering as a set of rules for calculating numeric values for each
1162e5b6d6dSopenharmony_cistring of text. The collation process then becomes one of simply comparing these
1172e5b6d6dSopenharmony_cinumeric values.
1182e5b6d6dSopenharmony_ci
1192e5b6d6dSopenharmony_ciThis essentially describes the way the Collation Service works. To implement
1202e5b6d6dSopenharmony_cia particular sort ordering, first the relationship between each character or
1212e5b6d6dSopenharmony_cicharacter sequence is derived. For example, a Spanish ordering defines the
1222e5b6d6dSopenharmony_ciletter sequence "CH" to be between the letters "C" and "D". As also discussed in
1232e5b6d6dSopenharmony_cithe introduction, to order strings properly requires that comparison of base
1242e5b6d6dSopenharmony_ciletters must be considered separately from comparison of accents. Letter case
1252e5b6d6dSopenharmony_cimust also be considered separately from either base letters or accents. Any
1262e5b6d6dSopenharmony_ciordering specification language must provide a way to define the relationships
1272e5b6d6dSopenharmony_cibetween characters or character sequences on multiple levels. ICU supports this
1282e5b6d6dSopenharmony_ciby using "<" to describe a relationship at the primary level, using "<<" to
1292e5b6d6dSopenharmony_cidescribe a relationship at the secondary level, and using "<<<" to describe a
1302e5b6d6dSopenharmony_cirelationship at the tertiary level. Here are some example usages:
1312e5b6d6dSopenharmony_ci
1322e5b6d6dSopenharmony_ciSymbol | Example  | Description
1332e5b6d6dSopenharmony_ci------ | -------- | -----------
1342e5b6d6dSopenharmony_ci`<`    | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch"
1352e5b6d6dSopenharmony_ci`<<`   | `a << ä` | Make a secondary (accent) difference between "a" and "ä"
1362e5b6d6dSopenharmony_ci`<<<`  | `a<<<A`  | Make a tertiary difference between "a" and "A"
1372e5b6d6dSopenharmony_ci
1382e5b6d6dSopenharmony_ciA more complete description of the ordering specification symbols and their
1392e5b6d6dSopenharmony_cimeanings is provided in the section on Collation Tailoring.
1402e5b6d6dSopenharmony_ci
1412e5b6d6dSopenharmony_ciOnce a sort ordering is defined by specifying the desired relationships between
1422e5b6d6dSopenharmony_cicharacters and character sequences, ICU can convert these relationships to a
1432e5b6d6dSopenharmony_ciseries of numerical values (one for each level) that satisfy these same
1442e5b6d6dSopenharmony_cirelationships.
1452e5b6d6dSopenharmony_ci
1462e5b6d6dSopenharmony_ciThis series of numeric values, representing the relative weighting of a
1472e5b6d6dSopenharmony_cicharacter or character sequence, is called a Collation Element (CE).
1482e5b6d6dSopenharmony_ciOne possible encoding of a Collation Element is a 32-bit value consisting of
1492e5b6d6dSopenharmony_cia 16-bit primary weight, a 8-bit secondary weight,
1502e5b6d6dSopenharmony_ci2 case bits, and a 6-bit tertiary weight.
1512e5b6d6dSopenharmony_ci
1522e5b6d6dSopenharmony_ciThe sort weight of a string is represented by the collation elements of its
1532e5b6d6dSopenharmony_cicomponent characters and character sequences. For example, the sort weight of
1542e5b6d6dSopenharmony_cithe string "apple" would consist of its component Collation Elements, as shown
1552e5b6d6dSopenharmony_cihere:
1562e5b6d6dSopenharmony_ci
1572e5b6d6dSopenharmony_ci"Apple" | "Apple" Collation Elements
1582e5b6d6dSopenharmony_ci------- | --------------------------
1592e5b6d6dSopenharmony_cia       | `[1900.05.05]`
1602e5b6d6dSopenharmony_cip       | `[3700.05.05]`
1612e5b6d6dSopenharmony_cip       | `[3700.05.05]`
1622e5b6d6dSopenharmony_cil       | `[2F00.05.05]`
1632e5b6d6dSopenharmony_cie       | `[2100.05.05]`
1642e5b6d6dSopenharmony_ci
1652e5b6d6dSopenharmony_ciIn this example, the letter "a" has a 16-bit primary weight of 1900 (hex), an
1662e5b6d6dSopenharmony_ci8-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of
1672e5b6d6dSopenharmony_ci05 (hex).
1682e5b6d6dSopenharmony_ci
1692e5b6d6dSopenharmony_ciString comparison is performed by comparing the collation elements of each
1702e5b6d6dSopenharmony_cistring. Each of the primary weights are compared. If a difference is found, that
1712e5b6d6dSopenharmony_cidifference determines the relationship between the two strings. If no
1722e5b6d6dSopenharmony_cidifferences are found, the secondary weights are compared and so forth.
1732e5b6d6dSopenharmony_ci
1742e5b6d6dSopenharmony_ciWith ICU it is possible to specify how many levels should be compared. For some
1752e5b6d6dSopenharmony_ciapplications, it can be desirable to compare only primary levels or to compare
1762e5b6d6dSopenharmony_cionly primary and secondary levels.
1772e5b6d6dSopenharmony_ci
1782e5b6d6dSopenharmony_ci## Sort Keys
1792e5b6d6dSopenharmony_ci
1802e5b6d6dSopenharmony_ciIf a string is to be compared thousands or millions of times,
1812e5b6d6dSopenharmony_ciit can be more efficient to use sort keys.
1822e5b6d6dSopenharmony_ciSort keys are useful in situations where a large amount of data is indexed
1832e5b6d6dSopenharmony_ciand frequently searched. The sort key is generated once and used in subsequent
1842e5b6d6dSopenharmony_cicomparisons, rather than repeatedly generating the string's Collation Elements.
1852e5b6d6dSopenharmony_ciThe comparison of sort keys is a very efficient and simple binary compare of strings of
1862e5b6d6dSopenharmony_ciunsigned bytes.
1872e5b6d6dSopenharmony_ci
1882e5b6d6dSopenharmony_ciAn important property of ICU sort keys is that you can obtain the same results
1892e5b6d6dSopenharmony_ciby comparing 2 strings as you do by comparing the sort keys of the 2 strings
1902e5b6d6dSopenharmony_ci(provided that the same ordering and related collation attributes are used).
1912e5b6d6dSopenharmony_ci
1922e5b6d6dSopenharmony_ciAn ICU sort key is a pre-processed sequence of bytes generated from a Unicode
1932e5b6d6dSopenharmony_cistring. The weights for each comparison level are concatenated, separated by a
1942e5b6d6dSopenharmony_ci"0x01" byte between levels.
1952e5b6d6dSopenharmony_ciThe entire sequence is terminated with a 0x00 byte for convenience in C APIs.
1962e5b6d6dSopenharmony_ci(This 0x00 terminator is counted in the sort key length —
1972e5b6d6dSopenharmony_ciunlike regular strings where the NUL terminator is excluded from the string length.)
1982e5b6d6dSopenharmony_ci
1992e5b6d6dSopenharmony_ciICU actually compresses the sort keys so that they take the
2002e5b6d6dSopenharmony_ciminimum storage in memory and in databases.
2012e5b6d6dSopenharmony_ci
2022e5b6d6dSopenharmony_ci<!-- TODO: (diagram was missing in Google Sites already)
2032e5b6d6dSopenharmony_ci    The diagram below represents an uncompressed sort key in ICU for ease of understanding.  -->
2042e5b6d6dSopenharmony_ci
2052e5b6d6dSopenharmony_ci### Sort key size
2062e5b6d6dSopenharmony_ci
2072e5b6d6dSopenharmony_ciOne of the more important issues when considering using sort keys is the sort
2082e5b6d6dSopenharmony_cikey size. Unfortunately, it is very hard to give a fast exact answer to the
2092e5b6d6dSopenharmony_cifollowing question: "What is the maximum size for sort keys generated for
2102e5b6d6dSopenharmony_cistrings of size X". This problem is twofold:
2112e5b6d6dSopenharmony_ci
2122e5b6d6dSopenharmony_ci1.  The maximum size of the sort key depends on the size of the collation
2132e5b6d6dSopenharmony_ci    elements that are used to build it. Size of collation elements vary greatly
2142e5b6d6dSopenharmony_ci    and depends both on the alphabet in question and on the locale used.
2152e5b6d6dSopenharmony_ci
2162e5b6d6dSopenharmony_ci2.  Compression is used in building sort keys. Most 'regular' sequences of
2172e5b6d6dSopenharmony_ci    characters produce very compact sort keys.
2182e5b6d6dSopenharmony_ci
2192e5b6d6dSopenharmony_ciIf one is to assume the worst case and use too-big buffers, a lot of space will
2202e5b6d6dSopenharmony_cibe wasted. However, if you use too-small buffers, you will lose performance if
2212e5b6d6dSopenharmony_cigenerated sort keys are longer than supplied buffers too often
2222e5b6d6dSopenharmony_ci(and you have to reallocate for each of those).
2232e5b6d6dSopenharmony_ciA good strategy
2242e5b6d6dSopenharmony_cifor this problem would be to manually manage a large buffer for storing sortkeys
2252e5b6d6dSopenharmony_ciand keep a list of indices to sort keys in this buffer (see the "large buffers"
2262e5b6d6dSopenharmony_ci[Collation Example](examples#using-large-buffers-to-manage-sort-keys)
2272e5b6d6dSopenharmony_cifor more details).
2282e5b6d6dSopenharmony_ci
2292e5b6d6dSopenharmony_ciHere are some rules of a thumb, please do not rely on them. If you are looking
2302e5b6d6dSopenharmony_ciat the East Asian locales, you probably want to go with 5 bytes per code point.
2312e5b6d6dSopenharmony_ciFor Thai, 3 bytes per code point should be sufficient. For all the other locales
2322e5b6d6dSopenharmony_ci(mostly Latin and Cyrillic), you should be fine with 2 bytes per code point.
2332e5b6d6dSopenharmony_ciThese values are based on average lengths of sort keys generated with tertiary
2342e5b6d6dSopenharmony_cistrength. If you need quaternary and identical strength (you should not), add 3
2352e5b6d6dSopenharmony_cibytes per code point to each of these.
2362e5b6d6dSopenharmony_ci
2372e5b6d6dSopenharmony_ci### Partial sort keys
2382e5b6d6dSopenharmony_ci
2392e5b6d6dSopenharmony_ciIn some cases, most notably when implementing [radix
2402e5b6d6dSopenharmony_cisorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only
2412e5b6d6dSopenharmony_ciparts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing
2422e5b6d6dSopenharmony_ciparts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be
2432e5b6d6dSopenharmony_cicompressed; that is, they may or may not be compatible with regular sort keys.
2442e5b6d6dSopenharmony_ci
2452e5b6d6dSopenharmony_ci### Merging sort keys
2462e5b6d6dSopenharmony_ci
2472e5b6d6dSopenharmony_ciSometimes, it is useful to be able to merge sort keys. One example is having
2482e5b6d6dSopenharmony_ciseparate sort keys for first and last names. If you need to perform an operation
2492e5b6d6dSopenharmony_cithat requires a sort key generated on the whole name, instead of concatenating
2502e5b6d6dSopenharmony_cistrings and regenerating sort keys, you should merge the sort keys. The merging
2512e5b6d6dSopenharmony_ciis done by merging the corresponding levels while inserting a terminator between
2522e5b6d6dSopenharmony_cimerged parts. The reserved sort key byte value for the merge terminator is 0x02.
2532e5b6d6dSopenharmony_ciFor more details see [UCA section 1.6, Merging Sort
2542e5b6d6dSopenharmony_ciKeys](http://www.unicode.org/reports/tr10/#Interleaved_Levels).
2552e5b6d6dSopenharmony_ci
2562e5b6d6dSopenharmony_ci*   C API: unicode/ucol.h `ucol_mergeSortkeys()`
2572e5b6d6dSopenharmony_ci*   Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)`
2582e5b6d6dSopenharmony_ci
2592e5b6d6dSopenharmony_ciCLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is
2602e5b6d6dSopenharmony_ciintended to allow concatenating strings like firstName+\\uFFFE+lastName to yield
2612e5b6d6dSopenharmony_cithe same results as merging their individual sort keys.
2622e5b6d6dSopenharmony_ciThis has been fully implemented in ICU since version 53.
2632e5b6d6dSopenharmony_ci
2642e5b6d6dSopenharmony_ci### Generating bounds for a sort key (prefix matching)
2652e5b6d6dSopenharmony_ci
2662e5b6d6dSopenharmony_ciHaving sort keys for strings allows for easy creation of bounds - sort keys that
2672e5b6d6dSopenharmony_ciare guaranteed to be smaller or larger than any sort key from a give range. For
2682e5b6d6dSopenharmony_ciexample, if bounds are produced for a sortkey of string "smith", strings between
2692e5b6d6dSopenharmony_ciupper and lower bounds with one level would include "Smith", "SMITH", "sMiTh".
2702e5b6d6dSopenharmony_ciTwo kinds of upper bounds can be generated - the first one will match only
2712e5b6d6dSopenharmony_cistrings of equal length, while the second one will match all the strings with
2722e5b6d6dSopenharmony_cithe same initial prefix.
2732e5b6d6dSopenharmony_ci
2742e5b6d6dSopenharmony_ciCLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum
2752e5b6d6dSopenharmony_ciprimary weight, so that for example the string "smith\\uFFFF" can be used as the
2762e5b6d6dSopenharmony_ciupper bound rather than modifying the sort key for "smith".
2772e5b6d6dSopenharmony_ci
2782e5b6d6dSopenharmony_ci## Collation Element Iterator
2792e5b6d6dSopenharmony_ci
2802e5b6d6dSopenharmony_ciThe collation element iterator is used for traversing Unicode string collation
2812e5b6d6dSopenharmony_cielements one at a time. It can be used to implement language-sensitive text
2822e5b6d6dSopenharmony_cisearch algorithms like Boyer-Moore.
2832e5b6d6dSopenharmony_ci
2842e5b6d6dSopenharmony_ciFor most applications, the two API categories, compare and sort key, are
2852e5b6d6dSopenharmony_cisufficient. Most people do not need to manipulate collation elements directly.
2862e5b6d6dSopenharmony_ci
2872e5b6d6dSopenharmony_ciExample:
2882e5b6d6dSopenharmony_ci
2892e5b6d6dSopenharmony_ciConsider iterating over "apple" and "äpple". Here are sequences of collation
2902e5b6d6dSopenharmony_cielements:
2912e5b6d6dSopenharmony_ci
2922e5b6d6dSopenharmony_ciString 1 | String 1 Collation Elements
2932e5b6d6dSopenharmony_ci-------- | ---------------------------
2942e5b6d6dSopenharmony_cia        | `[1900.05.05]`
2952e5b6d6dSopenharmony_cip        | `[3700.05.05]`
2962e5b6d6dSopenharmony_cip        | `[3700.05.05]`
2972e5b6d6dSopenharmony_cil        | `[2F00.05.05]`
2982e5b6d6dSopenharmony_cie        | `[2100.05.05]`
2992e5b6d6dSopenharmony_ci
3002e5b6d6dSopenharmony_ciString 2 | String 2 Collation Elements
3012e5b6d6dSopenharmony_ci-------- | ---------------------------
3022e5b6d6dSopenharmony_cia        | `[1900.05.05]`
3032e5b6d6dSopenharmony_ci\\u0308  | `[0000.9D.05]`
3042e5b6d6dSopenharmony_cip        | `[3700.05.05]`
3052e5b6d6dSopenharmony_cip        | `[3700.05.05]`
3062e5b6d6dSopenharmony_cil        | `[2F00.05.05]`
3072e5b6d6dSopenharmony_cie        | `[2100.05.05]`
3082e5b6d6dSopenharmony_ci
3092e5b6d6dSopenharmony_ciThe resulting CEs are typically masked according to the desired strength, and
3102e5b6d6dSopenharmony_cizero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength)
3112e5b6d6dSopenharmony_ciproduces the results of NULL secondary and tertiary differences. The collator then
3122e5b6d6dSopenharmony_ciignores the NULL differences and declares a match. For more details see the
3132e5b6d6dSopenharmony_cipaper "Efficient text searching in Java™: Finding the right string in any
3142e5b6d6dSopenharmony_cilanguage" by Laura Werner (
3152e5b6d6dSopenharmony_ci<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>).
3162e5b6d6dSopenharmony_ci
3172e5b6d6dSopenharmony_ci## Collation Attributes
3182e5b6d6dSopenharmony_ci
3192e5b6d6dSopenharmony_ciThe Collation Service has a number of attributes whose values can be changed
3202e5b6d6dSopenharmony_ciduring run time. These attributes affect both the functionality and the
3212e5b6d6dSopenharmony_ciperformance of the Collation Service. This section describes these
3222e5b6d6dSopenharmony_ciattributes and, where possible, their performance impact. Performance
3232e5b6d6dSopenharmony_ciindications are only approximate and timings may vary significantly depending on
3242e5b6d6dSopenharmony_cithe CPU, compiler, etc.
3252e5b6d6dSopenharmony_ci
3262e5b6d6dSopenharmony_ciAlthough string comparison by ICU and comparison of each string's sort key give
3272e5b6d6dSopenharmony_cithe same results, attribute settings can impact the execution time of each
3282e5b6d6dSopenharmony_cimethod differently. To be precise in the discussion of performance, this section
3292e5b6d6dSopenharmony_cirefers to the API employed in the measurement. The `ucol_strcoll` function is the
3302e5b6d6dSopenharmony_ciAPI for string comparison. The `ucol_getSortKey` function is used to create sort
3312e5b6d6dSopenharmony_cikeys.
3322e5b6d6dSopenharmony_ci
3332e5b6d6dSopenharmony_ci> :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`,
3342e5b6d6dSopenharmony_ci> that can be used to set any attribute to its default value
3352e5b6d6dSopenharmony_ci> (which is inherited from the UCA and the tailoring).
3362e5b6d6dSopenharmony_ci
3372e5b6d6dSopenharmony_ci### Attribute Types
3382e5b6d6dSopenharmony_ci
3392e5b6d6dSopenharmony_ci#### Strength level
3402e5b6d6dSopenharmony_ci
3412e5b6d6dSopenharmony_ciCollation strength, or the maximum collation level used for comparison, is set
3422e5b6d6dSopenharmony_ciby using the `UCOL_STRENGTH` attribute. Valid values are:
3432e5b6d6dSopenharmony_ci
3442e5b6d6dSopenharmony_ci1.  `UCOL_PRIMARY`
3452e5b6d6dSopenharmony_ci
3462e5b6d6dSopenharmony_ci2.  `UCOL_SECONDARY`
3472e5b6d6dSopenharmony_ci
3482e5b6d6dSopenharmony_ci3.  `UCOL_TERTIARY` (default)
3492e5b6d6dSopenharmony_ci
3502e5b6d6dSopenharmony_ci4.  `UCOL_QUATERNARY`
3512e5b6d6dSopenharmony_ci
3522e5b6d6dSopenharmony_ci5.  `UCOL_IDENTICAL`
3532e5b6d6dSopenharmony_ci
3542e5b6d6dSopenharmony_ci#### French collation
3552e5b6d6dSopenharmony_ci
3562e5b6d6dSopenharmony_ciThe `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary
3572e5b6d6dSopenharmony_cidifferences in reverse order. Valid values are:
3582e5b6d6dSopenharmony_ci
3592e5b6d6dSopenharmony_ci1.  `UCOL_OFF` (default): compares secondary differences in the order they appear
3602e5b6d6dSopenharmony_ci    in the string.
3612e5b6d6dSopenharmony_ci
3622e5b6d6dSopenharmony_ci2.  `UCOL_ON`: causes secondary differences to be considered in reverse order, as
3632e5b6d6dSopenharmony_ci    it is done in the French language.
3642e5b6d6dSopenharmony_ci
3652e5b6d6dSopenharmony_ci#### Normalization mode
3662e5b6d6dSopenharmony_ci
3672e5b6d6dSopenharmony_ciThe `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`,
3682e5b6d6dSopenharmony_cicontrols whether text normalization is performed on the input strings. Valid
3692e5b6d6dSopenharmony_civalues are:
3702e5b6d6dSopenharmony_ci
3712e5b6d6dSopenharmony_ci1.  `UCOL_OFF` (default): turns off normalization check
3722e5b6d6dSopenharmony_ci
3732e5b6d6dSopenharmony_ci2.  `UCOL_ON` : normalization is checked and the collator performs normalization
3742e5b6d6dSopenharmony_ci    if it is needed.
3752e5b6d6dSopenharmony_ci
3762e5b6d6dSopenharmony_ciX                     | FCD | NFC | NFD
3772e5b6d6dSopenharmony_ci--------------------- | --- | --- | ---
3782e5b6d6dSopenharmony_ciA-ring                | Y   | Y   |
3792e5b6d6dSopenharmony_ciAngstrom              | Y   |     |
3802e5b6d6dSopenharmony_ciA + ring              | Y   |     | Y
3812e5b6d6dSopenharmony_ciA + grave             | Y   | Y   |
3822e5b6d6dSopenharmony_ciA-ring + grave        | Y   |     |
3832e5b6d6dSopenharmony_ciA + cedilla + ring    | Y   |     | Y
3842e5b6d6dSopenharmony_ciA + ring + cedilla    |     |     |
3852e5b6d6dSopenharmony_ciA-ring + cedilla      |     | Y   |
3862e5b6d6dSopenharmony_ci
3872e5b6d6dSopenharmony_ciWith normalization mode turned on, the `ucol_strcoll` function slows down by 10%.
3882e5b6d6dSopenharmony_ciIn addition, the time to generate a sort key also increases by about 25%.
3892e5b6d6dSopenharmony_ci
3902e5b6d6dSopenharmony_ci#### Alternate handling
3912e5b6d6dSopenharmony_ci
3922e5b6d6dSopenharmony_ciThis attribute allows shifting of the variable characters (usually spaces and
3932e5b6d6dSopenharmony_cipunctuation, in the UCA also most symbols) from the primary to the quaternary
3942e5b6d6dSopenharmony_cistrength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For
3952e5b6d6dSopenharmony_cidetails see [UCA: Variable
3962e5b6d6dSopenharmony_ciWeighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML:
3972e5b6d6dSopenharmony_ciCollation
3982e5b6d6dSopenharmony_ciSettings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings),
3992e5b6d6dSopenharmony_ciand [“Ignore Punctuation” Options](customization/ignorepunct.md).
4002e5b6d6dSopenharmony_ci
4012e5b6d6dSopenharmony_ci1.  `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as
4022e5b6d6dSopenharmony_ci    all the other characters
4032e5b6d6dSopenharmony_ci
4042e5b6d6dSopenharmony_ci2.  `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at
4052e5b6d6dSopenharmony_ci    the primary, secondary and tertiary levels and their primary strengths will
4062e5b6d6dSopenharmony_ci    be shifted to the quaternary level.
4072e5b6d6dSopenharmony_ci
4082e5b6d6dSopenharmony_ci#### Case Ordering
4092e5b6d6dSopenharmony_ci
4102e5b6d6dSopenharmony_ciSome conventions require uppercase letters to sort before lowercase ones, while
4112e5b6d6dSopenharmony_ciothers require the opposite. This attribute is controlled by the value of the
4122e5b6d6dSopenharmony_ci`UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary
4132e5b6d6dSopenharmony_ciweights along with other appearance characteristics (like circling of letters).
4142e5b6d6dSopenharmony_ciThe case-first attribute allows for emphasizing of the case property of the
4152e5b6d6dSopenharmony_ciletters by reordering the tertiary weights with either upper-first, and/or
4162e5b6d6dSopenharmony_cilowercase-first. This difference gets the most significant bit in the weight.
4172e5b6d6dSopenharmony_ciValid values for this attribute are:
4182e5b6d6dSopenharmony_ci
4192e5b6d6dSopenharmony_ci1.  `UCOL_OFF` (default): leave tertiary weights unaffected
4202e5b6d6dSopenharmony_ci
4212e5b6d6dSopenharmony_ci2.  `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort
4222e5b6d6dSopenharmony_ci    before uppercase
4232e5b6d6dSopenharmony_ci
4242e5b6d6dSopenharmony_ci3.  `UCOL_UPPER_FIRST` : causes uppercase letters to sort first
4252e5b6d6dSopenharmony_ci
4262e5b6d6dSopenharmony_ciThe case-first attribute does not affect the performance substantially.
4272e5b6d6dSopenharmony_ci
4282e5b6d6dSopenharmony_ci#### Case level
4292e5b6d6dSopenharmony_ci
4302e5b6d6dSopenharmony_ciWhen this attribute is set, an additional level is formed between the secondary
4312e5b6d6dSopenharmony_ciand tertiary levels, known as the Case Level. The case level is used to
4322e5b6d6dSopenharmony_cidistinguish large and small Japanese Kana characters. Case level could also be
4332e5b6d6dSopenharmony_ciused in other situations. for example to distinguish certain Pinyin characters.
4342e5b6d6dSopenharmony_ciCase level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this
4352e5b6d6dSopenharmony_ciattribute are
4362e5b6d6dSopenharmony_ci
4372e5b6d6dSopenharmony_ci1.  `UCOL_OFF` (default): no additional case level
4382e5b6d6dSopenharmony_ci
4392e5b6d6dSopenharmony_ci2.  `UCOL_ON` : adds a case level
4402e5b6d6dSopenharmony_ci
4412e5b6d6dSopenharmony_ci#### Hiragana Quaternary
4422e5b6d6dSopenharmony_ci
4432e5b6d6dSopenharmony_ci*This setting is deprecated and ignored in recent versions of ICU.*
4442e5b6d6dSopenharmony_ci
4452e5b6d6dSopenharmony_ciHiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points
4462e5b6d6dSopenharmony_ciwill sort before everything else on the quaternary level. If set to `UCOL_OFF`
4472e5b6d6dSopenharmony_ciHiragana letters are treated the same as all the other code points. This setting
4482e5b6d6dSopenharmony_cican be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute.
4492e5b6d6dSopenharmony_ciYou probably won't need to use it.
4502e5b6d6dSopenharmony_ci
4512e5b6d6dSopenharmony_ci#### Variable Top
4522e5b6d6dSopenharmony_ci
4532e5b6d6dSopenharmony_ciVariable Top is a boundary which decides whether the code points will be treated
4542e5b6d6dSopenharmony_cias variable (shifted to quaternary level in the **shifted** mode) or
4552e5b6d6dSopenharmony_cinon-ignorable. Special APIs are used for setting of variable top. It can
4562e5b6d6dSopenharmony_cibasically be set either to a codepoint or a primary strength value.
4572e5b6d6dSopenharmony_ci
4582e5b6d6dSopenharmony_ci## Performance
4592e5b6d6dSopenharmony_ci
4602e5b6d6dSopenharmony_ciICU collation is designed to be fast, small and customizable. Several techniques
4612e5b6d6dSopenharmony_ciare used to enhance the performance:
4622e5b6d6dSopenharmony_ci
4632e5b6d6dSopenharmony_ci1.  Providing optimized processing for Latin characters.
4642e5b6d6dSopenharmony_ci
4652e5b6d6dSopenharmony_ci2.  Comparing strings incrementally and stopping at the first significant
4662e5b6d6dSopenharmony_ci    difference.
4672e5b6d6dSopenharmony_ci
4682e5b6d6dSopenharmony_ci3.  Tuning to eliminate unnecessary file access or memory allocation.
4692e5b6d6dSopenharmony_ci
4702e5b6d6dSopenharmony_ci4.  Providing efficient preflight functions that allows fast sort key size
4712e5b6d6dSopenharmony_ci    generation.
4722e5b6d6dSopenharmony_ci
4732e5b6d6dSopenharmony_ci5.  Using a single, shared copy of UCA in memory for the read-only default sort
4742e5b6d6dSopenharmony_ci    order. Only small tailoring tables are kept in memory for locale-specific
4752e5b6d6dSopenharmony_ci    customization.
4762e5b6d6dSopenharmony_ci
4772e5b6d6dSopenharmony_ci6.  Compressing sort keys efficiently.
4782e5b6d6dSopenharmony_ci
4792e5b6d6dSopenharmony_ci7.  Making the sort order be data-driven.
4802e5b6d6dSopenharmony_ci
4812e5b6d6dSopenharmony_ciIn general, the best performance from the Collation Service is expected by
4822e5b6d6dSopenharmony_cidoing the following:
4832e5b6d6dSopenharmony_ci
4842e5b6d6dSopenharmony_ci1.  After opening a collator, keep and reuse it until done. Do not open new
4852e5b6d6dSopenharmony_ci    collators for the same sort order. (Note the restriction on
4862e5b6d6dSopenharmony_ci    multi-threading.)
4872e5b6d6dSopenharmony_ci
4882e5b6d6dSopenharmony_ci2.  Use `ucol_strcoll` etc. when comparing strings. If it is necessary to
4892e5b6d6dSopenharmony_ci    compare strings thousands or millions of times,
4902e5b6d6dSopenharmony_ci    create the sort keys first and compare the sort keys instead.
4912e5b6d6dSopenharmony_ci    Generating the sort keys of two strings is about 5-10
4922e5b6d6dSopenharmony_ci    times slower than just comparing them directly.
4932e5b6d6dSopenharmony_ci
4942e5b6d6dSopenharmony_ci3.  Follow the best practice guidelines for generating sort keys. Do not call
4952e5b6d6dSopenharmony_ci    `ucol_getSortKey` twice to first size the key and then allocate the sort key
4962e5b6d6dSopenharmony_ci    buffer and repeat the call to the function to fill in the buffer.
4972e5b6d6dSopenharmony_ci
4982e5b6d6dSopenharmony_ci### Performance and Storage Implications of Attributes
4992e5b6d6dSopenharmony_ci
5002e5b6d6dSopenharmony_ciMost people use the default attributes when comparing strings or when creating
5012e5b6d6dSopenharmony_cisort keys. When they do want to customize the ordering, the most common options
5022e5b6d6dSopenharmony_ciare the following :
5032e5b6d6dSopenharmony_ci
5042e5b6d6dSopenharmony_ci`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\
5052e5b6d6dSopenharmony_ciUsed to ignore space and punctuation characters
5062e5b6d6dSopenharmony_ci
5072e5b6d6dSopenharmony_ci`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\
5082e5b6d6dSopenharmony_ciUsed to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences.
5092e5b6d6dSopenharmony_ci
5102e5b6d6dSopenharmony_ci`UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\
5112e5b6d6dSopenharmony_ciUsed to change the ordering of upper vs. lower case letters (as
5122e5b6d6dSopenharmony_ciwell as small vs. large kana)
5132e5b6d6dSopenharmony_ci
5142e5b6d6dSopenharmony_ci`UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\
5152e5b6d6dSopenharmony_ciUsed to ignore only the accent differences.
5162e5b6d6dSopenharmony_ci
5172e5b6d6dSopenharmony_ci`UCOL_NORMALIZATION_MODE == UCOL_ON`\
5182e5b6d6dSopenharmony_ciForce to always check for normalization. This
5192e5b6d6dSopenharmony_ciis used if the input text may not be in FCD form.
5202e5b6d6dSopenharmony_ci
5212e5b6d6dSopenharmony_ci`UCOL_FRENCH_COLLATION == UCOL_OFF`\
5222e5b6d6dSopenharmony_ciThis is only useful for languages like French and Catalan that may turn this attribute on.
5232e5b6d6dSopenharmony_ci(It is the default only for Canadian French ("fr-CA").)
5242e5b6d6dSopenharmony_ci
5252e5b6d6dSopenharmony_ciIn String Comparison, most of these options have little or no effect on
5262e5b6d6dSopenharmony_ciperformance. The only noticeable one is normalization, which can cost 10%-40% in
5272e5b6d6dSopenharmony_ciperformance.
5282e5b6d6dSopenharmony_ci
5292e5b6d6dSopenharmony_ciFor Sort Keys, most of these options either leave the storage alone or reduce
5302e5b6d6dSopenharmony_ciit. Shifting can reduce the storage by about 10%-20%; case level + primary-only
5312e5b6d6dSopenharmony_cican decrease it about 20% to 40%. Using no French accents can reduce the storage
5322e5b6d6dSopenharmony_ciby about 38% , but only for languages like French and Catalan that turn it on by
5332e5b6d6dSopenharmony_cidefault. On the other hand, using Shifted + Quaternary can increase the storage by
5342e5b6d6dSopenharmony_ci10%-15%. (The Identical Level also increases the length, but this option is not
5352e5b6d6dSopenharmony_cirecommended).
5362e5b6d6dSopenharmony_ci
5372e5b6d6dSopenharmony_ci> :point_right: **Note** All of the above numbers are based on
5382e5b6d6dSopenharmony_ci> tests run on a particular machine, with a particular set of data.
5392e5b6d6dSopenharmony_ci> (The data for each language is a large number of names
5402e5b6d6dSopenharmony_ci> in that language in the format <first_name>, <last name>.)
5412e5b6d6dSopenharmony_ci> The performance and storage may vary, depending on the particular computer,
5422e5b6d6dSopenharmony_ci> operating system, and data.
5432e5b6d6dSopenharmony_ci
5442e5b6d6dSopenharmony_ci## Versioning
5452e5b6d6dSopenharmony_ci
5462e5b6d6dSopenharmony_ciSort keys are often stored on disk for later reuse. A common example is the use
5472e5b6d6dSopenharmony_ciof keys to build indexes in databases. When comparing keys, it is important to
5482e5b6d6dSopenharmony_ciknow that both keys were generated by the same algorithms and weightings.
5492e5b6d6dSopenharmony_ciOtherwise, identical strings with keys generated on two different dates, for
5502e5b6d6dSopenharmony_ciexample, might compare as unequal. Sort keys can be affected by new versions of
5512e5b6d6dSopenharmony_ciICU or its data tables, new sort key formats, or changes to the Collator.
5522e5b6d6dSopenharmony_ciStarting with release 1.8.1, ICU provides a versioning mechanism to identify the
5532e5b6d6dSopenharmony_civersion information of the following (but not limited to),
5542e5b6d6dSopenharmony_ci
5552e5b6d6dSopenharmony_ci1.  The run-time executable
5562e5b6d6dSopenharmony_ci
5572e5b6d6dSopenharmony_ci2.  The collation element content
5582e5b6d6dSopenharmony_ci
5592e5b6d6dSopenharmony_ci3.  The Unicode/UCA database
5602e5b6d6dSopenharmony_ci
5612e5b6d6dSopenharmony_ci4.  The tailoring table
5622e5b6d6dSopenharmony_ci
5632e5b6d6dSopenharmony_ciThe version information of Collator is a 32-bit integer. If a new version of ICU
5642e5b6d6dSopenharmony_cihas changes affecting the content of collation elements, the version information
5652e5b6d6dSopenharmony_ciwill be changed. In that case, to use the new version of ICU collator will
5662e5b6d6dSopenharmony_cirequire regenerating any saved or stored sort keys.
5672e5b6d6dSopenharmony_ci
5682e5b6d6dSopenharmony_ciHowever, it is possible to modify ICU code or data without changing relevant version numbers,
5692e5b6d6dSopenharmony_ciso it is safer to regenerate sort keys any time after any part of ICU has been updated.
5702e5b6d6dSopenharmony_ci
5712e5b6d6dSopenharmony_ciSince ICU4C 1.8.1.
5722e5b6d6dSopenharmony_ciit is possible to build your program so that it uses more than one version of
5732e5b6d6dSopenharmony_ciICU (only in C/C++, not in Java). Therefore, you could use the current version
5742e5b6d6dSopenharmony_cifor the features you need and use the older version for collation.
5752e5b6d6dSopenharmony_ci
5762e5b6d6dSopenharmony_ci## Programming Examples
5772e5b6d6dSopenharmony_ci
5782e5b6d6dSopenharmony_ciSee the [Collation Examples](examples.md) chapter for an example of how to
5792e5b6d6dSopenharmony_cicompare and create sort keys with the default locale in C, C++ and Java.
580