12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: Architecture 42e5b6d6dSopenharmony_cinav_order: 2 52e5b6d6dSopenharmony_ciparent: Collation 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# Collation Service Architecture 132e5b6d6dSopenharmony_ci{: .no_toc } 142e5b6d6dSopenharmony_ci 152e5b6d6dSopenharmony_ci## Contents 162e5b6d6dSopenharmony_ci{: .no_toc .text-delta } 172e5b6d6dSopenharmony_ci 182e5b6d6dSopenharmony_ci1. TOC 192e5b6d6dSopenharmony_ci{:toc} 202e5b6d6dSopenharmony_ci 212e5b6d6dSopenharmony_ci--- 222e5b6d6dSopenharmony_ci 232e5b6d6dSopenharmony_ci## Overview 242e5b6d6dSopenharmony_ci 252e5b6d6dSopenharmony_ciThis section describes the design principles, architecture and coding 262e5b6d6dSopenharmony_ciconventions of the ICU Collation Service. 272e5b6d6dSopenharmony_ci 282e5b6d6dSopenharmony_ci## Collator 292e5b6d6dSopenharmony_ci 302e5b6d6dSopenharmony_ciTo use the Collation Service, a Collator must first be instantiated. An 312e5b6d6dSopenharmony_ciCollator is a data structure or object that maintains all of the property 322e5b6d6dSopenharmony_ciand state information necessary to define and support the specific collation 332e5b6d6dSopenharmony_cibehavior provided. Examples of properties described in the Collator are the 342e5b6d6dSopenharmony_cilocale, whether normalization is to be performed, and how many levels of 352e5b6d6dSopenharmony_cicollation are to be evaluated. Examples of the state information described in 362e5b6d6dSopenharmony_cithe Collator include the direction of a Collation Element Iterator (forward 372e5b6d6dSopenharmony_cior backward) and the status of the last API executed. 382e5b6d6dSopenharmony_ci 392e5b6d6dSopenharmony_ciThe Collator is instantiated either by referencing a locale or by defining a 402e5b6d6dSopenharmony_cicustom set of rules (a tailoring). 412e5b6d6dSopenharmony_ci 422e5b6d6dSopenharmony_ciThe Collation Service uses the paradigm: 432e5b6d6dSopenharmony_ci 442e5b6d6dSopenharmony_ci1. Open a Collator, 452e5b6d6dSopenharmony_ci 462e5b6d6dSopenharmony_ci2. Use while necessary, 472e5b6d6dSopenharmony_ci 482e5b6d6dSopenharmony_ci3. Close the Collator. 492e5b6d6dSopenharmony_ci 502e5b6d6dSopenharmony_ciCollator instances cannot be shared among threads. You should open them 512e5b6d6dSopenharmony_ciinstead, and use a different collator for each separate thread. The safe clone 522e5b6d6dSopenharmony_cifunction is supported for cloning collators in a thread-safe fashion. 532e5b6d6dSopenharmony_ci 542e5b6d6dSopenharmony_ciThe Collation Service follows the ICU conventions for locale designation 552e5b6d6dSopenharmony_ciwhen opening collators: 562e5b6d6dSopenharmony_ci 572e5b6d6dSopenharmony_ci1. NULL means the default locale. 582e5b6d6dSopenharmony_ci 592e5b6d6dSopenharmony_ci2. The empty locale name ("") means the root locale. 602e5b6d6dSopenharmony_ci The Collation Service adheres to the ICU conventions described in the 612e5b6d6dSopenharmony_ci "[ICU Architectural Design](../design.md) " section of the users guide. 622e5b6d6dSopenharmony_ci In particular: 632e5b6d6dSopenharmony_ci 642e5b6d6dSopenharmony_ci3. The standard error code convention is usually followed. (Functions that do 652e5b6d6dSopenharmony_ci not take an error code parameter do so for backward compatibility.) 662e5b6d6dSopenharmony_ci 672e5b6d6dSopenharmony_ci4. The string length convention is followed: when passing a `UChar *`, the 682e5b6d6dSopenharmony_ci length is required in a separate argument. If -1 is passed for the length, 692e5b6d6dSopenharmony_ci it is assumed that the string is zero terminated. 702e5b6d6dSopenharmony_ci 712e5b6d6dSopenharmony_ci### Collation locale and keyword handling 722e5b6d6dSopenharmony_ci 732e5b6d6dSopenharmony_ciWhen a collator is created from a locale, the collation service (like all ICU 742e5b6d6dSopenharmony_ciservices) must map the requested locale to the localized collation data 752e5b6d6dSopenharmony_ciavailable to ICU at the time. It does so using the standard ICU locale fallback 762e5b6d6dSopenharmony_cimechanism. See the fallback section of the [locale 772e5b6d6dSopenharmony_cichapter](../locale/index.md) for more details. 782e5b6d6dSopenharmony_ci 792e5b6d6dSopenharmony_ciIf you pass a regular locale in, like "en_US", the collation service first 802e5b6d6dSopenharmony_cisearches with fallback for "collations/default" key. The first such key it finds 812e5b6d6dSopenharmony_ciwill have an associated string value; this is the keyword name for the collation 822e5b6d6dSopenharmony_cithat is default for this locale. If the search falls all the way back to the 832e5b6d6dSopenharmony_ciroot locale, the collation service will us the "collations/default" key there, 842e5b6d6dSopenharmony_ciwhich has the value "standard". 852e5b6d6dSopenharmony_ci 862e5b6d6dSopenharmony_ciIf there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the 872e5b6d6dSopenharmony_cicollation service searches with fallback for "collations/phonebook". If the 882e5b6d6dSopenharmony_cisearch is successful, the collation service uses the string value it finds to 892e5b6d6dSopenharmony_ciinstantiate a Collator. If the search fails because no such key is present in 902e5b6d6dSopenharmony_ciany of ICU's locale data (e.g., "de@collation=funky"), the service returns a 912e5b6d6dSopenharmony_cicollator implementing the default tailoring of the locale. 922e5b6d6dSopenharmony_ciIf the fallback is all the way to the root locale, then 932e5b6d6dSopenharmony_cithe return `UErrorCode` is `U_USING_DEFAULT_WARNING`. 942e5b6d6dSopenharmony_ci 952e5b6d6dSopenharmony_ci## Input values for collation 962e5b6d6dSopenharmony_ci 972e5b6d6dSopenharmony_ciCollation deals with processing strings. ICU generally requires that all the 982e5b6d6dSopenharmony_cistrings should be in UTF-16 format, and that all the required conversion should 992e5b6d6dSopenharmony_cidone before ICU functions are used. In the case of collation, there are APIs 1002e5b6d6dSopenharmony_cithat can also take instances of character iterators (`UCharIterator`) 1012e5b6d6dSopenharmony_cior UTF-8 directly. 1022e5b6d6dSopenharmony_ci 1032e5b6d6dSopenharmony_ciTheoretically, character iterators can iterate strings 1042e5b6d6dSopenharmony_ciin any encoding. ICU currently provides character iterator implementations for 1052e5b6d6dSopenharmony_ciUTF-8 and UTF-16BE (useful when processing data from a big endian platform on an 1062e5b6d6dSopenharmony_cilittle endian machine). It should be noted, however, that using iterators for 1072e5b6d6dSopenharmony_cicollation APIs has a performance impact. It should be used in situations when it 1082e5b6d6dSopenharmony_ciis not desirable to convert whole strings before the operation - such as when 1092e5b6d6dSopenharmony_ciusing a string compare function. 1102e5b6d6dSopenharmony_ci 1112e5b6d6dSopenharmony_ci## Collation Elements 1122e5b6d6dSopenharmony_ci 1132e5b6d6dSopenharmony_ciAs discussed in the introduction, there are many possible orderings for sorted 1142e5b6d6dSopenharmony_citext, depending on language and other factors. Ideally, there is a way to 1152e5b6d6dSopenharmony_cidescribe each ordering as a set of rules for calculating numeric values for each 1162e5b6d6dSopenharmony_cistring of text. The collation process then becomes one of simply comparing these 1172e5b6d6dSopenharmony_cinumeric values. 1182e5b6d6dSopenharmony_ci 1192e5b6d6dSopenharmony_ciThis essentially describes the way the Collation Service works. To implement 1202e5b6d6dSopenharmony_cia particular sort ordering, first the relationship between each character or 1212e5b6d6dSopenharmony_cicharacter sequence is derived. For example, a Spanish ordering defines the 1222e5b6d6dSopenharmony_ciletter sequence "CH" to be between the letters "C" and "D". As also discussed in 1232e5b6d6dSopenharmony_cithe introduction, to order strings properly requires that comparison of base 1242e5b6d6dSopenharmony_ciletters must be considered separately from comparison of accents. Letter case 1252e5b6d6dSopenharmony_cimust also be considered separately from either base letters or accents. Any 1262e5b6d6dSopenharmony_ciordering specification language must provide a way to define the relationships 1272e5b6d6dSopenharmony_cibetween characters or character sequences on multiple levels. ICU supports this 1282e5b6d6dSopenharmony_ciby using "<" to describe a relationship at the primary level, using "<<" to 1292e5b6d6dSopenharmony_cidescribe a relationship at the secondary level, and using "<<<" to describe a 1302e5b6d6dSopenharmony_cirelationship at the tertiary level. Here are some example usages: 1312e5b6d6dSopenharmony_ci 1322e5b6d6dSopenharmony_ciSymbol | Example | Description 1332e5b6d6dSopenharmony_ci------ | -------- | ----------- 1342e5b6d6dSopenharmony_ci`<` | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch" 1352e5b6d6dSopenharmony_ci`<<` | `a << ä` | Make a secondary (accent) difference between "a" and "ä" 1362e5b6d6dSopenharmony_ci`<<<` | `a<<<A` | Make a tertiary difference between "a" and "A" 1372e5b6d6dSopenharmony_ci 1382e5b6d6dSopenharmony_ciA more complete description of the ordering specification symbols and their 1392e5b6d6dSopenharmony_cimeanings is provided in the section on Collation Tailoring. 1402e5b6d6dSopenharmony_ci 1412e5b6d6dSopenharmony_ciOnce a sort ordering is defined by specifying the desired relationships between 1422e5b6d6dSopenharmony_cicharacters and character sequences, ICU can convert these relationships to a 1432e5b6d6dSopenharmony_ciseries of numerical values (one for each level) that satisfy these same 1442e5b6d6dSopenharmony_cirelationships. 1452e5b6d6dSopenharmony_ci 1462e5b6d6dSopenharmony_ciThis series of numeric values, representing the relative weighting of a 1472e5b6d6dSopenharmony_cicharacter or character sequence, is called a Collation Element (CE). 1482e5b6d6dSopenharmony_ciOne possible encoding of a Collation Element is a 32-bit value consisting of 1492e5b6d6dSopenharmony_cia 16-bit primary weight, a 8-bit secondary weight, 1502e5b6d6dSopenharmony_ci2 case bits, and a 6-bit tertiary weight. 1512e5b6d6dSopenharmony_ci 1522e5b6d6dSopenharmony_ciThe sort weight of a string is represented by the collation elements of its 1532e5b6d6dSopenharmony_cicomponent characters and character sequences. For example, the sort weight of 1542e5b6d6dSopenharmony_cithe string "apple" would consist of its component Collation Elements, as shown 1552e5b6d6dSopenharmony_cihere: 1562e5b6d6dSopenharmony_ci 1572e5b6d6dSopenharmony_ci"Apple" | "Apple" Collation Elements 1582e5b6d6dSopenharmony_ci------- | -------------------------- 1592e5b6d6dSopenharmony_cia | `[1900.05.05]` 1602e5b6d6dSopenharmony_cip | `[3700.05.05]` 1612e5b6d6dSopenharmony_cip | `[3700.05.05]` 1622e5b6d6dSopenharmony_cil | `[2F00.05.05]` 1632e5b6d6dSopenharmony_cie | `[2100.05.05]` 1642e5b6d6dSopenharmony_ci 1652e5b6d6dSopenharmony_ciIn this example, the letter "a" has a 16-bit primary weight of 1900 (hex), an 1662e5b6d6dSopenharmony_ci8-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of 1672e5b6d6dSopenharmony_ci05 (hex). 1682e5b6d6dSopenharmony_ci 1692e5b6d6dSopenharmony_ciString comparison is performed by comparing the collation elements of each 1702e5b6d6dSopenharmony_cistring. Each of the primary weights are compared. If a difference is found, that 1712e5b6d6dSopenharmony_cidifference determines the relationship between the two strings. If no 1722e5b6d6dSopenharmony_cidifferences are found, the secondary weights are compared and so forth. 1732e5b6d6dSopenharmony_ci 1742e5b6d6dSopenharmony_ciWith ICU it is possible to specify how many levels should be compared. For some 1752e5b6d6dSopenharmony_ciapplications, it can be desirable to compare only primary levels or to compare 1762e5b6d6dSopenharmony_cionly primary and secondary levels. 1772e5b6d6dSopenharmony_ci 1782e5b6d6dSopenharmony_ci## Sort Keys 1792e5b6d6dSopenharmony_ci 1802e5b6d6dSopenharmony_ciIf a string is to be compared thousands or millions of times, 1812e5b6d6dSopenharmony_ciit can be more efficient to use sort keys. 1822e5b6d6dSopenharmony_ciSort keys are useful in situations where a large amount of data is indexed 1832e5b6d6dSopenharmony_ciand frequently searched. The sort key is generated once and used in subsequent 1842e5b6d6dSopenharmony_cicomparisons, rather than repeatedly generating the string's Collation Elements. 1852e5b6d6dSopenharmony_ciThe comparison of sort keys is a very efficient and simple binary compare of strings of 1862e5b6d6dSopenharmony_ciunsigned bytes. 1872e5b6d6dSopenharmony_ci 1882e5b6d6dSopenharmony_ciAn important property of ICU sort keys is that you can obtain the same results 1892e5b6d6dSopenharmony_ciby comparing 2 strings as you do by comparing the sort keys of the 2 strings 1902e5b6d6dSopenharmony_ci(provided that the same ordering and related collation attributes are used). 1912e5b6d6dSopenharmony_ci 1922e5b6d6dSopenharmony_ciAn ICU sort key is a pre-processed sequence of bytes generated from a Unicode 1932e5b6d6dSopenharmony_cistring. The weights for each comparison level are concatenated, separated by a 1942e5b6d6dSopenharmony_ci"0x01" byte between levels. 1952e5b6d6dSopenharmony_ciThe entire sequence is terminated with a 0x00 byte for convenience in C APIs. 1962e5b6d6dSopenharmony_ci(This 0x00 terminator is counted in the sort key length — 1972e5b6d6dSopenharmony_ciunlike regular strings where the NUL terminator is excluded from the string length.) 1982e5b6d6dSopenharmony_ci 1992e5b6d6dSopenharmony_ciICU actually compresses the sort keys so that they take the 2002e5b6d6dSopenharmony_ciminimum storage in memory and in databases. 2012e5b6d6dSopenharmony_ci 2022e5b6d6dSopenharmony_ci<!-- TODO: (diagram was missing in Google Sites already) 2032e5b6d6dSopenharmony_ci The diagram below represents an uncompressed sort key in ICU for ease of understanding. --> 2042e5b6d6dSopenharmony_ci 2052e5b6d6dSopenharmony_ci### Sort key size 2062e5b6d6dSopenharmony_ci 2072e5b6d6dSopenharmony_ciOne of the more important issues when considering using sort keys is the sort 2082e5b6d6dSopenharmony_cikey size. Unfortunately, it is very hard to give a fast exact answer to the 2092e5b6d6dSopenharmony_cifollowing question: "What is the maximum size for sort keys generated for 2102e5b6d6dSopenharmony_cistrings of size X". This problem is twofold: 2112e5b6d6dSopenharmony_ci 2122e5b6d6dSopenharmony_ci1. The maximum size of the sort key depends on the size of the collation 2132e5b6d6dSopenharmony_ci elements that are used to build it. Size of collation elements vary greatly 2142e5b6d6dSopenharmony_ci and depends both on the alphabet in question and on the locale used. 2152e5b6d6dSopenharmony_ci 2162e5b6d6dSopenharmony_ci2. Compression is used in building sort keys. Most 'regular' sequences of 2172e5b6d6dSopenharmony_ci characters produce very compact sort keys. 2182e5b6d6dSopenharmony_ci 2192e5b6d6dSopenharmony_ciIf one is to assume the worst case and use too-big buffers, a lot of space will 2202e5b6d6dSopenharmony_cibe wasted. However, if you use too-small buffers, you will lose performance if 2212e5b6d6dSopenharmony_cigenerated sort keys are longer than supplied buffers too often 2222e5b6d6dSopenharmony_ci(and you have to reallocate for each of those). 2232e5b6d6dSopenharmony_ciA good strategy 2242e5b6d6dSopenharmony_cifor this problem would be to manually manage a large buffer for storing sortkeys 2252e5b6d6dSopenharmony_ciand keep a list of indices to sort keys in this buffer (see the "large buffers" 2262e5b6d6dSopenharmony_ci[Collation Example](examples#using-large-buffers-to-manage-sort-keys) 2272e5b6d6dSopenharmony_cifor more details). 2282e5b6d6dSopenharmony_ci 2292e5b6d6dSopenharmony_ciHere are some rules of a thumb, please do not rely on them. If you are looking 2302e5b6d6dSopenharmony_ciat the East Asian locales, you probably want to go with 5 bytes per code point. 2312e5b6d6dSopenharmony_ciFor Thai, 3 bytes per code point should be sufficient. For all the other locales 2322e5b6d6dSopenharmony_ci(mostly Latin and Cyrillic), you should be fine with 2 bytes per code point. 2332e5b6d6dSopenharmony_ciThese values are based on average lengths of sort keys generated with tertiary 2342e5b6d6dSopenharmony_cistrength. If you need quaternary and identical strength (you should not), add 3 2352e5b6d6dSopenharmony_cibytes per code point to each of these. 2362e5b6d6dSopenharmony_ci 2372e5b6d6dSopenharmony_ci### Partial sort keys 2382e5b6d6dSopenharmony_ci 2392e5b6d6dSopenharmony_ciIn some cases, most notably when implementing [radix 2402e5b6d6dSopenharmony_cisorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only 2412e5b6d6dSopenharmony_ciparts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing 2422e5b6d6dSopenharmony_ciparts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be 2432e5b6d6dSopenharmony_cicompressed; that is, they may or may not be compatible with regular sort keys. 2442e5b6d6dSopenharmony_ci 2452e5b6d6dSopenharmony_ci### Merging sort keys 2462e5b6d6dSopenharmony_ci 2472e5b6d6dSopenharmony_ciSometimes, it is useful to be able to merge sort keys. One example is having 2482e5b6d6dSopenharmony_ciseparate sort keys for first and last names. If you need to perform an operation 2492e5b6d6dSopenharmony_cithat requires a sort key generated on the whole name, instead of concatenating 2502e5b6d6dSopenharmony_cistrings and regenerating sort keys, you should merge the sort keys. The merging 2512e5b6d6dSopenharmony_ciis done by merging the corresponding levels while inserting a terminator between 2522e5b6d6dSopenharmony_cimerged parts. The reserved sort key byte value for the merge terminator is 0x02. 2532e5b6d6dSopenharmony_ciFor more details see [UCA section 1.6, Merging Sort 2542e5b6d6dSopenharmony_ciKeys](http://www.unicode.org/reports/tr10/#Interleaved_Levels). 2552e5b6d6dSopenharmony_ci 2562e5b6d6dSopenharmony_ci* C API: unicode/ucol.h `ucol_mergeSortkeys()` 2572e5b6d6dSopenharmony_ci* Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)` 2582e5b6d6dSopenharmony_ci 2592e5b6d6dSopenharmony_ciCLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is 2602e5b6d6dSopenharmony_ciintended to allow concatenating strings like firstName+\\uFFFE+lastName to yield 2612e5b6d6dSopenharmony_cithe same results as merging their individual sort keys. 2622e5b6d6dSopenharmony_ciThis has been fully implemented in ICU since version 53. 2632e5b6d6dSopenharmony_ci 2642e5b6d6dSopenharmony_ci### Generating bounds for a sort key (prefix matching) 2652e5b6d6dSopenharmony_ci 2662e5b6d6dSopenharmony_ciHaving sort keys for strings allows for easy creation of bounds - sort keys that 2672e5b6d6dSopenharmony_ciare guaranteed to be smaller or larger than any sort key from a give range. For 2682e5b6d6dSopenharmony_ciexample, if bounds are produced for a sortkey of string "smith", strings between 2692e5b6d6dSopenharmony_ciupper and lower bounds with one level would include "Smith", "SMITH", "sMiTh". 2702e5b6d6dSopenharmony_ciTwo kinds of upper bounds can be generated - the first one will match only 2712e5b6d6dSopenharmony_cistrings of equal length, while the second one will match all the strings with 2722e5b6d6dSopenharmony_cithe same initial prefix. 2732e5b6d6dSopenharmony_ci 2742e5b6d6dSopenharmony_ciCLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum 2752e5b6d6dSopenharmony_ciprimary weight, so that for example the string "smith\\uFFFF" can be used as the 2762e5b6d6dSopenharmony_ciupper bound rather than modifying the sort key for "smith". 2772e5b6d6dSopenharmony_ci 2782e5b6d6dSopenharmony_ci## Collation Element Iterator 2792e5b6d6dSopenharmony_ci 2802e5b6d6dSopenharmony_ciThe collation element iterator is used for traversing Unicode string collation 2812e5b6d6dSopenharmony_cielements one at a time. It can be used to implement language-sensitive text 2822e5b6d6dSopenharmony_cisearch algorithms like Boyer-Moore. 2832e5b6d6dSopenharmony_ci 2842e5b6d6dSopenharmony_ciFor most applications, the two API categories, compare and sort key, are 2852e5b6d6dSopenharmony_cisufficient. Most people do not need to manipulate collation elements directly. 2862e5b6d6dSopenharmony_ci 2872e5b6d6dSopenharmony_ciExample: 2882e5b6d6dSopenharmony_ci 2892e5b6d6dSopenharmony_ciConsider iterating over "apple" and "äpple". Here are sequences of collation 2902e5b6d6dSopenharmony_cielements: 2912e5b6d6dSopenharmony_ci 2922e5b6d6dSopenharmony_ciString 1 | String 1 Collation Elements 2932e5b6d6dSopenharmony_ci-------- | --------------------------- 2942e5b6d6dSopenharmony_cia | `[1900.05.05]` 2952e5b6d6dSopenharmony_cip | `[3700.05.05]` 2962e5b6d6dSopenharmony_cip | `[3700.05.05]` 2972e5b6d6dSopenharmony_cil | `[2F00.05.05]` 2982e5b6d6dSopenharmony_cie | `[2100.05.05]` 2992e5b6d6dSopenharmony_ci 3002e5b6d6dSopenharmony_ciString 2 | String 2 Collation Elements 3012e5b6d6dSopenharmony_ci-------- | --------------------------- 3022e5b6d6dSopenharmony_cia | `[1900.05.05]` 3032e5b6d6dSopenharmony_ci\\u0308 | `[0000.9D.05]` 3042e5b6d6dSopenharmony_cip | `[3700.05.05]` 3052e5b6d6dSopenharmony_cip | `[3700.05.05]` 3062e5b6d6dSopenharmony_cil | `[2F00.05.05]` 3072e5b6d6dSopenharmony_cie | `[2100.05.05]` 3082e5b6d6dSopenharmony_ci 3092e5b6d6dSopenharmony_ciThe resulting CEs are typically masked according to the desired strength, and 3102e5b6d6dSopenharmony_cizero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength) 3112e5b6d6dSopenharmony_ciproduces the results of NULL secondary and tertiary differences. The collator then 3122e5b6d6dSopenharmony_ciignores the NULL differences and declares a match. For more details see the 3132e5b6d6dSopenharmony_cipaper "Efficient text searching in Java™: Finding the right string in any 3142e5b6d6dSopenharmony_cilanguage" by Laura Werner ( 3152e5b6d6dSopenharmony_ci<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>). 3162e5b6d6dSopenharmony_ci 3172e5b6d6dSopenharmony_ci## Collation Attributes 3182e5b6d6dSopenharmony_ci 3192e5b6d6dSopenharmony_ciThe Collation Service has a number of attributes whose values can be changed 3202e5b6d6dSopenharmony_ciduring run time. These attributes affect both the functionality and the 3212e5b6d6dSopenharmony_ciperformance of the Collation Service. This section describes these 3222e5b6d6dSopenharmony_ciattributes and, where possible, their performance impact. Performance 3232e5b6d6dSopenharmony_ciindications are only approximate and timings may vary significantly depending on 3242e5b6d6dSopenharmony_cithe CPU, compiler, etc. 3252e5b6d6dSopenharmony_ci 3262e5b6d6dSopenharmony_ciAlthough string comparison by ICU and comparison of each string's sort key give 3272e5b6d6dSopenharmony_cithe same results, attribute settings can impact the execution time of each 3282e5b6d6dSopenharmony_cimethod differently. To be precise in the discussion of performance, this section 3292e5b6d6dSopenharmony_cirefers to the API employed in the measurement. The `ucol_strcoll` function is the 3302e5b6d6dSopenharmony_ciAPI for string comparison. The `ucol_getSortKey` function is used to create sort 3312e5b6d6dSopenharmony_cikeys. 3322e5b6d6dSopenharmony_ci 3332e5b6d6dSopenharmony_ci> :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`, 3342e5b6d6dSopenharmony_ci> that can be used to set any attribute to its default value 3352e5b6d6dSopenharmony_ci> (which is inherited from the UCA and the tailoring). 3362e5b6d6dSopenharmony_ci 3372e5b6d6dSopenharmony_ci### Attribute Types 3382e5b6d6dSopenharmony_ci 3392e5b6d6dSopenharmony_ci#### Strength level 3402e5b6d6dSopenharmony_ci 3412e5b6d6dSopenharmony_ciCollation strength, or the maximum collation level used for comparison, is set 3422e5b6d6dSopenharmony_ciby using the `UCOL_STRENGTH` attribute. Valid values are: 3432e5b6d6dSopenharmony_ci 3442e5b6d6dSopenharmony_ci1. `UCOL_PRIMARY` 3452e5b6d6dSopenharmony_ci 3462e5b6d6dSopenharmony_ci2. `UCOL_SECONDARY` 3472e5b6d6dSopenharmony_ci 3482e5b6d6dSopenharmony_ci3. `UCOL_TERTIARY` (default) 3492e5b6d6dSopenharmony_ci 3502e5b6d6dSopenharmony_ci4. `UCOL_QUATERNARY` 3512e5b6d6dSopenharmony_ci 3522e5b6d6dSopenharmony_ci5. `UCOL_IDENTICAL` 3532e5b6d6dSopenharmony_ci 3542e5b6d6dSopenharmony_ci#### French collation 3552e5b6d6dSopenharmony_ci 3562e5b6d6dSopenharmony_ciThe `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary 3572e5b6d6dSopenharmony_cidifferences in reverse order. Valid values are: 3582e5b6d6dSopenharmony_ci 3592e5b6d6dSopenharmony_ci1. `UCOL_OFF` (default): compares secondary differences in the order they appear 3602e5b6d6dSopenharmony_ci in the string. 3612e5b6d6dSopenharmony_ci 3622e5b6d6dSopenharmony_ci2. `UCOL_ON`: causes secondary differences to be considered in reverse order, as 3632e5b6d6dSopenharmony_ci it is done in the French language. 3642e5b6d6dSopenharmony_ci 3652e5b6d6dSopenharmony_ci#### Normalization mode 3662e5b6d6dSopenharmony_ci 3672e5b6d6dSopenharmony_ciThe `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`, 3682e5b6d6dSopenharmony_cicontrols whether text normalization is performed on the input strings. Valid 3692e5b6d6dSopenharmony_civalues are: 3702e5b6d6dSopenharmony_ci 3712e5b6d6dSopenharmony_ci1. `UCOL_OFF` (default): turns off normalization check 3722e5b6d6dSopenharmony_ci 3732e5b6d6dSopenharmony_ci2. `UCOL_ON` : normalization is checked and the collator performs normalization 3742e5b6d6dSopenharmony_ci if it is needed. 3752e5b6d6dSopenharmony_ci 3762e5b6d6dSopenharmony_ciX | FCD | NFC | NFD 3772e5b6d6dSopenharmony_ci--------------------- | --- | --- | --- 3782e5b6d6dSopenharmony_ciA-ring | Y | Y | 3792e5b6d6dSopenharmony_ciAngstrom | Y | | 3802e5b6d6dSopenharmony_ciA + ring | Y | | Y 3812e5b6d6dSopenharmony_ciA + grave | Y | Y | 3822e5b6d6dSopenharmony_ciA-ring + grave | Y | | 3832e5b6d6dSopenharmony_ciA + cedilla + ring | Y | | Y 3842e5b6d6dSopenharmony_ciA + ring + cedilla | | | 3852e5b6d6dSopenharmony_ciA-ring + cedilla | | Y | 3862e5b6d6dSopenharmony_ci 3872e5b6d6dSopenharmony_ciWith normalization mode turned on, the `ucol_strcoll` function slows down by 10%. 3882e5b6d6dSopenharmony_ciIn addition, the time to generate a sort key also increases by about 25%. 3892e5b6d6dSopenharmony_ci 3902e5b6d6dSopenharmony_ci#### Alternate handling 3912e5b6d6dSopenharmony_ci 3922e5b6d6dSopenharmony_ciThis attribute allows shifting of the variable characters (usually spaces and 3932e5b6d6dSopenharmony_cipunctuation, in the UCA also most symbols) from the primary to the quaternary 3942e5b6d6dSopenharmony_cistrength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For 3952e5b6d6dSopenharmony_cidetails see [UCA: Variable 3962e5b6d6dSopenharmony_ciWeighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML: 3972e5b6d6dSopenharmony_ciCollation 3982e5b6d6dSopenharmony_ciSettings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings), 3992e5b6d6dSopenharmony_ciand [“Ignore Punctuation” Options](customization/ignorepunct.md). 4002e5b6d6dSopenharmony_ci 4012e5b6d6dSopenharmony_ci1. `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as 4022e5b6d6dSopenharmony_ci all the other characters 4032e5b6d6dSopenharmony_ci 4042e5b6d6dSopenharmony_ci2. `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at 4052e5b6d6dSopenharmony_ci the primary, secondary and tertiary levels and their primary strengths will 4062e5b6d6dSopenharmony_ci be shifted to the quaternary level. 4072e5b6d6dSopenharmony_ci 4082e5b6d6dSopenharmony_ci#### Case Ordering 4092e5b6d6dSopenharmony_ci 4102e5b6d6dSopenharmony_ciSome conventions require uppercase letters to sort before lowercase ones, while 4112e5b6d6dSopenharmony_ciothers require the opposite. This attribute is controlled by the value of the 4122e5b6d6dSopenharmony_ci`UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary 4132e5b6d6dSopenharmony_ciweights along with other appearance characteristics (like circling of letters). 4142e5b6d6dSopenharmony_ciThe case-first attribute allows for emphasizing of the case property of the 4152e5b6d6dSopenharmony_ciletters by reordering the tertiary weights with either upper-first, and/or 4162e5b6d6dSopenharmony_cilowercase-first. This difference gets the most significant bit in the weight. 4172e5b6d6dSopenharmony_ciValid values for this attribute are: 4182e5b6d6dSopenharmony_ci 4192e5b6d6dSopenharmony_ci1. `UCOL_OFF` (default): leave tertiary weights unaffected 4202e5b6d6dSopenharmony_ci 4212e5b6d6dSopenharmony_ci2. `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort 4222e5b6d6dSopenharmony_ci before uppercase 4232e5b6d6dSopenharmony_ci 4242e5b6d6dSopenharmony_ci3. `UCOL_UPPER_FIRST` : causes uppercase letters to sort first 4252e5b6d6dSopenharmony_ci 4262e5b6d6dSopenharmony_ciThe case-first attribute does not affect the performance substantially. 4272e5b6d6dSopenharmony_ci 4282e5b6d6dSopenharmony_ci#### Case level 4292e5b6d6dSopenharmony_ci 4302e5b6d6dSopenharmony_ciWhen this attribute is set, an additional level is formed between the secondary 4312e5b6d6dSopenharmony_ciand tertiary levels, known as the Case Level. The case level is used to 4322e5b6d6dSopenharmony_cidistinguish large and small Japanese Kana characters. Case level could also be 4332e5b6d6dSopenharmony_ciused in other situations. for example to distinguish certain Pinyin characters. 4342e5b6d6dSopenharmony_ciCase level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this 4352e5b6d6dSopenharmony_ciattribute are 4362e5b6d6dSopenharmony_ci 4372e5b6d6dSopenharmony_ci1. `UCOL_OFF` (default): no additional case level 4382e5b6d6dSopenharmony_ci 4392e5b6d6dSopenharmony_ci2. `UCOL_ON` : adds a case level 4402e5b6d6dSopenharmony_ci 4412e5b6d6dSopenharmony_ci#### Hiragana Quaternary 4422e5b6d6dSopenharmony_ci 4432e5b6d6dSopenharmony_ci*This setting is deprecated and ignored in recent versions of ICU.* 4442e5b6d6dSopenharmony_ci 4452e5b6d6dSopenharmony_ciHiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points 4462e5b6d6dSopenharmony_ciwill sort before everything else on the quaternary level. If set to `UCOL_OFF` 4472e5b6d6dSopenharmony_ciHiragana letters are treated the same as all the other code points. This setting 4482e5b6d6dSopenharmony_cican be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute. 4492e5b6d6dSopenharmony_ciYou probably won't need to use it. 4502e5b6d6dSopenharmony_ci 4512e5b6d6dSopenharmony_ci#### Variable Top 4522e5b6d6dSopenharmony_ci 4532e5b6d6dSopenharmony_ciVariable Top is a boundary which decides whether the code points will be treated 4542e5b6d6dSopenharmony_cias variable (shifted to quaternary level in the **shifted** mode) or 4552e5b6d6dSopenharmony_cinon-ignorable. Special APIs are used for setting of variable top. It can 4562e5b6d6dSopenharmony_cibasically be set either to a codepoint or a primary strength value. 4572e5b6d6dSopenharmony_ci 4582e5b6d6dSopenharmony_ci## Performance 4592e5b6d6dSopenharmony_ci 4602e5b6d6dSopenharmony_ciICU collation is designed to be fast, small and customizable. Several techniques 4612e5b6d6dSopenharmony_ciare used to enhance the performance: 4622e5b6d6dSopenharmony_ci 4632e5b6d6dSopenharmony_ci1. Providing optimized processing for Latin characters. 4642e5b6d6dSopenharmony_ci 4652e5b6d6dSopenharmony_ci2. Comparing strings incrementally and stopping at the first significant 4662e5b6d6dSopenharmony_ci difference. 4672e5b6d6dSopenharmony_ci 4682e5b6d6dSopenharmony_ci3. Tuning to eliminate unnecessary file access or memory allocation. 4692e5b6d6dSopenharmony_ci 4702e5b6d6dSopenharmony_ci4. Providing efficient preflight functions that allows fast sort key size 4712e5b6d6dSopenharmony_ci generation. 4722e5b6d6dSopenharmony_ci 4732e5b6d6dSopenharmony_ci5. Using a single, shared copy of UCA in memory for the read-only default sort 4742e5b6d6dSopenharmony_ci order. Only small tailoring tables are kept in memory for locale-specific 4752e5b6d6dSopenharmony_ci customization. 4762e5b6d6dSopenharmony_ci 4772e5b6d6dSopenharmony_ci6. Compressing sort keys efficiently. 4782e5b6d6dSopenharmony_ci 4792e5b6d6dSopenharmony_ci7. Making the sort order be data-driven. 4802e5b6d6dSopenharmony_ci 4812e5b6d6dSopenharmony_ciIn general, the best performance from the Collation Service is expected by 4822e5b6d6dSopenharmony_cidoing the following: 4832e5b6d6dSopenharmony_ci 4842e5b6d6dSopenharmony_ci1. After opening a collator, keep and reuse it until done. Do not open new 4852e5b6d6dSopenharmony_ci collators for the same sort order. (Note the restriction on 4862e5b6d6dSopenharmony_ci multi-threading.) 4872e5b6d6dSopenharmony_ci 4882e5b6d6dSopenharmony_ci2. Use `ucol_strcoll` etc. when comparing strings. If it is necessary to 4892e5b6d6dSopenharmony_ci compare strings thousands or millions of times, 4902e5b6d6dSopenharmony_ci create the sort keys first and compare the sort keys instead. 4912e5b6d6dSopenharmony_ci Generating the sort keys of two strings is about 5-10 4922e5b6d6dSopenharmony_ci times slower than just comparing them directly. 4932e5b6d6dSopenharmony_ci 4942e5b6d6dSopenharmony_ci3. Follow the best practice guidelines for generating sort keys. Do not call 4952e5b6d6dSopenharmony_ci `ucol_getSortKey` twice to first size the key and then allocate the sort key 4962e5b6d6dSopenharmony_ci buffer and repeat the call to the function to fill in the buffer. 4972e5b6d6dSopenharmony_ci 4982e5b6d6dSopenharmony_ci### Performance and Storage Implications of Attributes 4992e5b6d6dSopenharmony_ci 5002e5b6d6dSopenharmony_ciMost people use the default attributes when comparing strings or when creating 5012e5b6d6dSopenharmony_cisort keys. When they do want to customize the ordering, the most common options 5022e5b6d6dSopenharmony_ciare the following : 5032e5b6d6dSopenharmony_ci 5042e5b6d6dSopenharmony_ci`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\ 5052e5b6d6dSopenharmony_ciUsed to ignore space and punctuation characters 5062e5b6d6dSopenharmony_ci 5072e5b6d6dSopenharmony_ci`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\ 5082e5b6d6dSopenharmony_ciUsed to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences. 5092e5b6d6dSopenharmony_ci 5102e5b6d6dSopenharmony_ci`UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\ 5112e5b6d6dSopenharmony_ciUsed to change the ordering of upper vs. lower case letters (as 5122e5b6d6dSopenharmony_ciwell as small vs. large kana) 5132e5b6d6dSopenharmony_ci 5142e5b6d6dSopenharmony_ci`UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\ 5152e5b6d6dSopenharmony_ciUsed to ignore only the accent differences. 5162e5b6d6dSopenharmony_ci 5172e5b6d6dSopenharmony_ci`UCOL_NORMALIZATION_MODE == UCOL_ON`\ 5182e5b6d6dSopenharmony_ciForce to always check for normalization. This 5192e5b6d6dSopenharmony_ciis used if the input text may not be in FCD form. 5202e5b6d6dSopenharmony_ci 5212e5b6d6dSopenharmony_ci`UCOL_FRENCH_COLLATION == UCOL_OFF`\ 5222e5b6d6dSopenharmony_ciThis is only useful for languages like French and Catalan that may turn this attribute on. 5232e5b6d6dSopenharmony_ci(It is the default only for Canadian French ("fr-CA").) 5242e5b6d6dSopenharmony_ci 5252e5b6d6dSopenharmony_ciIn String Comparison, most of these options have little or no effect on 5262e5b6d6dSopenharmony_ciperformance. The only noticeable one is normalization, which can cost 10%-40% in 5272e5b6d6dSopenharmony_ciperformance. 5282e5b6d6dSopenharmony_ci 5292e5b6d6dSopenharmony_ciFor Sort Keys, most of these options either leave the storage alone or reduce 5302e5b6d6dSopenharmony_ciit. Shifting can reduce the storage by about 10%-20%; case level + primary-only 5312e5b6d6dSopenharmony_cican decrease it about 20% to 40%. Using no French accents can reduce the storage 5322e5b6d6dSopenharmony_ciby about 38% , but only for languages like French and Catalan that turn it on by 5332e5b6d6dSopenharmony_cidefault. On the other hand, using Shifted + Quaternary can increase the storage by 5342e5b6d6dSopenharmony_ci10%-15%. (The Identical Level also increases the length, but this option is not 5352e5b6d6dSopenharmony_cirecommended). 5362e5b6d6dSopenharmony_ci 5372e5b6d6dSopenharmony_ci> :point_right: **Note** All of the above numbers are based on 5382e5b6d6dSopenharmony_ci> tests run on a particular machine, with a particular set of data. 5392e5b6d6dSopenharmony_ci> (The data for each language is a large number of names 5402e5b6d6dSopenharmony_ci> in that language in the format <first_name>, <last name>.) 5412e5b6d6dSopenharmony_ci> The performance and storage may vary, depending on the particular computer, 5422e5b6d6dSopenharmony_ci> operating system, and data. 5432e5b6d6dSopenharmony_ci 5442e5b6d6dSopenharmony_ci## Versioning 5452e5b6d6dSopenharmony_ci 5462e5b6d6dSopenharmony_ciSort keys are often stored on disk for later reuse. A common example is the use 5472e5b6d6dSopenharmony_ciof keys to build indexes in databases. When comparing keys, it is important to 5482e5b6d6dSopenharmony_ciknow that both keys were generated by the same algorithms and weightings. 5492e5b6d6dSopenharmony_ciOtherwise, identical strings with keys generated on two different dates, for 5502e5b6d6dSopenharmony_ciexample, might compare as unequal. Sort keys can be affected by new versions of 5512e5b6d6dSopenharmony_ciICU or its data tables, new sort key formats, or changes to the Collator. 5522e5b6d6dSopenharmony_ciStarting with release 1.8.1, ICU provides a versioning mechanism to identify the 5532e5b6d6dSopenharmony_civersion information of the following (but not limited to), 5542e5b6d6dSopenharmony_ci 5552e5b6d6dSopenharmony_ci1. The run-time executable 5562e5b6d6dSopenharmony_ci 5572e5b6d6dSopenharmony_ci2. The collation element content 5582e5b6d6dSopenharmony_ci 5592e5b6d6dSopenharmony_ci3. The Unicode/UCA database 5602e5b6d6dSopenharmony_ci 5612e5b6d6dSopenharmony_ci4. The tailoring table 5622e5b6d6dSopenharmony_ci 5632e5b6d6dSopenharmony_ciThe version information of Collator is a 32-bit integer. If a new version of ICU 5642e5b6d6dSopenharmony_cihas changes affecting the content of collation elements, the version information 5652e5b6d6dSopenharmony_ciwill be changed. In that case, to use the new version of ICU collator will 5662e5b6d6dSopenharmony_cirequire regenerating any saved or stored sort keys. 5672e5b6d6dSopenharmony_ci 5682e5b6d6dSopenharmony_ciHowever, it is possible to modify ICU code or data without changing relevant version numbers, 5692e5b6d6dSopenharmony_ciso it is safer to regenerate sort keys any time after any part of ICU has been updated. 5702e5b6d6dSopenharmony_ci 5712e5b6d6dSopenharmony_ciSince ICU4C 1.8.1. 5722e5b6d6dSopenharmony_ciit is possible to build your program so that it uses more than one version of 5732e5b6d6dSopenharmony_ciICU (only in C/C++, not in Java). Therefore, you could use the current version 5742e5b6d6dSopenharmony_cifor the features you need and use the older version for collation. 5752e5b6d6dSopenharmony_ci 5762e5b6d6dSopenharmony_ci## Programming Examples 5772e5b6d6dSopenharmony_ci 5782e5b6d6dSopenharmony_ciSee the [Collation Examples](examples.md) chapter for an example of how to 5792e5b6d6dSopenharmony_cicompare and create sort keys with the default locale in C, C++ and Java. 580