12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: Collation 42e5b6d6dSopenharmony_cinav_order: 1200 52e5b6d6dSopenharmony_cihas_children: true 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# Collation 132e5b6d6dSopenharmony_ci 142e5b6d6dSopenharmony_ci## Overview 152e5b6d6dSopenharmony_ci 162e5b6d6dSopenharmony_ciInformation is displayed in sorted order to enable users to easily find the 172e5b6d6dSopenharmony_ciitems they are looking for. However, users of different languages might have 182e5b6d6dSopenharmony_civery different expectations of what a "sorted" list should look like. Not only 192e5b6d6dSopenharmony_cidoes the alphabetical order vary from one language to another, but it also can 202e5b6d6dSopenharmony_civary from document to document within the same language. For example, phonebook 212e5b6d6dSopenharmony_ciordering might be different than dictionary ordering. String comparison is one 222e5b6d6dSopenharmony_ciof the basic functions most applications require, and yet implementations often 232e5b6d6dSopenharmony_cido not match local conventions. The ICU Collation Service provides string 242e5b6d6dSopenharmony_cicomparison capability with support for appropriate sort orderings for each of 252e5b6d6dSopenharmony_cithe locales you need. In the event that you have a very unusual requirement, you 262e5b6d6dSopenharmony_ciare also provided the facilities to customize orderings. 272e5b6d6dSopenharmony_ci 282e5b6d6dSopenharmony_ciStarting in release 1.8, the ICU Collation Service is compliant to the Unicode 292e5b6d6dSopenharmony_ciCollation Algorithm (UCA) ([Unicode Technical Standard 302e5b6d6dSopenharmony_ci#10](http://www.unicode.org/reports/tr10/)) and based on the Default 312e5b6d6dSopenharmony_ciUnicode Collation Element Table (DUCET) which defines the same sort order as ISO 322e5b6d6dSopenharmony_ci14651. 332e5b6d6dSopenharmony_ci 342e5b6d6dSopenharmony_ciThe ICU Collation Service also contains several enhancements that are not 352e5b6d6dSopenharmony_ciavailable in UCA. These have been adopted into the [CLDR Collation 362e5b6d6dSopenharmony_ciAlgorithm](http://www.unicode.org/reports/tr35/tr35-collation.html#CLDR_Collation_Algorithm). 372e5b6d6dSopenharmony_ciFor example: 382e5b6d6dSopenharmony_ci 392e5b6d6dSopenharmony_ci* Additional case handling (as specified by CLDR): ICU allows case differences 402e5b6d6dSopenharmony_ci to be ignored or flipped. Uppercase letters can be sorted before lowercase 412e5b6d6dSopenharmony_ci letters, or vice-versa. 422e5b6d6dSopenharmony_ci* Easy customization (as specified by CLDR): Services can be easily tailored 432e5b6d6dSopenharmony_ci to address a wide range of collation requirements. 442e5b6d6dSopenharmony_ci* The [default (root) sort 452e5b6d6dSopenharmony_ci order](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) 462e5b6d6dSopenharmony_ci has been tailored slightly for improved functionality and performance. 472e5b6d6dSopenharmony_ci 482e5b6d6dSopenharmony_ciIn other words, ICU implements the CLDR Collation Algorithm which is an 492e5b6d6dSopenharmony_ciextension of the Unicode Collation Algorithm (UCA) which is an extension of ISO 502e5b6d6dSopenharmony_ci14651. 512e5b6d6dSopenharmony_ci 522e5b6d6dSopenharmony_ciThere are several benefits to using the collation algorithms defined in these 532e5b6d6dSopenharmony_cistandards, including: 542e5b6d6dSopenharmony_ci 552e5b6d6dSopenharmony_ci* The algorithms have been designed and reviewed by experts in multilingual 562e5b6d6dSopenharmony_ci collation, and therefore are robust and comprehensive. 572e5b6d6dSopenharmony_ci 582e5b6d6dSopenharmony_ci* Applications that share sorted data but do not agree on how the data should 592e5b6d6dSopenharmony_ci be ordered fail to perform correctly. By conforming to the CLDR/UCA/14651 602e5b6d6dSopenharmony_ci standards for collation and using CLDR language-specific collation data, 612e5b6d6dSopenharmony_ci independently developed applications sort data identically and perform 622e5b6d6dSopenharmony_ci properly. 632e5b6d6dSopenharmony_ci 642e5b6d6dSopenharmony_ciIn addition, Unicode contains a large set of characters. This can make it 652e5b6d6dSopenharmony_cidifficult for collation to be a fast operation or require collation to use 662e5b6d6dSopenharmony_cisignificant memory or disk resources. The ICU collation implementation is 672e5b6d6dSopenharmony_cidesigned to be fast, have a small memory footprint and be highly customizable. 682e5b6d6dSopenharmony_ci 692e5b6d6dSopenharmony_ciThere are many challenges when accommodating the world's languages and writing 702e5b6d6dSopenharmony_cisystems and the different orderings that are used. However, the ICU Collation 712e5b6d6dSopenharmony_ciService provides an excellent means for comparing strings in a locale-sensitive 722e5b6d6dSopenharmony_cifashion. 732e5b6d6dSopenharmony_ci 742e5b6d6dSopenharmony_ciFor example, here are some of the ways languages vary in ordering strings: 752e5b6d6dSopenharmony_ci 762e5b6d6dSopenharmony_ci* The letters A-Z can be sorted in a different order than in English. For 772e5b6d6dSopenharmony_ci example, in Lithuanian, "y" is sorted between "i" and "k". 782e5b6d6dSopenharmony_ci 792e5b6d6dSopenharmony_ci* Combinations of letters can be treated as if they were one letter. For 802e5b6d6dSopenharmony_ci example, in traditional Spanish "ch" is treated as a single letter, and 812e5b6d6dSopenharmony_ci sorted between "c" and "d". 822e5b6d6dSopenharmony_ci 832e5b6d6dSopenharmony_ci* Accented letters can be treated as minor variants of the unaccented letter. 842e5b6d6dSopenharmony_ci For example, "é" can be treated equivalent to "e". 852e5b6d6dSopenharmony_ci 862e5b6d6dSopenharmony_ci* Accented letters can be treated as distinct letters. For example, "Å" in 872e5b6d6dSopenharmony_ci Danish is treated as a separate letter that sorts just after "Z". 882e5b6d6dSopenharmony_ci 892e5b6d6dSopenharmony_ci* Unaccented letters that are considered distinct in one language can be 902e5b6d6dSopenharmony_ci indistinct in another. For example, the letters "v" and "w" are two 912e5b6d6dSopenharmony_ci different letters according to English. However, "v" and "w" are 922e5b6d6dSopenharmony_ci traditionally considered variant forms of the same letter in Swedish. 932e5b6d6dSopenharmony_ci 942e5b6d6dSopenharmony_ci* A letter can be treated as if it were two letters. For example, in German 952e5b6d6dSopenharmony_ci phonebook (or "lists of names") order "ä" is compared as if it were "ae". 962e5b6d6dSopenharmony_ci 972e5b6d6dSopenharmony_ci* Thai requires that the order of certain letters be reversed. 982e5b6d6dSopenharmony_ci 992e5b6d6dSopenharmony_ci* Some French dictionary ordering traditions sort accents in backwards order, 1002e5b6d6dSopenharmony_ci from the end of the string. For example, the word "côte" sorts before "coté" 1012e5b6d6dSopenharmony_ci because the acute accent on the final "e" is more significant than the 1022e5b6d6dSopenharmony_ci circumflex on the "o". 1032e5b6d6dSopenharmony_ci 1042e5b6d6dSopenharmony_ci* Sometimes lowercase letters sort before uppercase letters. The reverse is 1052e5b6d6dSopenharmony_ci required in other situations. For example, lowercase letters are usually 1062e5b6d6dSopenharmony_ci sorted before uppercase letters in English. Danish letters are the exact 1072e5b6d6dSopenharmony_ci opposite. 1082e5b6d6dSopenharmony_ci 1092e5b6d6dSopenharmony_ci* Even in the same language, different applications might require different 1102e5b6d6dSopenharmony_ci sorting orders. For example, in German dictionaries, "öf" would come before 1112e5b6d6dSopenharmony_ci "of". In phone books the situation is the exact opposite. 1122e5b6d6dSopenharmony_ci 1132e5b6d6dSopenharmony_ci* Sorting orders can change over time due to government regulations or new 1142e5b6d6dSopenharmony_ci characters/scripts in Unicode. 1152e5b6d6dSopenharmony_ci 1162e5b6d6dSopenharmony_ciTo accommodate the many languages and differing requirements, ICU collation 1172e5b6d6dSopenharmony_cisupports customizing sort orderings - also known as **tailoring**. More details 1182e5b6d6dSopenharmony_ciregarding tailoring are discussed in the [Customization 1192e5b6d6dSopenharmony_cichapter.](customization/index.md) 1202e5b6d6dSopenharmony_ci 1212e5b6d6dSopenharmony_ciThe basic ICU Collation Service is provided by two main categories of APIs: 1222e5b6d6dSopenharmony_ci 1232e5b6d6dSopenharmony_ci* String comparison - most commonly used: APIs return result of comparing two 1242e5b6d6dSopenharmony_ci strings (greater than, equal or less than). This is used as a comparator 1252e5b6d6dSopenharmony_ci when sorting lists, building tree maps, etc. 1262e5b6d6dSopenharmony_ci 1272e5b6d6dSopenharmony_ci* Sort key generation - used when a very large set of strings are 1282e5b6d6dSopenharmony_ci compared/sorted repeatedly: APIs return a zero-terminated array of bytes per 1292e5b6d6dSopenharmony_ci string known as a sort key. The keys can be compared directly using strcmp 1302e5b6d6dSopenharmony_ci or memcmp standard library functions, saving repeated lookup and computation 1312e5b6d6dSopenharmony_ci of each string's collation properties. For example, database applications 1322e5b6d6dSopenharmony_ci use index tables of sort keys to index strings quickly. Note, however, that 1332e5b6d6dSopenharmony_ci this only improves performance for large numbers of strings because sorting 1342e5b6d6dSopenharmony_ci via the comparison functions is very fast. For more information, see 1352e5b6d6dSopenharmony_ci [Sortkeys vs Comparison](concepts#sortkeys-vs-comparison). 1362e5b6d6dSopenharmony_ci 1372e5b6d6dSopenharmony_ciICU provides an AlphabeticIndex API for generating language-appropriate 1382e5b6d6dSopenharmony_cisorted-section labels like in dictionaries and phone books. 1392e5b6d6dSopenharmony_ci 1402e5b6d6dSopenharmony_ciICU also provides a higher-level [string search](string-search) 1412e5b6d6dSopenharmony_ciAPI which can be used, for example, for case-insensitive or accent-insensitive 1422e5b6d6dSopenharmony_cisearch in an editor or in a web page. ICU string search is based on the 1432e5b6d6dSopenharmony_cilow-level [collation element iteration](architecture). 1442e5b6d6dSopenharmony_ci 1452e5b6d6dSopenharmony_ci## Programming Examples 1462e5b6d6dSopenharmony_ci 1472e5b6d6dSopenharmony_ciHere are some [API usage conventions](api.md) for the ICU Collation Service 1482e5b6d6dSopenharmony_ciAPIs. 149