12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: Collation
42e5b6d6dSopenharmony_cinav_order: 1200
52e5b6d6dSopenharmony_cihas_children: true
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# Collation
132e5b6d6dSopenharmony_ci
142e5b6d6dSopenharmony_ci## Overview
152e5b6d6dSopenharmony_ci
162e5b6d6dSopenharmony_ciInformation is displayed in sorted order to enable users to easily find the
172e5b6d6dSopenharmony_ciitems they are looking for. However, users of different languages might have
182e5b6d6dSopenharmony_civery different expectations of what a "sorted" list should look like. Not only
192e5b6d6dSopenharmony_cidoes the alphabetical order vary from one language to another, but it also can
202e5b6d6dSopenharmony_civary from document to document within the same language. For example, phonebook
212e5b6d6dSopenharmony_ciordering might be different than dictionary ordering. String comparison is one
222e5b6d6dSopenharmony_ciof the basic functions most applications require, and yet implementations often
232e5b6d6dSopenharmony_cido not match local conventions. The ICU Collation Service provides string
242e5b6d6dSopenharmony_cicomparison capability with support for appropriate sort orderings for each of
252e5b6d6dSopenharmony_cithe locales you need. In the event that you have a very unusual requirement, you
262e5b6d6dSopenharmony_ciare also provided the facilities to customize orderings.
272e5b6d6dSopenharmony_ci
282e5b6d6dSopenharmony_ciStarting in release 1.8, the ICU Collation Service is compliant to the Unicode
292e5b6d6dSopenharmony_ciCollation Algorithm (UCA) ([Unicode Technical Standard
302e5b6d6dSopenharmony_ci#10](http://www.unicode.org/reports/tr10/)) and based on the Default
312e5b6d6dSopenharmony_ciUnicode Collation Element Table (DUCET) which defines the same sort order as ISO
322e5b6d6dSopenharmony_ci14651.
332e5b6d6dSopenharmony_ci
342e5b6d6dSopenharmony_ciThe ICU Collation Service also contains several enhancements that are not
352e5b6d6dSopenharmony_ciavailable in UCA. These have been adopted into the [CLDR Collation
362e5b6d6dSopenharmony_ciAlgorithm](http://www.unicode.org/reports/tr35/tr35-collation.html#CLDR_Collation_Algorithm).
372e5b6d6dSopenharmony_ciFor example:
382e5b6d6dSopenharmony_ci
392e5b6d6dSopenharmony_ci*   Additional case handling (as specified by CLDR): ICU allows case differences
402e5b6d6dSopenharmony_ci    to be ignored or flipped. Uppercase letters can be sorted before lowercase
412e5b6d6dSopenharmony_ci    letters, or vice-versa.
422e5b6d6dSopenharmony_ci*   Easy customization (as specified by CLDR): Services can be easily tailored
432e5b6d6dSopenharmony_ci    to address a wide range of collation requirements.
442e5b6d6dSopenharmony_ci*   The [default (root) sort
452e5b6d6dSopenharmony_ci    order](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation)
462e5b6d6dSopenharmony_ci    has been tailored slightly for improved functionality and performance.
472e5b6d6dSopenharmony_ci
482e5b6d6dSopenharmony_ciIn other words, ICU implements the CLDR Collation Algorithm which is an
492e5b6d6dSopenharmony_ciextension of the Unicode Collation Algorithm (UCA) which is an extension of ISO
502e5b6d6dSopenharmony_ci14651.
512e5b6d6dSopenharmony_ci
522e5b6d6dSopenharmony_ciThere are several benefits to using the collation algorithms defined in these
532e5b6d6dSopenharmony_cistandards, including:
542e5b6d6dSopenharmony_ci
552e5b6d6dSopenharmony_ci*   The algorithms have been designed and reviewed by experts in multilingual
562e5b6d6dSopenharmony_ci    collation, and therefore are robust and comprehensive.
572e5b6d6dSopenharmony_ci
582e5b6d6dSopenharmony_ci*   Applications that share sorted data but do not agree on how the data should
592e5b6d6dSopenharmony_ci    be ordered fail to perform correctly. By conforming to the CLDR/UCA/14651
602e5b6d6dSopenharmony_ci    standards for collation and using CLDR language-specific collation data,
612e5b6d6dSopenharmony_ci    independently developed applications sort data identically and perform
622e5b6d6dSopenharmony_ci    properly.
632e5b6d6dSopenharmony_ci
642e5b6d6dSopenharmony_ciIn addition, Unicode contains a large set of characters. This can make it
652e5b6d6dSopenharmony_cidifficult for collation to be a fast operation or require collation to use
662e5b6d6dSopenharmony_cisignificant memory or disk resources. The ICU collation implementation is
672e5b6d6dSopenharmony_cidesigned to be fast, have a small memory footprint and be highly customizable.
682e5b6d6dSopenharmony_ci
692e5b6d6dSopenharmony_ciThere are many challenges when accommodating the world's languages and writing
702e5b6d6dSopenharmony_cisystems and the different orderings that are used. However, the ICU Collation
712e5b6d6dSopenharmony_ciService provides an excellent means for comparing strings in a locale-sensitive
722e5b6d6dSopenharmony_cifashion.
732e5b6d6dSopenharmony_ci
742e5b6d6dSopenharmony_ciFor example, here are some of the ways languages vary in ordering strings:
752e5b6d6dSopenharmony_ci
762e5b6d6dSopenharmony_ci*   The letters A-Z can be sorted in a different order than in English. For
772e5b6d6dSopenharmony_ci    example, in Lithuanian, "y" is sorted between "i" and "k".
782e5b6d6dSopenharmony_ci
792e5b6d6dSopenharmony_ci*   Combinations of letters can be treated as if they were one letter. For
802e5b6d6dSopenharmony_ci    example, in traditional Spanish "ch" is treated as a single letter, and
812e5b6d6dSopenharmony_ci    sorted between "c" and "d".
822e5b6d6dSopenharmony_ci
832e5b6d6dSopenharmony_ci*   Accented letters can be treated as minor variants of the unaccented letter.
842e5b6d6dSopenharmony_ci    For example, "é" can be treated equivalent to "e".
852e5b6d6dSopenharmony_ci
862e5b6d6dSopenharmony_ci*   Accented letters can be treated as distinct letters. For example, "Å" in
872e5b6d6dSopenharmony_ci    Danish is treated as a separate letter that sorts just after "Z".
882e5b6d6dSopenharmony_ci
892e5b6d6dSopenharmony_ci*   Unaccented letters that are considered distinct in one language can be
902e5b6d6dSopenharmony_ci    indistinct in another. For example, the letters "v" and "w" are two
912e5b6d6dSopenharmony_ci    different letters according to English. However, "v" and "w" are
922e5b6d6dSopenharmony_ci    traditionally considered variant forms of the same letter in Swedish.
932e5b6d6dSopenharmony_ci
942e5b6d6dSopenharmony_ci*   A letter can be treated as if it were two letters. For example, in German
952e5b6d6dSopenharmony_ci    phonebook (or "lists of names") order "ä" is compared as if it were "ae".
962e5b6d6dSopenharmony_ci
972e5b6d6dSopenharmony_ci*   Thai requires that the order of certain letters be reversed.
982e5b6d6dSopenharmony_ci
992e5b6d6dSopenharmony_ci*   Some French dictionary ordering traditions sort accents in backwards order,
1002e5b6d6dSopenharmony_ci    from the end of the string. For example, the word "côte" sorts before "coté"
1012e5b6d6dSopenharmony_ci    because the acute accent on the final "e" is more significant than the
1022e5b6d6dSopenharmony_ci    circumflex on the "o".
1032e5b6d6dSopenharmony_ci
1042e5b6d6dSopenharmony_ci*   Sometimes lowercase letters sort before uppercase letters. The reverse is
1052e5b6d6dSopenharmony_ci    required in other situations. For example, lowercase letters are usually
1062e5b6d6dSopenharmony_ci    sorted before uppercase letters in English. Danish letters are the exact
1072e5b6d6dSopenharmony_ci    opposite.
1082e5b6d6dSopenharmony_ci
1092e5b6d6dSopenharmony_ci*   Even in the same language, different applications might require different
1102e5b6d6dSopenharmony_ci    sorting orders. For example, in German dictionaries, "öf" would come before
1112e5b6d6dSopenharmony_ci    "of". In phone books the situation is the exact opposite.
1122e5b6d6dSopenharmony_ci
1132e5b6d6dSopenharmony_ci*   Sorting orders can change over time due to government regulations or new
1142e5b6d6dSopenharmony_ci    characters/scripts in Unicode.
1152e5b6d6dSopenharmony_ci
1162e5b6d6dSopenharmony_ciTo accommodate the many languages and differing requirements, ICU collation
1172e5b6d6dSopenharmony_cisupports customizing sort orderings - also known as **tailoring**. More details
1182e5b6d6dSopenharmony_ciregarding tailoring are discussed in the [Customization
1192e5b6d6dSopenharmony_cichapter.](customization/index.md)
1202e5b6d6dSopenharmony_ci
1212e5b6d6dSopenharmony_ciThe basic ICU Collation Service is provided by two main categories of APIs:
1222e5b6d6dSopenharmony_ci
1232e5b6d6dSopenharmony_ci*   String comparison - most commonly used: APIs return result of comparing two
1242e5b6d6dSopenharmony_ci    strings (greater than, equal or less than). This is used as a comparator
1252e5b6d6dSopenharmony_ci    when sorting lists, building tree maps, etc.
1262e5b6d6dSopenharmony_ci
1272e5b6d6dSopenharmony_ci*   Sort key generation - used when a very large set of strings are
1282e5b6d6dSopenharmony_ci    compared/sorted repeatedly: APIs return a zero-terminated array of bytes per
1292e5b6d6dSopenharmony_ci    string known as a sort key. The keys can be compared directly using strcmp
1302e5b6d6dSopenharmony_ci    or memcmp standard library functions, saving repeated lookup and computation
1312e5b6d6dSopenharmony_ci    of each string's collation properties. For example, database applications
1322e5b6d6dSopenharmony_ci    use index tables of sort keys to index strings quickly. Note, however, that
1332e5b6d6dSopenharmony_ci    this only improves performance for large numbers of strings because sorting
1342e5b6d6dSopenharmony_ci    via the comparison functions is very fast. For more information, see
1352e5b6d6dSopenharmony_ci    [Sortkeys vs Comparison](concepts#sortkeys-vs-comparison).
1362e5b6d6dSopenharmony_ci
1372e5b6d6dSopenharmony_ciICU provides an AlphabeticIndex API for generating language-appropriate
1382e5b6d6dSopenharmony_cisorted-section labels like in dictionaries and phone books.
1392e5b6d6dSopenharmony_ci
1402e5b6d6dSopenharmony_ciICU also provides a higher-level [string search](string-search)
1412e5b6d6dSopenharmony_ciAPI which can be used, for example, for case-insensitive or accent-insensitive
1422e5b6d6dSopenharmony_cisearch in an editor or in a web page. ICU string search is based on the
1432e5b6d6dSopenharmony_cilow-level [collation element iteration](architecture).
1442e5b6d6dSopenharmony_ci
1452e5b6d6dSopenharmony_ci## Programming Examples
1462e5b6d6dSopenharmony_ci
1472e5b6d6dSopenharmony_ciHere are some [API usage conventions](api.md) for the ICU Collation Service
1482e5b6d6dSopenharmony_ciAPIs.
149