12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: Case Mappings 42e5b6d6dSopenharmony_cinav_order: 1 52e5b6d6dSopenharmony_ciparent: Transforms 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# Case Mappings 132e5b6d6dSopenharmony_ci{: .no_toc } 142e5b6d6dSopenharmony_ci 152e5b6d6dSopenharmony_ci## Contents 162e5b6d6dSopenharmony_ci{: .no_toc .text-delta } 172e5b6d6dSopenharmony_ci 182e5b6d6dSopenharmony_ci1. TOC 192e5b6d6dSopenharmony_ci{:toc} 202e5b6d6dSopenharmony_ci 212e5b6d6dSopenharmony_ci--- 222e5b6d6dSopenharmony_ci 232e5b6d6dSopenharmony_ci## Overview 242e5b6d6dSopenharmony_ci 252e5b6d6dSopenharmony_ciCase mapping is used to handle the mapping of upper-case, lower-case, and title 262e5b6d6dSopenharmony_cicase characters for a given language. Case is a normative property of characters 272e5b6d6dSopenharmony_ciin specific alphabets (e.g. Latin, Greek, Cyrillic, Armenian, and Georgian) 282e5b6d6dSopenharmony_ciwhereby characters are considered to be variants of a single letter. ICU refers 292e5b6d6dSopenharmony_cito these variants, which may differ markedly in shape and size, as uppercase 302e5b6d6dSopenharmony_ciletters (also known as capital or majuscule) and lower-case letters (also known 312e5b6d6dSopenharmony_cias small or minuscule). Alphabets with case differences are called bicameral and 322e5b6d6dSopenharmony_cialphabets without case differences are called unicameral. 332e5b6d6dSopenharmony_ci 342e5b6d6dSopenharmony_ciDue to the inclusion of certain composite characters for compatibility, such as 352e5b6d6dSopenharmony_cithe Latin capital letter 'DZ' (\\u01F1 'DZ'), there is a third case called title 362e5b6d6dSopenharmony_cicase. Title case is used to capitalize the first character of a word such as the 372e5b6d6dSopenharmony_ciLatin capital letter 'D' with small letter 'z' ( \\u01F2 'Dz'). The term "title 382e5b6d6dSopenharmony_cicase" can also be used to refer to words whose first letter is an uppercase or 392e5b6d6dSopenharmony_cititle case letter and the rest are lowercase letters. However, not all words in 402e5b6d6dSopenharmony_cithe title of a document or first words in a sentence will be title case. The use 412e5b6d6dSopenharmony_ciof title case words is language dependent. For example, in English, "Taming of 422e5b6d6dSopenharmony_cithe Shrew" would be the appropriate capitalization and not "Taming Of The 432e5b6d6dSopenharmony_ciShrew". 442e5b6d6dSopenharmony_ci 452e5b6d6dSopenharmony_ci> :point_right: **Note**: *As of Unicode 11, Georgian now has Mkhedruli (lowercase) and Mtavruli 462e5b6d6dSopenharmony_ci(uppercase) which form case pairs, but are not used in title case.* 472e5b6d6dSopenharmony_ci 482e5b6d6dSopenharmony_ciSample code is available in the ICU source code library at 492e5b6d6dSopenharmony_ci[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/ustring/ustring.cpp) 502e5b6d6dSopenharmony_ci. 512e5b6d6dSopenharmony_ci 522e5b6d6dSopenharmony_ciPlease refer to the following sections in the [The Unicode Standard](http://www.unicode.org/versions/latest/) 532e5b6d6dSopenharmony_cifor more information about case mapping: 542e5b6d6dSopenharmony_ci 552e5b6d6dSopenharmony_ci* 3.13 Default Case Algorithms 562e5b6d6dSopenharmony_ci* 4.2 Case 572e5b6d6dSopenharmony_ci* 5.18 Case Mappings 582e5b6d6dSopenharmony_ci 592e5b6d6dSopenharmony_ci## Simple (Single-Character) Case Mapping 602e5b6d6dSopenharmony_ci 612e5b6d6dSopenharmony_ciThe general case mapping in ICU is non-language based and a 1 to 1 generic 622e5b6d6dSopenharmony_cicharacter map. 632e5b6d6dSopenharmony_ci 642e5b6d6dSopenharmony_ciA character is considered to have a lowercase, uppercase, or title case 652e5b6d6dSopenharmony_ciequivalent if there is a respective "simple" case mapping specified for the 662e5b6d6dSopenharmony_cicharacter in the [Unicode Character Database](http://www.unicode.org/ucd/) (UnicodeData.txt). 672e5b6d6dSopenharmony_ciIf a character has no mapping equivalent, the result is the character itself. 682e5b6d6dSopenharmony_ci 692e5b6d6dSopenharmony_ciThe APIs provided for the general case mapping, located in `uchar.h` file, handles 702e5b6d6dSopenharmony_cionly single characters of type `UChar32` and returns only single characters. To 712e5b6d6dSopenharmony_ciconvert a string to a non-language based specific case, use the APIs in either 722e5b6d6dSopenharmony_cithe `unistr.h` or `ustring.h` files with a `NULL` argument locale. 732e5b6d6dSopenharmony_ci 742e5b6d6dSopenharmony_ci## Full (Language-Specific) Case Mapping 752e5b6d6dSopenharmony_ci 762e5b6d6dSopenharmony_ciThere are different case mappings for different locales. For instance, unlike 772e5b6d6dSopenharmony_ciEnglish, the character Latin small letter 'i' in Turkish has an equivalent Latin 782e5b6d6dSopenharmony_cicapital letter 'I' with dot above ( \\u0130 'İ'). 792e5b6d6dSopenharmony_ci 802e5b6d6dSopenharmony_ciSimilar to the simple case mapping API, a character is considered to have a 812e5b6d6dSopenharmony_cilowercase, uppercase or title case equivalent if there is a respective mapping 822e5b6d6dSopenharmony_cispecified for the character in the Unicode Character database (UnicodeData.txt). 832e5b6d6dSopenharmony_ciIn the case where a character has no mapping equivalent, the result is the 842e5b6d6dSopenharmony_cicharacter itself. 852e5b6d6dSopenharmony_ci 862e5b6d6dSopenharmony_ciTo convert a string to a language based specific case, use the APIs in `ustring.h` 872e5b6d6dSopenharmony_ciand `unistr.h` with an intended argument locale. 882e5b6d6dSopenharmony_ci 892e5b6d6dSopenharmony_ciICU implements full Unicode string case mappings. 902e5b6d6dSopenharmony_ci 912e5b6d6dSopenharmony_ci**In general:** 922e5b6d6dSopenharmony_ci 932e5b6d6dSopenharmony_ci* **case mapping can change the number of code points and/or code units of a 942e5b6d6dSopenharmony_ci string,** 952e5b6d6dSopenharmony_ci* **is language-sensitive (results may differ depending on language), and** 962e5b6d6dSopenharmony_ci* **is context-sensitive (a character in the input string may map differently 972e5b6d6dSopenharmony_ci depending on surrounding characters).** 982e5b6d6dSopenharmony_ci 992e5b6d6dSopenharmony_ci## Case Folding 1002e5b6d6dSopenharmony_ci 1012e5b6d6dSopenharmony_ciCase folding maps strings to a canonical form where case differences are erased. 1022e5b6d6dSopenharmony_ciUsing the case folding API, ICU supports fast matches without regard to case in 1032e5b6d6dSopenharmony_cilookups, since only binary comparison is required. 1042e5b6d6dSopenharmony_ci 1052e5b6d6dSopenharmony_ciThe CaseFolding.txt file in the Unicode Character Database is used for 1062e5b6d6dSopenharmony_ciperforming locale-independent case folding. This text file is generated from the 1072e5b6d6dSopenharmony_cicase mappings in the Unicode Character Database, using both the single-character 1082e5b6d6dSopenharmony_ciand the multi-character mappings. The CaseFolding.txt file transforms all 1092e5b6d6dSopenharmony_cicharacters having different case forms into a common form. To compare two 1102e5b6d6dSopenharmony_cistrings for non-case-sensitive matching, you can transform each string and then 1112e5b6d6dSopenharmony_ciuse a binary comparison. There are also functions to compare two strings 1122e5b6d6dSopenharmony_cicase-insensitively using the same case folding data. 1132e5b6d6dSopenharmony_ci 1142e5b6d6dSopenharmony_ciUnicode case folding is not context-sensitive. It is also not 1152e5b6d6dSopenharmony_cilanguage-sensitive, although there is a flag for whether to apply special 1162e5b6d6dSopenharmony_cimappings for use with Turkic (Turkish/Azerbaijani) text data. 1172e5b6d6dSopenharmony_ci 1182e5b6d6dSopenharmony_ciCharacter case folding APIs implementations are located in: 1192e5b6d6dSopenharmony_ci 1202e5b6d6dSopenharmony_ci1. `uchar.h` for single character folding 1212e5b6d6dSopenharmony_ci 1222e5b6d6dSopenharmony_ci2. `ustring.h` and `unistr.h` for character string folding. 123