12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: Chars and Strings 42e5b6d6dSopenharmony_cinav_order: 600 52e5b6d6dSopenharmony_cihas_children: true 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# Strings 132e5b6d6dSopenharmony_ci 142e5b6d6dSopenharmony_ci## Overview 152e5b6d6dSopenharmony_ci 162e5b6d6dSopenharmony_ciThis section explains how to handle Unicode strings with ICU in C and C++. 172e5b6d6dSopenharmony_ci 182e5b6d6dSopenharmony_ciSample code is available in the ICU source code library at 192e5b6d6dSopenharmony_ci[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/ustring/ustring.cpp) 202e5b6d6dSopenharmony_ci. 212e5b6d6dSopenharmony_ci 222e5b6d6dSopenharmony_ci## Text Access Overview 232e5b6d6dSopenharmony_ci 242e5b6d6dSopenharmony_ciStrings are the most common and fundamental form of handling text in software. 252e5b6d6dSopenharmony_ciLogically, and often physically, they contain contiguous arrays (vectors) of 262e5b6d6dSopenharmony_cibasic units. Most of the ICU API functions work directly with simple strings, 272e5b6d6dSopenharmony_ciand where possible, this is preferred. 282e5b6d6dSopenharmony_ci 292e5b6d6dSopenharmony_ciSometimes, text needs to be accessed via more powerful and complicated methods. 302e5b6d6dSopenharmony_ciFor example, text may be stored in discontiguous chunks in order to deal with 312e5b6d6dSopenharmony_cifrequent modification (like typing) and large amounts, or it may not be stored 322e5b6d6dSopenharmony_ciin the internal encoding, or it may have associated attributes like bold or 332e5b6d6dSopenharmony_ciitalic styles. 342e5b6d6dSopenharmony_ci 352e5b6d6dSopenharmony_ci### Guidance 362e5b6d6dSopenharmony_ci 372e5b6d6dSopenharmony_ciICU provides multiple text access interfaces which were added over time. If 382e5b6d6dSopenharmony_cisimple strings cannot be used, then consider the following: 392e5b6d6dSopenharmony_ci 402e5b6d6dSopenharmony_ci1. [UText](utext.md): Added in ICU4C 3.4 as a technology preview. Intended to 412e5b6d6dSopenharmony_ci be the strategic text access API for use with ICU. C API, high performance, 422e5b6d6dSopenharmony_ci writable, supports native indexes for efficient non-UTF-16 text storage. So 432e5b6d6dSopenharmony_ci far (3.4) only supported in BreakIterator. Some API changes are anticipated 442e5b6d6dSopenharmony_ci for ICU 3.6. 452e5b6d6dSopenharmony_ci 462e5b6d6dSopenharmony_ci2. Replaceable (Java & C++) and UReplaceable (C): Writable, designed for use 472e5b6d6dSopenharmony_ci with Transliterator. 482e5b6d6dSopenharmony_ci 492e5b6d6dSopenharmony_ci3. CharacterIterator (Java JDK & C++): Read-only, used in many APIs. Large 502e5b6d6dSopenharmony_ci differences between the JDK and C++ versions. 512e5b6d6dSopenharmony_ci 522e5b6d6dSopenharmony_ci4. UCharacterIterator (Java): Back-port of the C++ CharacterIterator to ICU4J 532e5b6d6dSopenharmony_ci for support of supplementary code points and post-increment iteration. 542e5b6d6dSopenharmony_ci 552e5b6d6dSopenharmony_ci5. UCharIterator (C): Read-only, C interface used mostly in incremental 562e5b6d6dSopenharmony_ci normalization and collation. 572e5b6d6dSopenharmony_ci 582e5b6d6dSopenharmony_ciThe following provides some historical perspective and comparison between the 592e5b6d6dSopenharmony_ciinterfaces. 602e5b6d6dSopenharmony_ci 612e5b6d6dSopenharmony_ci### CharacterIterator 622e5b6d6dSopenharmony_ci 632e5b6d6dSopenharmony_ciICU has long provided the CharacterIterator interface for some services. It 642e5b6d6dSopenharmony_ciallows for abstract text access, but has limitations: 652e5b6d6dSopenharmony_ci 662e5b6d6dSopenharmony_ci1. It has a per-character function call overhead. 672e5b6d6dSopenharmony_ci 682e5b6d6dSopenharmony_ci2. Originally, it was designed for UCS-2 operation and did not support direct 692e5b6d6dSopenharmony_ci handling of supplementary Unicode code points. Such support was later added. 702e5b6d6dSopenharmony_ci 712e5b6d6dSopenharmony_ci3. Its pre-increment iteration semantics are uncommon, and are inefficient when 722e5b6d6dSopenharmony_ci used with a variable-width encoding form (UTF-16). Functions for 732e5b6d6dSopenharmony_ci post-increment iteration were added later. 742e5b6d6dSopenharmony_ci 752e5b6d6dSopenharmony_ci4. The C++ version added iteration start/limit boundaries only because the C++ 762e5b6d6dSopenharmony_ci UnicodeString copies string contents during substringing; the Java 772e5b6d6dSopenharmony_ci CharacterIterator does not have these extra boundaries – substringing is 782e5b6d6dSopenharmony_ci more efficient in Java. 792e5b6d6dSopenharmony_ci 802e5b6d6dSopenharmony_ci5. CharacterIterator is not available for use in C. 812e5b6d6dSopenharmony_ci 822e5b6d6dSopenharmony_ci6. CharacterIterator is a read-only interface. 832e5b6d6dSopenharmony_ci 842e5b6d6dSopenharmony_ci7. It uses UTF-16 indexes into the text, which is not efficient for other 852e5b6d6dSopenharmony_ci encoding forms. 862e5b6d6dSopenharmony_ci 872e5b6d6dSopenharmony_ci8. With the additions to the API over time, the number of methods that have to 882e5b6d6dSopenharmony_ci be overridden by subclasses has become rather large. 892e5b6d6dSopenharmony_ci 902e5b6d6dSopenharmony_ciThe core Java adopted an early version of CharacterIterator; later 912e5b6d6dSopenharmony_cifunctionality, like support for supplementary code points, was back-ported from 922e5b6d6dSopenharmony_ciICU4C to ICU4J to form the UCharacterIterator class. 932e5b6d6dSopenharmony_ci 942e5b6d6dSopenharmony_ciThe UCharIterator C interface was added to allow for incremental normalization 952e5b6d6dSopenharmony_ciand collation in C. It is entirely code unit (UChar)-oriented, uses only 962e5b6d6dSopenharmony_cipost-increment iteration and has a smaller number of overridable methods. 972e5b6d6dSopenharmony_ci 982e5b6d6dSopenharmony_ci### Replaceable 992e5b6d6dSopenharmony_ci 1002e5b6d6dSopenharmony_ciThe Replaceable (Java & C++) and UReplaceable (C) interfaces are designed for, 1012e5b6d6dSopenharmony_ciand used in, Transliterator. They are random-access interfaces, not iterators. 1022e5b6d6dSopenharmony_ci 1032e5b6d6dSopenharmony_ci### UText 1042e5b6d6dSopenharmony_ci 1052e5b6d6dSopenharmony_ciThe [UText](utext.md) text access interface was designed as a possible 1062e5b6d6dSopenharmony_cireplacement for all previous interfaces listed above, with additional 1072e5b6d6dSopenharmony_cifunctionality. It allows for high-performance operation through the use of 1082e5b6d6dSopenharmony_cistorage-native indexes (for efficient use of non-UTF-16 text) and through 1092e5b6d6dSopenharmony_ciaccessing multiple characters per function call. Code point iteration is 1102e5b6d6dSopenharmony_ciavailable with functions as well as with C macros, for maximum performance. 1112e5b6d6dSopenharmony_ciUText is also writable, mostly patterned after Replaceable. For details see the 1122e5b6d6dSopenharmony_ciUText chaper. 1132e5b6d6dSopenharmony_ci 1142e5b6d6dSopenharmony_ci## Strings in ICU 1152e5b6d6dSopenharmony_ci 1162e5b6d6dSopenharmony_ci### Strings in Java 1172e5b6d6dSopenharmony_ci 1182e5b6d6dSopenharmony_ciIn Java, ICU uses the standard String and StringBuffer classes, `char[]`, etc. 1192e5b6d6dSopenharmony_ciSee the Java documentation for details. 1202e5b6d6dSopenharmony_ci 1212e5b6d6dSopenharmony_ci### Strings in C/C++ 1222e5b6d6dSopenharmony_ci 1232e5b6d6dSopenharmony_ciStrings in C and C++ are, at the lowest level, arrays of some particular base 1242e5b6d6dSopenharmony_citype. In most cases, the base type is a char, which is an 8-bit byte in modern 1252e5b6d6dSopenharmony_cicompilers. Some APIs use a "wide character" type wchar_t that is typically 8, 1262e5b6d6dSopenharmony_ci16, or 32 bits wide and upwards compatible with char. C code passes `char *` or 1272e5b6d6dSopenharmony_ciwchar_t pointers to the first element of an array. C++ enables you to create a 1282e5b6d6dSopenharmony_ciclass for encapsulating these kinds of character arrays in handy and safe 1292e5b6d6dSopenharmony_ciobjects. 1302e5b6d6dSopenharmony_ci 1312e5b6d6dSopenharmony_ciThe interpretation of the byte or wchar_t values depends on the platform, the 1322e5b6d6dSopenharmony_cicompiler, the signed state of both char and wchar_t, and the width of wchar_t. 1332e5b6d6dSopenharmony_ciThese characteristics are not specified in the language standards. When using 1342e5b6d6dSopenharmony_ciinternationalized text, the encoding often uses multiple chars for most 1352e5b6d6dSopenharmony_cicharacters and a wchar_t that is wide enough to hold exactly one character code 1362e5b6d6dSopenharmony_cipoint value each. Some APIs, especially in the standard library (stdlib), assume 1372e5b6d6dSopenharmony_cithat wchar_t strings use a fixed-width encoding with exactly one character code 1382e5b6d6dSopenharmony_cipoint per wchar_t. 1392e5b6d6dSopenharmony_ci 1402e5b6d6dSopenharmony_ci### ICU: 16-bit Unicode strings 1412e5b6d6dSopenharmony_ci 1422e5b6d6dSopenharmony_ciIn order to take advantage of Unicode with its large character repertoire and 1432e5b6d6dSopenharmony_ciits well-defined properties, there must be types with consistent definitions and 1442e5b6d6dSopenharmony_cisemantics. The Unicode standard defines a default encoding based on 16-bit code 1452e5b6d6dSopenharmony_ciunits. This is supported in ICU by the definition of the UChar to be an unsigned 1462e5b6d6dSopenharmony_ci16-bit integer type. This is the base type for character arrays for strings in 1472e5b6d6dSopenharmony_ciICU. 1482e5b6d6dSopenharmony_ci 1492e5b6d6dSopenharmony_ci> :point_right: **Note**: *Endianness is not an issue on this level because the interpretation of an 1502e5b6d6dSopenharmony_ciinteger is fixed within any given platform.* 1512e5b6d6dSopenharmony_ci 1522e5b6d6dSopenharmony_ciWith the UTF-16 encoding form, a single Unicode code point is encoded with 1532e5b6d6dSopenharmony_cieither one or two 16-bit UChar code units (unambiguously). "Supplementary" code 1542e5b6d6dSopenharmony_cipoints, which are encoded with pairs of code units, are rare in most texts. The 1552e5b6d6dSopenharmony_citwo code units are called "surrogates", and their unit value ranges are distinct 1562e5b6d6dSopenharmony_cifrom each other and from single-unit value ranges. Code should be generally 1572e5b6d6dSopenharmony_cioptimized for the common, single-unit case. 1582e5b6d6dSopenharmony_ci 1592e5b6d6dSopenharmony_ci16-bit Unicode strings in internal processing contain sequences of 16-bit code 1602e5b6d6dSopenharmony_ciunits that may not always be well-formed UTF-16. ICU treats single, unpaired 1612e5b6d6dSopenharmony_cisurrogates as surrogate code points, i.e., they are returned in per-code point 1622e5b6d6dSopenharmony_ciiteration, they are included in the number of code points of a string, and they 1632e5b6d6dSopenharmony_ciare generally treated much like normal, unassigned code points in most APIs. 1642e5b6d6dSopenharmony_ciSurrogate code points have Unicode properties although they cannot be assigned 1652e5b6d6dSopenharmony_cian actual character. 1662e5b6d6dSopenharmony_ci 1672e5b6d6dSopenharmony_ciICU string handling functions (including append, substring, etc.) do not 1682e5b6d6dSopenharmony_ciautomatically protect against producing malformed UTF-16 strings. Most of the 1692e5b6d6dSopenharmony_citime, indexes into strings are naturally at code point boundaries because they 1702e5b6d6dSopenharmony_ciresult from other functions that always produce such indexes. If necessary, the 1712e5b6d6dSopenharmony_ciuser can test for proper boundaries by checking the code unit values, or adjust 1722e5b6d6dSopenharmony_ciarbitrary indexes to code point boundaries by using the C macros 1732e5b6d6dSopenharmony_ciU16_SET_CP_START() and U16_SET_CP_LIMIT() (see utf.h) and the UnicodeString 1742e5b6d6dSopenharmony_cifunctions getChar32Start() and getChar32Limit(). 1752e5b6d6dSopenharmony_ci 1762e5b6d6dSopenharmony_ciUTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and 1772e5b6d6dSopenharmony_ciconvenience functions (ustring.h), but only a subset of APIs works with UTF-8 1782e5b6d6dSopenharmony_cidirectly as string encoding form. 1792e5b6d6dSopenharmony_ci 1802e5b6d6dSopenharmony_ci**See the [UTF-8](utf-8.md) subpage for details about working with 1812e5b6d6dSopenharmony_ciUTF-8.** Some of the following sections apply to UTF-8 APIs as well; for example 1822e5b6d6dSopenharmony_cisections about handling lengths and overflows. 1832e5b6d6dSopenharmony_ci 1842e5b6d6dSopenharmony_ci### Separate type for single code points 1852e5b6d6dSopenharmony_ci 1862e5b6d6dSopenharmony_ciA Unicode code point is an integer with a value from 0 to 0x10FFFF. ICU 2.4 and 1872e5b6d6dSopenharmony_cilater defines the UChar32 type for single code point values as a 32 bits wide 1882e5b6d6dSopenharmony_cisigned integer (int32_t). This allows the use of easily testable negative values 1892e5b6d6dSopenharmony_cias sentinels, to indicate errors, exceptions or "done" conditions. All negative 1902e5b6d6dSopenharmony_civalues and positive values greater than 0x10FFFF are illegal as Unicode code 1912e5b6d6dSopenharmony_cipoints. 1922e5b6d6dSopenharmony_ci 1932e5b6d6dSopenharmony_ciICU 2.2 and earlier defined UChar32 depending on the platform: If the compiler's 1942e5b6d6dSopenharmony_ciwchar_t was 32 bits wide, then UChar32 was defined to be the same as wchar_t. 1952e5b6d6dSopenharmony_ciOtherwise, it was defined to be an unsigned 32-bit integer. This means that 1962e5b6d6dSopenharmony_ciUChar32 was either a signed or unsigned integer type depending on the compiler. 1972e5b6d6dSopenharmony_ciThis was meant for better interoperability with existing libraries, but was of 1982e5b6d6dSopenharmony_cilittle use because ICU does not process 32-bit strings — UChar32 is only used 1992e5b6d6dSopenharmony_cifor single code points. The platform dependence of UChar32 could cause problems 2002e5b6d6dSopenharmony_ciwith C++ function overloading. 2012e5b6d6dSopenharmony_ci 2022e5b6d6dSopenharmony_ci### Compiler-dependent definitions 2032e5b6d6dSopenharmony_ci 2042e5b6d6dSopenharmony_ciThe compiler's and the runtime character set's codepage encodings are not 2052e5b6d6dSopenharmony_cispecified by the C/C++ language standards and are usually not a Unicode encoding 2062e5b6d6dSopenharmony_ciform. They typically depend on the settings of the individual system, process, 2072e5b6d6dSopenharmony_cior thread. Therefore, it is not possible to instantiate a Unicode character or 2082e5b6d6dSopenharmony_cistring variable directly with C/C++ character or string literals. The only safe 2092e5b6d6dSopenharmony_ciway is to use numeric values. It is not an issue for User Interface (UI) strings 2102e5b6d6dSopenharmony_cithat are translated. These UI strings are loaded from a resource bundle, which 2112e5b6d6dSopenharmony_ciis generated from a text file that can be in Unicode or in any other 2122e5b6d6dSopenharmony_ciICU-provided codepage. The binary form of the genrb tool generates UTF-16 2132e5b6d6dSopenharmony_cistrings that are ready for direct use. 2142e5b6d6dSopenharmony_ci 2152e5b6d6dSopenharmony_ciThere is a useful exception to this for program-internal strings and test 2162e5b6d6dSopenharmony_cistrings. Within each "family" of character encodings, there is a set of 2172e5b6d6dSopenharmony_cicharacters that have the same numeric code values. Such characters include Latin 2182e5b6d6dSopenharmony_ciletters, the basic digits, the space, and some punctuation. Most of the ASCII 2192e5b6d6dSopenharmony_cigraphic characters are invariant characters. The same set, with different but 2202e5b6d6dSopenharmony_ciagain consistent numeric values, is invariant among almost all EBCDIC codepages. 2212e5b6d6dSopenharmony_ciFor details, see 2222e5b6d6dSopenharmony_ci[icu4c/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) 2232e5b6d6dSopenharmony_ci. With strings that contain only these invariant characters, it is possible to 2242e5b6d6dSopenharmony_ciuse efficient ICU constructs to write a C/C++ string literal and use it to 2252e5b6d6dSopenharmony_ciinitialize Unicode strings. 2262e5b6d6dSopenharmony_ci 2272e5b6d6dSopenharmony_ciIn some APIs, ICU uses `char *` strings. This is either for file system paths or 2282e5b6d6dSopenharmony_cifor strings that contain invariant characters only (such as locale identifiers). 2292e5b6d6dSopenharmony_ciThese strings are in the platform-specific encoding of either ASCII or EBCDIC. 2302e5b6d6dSopenharmony_ciAll other codepage differences do not matter for invariant characters and are 2312e5b6d6dSopenharmony_cimanipulated by the C stdlib functions like strcpy(). 2322e5b6d6dSopenharmony_ci 2332e5b6d6dSopenharmony_ciIn some APIs where identifiers are used, ICU uses `char *` strings with invariant 2342e5b6d6dSopenharmony_cicharacters. Such strings do not require the full Unicode repertoire and are 2352e5b6d6dSopenharmony_cieasier to handle in C and C++ with `char *` string literals and standard C 2362e5b6d6dSopenharmony_cilibrary functions. Their useful character repertoire is actually smaller than 2372e5b6d6dSopenharmony_cithe set of graphic ASCII characters; for details, see 2382e5b6d6dSopenharmony_ci[utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) . Examples of 2392e5b6d6dSopenharmony_ci`char *` identifier uses are converter names, locale IDs, and resource bundle 2402e5b6d6dSopenharmony_citable keys. 2412e5b6d6dSopenharmony_ci 2422e5b6d6dSopenharmony_ciThere is another, less efficient way to have human-readable Unicode string 2432e5b6d6dSopenharmony_ciliterals in C and C++ code. ICU provides a small number of functions that allow 2442e5b6d6dSopenharmony_ciany Unicode characters to be inserted into a string with escape sequences 2452e5b6d6dSopenharmony_cisimilar to the one that is used in the C and C++ language. In addition to the 2462e5b6d6dSopenharmony_cifamiliar \\n and \\xhh etc., ICU also provides the \\uhhhh syntax with four hex 2472e5b6d6dSopenharmony_cidigits and the \\Uhhhhhhhh syntax with eight hex digits for hexadecimal Unicode 2482e5b6d6dSopenharmony_cicode point values. This is very similar to the newer escape sequences used in 2492e5b6d6dSopenharmony_ciJava and defined in the latest C and C++ standards. Since ICU is not a compiler 2502e5b6d6dSopenharmony_ciextension, the "unescaping" is done at runtime and the backslash itself must be 2512e5b6d6dSopenharmony_ciescaped (duplicated) so that the compiler does not attempt to "unescape" the 2522e5b6d6dSopenharmony_cisequence itself. 2532e5b6d6dSopenharmony_ci 2542e5b6d6dSopenharmony_ci## Handling Lengths, Indexes, and Offsets in Strings 2552e5b6d6dSopenharmony_ci 2562e5b6d6dSopenharmony_ciThe length of a string and all indexes and offsets related to the string are 2572e5b6d6dSopenharmony_cialways counted in terms of UChar code units, not in terms of UChar32 code 2582e5b6d6dSopenharmony_cipoints. (This is the same as in common C library functions that use `char *` 2592e5b6d6dSopenharmony_cistrings with multi-byte encodings.) 2602e5b6d6dSopenharmony_ci 2612e5b6d6dSopenharmony_ciOften, a user thinks of a "character" as a complete unit in a language, like an 2622e5b6d6dSopenharmony_ci'Ä', while it may be represented with multiple Unicode code points including a 2632e5b6d6dSopenharmony_cibase character and combining marks. (See the Unicode standard for details.) This 2642e5b6d6dSopenharmony_cioften requires users to index and pass strings (UnicodeString or `UChar *`) with 2652e5b6d6dSopenharmony_cimultiple code units or code points. It cannot be done with single-integer 2662e5b6d6dSopenharmony_cicharacter types. Indexing of such "characters" is done with the BreakIterator 2672e5b6d6dSopenharmony_ciclass (in C: ubrk_ functions). 2682e5b6d6dSopenharmony_ci 2692e5b6d6dSopenharmony_ciEven with such "higher-level" indexing functions, the actual index values will 2702e5b6d6dSopenharmony_cibe expressed in terms of UChar code units. When more than one code unit is used 2712e5b6d6dSopenharmony_ciat a time, the index value changes by more than one at a time. 2722e5b6d6dSopenharmony_ci 2732e5b6d6dSopenharmony_ciICU uses signed 32-bit integers (int32_t) for lengths and offsets. Because of 2742e5b6d6dSopenharmony_ciinternal computations, strings (and arrays in general) are limited to 1G base 2752e5b6d6dSopenharmony_ciunits or 2G bytes, whichever is smaller. 2762e5b6d6dSopenharmony_ci 2772e5b6d6dSopenharmony_ci## Using C Strings: NUL-Terminated vs. Length Parameters 2782e5b6d6dSopenharmony_ci 2792e5b6d6dSopenharmony_ciStrings are either terminated with a NUL character (code point 0, U+0000) or 2802e5b6d6dSopenharmony_citheir length is specified. In the latter case, it is possible to have one or 2812e5b6d6dSopenharmony_cimore NUL characters inside the string. 2822e5b6d6dSopenharmony_ci 2832e5b6d6dSopenharmony_ci**Input string** arguments are typically passed with two parameters: The (const) 2842e5b6d6dSopenharmony_ci`UChar *` pointer and an int32_t length argument. If the length is -1 then the 2852e5b6d6dSopenharmony_cistring must be NUL-terminated and the ICU function will call the u_strlen() 2862e5b6d6dSopenharmony_cimethod or treat it equivalently. If the input string contains embedded NUL 2872e5b6d6dSopenharmony_cicharacters, then the length must be specified. 2882e5b6d6dSopenharmony_ci 2892e5b6d6dSopenharmony_ci**Output string** arguments are typically passed with a destination `UChar *` 2902e5b6d6dSopenharmony_cipointer and an int32_t capacity argument and the function returns the length of 2912e5b6d6dSopenharmony_cithe output as an int32_t. There is also almost always a UErrorCode argument. 2922e5b6d6dSopenharmony_ciEssentially, a `UChar[]` array is passed in with its start and the number of 2932e5b6d6dSopenharmony_ciavailable UChars. The array is filled with the output and if space permits the 2942e5b6d6dSopenharmony_cioutput will be NUL-terminated. The length of the output string is returned. In 2952e5b6d6dSopenharmony_ciall cases the length of the output string does not include the terminating NUL. 2962e5b6d6dSopenharmony_ciThis is the same behavior found in most ICU and non-ICU string APIs, for example 2972e5b6d6dSopenharmony_ciu_strlen(). The output string may **contain** NUL characters as part of its 2982e5b6d6dSopenharmony_ciactual contents, depending on the input and the operation. Note that the 2992e5b6d6dSopenharmony_ciUErrorCode parameter is used to indicate both errors and warnings (non-errors). 3002e5b6d6dSopenharmony_ciThe following describes some of the situations in which the UErrorCode will be 3012e5b6d6dSopenharmony_ciset to a non-zero value: 3022e5b6d6dSopenharmony_ci 3032e5b6d6dSopenharmony_ci1. If the output length is greater than the output array capacity, then the 3042e5b6d6dSopenharmony_ci UErrorCode will be set to U_BUFFER_OVERFLOW_ERROR and the contents of the 3052e5b6d6dSopenharmony_ci output array is undefined. 3062e5b6d6dSopenharmony_ci 3072e5b6d6dSopenharmony_ci2. If the output length is equal to the capacity, then the output has been 3082e5b6d6dSopenharmony_ci completely written minus the terminating NUL. This is also indicated by 3092e5b6d6dSopenharmony_ci setting the UErrorCode to U_STRING_NOT_TERMINATED_WARNING. 3102e5b6d6dSopenharmony_ci Note that U_STRING_NOT_TERMINATED_WARNING does not indicate failure (it 3112e5b6d6dSopenharmony_ci passes the U_SUCCESS() macro). 3122e5b6d6dSopenharmony_ci Note also that it is more reliable to check the output length against the 3132e5b6d6dSopenharmony_ci capacity, rather than checking for the warning code, because warning codes 3142e5b6d6dSopenharmony_ci do not cause the early termination of a function and may subsequently be 3152e5b6d6dSopenharmony_ci overwritten. 3162e5b6d6dSopenharmony_ci 3172e5b6d6dSopenharmony_ci3. If neither of these two conditions apply, the error code will indicate 3182e5b6d6dSopenharmony_ci success and not a U_STRING_NOT_TERMINATED_WARNING. (If a 3192e5b6d6dSopenharmony_ci U_STRING_NOT_TERMINATED_WARNING code had been set in the UErrorCode 3202e5b6d6dSopenharmony_ci parameter before the function call, then it is reset to a U_ZERO_ERROR.) 3212e5b6d6dSopenharmony_ci 3222e5b6d6dSopenharmony_ci**Preflighting:** The returned length is always the full output length even if 3232e5b6d6dSopenharmony_cithe output buffer is too small. It is possible to pass in a capacity of 0 (and 3242e5b6d6dSopenharmony_cian output array pointer of NUL) for "pure preflighting" to determine the 3252e5b6d6dSopenharmony_cinecessary output buffer size. Add one to make the output string NUL-terminated. 3262e5b6d6dSopenharmony_ci 3272e5b6d6dSopenharmony_ciNote that — whether the caller intends to "preflight" or not — if the output 3282e5b6d6dSopenharmony_cilength is equal to or greater than the capacity, then the UErrorCode is set to 3292e5b6d6dSopenharmony_ciU_STRING_NOT_TERMINATED_WARNING or U_BUFFER_OVERFLOW_ERROR respectively, as 3302e5b6d6dSopenharmony_cidescribed above. 3312e5b6d6dSopenharmony_ci 3322e5b6d6dSopenharmony_ciHowever, "pure preflighting" is very expensive because the operation has to be 3332e5b6d6dSopenharmony_ciprocessed twice — once for calculating the output length, and a second time to 3342e5b6d6dSopenharmony_ciactually generate the output. It is much more efficient to always provide an 3352e5b6d6dSopenharmony_cioutput buffer that is expected to be large enough for most cases, and to 3362e5b6d6dSopenharmony_cireallocate and repeat the operation only when an overflow occurred. (Remember to 3372e5b6d6dSopenharmony_cireset the UErrorCode to U_ZERO_ERROR before calling the function again.) In 3382e5b6d6dSopenharmony_ciC/C++, the initial output buffer can be a stack buffer. In case of a 3392e5b6d6dSopenharmony_cireallocation, it may be possible and useful to cache and reuse the new, larger 3402e5b6d6dSopenharmony_cibuffer. 3412e5b6d6dSopenharmony_ci 3422e5b6d6dSopenharmony_ci> :point_right: **Note**:*The exception to these rules are the ANSI-C-style functions like u_strcpy(), 3432e5b6d6dSopenharmony_ciwhich generally require NUL-terminated strings, forbid embedded NULs, and do not 3442e5b6d6dSopenharmony_citake capacity arguments for buffer overflow checking.* 3452e5b6d6dSopenharmony_ci 3462e5b6d6dSopenharmony_ci## Using Unicode Strings in C 3472e5b6d6dSopenharmony_ci 3482e5b6d6dSopenharmony_ciIn C, Unicode strings are similar to standard `char *` strings. Unicode strings 3492e5b6d6dSopenharmony_ciare arrays of UChar and most APIs take a `UChar *` pointer to the first element 3502e5b6d6dSopenharmony_ciand an input length and/or output capacity, see above. ICU has a number of 3512e5b6d6dSopenharmony_cifunctions that provide the Unicode equivalent of the stdlib functions such as 3522e5b6d6dSopenharmony_cistrcpy(), strstr(), etc. Compared with their C standard counterparts, their 3532e5b6d6dSopenharmony_cifunction names begin with u_. Otherwise, their semantics are equivalent. These 3542e5b6d6dSopenharmony_cifunctions are defined in icu/source/common/unicode/ustring.h. 3552e5b6d6dSopenharmony_ci 3562e5b6d6dSopenharmony_ci### Code Point Access 3572e5b6d6dSopenharmony_ci 3582e5b6d6dSopenharmony_ciSometimes, Unicode code points need to be accessed in C for iteration, movement 3592e5b6d6dSopenharmony_ciforward, or movement backward in a string. A string might also need to be 3602e5b6d6dSopenharmony_ciwritten from code points values. ICU provides a number of macros that are 3612e5b6d6dSopenharmony_cidefined in the icu/source/common/unicode/utf.h and utf8.h/utf16.h headers that 3622e5b6d6dSopenharmony_ciit includes (utf.h is in turn included with utypes.h). 3632e5b6d6dSopenharmony_ci 3642e5b6d6dSopenharmony_ciMacros for 16-bit Unicode strings have a U16_ prefix. For example: 3652e5b6d6dSopenharmony_ci 3662e5b6d6dSopenharmony_ci U16_NEXT(s, i, length, c) 3672e5b6d6dSopenharmony_ci U16_PREV(s, start, i, c) 3682e5b6d6dSopenharmony_ci U16_APPEND(s, i, length, c, isError) 3692e5b6d6dSopenharmony_ci 3702e5b6d6dSopenharmony_ciThere are also macros with a U_ prefix for code point range checks (e.g., test 3712e5b6d6dSopenharmony_cifor non-character code point), and U8_ macros for 8-bit (UTF-8) strings. See the 3722e5b6d6dSopenharmony_ciheader files and the API References for more details. 3732e5b6d6dSopenharmony_ci 3742e5b6d6dSopenharmony_ci#### UTF Macros before ICU 2.4 3752e5b6d6dSopenharmony_ci 3762e5b6d6dSopenharmony_ciIn ICU 2.4, the utf\*.h macros have been revamped, improved, simplified, and 3772e5b6d6dSopenharmony_cirenamed. The old macros continue to be available. They are in utf_old.h, 3782e5b6d6dSopenharmony_citogether with an explanation of the change. utf.h, utf8.h and utf16.h contain 3792e5b6d6dSopenharmony_cithe new macros instead. The new macros are intended to be more consistent, more 3802e5b6d6dSopenharmony_ciuseful, and less confusing. Some macros were simply renamed for consistency with 3812e5b6d6dSopenharmony_cia new naming scheme. 3822e5b6d6dSopenharmony_ci 3832e5b6d6dSopenharmony_ciThe documentation of the old macros has been removed. If you need it, see a User 3842e5b6d6dSopenharmony_ciGuide version from ICU 4.2 or earlier (see the [download 3852e5b6d6dSopenharmony_cipage](https://icu.unicode.org/download)). 3862e5b6d6dSopenharmony_ci 3872e5b6d6dSopenharmony_ciC Unicode String Literals 3882e5b6d6dSopenharmony_ci 3892e5b6d6dSopenharmony_ciThere is a pair of macros that together enable users to instantiate a Unicode 3902e5b6d6dSopenharmony_cistring in C — a `UChar []` array — from a C string literal: 3912e5b6d6dSopenharmony_ci 3922e5b6d6dSopenharmony_ci /* 3932e5b6d6dSopenharmony_ci * In C, we need two macros: one to declare the UChar[] array, and 3942e5b6d6dSopenharmony_ci * one to populate it; the second one is a noop on platforms where 3952e5b6d6dSopenharmony_ci * wchar_t is compatible with UChar and ASCII-based. 3962e5b6d6dSopenharmony_ci * The length of the string literal must be counted for both macros. 3972e5b6d6dSopenharmony_ci */ 3982e5b6d6dSopenharmony_ci /* declare the invString array for the string */ 3992e5b6d6dSopenharmony_ci U_STRING_DECL(invString, "such characters are safe 123 %-.", 32); 4002e5b6d6dSopenharmony_ci /* populate it with the characters */ 4012e5b6d6dSopenharmony_ci U_STRING_INIT(invString, "such characters are safe 123 %-.", 32); 4022e5b6d6dSopenharmony_ci 4032e5b6d6dSopenharmony_ciWith invariant characters, it is also possible to efficiently convert `char *` 4042e5b6d6dSopenharmony_cistrings to and from UChar \ strings: 4052e5b6d6dSopenharmony_ci 4062e5b6d6dSopenharmony_ci static const char *cs1="such characters are safe 123 %-."; 4072e5b6d6dSopenharmony_ci static UChar us1[40]; 4082e5b6d6dSopenharmony_ci static char cs2[40]; 4092e5b6d6dSopenharmony_ci u_charsToUChars(cs1, us1, 33); /* include the terminating NUL */ 4102e5b6d6dSopenharmony_ci u_UCharsToChars(us1, cs2, 33); 4112e5b6d6dSopenharmony_ci 4122e5b6d6dSopenharmony_ci## Testing for well-formed UTF-16 strings 4132e5b6d6dSopenharmony_ci 4142e5b6d6dSopenharmony_ciIt is sometimes useful to test if a 16-bit Unicode string is well-formed UTF-16, 4152e5b6d6dSopenharmony_cithat is, that it does not contain unpaired surrogate code units. For a boolean 4162e5b6d6dSopenharmony_citest, call a function like u_strToUTF8() which sets an error code if the input 4172e5b6d6dSopenharmony_cistring is malformed. (Provide a zero-capacity destination buffer and treat the 4182e5b6d6dSopenharmony_cibuffer overflow error as "is well-formed".) If you need to know the position of 4192e5b6d6dSopenharmony_cithe unpaired surrogate, you can iterate through the string with U16_NEXT() and 4202e5b6d6dSopenharmony_ciU_IS_SURROGATE(). 4212e5b6d6dSopenharmony_ci 4222e5b6d6dSopenharmony_ci## Using Unicode Strings in C++ 4232e5b6d6dSopenharmony_ci 4242e5b6d6dSopenharmony_ci[UnicodeString](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUnicodeString.html) is 4252e5b6d6dSopenharmony_cia C++ string class that wraps a UChar array and associated bookkeeping. It 4262e5b6d6dSopenharmony_ciprovides a rich set of string handling functions. 4272e5b6d6dSopenharmony_ci 4282e5b6d6dSopenharmony_ciUnicodeString combines elements of both the Java String and StringBuffer 4292e5b6d6dSopenharmony_ciclasses. Many UnicodeString functions are named and work similar to Java String 4302e5b6d6dSopenharmony_cimethods but modify the object (UnicodeString is "mutable"). 4312e5b6d6dSopenharmony_ci 4322e5b6d6dSopenharmony_ciUnicodeString provides functions for random access and use (insert/append/find 4332e5b6d6dSopenharmony_cietc.) of both code units and code points. For each non-iterative string/code 4342e5b6d6dSopenharmony_cipoint macro in utf.h there is at least one UnicodeString member function. The 4352e5b6d6dSopenharmony_cinames of most of these functions contain "32" to indicate the use of a UChar32. 4362e5b6d6dSopenharmony_ci 4372e5b6d6dSopenharmony_ciCode point and code unit iteration is provided by the 4382e5b6d6dSopenharmony_ci[CharacterIterator](characteriterator.md) abstract class and its subclasses. 4392e5b6d6dSopenharmony_ciThere are concrete iterator implementations for UnicodeString objects and plain 4402e5b6d6dSopenharmony_ci`UChar []` arrays. 4412e5b6d6dSopenharmony_ci 4422e5b6d6dSopenharmony_ciMost UnicodeString constructors and functions do not have a UErrorCode 4432e5b6d6dSopenharmony_ciparameter. Instead, if the construction of a UnicodeString fails, for example 4442e5b6d6dSopenharmony_ciwhen it is constructed from a NULL `UChar *` pointer, then the UnicodeString 4452e5b6d6dSopenharmony_ciobject becomes "bogus". This can be tested with the isBogus() function. A 4462e5b6d6dSopenharmony_ciUnicodeString can be put into the "bogus" state explicitly with the setToBogus() 4472e5b6d6dSopenharmony_cifunction. This is different from an empty string (although a "bogus" string also 4482e5b6d6dSopenharmony_cireturns true from isEmpty()) and may be used equivalently to NULL in `UChar *` C 4492e5b6d6dSopenharmony_ciAPIs (or null references in Java, or NULL values in SQL). A string remains 4502e5b6d6dSopenharmony_ci"bogus" until a non-bogus string value is assigned to it. For complete details 4512e5b6d6dSopenharmony_ciof the behavior of "bogus" strings see the description of the setToBogus() 4522e5b6d6dSopenharmony_cifunction. 4532e5b6d6dSopenharmony_ci 4542e5b6d6dSopenharmony_ciSome APIs work with the 4552e5b6d6dSopenharmony_ci[Replaceable](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classReplaceable.html) 4562e5b6d6dSopenharmony_ciabstract class. It defines a simple interface for random access and text 4572e5b6d6dSopenharmony_cimodification and is useful for operations on text that may have associated 4582e5b6d6dSopenharmony_cimeta-data (e.g., styled text), especially in the Transliterator API. 4592e5b6d6dSopenharmony_ciUnicodeString implements Replaceable. 4602e5b6d6dSopenharmony_ci 4612e5b6d6dSopenharmony_ci### C++ Unicode String Literals 4622e5b6d6dSopenharmony_ci 4632e5b6d6dSopenharmony_ciLike in C, there are macros that enable users to instantiate a UnicodeString 4642e5b6d6dSopenharmony_cifrom a C string literal. One macro requires the length of the string as in the C 4652e5b6d6dSopenharmony_cimacros, the other one implies a strlen(). 4662e5b6d6dSopenharmony_ci 4672e5b6d6dSopenharmony_ci UnicodeString s1=UNICODE_STRING("such characters are safe 123 %-.", 32); 4682e5b6d6dSopenharmony_ci UnicodeString s1=UNICODE_STRING_SIMPLE("such characters are safe 123 %-."); 4692e5b6d6dSopenharmony_ci 4702e5b6d6dSopenharmony_ciIt is possible to efficiently convert between invariant-character strings and 4712e5b6d6dSopenharmony_ciUnicodeStrings by using constructor, setTo() or extract() overloads that take 4722e5b6d6dSopenharmony_cicodepage data (`const char *`) and specifying an empty string ("") as the 4732e5b6d6dSopenharmony_cicodepage name. 4742e5b6d6dSopenharmony_ci 4752e5b6d6dSopenharmony_ci## Using C++ Strings in C APIs 4762e5b6d6dSopenharmony_ci 4772e5b6d6dSopenharmony_ciThe internal buffer of UnicodeString objects is available for direct handling in 4782e5b6d6dSopenharmony_ciC (or C-style) APIs that take `UChar *` arguments. It is possible but usually not 4792e5b6d6dSopenharmony_cinecessary to copy the string contents with one of the extract functions. The 4802e5b6d6dSopenharmony_cifollowing describes several direct buffer access methods. 4812e5b6d6dSopenharmony_ci 4822e5b6d6dSopenharmony_ciThe UnicodeString function getBuffer() const returns a readonly const `UChar *`. 4832e5b6d6dSopenharmony_ciThe length of the string is indicated by UnicodeString's length() function. 4842e5b6d6dSopenharmony_ciGenerally, UnicodeString does not NUL-terminate the contents of its internal 4852e5b6d6dSopenharmony_cibuffer. However, it is possible to check for a NUL character if the length of 4862e5b6d6dSopenharmony_cithe string is less than the capacity of the buffer. The following code is an 4872e5b6d6dSopenharmony_ciexample of how to check the capacity of the buffer: 4882e5b6d6dSopenharmony_ci`(s.length()<s.getCapacity() && buffer[s.length()]==0)` 4892e5b6d6dSopenharmony_ci 4902e5b6d6dSopenharmony_ciAn easier way to NUL-terminate the buffer and get a `const UChar *` pointer to it 4912e5b6d6dSopenharmony_ciis the getTerminatedBuffer() function. Unlike getBuffer() const, 4922e5b6d6dSopenharmony_cigetTerminatedBuffer() is not a const function because it may have to (reallocate 4932e5b6d6dSopenharmony_ciand) modify the buffer to append a terminating NUL. Therefore, use getBuffer() 4942e5b6d6dSopenharmony_ciconst if you do not need a NUL-terminated buffer. 4952e5b6d6dSopenharmony_ci 4962e5b6d6dSopenharmony_ciThere is also a pair of functions that allow controlled write access to the 4972e5b6d6dSopenharmony_cibuffer of a UnicodeString: `UChar *getBuffer(int32_t minCapacity)` and 4982e5b6d6dSopenharmony_ci`releaseBuffer(int32_t newLength)`. `UChar *getBuffer(int32_t minCapacity)` 4992e5b6d6dSopenharmony_ciprovides a writeable buffer of at least the requested capacity and returns a 5002e5b6d6dSopenharmony_cipointer to it. The actual capacity of the buffer after the 5012e5b6d6dSopenharmony_ci`getBuffer(minCapacity)` call may be larger than the requested capacity and can be 5022e5b6d6dSopenharmony_cidetermined with `getCapacity()`. 5032e5b6d6dSopenharmony_ci 5042e5b6d6dSopenharmony_ciOnce the buffer contents are modified, the buffer must be released with the 5052e5b6d6dSopenharmony_ci`releaseBuffer(int32_t newLength)` function, which sets the new length of the 5062e5b6d6dSopenharmony_ciUnicodeString (newLength=-1 can be passed to determine the length of 5072e5b6d6dSopenharmony_ciNUL-terminated contents like `u_strlen()`). 5082e5b6d6dSopenharmony_ci 5092e5b6d6dSopenharmony_ciBetween the `getBuffer(minCapacity)` and `releaseBuffer(newLength)` function calls, 5102e5b6d6dSopenharmony_cithe contents of the UnicodeString is unknown and the object behaves like it 5112e5b6d6dSopenharmony_cicontains an empty string. A nested `getBuffer(minCapacity)`, `getBuffer() const` or 5122e5b6d6dSopenharmony_ci`getTerminatedBuffer()` will fail (return NULL) and modifications of the string 5132e5b6d6dSopenharmony_civia UnicodeString member functions will have no effect. Copying a string with an 5142e5b6d6dSopenharmony_ci"open buffer" yields an empty copy. The move constructor, move assignment 5152e5b6d6dSopenharmony_cioperator and Return Value Optimization (RVO) transfer the state, including the 5162e5b6d6dSopenharmony_ciopen buffer. 5172e5b6d6dSopenharmony_ci 5182e5b6d6dSopenharmony_ciSee the UnicodeString API documentation for more information. 5192e5b6d6dSopenharmony_ci 5202e5b6d6dSopenharmony_ci## Using C Strings in C++ APIs 5212e5b6d6dSopenharmony_ci 5222e5b6d6dSopenharmony_ciThere are efficient ways to wrap C-style strings in C++ UnicodeString objects 5232e5b6d6dSopenharmony_ciwithout copying the string contents. In order to use C strings in C++ APIs, the 5242e5b6d6dSopenharmony_ci`UChar *` pointer and length need to be wrapped into a UnicodeString. This can be 5252e5b6d6dSopenharmony_cidone efficiently in two ways: With a readonly alias and a writable alias. The 5262e5b6d6dSopenharmony_ciUnicodeString object that is constructed actually uses the `UChar *` pointer as 5272e5b6d6dSopenharmony_ciits internal buffer pointer instead of allocating a new buffer and copying the 5282e5b6d6dSopenharmony_cistring contents. 5292e5b6d6dSopenharmony_ci 5302e5b6d6dSopenharmony_ciIf the original string is a readonly `const UChar *`, then the UnicodeString must 5312e5b6d6dSopenharmony_cibe constructed with a read only alias. If the original string is a writable 5322e5b6d6dSopenharmony_ci(non-const) `UChar *` and is to be modified (e.g., if the `UChar *` buffer is an 5332e5b6d6dSopenharmony_cioutput buffer) then the UnicodeString should be constructed with a writeable 5342e5b6d6dSopenharmony_cialias. For more details see the section "Maximizing Performance with the 5352e5b6d6dSopenharmony_ciUnicodeString Storage Model" and search the unistr.h header file for "alias". 5362e5b6d6dSopenharmony_ci 5372e5b6d6dSopenharmony_ci## Maximizing Performance with the UnicodeString Storage Model 5382e5b6d6dSopenharmony_ci 5392e5b6d6dSopenharmony_ciUnicodeString uses four storage methods to maximize performance and minimize 5402e5b6d6dSopenharmony_cimemory consumption: 5412e5b6d6dSopenharmony_ci 5422e5b6d6dSopenharmony_ci1. Short strings are normally stored inside the UnicodeString object. The 5432e5b6d6dSopenharmony_ci object has fields for the "bookkeeping" and a small UChar array. When the 5442e5b6d6dSopenharmony_ci object is copied, the internal characters are copied into the destination 5452e5b6d6dSopenharmony_ci object. 5462e5b6d6dSopenharmony_ci2. Longer strings are normally stored in allocated memory. The allocated UChar 5472e5b6d6dSopenharmony_ci array is preceded by a reference counter. When the string object is copied, 5482e5b6d6dSopenharmony_ci the allocated buffer is shared by incrementing the reference counter. If any 5492e5b6d6dSopenharmony_ci of the objects that share the same string buffer are modified, they receive 5502e5b6d6dSopenharmony_ci their own copy of the buffer and decrement the reference counter of the 5512e5b6d6dSopenharmony_ci previously co-used buffer. 5522e5b6d6dSopenharmony_ci3. A UnicodeString can be constructed (or set with a setTo() function) so that 5532e5b6d6dSopenharmony_ci it aliases a readonly buffer instead of copying the characters. In this 5542e5b6d6dSopenharmony_ci case, the string object uses this aliased buffer for as long as the object 5552e5b6d6dSopenharmony_ci is not modified and it will never attempt to modify or release the buffer. 5562e5b6d6dSopenharmony_ci This model has copy-on-write semantics. For example, when the string object 5572e5b6d6dSopenharmony_ci is modified, the buffer contents are first copied into writable memory 5582e5b6d6dSopenharmony_ci (inside the object for short strings or the allocated buffer for longer 5592e5b6d6dSopenharmony_ci strings). When a UnicodeString with a readonly setting is copied to another 5602e5b6d6dSopenharmony_ci UnicodeString using the fastCopyFrom() function, then both string objects 5612e5b6d6dSopenharmony_ci share the same readonly setting and point to the same storage. Copying a 5622e5b6d6dSopenharmony_ci string with the normal assignment operator or copy constructor will copy the 5632e5b6d6dSopenharmony_ci buffer. This prevents accidental misuse of readonly-aliased strings. (This 5642e5b6d6dSopenharmony_ci is new in ICU 2.4; earlier, the assignment operator and copy constructor 5652e5b6d6dSopenharmony_ci behaved like the new fastCopyFrom() does now.) 5662e5b6d6dSopenharmony_ci **Important:** 5672e5b6d6dSopenharmony_ci 1. The aliased buffer must remain valid for as long as any UnicodeString 5682e5b6d6dSopenharmony_ci object aliases it. This includes unmodified fastCopyFrom()and 5692e5b6d6dSopenharmony_ci `movedFrom()` copies of the object (including moves via the move 5702e5b6d6dSopenharmony_ci constructor and move assignment operator), and when the compiler uses 5712e5b6d6dSopenharmony_ci Return Value Optimization (RVO) where a function returns a UnicodeString 5722e5b6d6dSopenharmony_ci by value. 5732e5b6d6dSopenharmony_ci 2. Be prepared that return-by-value may either make a copy (which does not 5742e5b6d6dSopenharmony_ci preserve aliasing), or moves the value or uses RVO (which do preserve 5752e5b6d6dSopenharmony_ci aliasing). 5762e5b6d6dSopenharmony_ci 3. It is an error to readonly-alias temporary buffers and then pass the 5772e5b6d6dSopenharmony_ci resulting UnicodeString objects (or references/pointers to them) to APIs 5782e5b6d6dSopenharmony_ci that store them for longer than the buffers are valid. 5792e5b6d6dSopenharmony_ci 4. If it is necessary to make sure that a string is not a readonly alias, 5802e5b6d6dSopenharmony_ci then use any modifying function without actually changing the contents 5812e5b6d6dSopenharmony_ci (for example, s.setCharAt(0, s.charAt(0))). 5822e5b6d6dSopenharmony_ci 5. In ICU 2.4 and later, a simple assignment or copy construction will also 5832e5b6d6dSopenharmony_ci copy the buffer. 5842e5b6d6dSopenharmony_ci4. A UnicodeString can be constructed (or set with a setTo() function) so that 5852e5b6d6dSopenharmony_ci it aliases a writable buffer instead of copying the characters. The 5862e5b6d6dSopenharmony_ci difference from the above is that the string object writes through to this 5872e5b6d6dSopenharmony_ci aliased buffer for write operations. A new buffer is allocated and the 5882e5b6d6dSopenharmony_ci contents are copied only when the capacity of the buffer is not sufficient. 5892e5b6d6dSopenharmony_ci An efficient way to get the string contents into the original buffer is to 5902e5b6d6dSopenharmony_ci use the `extract(..., UChar *dst, ...)` function. 5912e5b6d6dSopenharmony_ci The `extract(..., UChar *dst, ...)` function copies the string contents only if the dst buffer is 5922e5b6d6dSopenharmony_ci different from the buffer of the string object itself. If a string grows and 5932e5b6d6dSopenharmony_ci shrinks during a sequence of operations, then it will not use the same 5942e5b6d6dSopenharmony_ci buffer, even if the string would fit. When a UnicodeString with a writeable 5952e5b6d6dSopenharmony_ci alias is assigned to another UnicodeString, the contents are always copied. 5962e5b6d6dSopenharmony_ci The destination string will not point to the buffer that the source string 5972e5b6d6dSopenharmony_ci aliases point to. However, a move constructor, move assignment operator, and 5982e5b6d6dSopenharmony_ci Return Value Optimization (RVO) do preserve aliasing. 5992e5b6d6dSopenharmony_ci 6002e5b6d6dSopenharmony_ciIn general, UnicodeString objects have "copy-on-write" semantics. Several 6012e5b6d6dSopenharmony_ciobjects may share the same string buffer, but a modification only affects the 6022e5b6d6dSopenharmony_ciobject that is modified itself. This is achieved by copying the string contents 6032e5b6d6dSopenharmony_ciif it is not owned exclusively by this one object. Only after that is the object 6042e5b6d6dSopenharmony_cimodified. 6052e5b6d6dSopenharmony_ci 6062e5b6d6dSopenharmony_ciEven though it is fairly efficient to copy UnicodeString objects, it is even 6072e5b6d6dSopenharmony_cimore efficient, if possible, to work with references or pointers. Functions that 6082e5b6d6dSopenharmony_cioutput strings can be faster by appending their results to a UnicodeString that 6092e5b6d6dSopenharmony_ciis passed in by reference, compared with returning a UnicodeString object or 6102e5b6d6dSopenharmony_cijust setting the local results alone into a string reference. 6112e5b6d6dSopenharmony_ci 6122e5b6d6dSopenharmony_ci> :point_right: **Note**: *UnicodeStrings can be copied in a thread-safe manner by just using their 6132e5b6d6dSopenharmony_cistandard copy constructors and assignment operators. fastCopyFrom() is also 6142e5b6d6dSopenharmony_cithread-safe, but if the original string is a readonly alias, then the copy 6152e5b6d6dSopenharmony_cishares the same aliased buffer.* 6162e5b6d6dSopenharmony_ci 6172e5b6d6dSopenharmony_ci## Using UTF-8 strings with ICU 6182e5b6d6dSopenharmony_ci 6192e5b6d6dSopenharmony_ciAs mentioned in the overview of this chapter, ICU and most other 6202e5b6d6dSopenharmony_ciUnicode-supporting software uses 16-bit Unicode for internal processing. 6212e5b6d6dSopenharmony_ciHowever, there are circumstances where UTF-8 is used instead. This is usually 6222e5b6d6dSopenharmony_cithe case for software that does little or no processing of non-ASCII characters, 6232e5b6d6dSopenharmony_ciand/or for APIs that predate Unicode, use byte-based strings, and cannot be 6242e5b6d6dSopenharmony_cichanged or replaced for various reasons. 6252e5b6d6dSopenharmony_ci 6262e5b6d6dSopenharmony_ciA common perception is that UTF-8 has an advantage because it was designed for 6272e5b6d6dSopenharmony_cicompatibility with byte-based, ASCII-based systems, although it was designed for 6282e5b6d6dSopenharmony_cistring storage (of Unicode characters in Unix file names) rather than for 6292e5b6d6dSopenharmony_ciprocessing performance. 6302e5b6d6dSopenharmony_ci 6312e5b6d6dSopenharmony_ciWhile ICU mostly does not natively use UTF-8 strings, there are many ways to 6322e5b6d6dSopenharmony_ciwork with UTF-8 strings and ICU. For more information see the newer 6332e5b6d6dSopenharmony_ci[UTF-8](utf-8.md) subpage. 6342e5b6d6dSopenharmony_ci 6352e5b6d6dSopenharmony_ci## Using UTF-32 strings with ICU 6362e5b6d6dSopenharmony_ci 6372e5b6d6dSopenharmony_ciIt is even rarer to use UTF-32 for string processing than UTF-8. While 32-bit 6382e5b6d6dSopenharmony_ciUnicode is convenient because it is the only fixed-width UTF, there are few or 6392e5b6d6dSopenharmony_cino legacy systems with 32-bit string processing that would benefit from a 6402e5b6d6dSopenharmony_cicompatible format, and the memory bandwidth requirements of UTF-32 diminish the 6412e5b6d6dSopenharmony_ciperformance and handling advantage of the fixed-width format. 6422e5b6d6dSopenharmony_ci 6432e5b6d6dSopenharmony_ciOver time, the wchar_t type of some C/C++ compilers became a 32-bit integer, and 6442e5b6d6dSopenharmony_cisome C libraries do use it for Unicode processing. However, application software 6452e5b6d6dSopenharmony_ciwith good Unicode support tends to have little use for the rudimentary Unicode 6462e5b6d6dSopenharmony_ciand Internationalization support of the standard C/C++ libraries and often uses 6472e5b6d6dSopenharmony_cicustom types (like ICU's) and UTF-16 or UTF-8. 6482e5b6d6dSopenharmony_ci 6492e5b6d6dSopenharmony_ciFor those systems where 32-bit Unicode strings are used, ICU offers some 6502e5b6d6dSopenharmony_ciconvenience functions. 6512e5b6d6dSopenharmony_ci 6522e5b6d6dSopenharmony_ci1. Conversion of whole strings: u_strFromUTF32() and u_strFromUTF32() in 6532e5b6d6dSopenharmony_ci ustring.h. 6542e5b6d6dSopenharmony_ci 6552e5b6d6dSopenharmony_ci2. Access to code points is trivial and does not require any macros. 6562e5b6d6dSopenharmony_ci 6572e5b6d6dSopenharmony_ci3. Using a UTF-32 converter with all of the ICU conversion APIs in ucnv.h, 6582e5b6d6dSopenharmony_ci including ones with an "Algorithmic" suffix. 6592e5b6d6dSopenharmony_ci 6602e5b6d6dSopenharmony_ci4. UnicodeString has `fromUTF32()` and `toUTF32()` methods. 6612e5b6d6dSopenharmony_ci 6622e5b6d6dSopenharmony_ci5. For conversion directly between UTF-32 and another charset use 6632e5b6d6dSopenharmony_ci ucnv_convertEx(). However, since ICU converters work with byte streams in 6642e5b6d6dSopenharmony_ci external charsets on the non-"Unicode" side, the UTF-32 string will be 6652e5b6d6dSopenharmony_ci treated as a byte stream (UTF-32 Character Encoding *Scheme*) rather than a 6662e5b6d6dSopenharmony_ci sequence of 32-bit code units (UTF-32 Character Encoding *Form*). The 6672e5b6d6dSopenharmony_ci correct converter must be used: UTF-32BE or UTF-32LE according to the 6682e5b6d6dSopenharmony_ci platform endianness (U_IS_BIG_ENDIAN). Treating the string like a byte 6692e5b6d6dSopenharmony_ci stream also makes a difference in data types (`char *`), lengths and indexes 6702e5b6d6dSopenharmony_ci (counting bytes), and NUL-termination handling (input NUL-termination not 6712e5b6d6dSopenharmony_ci possible, output writes only a NUL byte, not a NUL 32-bit code unit). For 6722e5b6d6dSopenharmony_ci the difference between internal encoding forms and external encoding schemes 6732e5b6d6dSopenharmony_ci see the Unicode Standard. 6742e5b6d6dSopenharmony_ci 6752e5b6d6dSopenharmony_ci6. Some ICU APIs work with a CharacterIterator, a UText or a UCharIterator 6762e5b6d6dSopenharmony_ci instead of directly with a C/C++ string parameter. There is currently no ICU 6772e5b6d6dSopenharmony_ci instance of any of these interfaces that reads UTF-32, although an 6782e5b6d6dSopenharmony_ci application could provide one. 6792e5b6d6dSopenharmony_ci 6802e5b6d6dSopenharmony_ci## Changes in ICU 2.0 6812e5b6d6dSopenharmony_ci 6822e5b6d6dSopenharmony_ciBeginning with ICU release 2.0, there are a few changes to the ICU string 6832e5b6d6dSopenharmony_cifacilities compared with earlier ICU releases. 6842e5b6d6dSopenharmony_ci 6852e5b6d6dSopenharmony_ciSome of the NUL-termination behavior was inconsistent across the ICU API 6862e5b6d6dSopenharmony_cifunctions. In particular, the following functions used to count the terminating 6872e5b6d6dSopenharmony_ciNUL character in their output length (counted one more before ICU 2.0 than now): 6882e5b6d6dSopenharmony_ciucnv_toUChars, ucnv_fromUChars, uloc_getLanguage, uloc_getCountry, 6892e5b6d6dSopenharmony_ciuloc_getVariant, uloc_getName, uloc_getDisplayLanguage, uloc_getDisplayCountry, 6902e5b6d6dSopenharmony_ciuloc_getDisplayVariant, uloc_getDisplayName 6912e5b6d6dSopenharmony_ci 6922e5b6d6dSopenharmony_ciSome functions used to set an overflow error code even when only the terminating 6932e5b6d6dSopenharmony_ciNUL did not fit into the output buffer. These functions now set UErrorCode to 6942e5b6d6dSopenharmony_ciU_STRING_NOT_TERMINATED_WARNING rather than to U_BUFFER_OVERFLOW_ERROR. 6952e5b6d6dSopenharmony_ci 6962e5b6d6dSopenharmony_ciThe aliasing UnicodeString constructors and most extract functions have existed 6972e5b6d6dSopenharmony_cifor several releases prior to ICU 2.0. There is now an additional extract 6982e5b6d6dSopenharmony_cifunction with a UErrorCode parameter. Also, the getBuffer, releaseBuffer and 6992e5b6d6dSopenharmony_cigetCapacity functions are new to ICU 2.0. 7002e5b6d6dSopenharmony_ci 7012e5b6d6dSopenharmony_ciFor more information about these changes, please consult the old and new API 7022e5b6d6dSopenharmony_cidocumentation. 703