12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: Chars and Strings
42e5b6d6dSopenharmony_cinav_order: 600
52e5b6d6dSopenharmony_cihas_children: true
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# Strings
132e5b6d6dSopenharmony_ci
142e5b6d6dSopenharmony_ci## Overview
152e5b6d6dSopenharmony_ci
162e5b6d6dSopenharmony_ciThis section explains how to handle Unicode strings with ICU in C and C++.
172e5b6d6dSopenharmony_ci
182e5b6d6dSopenharmony_ciSample code is available in the ICU source code library at
192e5b6d6dSopenharmony_ci[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/ustring/ustring.cpp)
202e5b6d6dSopenharmony_ci.
212e5b6d6dSopenharmony_ci
222e5b6d6dSopenharmony_ci## Text Access Overview
232e5b6d6dSopenharmony_ci
242e5b6d6dSopenharmony_ciStrings are the most common and fundamental form of handling text in software.
252e5b6d6dSopenharmony_ciLogically, and often physically, they contain contiguous arrays (vectors) of
262e5b6d6dSopenharmony_cibasic units. Most of the ICU API functions work directly with simple strings,
272e5b6d6dSopenharmony_ciand where possible, this is preferred.
282e5b6d6dSopenharmony_ci
292e5b6d6dSopenharmony_ciSometimes, text needs to be accessed via more powerful and complicated methods.
302e5b6d6dSopenharmony_ciFor example, text may be stored in discontiguous chunks in order to deal with
312e5b6d6dSopenharmony_cifrequent modification (like typing) and large amounts, or it may not be stored
322e5b6d6dSopenharmony_ciin the internal encoding, or it may have associated attributes like bold or
332e5b6d6dSopenharmony_ciitalic styles.
342e5b6d6dSopenharmony_ci
352e5b6d6dSopenharmony_ci### Guidance
362e5b6d6dSopenharmony_ci
372e5b6d6dSopenharmony_ciICU provides multiple text access interfaces which were added over time. If
382e5b6d6dSopenharmony_cisimple strings cannot be used, then consider the following:
392e5b6d6dSopenharmony_ci
402e5b6d6dSopenharmony_ci1.  [UText](utext.md): Added in ICU4C 3.4 as a technology preview. Intended to
412e5b6d6dSopenharmony_ci    be the strategic text access API for use with ICU. C API, high performance,
422e5b6d6dSopenharmony_ci    writable, supports native indexes for efficient non-UTF-16 text storage. So
432e5b6d6dSopenharmony_ci    far (3.4) only supported in BreakIterator. Some API changes are anticipated
442e5b6d6dSopenharmony_ci    for ICU 3.6.
452e5b6d6dSopenharmony_ci
462e5b6d6dSopenharmony_ci2.  Replaceable (Java & C++) and UReplaceable (C): Writable, designed for use
472e5b6d6dSopenharmony_ci    with Transliterator.
482e5b6d6dSopenharmony_ci
492e5b6d6dSopenharmony_ci3.  CharacterIterator (Java JDK & C++): Read-only, used in many APIs. Large
502e5b6d6dSopenharmony_ci    differences between the JDK and C++ versions.
512e5b6d6dSopenharmony_ci
522e5b6d6dSopenharmony_ci4.  UCharacterIterator (Java): Back-port of the C++ CharacterIterator to ICU4J
532e5b6d6dSopenharmony_ci    for support of supplementary code points and post-increment iteration.
542e5b6d6dSopenharmony_ci
552e5b6d6dSopenharmony_ci5.  UCharIterator (C): Read-only, C interface used mostly in incremental
562e5b6d6dSopenharmony_ci    normalization and collation.
572e5b6d6dSopenharmony_ci
582e5b6d6dSopenharmony_ciThe following provides some historical perspective and comparison between the
592e5b6d6dSopenharmony_ciinterfaces.
602e5b6d6dSopenharmony_ci
612e5b6d6dSopenharmony_ci### CharacterIterator
622e5b6d6dSopenharmony_ci
632e5b6d6dSopenharmony_ciICU has long provided the CharacterIterator interface for some services. It
642e5b6d6dSopenharmony_ciallows for abstract text access, but has limitations:
652e5b6d6dSopenharmony_ci
662e5b6d6dSopenharmony_ci1.  It has a per-character function call overhead.
672e5b6d6dSopenharmony_ci
682e5b6d6dSopenharmony_ci2.  Originally, it was designed for UCS-2 operation and did not support direct
692e5b6d6dSopenharmony_ci    handling of supplementary Unicode code points. Such support was later added.
702e5b6d6dSopenharmony_ci
712e5b6d6dSopenharmony_ci3.  Its pre-increment iteration semantics are uncommon, and are inefficient when
722e5b6d6dSopenharmony_ci    used with a variable-width encoding form (UTF-16). Functions for
732e5b6d6dSopenharmony_ci    post-increment iteration were added later.
742e5b6d6dSopenharmony_ci
752e5b6d6dSopenharmony_ci4.  The C++ version added iteration start/limit boundaries only because the C++
762e5b6d6dSopenharmony_ci    UnicodeString copies string contents during substringing; the Java
772e5b6d6dSopenharmony_ci    CharacterIterator does not have these extra boundaries – substringing is
782e5b6d6dSopenharmony_ci    more efficient in Java.
792e5b6d6dSopenharmony_ci
802e5b6d6dSopenharmony_ci5.  CharacterIterator is not available for use in C.
812e5b6d6dSopenharmony_ci
822e5b6d6dSopenharmony_ci6.  CharacterIterator is a read-only interface.
832e5b6d6dSopenharmony_ci
842e5b6d6dSopenharmony_ci7.  It uses UTF-16 indexes into the text, which is not efficient for other
852e5b6d6dSopenharmony_ci    encoding forms.
862e5b6d6dSopenharmony_ci
872e5b6d6dSopenharmony_ci8.  With the additions to the API over time, the number of methods that have to
882e5b6d6dSopenharmony_ci    be overridden by subclasses has become rather large.
892e5b6d6dSopenharmony_ci
902e5b6d6dSopenharmony_ciThe core Java adopted an early version of CharacterIterator; later
912e5b6d6dSopenharmony_cifunctionality, like support for supplementary code points, was back-ported from
922e5b6d6dSopenharmony_ciICU4C to ICU4J to form the UCharacterIterator class.
932e5b6d6dSopenharmony_ci
942e5b6d6dSopenharmony_ciThe UCharIterator C interface was added to allow for incremental normalization
952e5b6d6dSopenharmony_ciand collation in C. It is entirely code unit (UChar)-oriented, uses only
962e5b6d6dSopenharmony_cipost-increment iteration and has a smaller number of overridable methods.
972e5b6d6dSopenharmony_ci
982e5b6d6dSopenharmony_ci### Replaceable
992e5b6d6dSopenharmony_ci
1002e5b6d6dSopenharmony_ciThe Replaceable (Java & C++) and UReplaceable (C) interfaces are designed for,
1012e5b6d6dSopenharmony_ciand used in, Transliterator. They are random-access interfaces, not iterators.
1022e5b6d6dSopenharmony_ci
1032e5b6d6dSopenharmony_ci### UText
1042e5b6d6dSopenharmony_ci
1052e5b6d6dSopenharmony_ciThe [UText](utext.md) text access interface was designed as a possible
1062e5b6d6dSopenharmony_cireplacement for all previous interfaces listed above, with additional
1072e5b6d6dSopenharmony_cifunctionality. It allows for high-performance operation through the use of
1082e5b6d6dSopenharmony_cistorage-native indexes (for efficient use of non-UTF-16 text) and through
1092e5b6d6dSopenharmony_ciaccessing multiple characters per function call. Code point iteration is
1102e5b6d6dSopenharmony_ciavailable with functions as well as with C macros, for maximum performance.
1112e5b6d6dSopenharmony_ciUText is also writable, mostly patterned after Replaceable. For details see the
1122e5b6d6dSopenharmony_ciUText chaper.
1132e5b6d6dSopenharmony_ci
1142e5b6d6dSopenharmony_ci## Strings in ICU
1152e5b6d6dSopenharmony_ci
1162e5b6d6dSopenharmony_ci### Strings in Java
1172e5b6d6dSopenharmony_ci
1182e5b6d6dSopenharmony_ciIn Java, ICU uses the standard String and StringBuffer classes, `char[]`, etc.
1192e5b6d6dSopenharmony_ciSee the Java documentation for details.
1202e5b6d6dSopenharmony_ci
1212e5b6d6dSopenharmony_ci### Strings in C/C++
1222e5b6d6dSopenharmony_ci
1232e5b6d6dSopenharmony_ciStrings in C and C++ are, at the lowest level, arrays of some particular base
1242e5b6d6dSopenharmony_citype. In most cases, the base type is a char, which is an 8-bit byte in modern
1252e5b6d6dSopenharmony_cicompilers. Some APIs use a "wide character" type wchar_t that is typically 8,
1262e5b6d6dSopenharmony_ci16, or 32 bits wide and upwards compatible with char. C code passes `char *` or
1272e5b6d6dSopenharmony_ciwchar_t pointers to the first element of an array. C++ enables you to create a
1282e5b6d6dSopenharmony_ciclass for encapsulating these kinds of character arrays in handy and safe
1292e5b6d6dSopenharmony_ciobjects.
1302e5b6d6dSopenharmony_ci
1312e5b6d6dSopenharmony_ciThe interpretation of the byte or wchar_t values depends on the platform, the
1322e5b6d6dSopenharmony_cicompiler, the signed state of both char and wchar_t, and the width of wchar_t.
1332e5b6d6dSopenharmony_ciThese characteristics are not specified in the language standards. When using
1342e5b6d6dSopenharmony_ciinternationalized text, the encoding often uses multiple chars for most
1352e5b6d6dSopenharmony_cicharacters and a wchar_t that is wide enough to hold exactly one character code
1362e5b6d6dSopenharmony_cipoint value each. Some APIs, especially in the standard library (stdlib), assume
1372e5b6d6dSopenharmony_cithat wchar_t strings use a fixed-width encoding with exactly one character code
1382e5b6d6dSopenharmony_cipoint per wchar_t.
1392e5b6d6dSopenharmony_ci
1402e5b6d6dSopenharmony_ci### ICU: 16-bit Unicode strings
1412e5b6d6dSopenharmony_ci
1422e5b6d6dSopenharmony_ciIn order to take advantage of Unicode with its large character repertoire and
1432e5b6d6dSopenharmony_ciits well-defined properties, there must be types with consistent definitions and
1442e5b6d6dSopenharmony_cisemantics. The Unicode standard defines a default encoding based on 16-bit code
1452e5b6d6dSopenharmony_ciunits. This is supported in ICU by the definition of the UChar to be an unsigned
1462e5b6d6dSopenharmony_ci16-bit integer type. This is the base type for character arrays for strings in
1472e5b6d6dSopenharmony_ciICU.
1482e5b6d6dSopenharmony_ci
1492e5b6d6dSopenharmony_ci> :point_right: **Note**: *Endianness is not an issue on this level because the interpretation of an
1502e5b6d6dSopenharmony_ciinteger is fixed within any given platform.*
1512e5b6d6dSopenharmony_ci
1522e5b6d6dSopenharmony_ciWith the UTF-16 encoding form, a single Unicode code point is encoded with
1532e5b6d6dSopenharmony_cieither one or two 16-bit UChar code units (unambiguously). "Supplementary" code
1542e5b6d6dSopenharmony_cipoints, which are encoded with pairs of code units, are rare in most texts. The
1552e5b6d6dSopenharmony_citwo code units are called "surrogates", and their unit value ranges are distinct
1562e5b6d6dSopenharmony_cifrom each other and from single-unit value ranges. Code should be generally
1572e5b6d6dSopenharmony_cioptimized for the common, single-unit case.
1582e5b6d6dSopenharmony_ci
1592e5b6d6dSopenharmony_ci16-bit Unicode strings in internal processing contain sequences of 16-bit code
1602e5b6d6dSopenharmony_ciunits that may not always be well-formed UTF-16. ICU treats single, unpaired
1612e5b6d6dSopenharmony_cisurrogates as surrogate code points, i.e., they are returned in per-code point
1622e5b6d6dSopenharmony_ciiteration, they are included in the number of code points of a string, and they
1632e5b6d6dSopenharmony_ciare generally treated much like normal, unassigned code points in most APIs.
1642e5b6d6dSopenharmony_ciSurrogate code points have Unicode properties although they cannot be assigned
1652e5b6d6dSopenharmony_cian actual character.
1662e5b6d6dSopenharmony_ci
1672e5b6d6dSopenharmony_ciICU string handling functions (including append, substring, etc.) do not
1682e5b6d6dSopenharmony_ciautomatically protect against producing malformed UTF-16 strings. Most of the
1692e5b6d6dSopenharmony_citime, indexes into strings are naturally at code point boundaries because they
1702e5b6d6dSopenharmony_ciresult from other functions that always produce such indexes. If necessary, the
1712e5b6d6dSopenharmony_ciuser can test for proper boundaries by checking the code unit values, or adjust
1722e5b6d6dSopenharmony_ciarbitrary indexes to code point boundaries by using the C macros
1732e5b6d6dSopenharmony_ciU16_SET_CP_START() and U16_SET_CP_LIMIT() (see utf.h) and the UnicodeString
1742e5b6d6dSopenharmony_cifunctions getChar32Start() and getChar32Limit().
1752e5b6d6dSopenharmony_ci
1762e5b6d6dSopenharmony_ciUTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and
1772e5b6d6dSopenharmony_ciconvenience functions (ustring.h), but only a subset of APIs works with UTF-8
1782e5b6d6dSopenharmony_cidirectly as string encoding form.
1792e5b6d6dSopenharmony_ci
1802e5b6d6dSopenharmony_ci**See the [UTF-8](utf-8.md) subpage for details about working with
1812e5b6d6dSopenharmony_ciUTF-8.** Some of the following sections apply to UTF-8 APIs as well; for example
1822e5b6d6dSopenharmony_cisections about handling lengths and overflows.
1832e5b6d6dSopenharmony_ci
1842e5b6d6dSopenharmony_ci### Separate type for single code points
1852e5b6d6dSopenharmony_ci
1862e5b6d6dSopenharmony_ciA Unicode code point is an integer with a value from 0 to 0x10FFFF. ICU 2.4 and
1872e5b6d6dSopenharmony_cilater defines the UChar32 type for single code point values as a 32 bits wide
1882e5b6d6dSopenharmony_cisigned integer (int32_t). This allows the use of easily testable negative values
1892e5b6d6dSopenharmony_cias sentinels, to indicate errors, exceptions or "done" conditions. All negative
1902e5b6d6dSopenharmony_civalues and positive values greater than 0x10FFFF are illegal as Unicode code
1912e5b6d6dSopenharmony_cipoints.
1922e5b6d6dSopenharmony_ci
1932e5b6d6dSopenharmony_ciICU 2.2 and earlier defined UChar32 depending on the platform: If the compiler's
1942e5b6d6dSopenharmony_ciwchar_t was 32 bits wide, then UChar32 was defined to be the same as wchar_t.
1952e5b6d6dSopenharmony_ciOtherwise, it was defined to be an unsigned 32-bit integer. This means that
1962e5b6d6dSopenharmony_ciUChar32 was either a signed or unsigned integer type depending on the compiler.
1972e5b6d6dSopenharmony_ciThis was meant for better interoperability with existing libraries, but was of
1982e5b6d6dSopenharmony_cilittle use because ICU does not process 32-bit strings — UChar32 is only used
1992e5b6d6dSopenharmony_cifor single code points. The platform dependence of UChar32 could cause problems
2002e5b6d6dSopenharmony_ciwith C++ function overloading.
2012e5b6d6dSopenharmony_ci
2022e5b6d6dSopenharmony_ci### Compiler-dependent definitions
2032e5b6d6dSopenharmony_ci
2042e5b6d6dSopenharmony_ciThe compiler's and the runtime character set's codepage encodings are not
2052e5b6d6dSopenharmony_cispecified by the C/C++ language standards and are usually not a Unicode encoding
2062e5b6d6dSopenharmony_ciform. They typically depend on the settings of the individual system, process,
2072e5b6d6dSopenharmony_cior thread. Therefore, it is not possible to instantiate a Unicode character or
2082e5b6d6dSopenharmony_cistring variable directly with C/C++ character or string literals. The only safe
2092e5b6d6dSopenharmony_ciway is to use numeric values. It is not an issue for User Interface (UI) strings
2102e5b6d6dSopenharmony_cithat are translated. These UI strings are loaded from a resource bundle, which
2112e5b6d6dSopenharmony_ciis generated from a text file that can be in Unicode or in any other
2122e5b6d6dSopenharmony_ciICU-provided codepage. The binary form of the genrb tool generates UTF-16
2132e5b6d6dSopenharmony_cistrings that are ready for direct use.
2142e5b6d6dSopenharmony_ci
2152e5b6d6dSopenharmony_ciThere is a useful exception to this for program-internal strings and test
2162e5b6d6dSopenharmony_cistrings. Within each "family" of character encodings, there is a set of
2172e5b6d6dSopenharmony_cicharacters that have the same numeric code values. Such characters include Latin
2182e5b6d6dSopenharmony_ciletters, the basic digits, the space, and some punctuation. Most of the ASCII
2192e5b6d6dSopenharmony_cigraphic characters are invariant characters. The same set, with different but
2202e5b6d6dSopenharmony_ciagain consistent numeric values, is invariant among almost all EBCDIC codepages.
2212e5b6d6dSopenharmony_ciFor details, see
2222e5b6d6dSopenharmony_ci[icu4c/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html)
2232e5b6d6dSopenharmony_ci. With strings that contain only these invariant characters, it is possible to
2242e5b6d6dSopenharmony_ciuse efficient ICU constructs to write a C/C++ string literal and use it to
2252e5b6d6dSopenharmony_ciinitialize Unicode strings.
2262e5b6d6dSopenharmony_ci
2272e5b6d6dSopenharmony_ciIn some APIs, ICU uses `char *` strings. This is either for file system paths or
2282e5b6d6dSopenharmony_cifor strings that contain invariant characters only (such as locale identifiers).
2292e5b6d6dSopenharmony_ciThese strings are in the platform-specific encoding of either ASCII or EBCDIC.
2302e5b6d6dSopenharmony_ciAll other codepage differences do not matter for invariant characters and are
2312e5b6d6dSopenharmony_cimanipulated by the C stdlib functions like strcpy().
2322e5b6d6dSopenharmony_ci
2332e5b6d6dSopenharmony_ciIn some APIs where identifiers are used, ICU uses `char *` strings with invariant
2342e5b6d6dSopenharmony_cicharacters. Such strings do not require the full Unicode repertoire and are
2352e5b6d6dSopenharmony_cieasier to handle in C and C++ with `char *` string literals and standard C
2362e5b6d6dSopenharmony_cilibrary functions. Their useful character repertoire is actually smaller than
2372e5b6d6dSopenharmony_cithe set of graphic ASCII characters; for details, see
2382e5b6d6dSopenharmony_ci[utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) . Examples of
2392e5b6d6dSopenharmony_ci`char *` identifier uses are converter names, locale IDs, and resource bundle
2402e5b6d6dSopenharmony_citable keys.
2412e5b6d6dSopenharmony_ci
2422e5b6d6dSopenharmony_ciThere is another, less efficient way to have human-readable Unicode string
2432e5b6d6dSopenharmony_ciliterals in C and C++ code. ICU provides a small number of functions that allow
2442e5b6d6dSopenharmony_ciany Unicode characters to be inserted into a string with escape sequences
2452e5b6d6dSopenharmony_cisimilar to the one that is used in the C and C++ language. In addition to the
2462e5b6d6dSopenharmony_cifamiliar \\n and \\xhh etc., ICU also provides the \\uhhhh syntax with four hex
2472e5b6d6dSopenharmony_cidigits and the \\Uhhhhhhhh syntax with eight hex digits for hexadecimal Unicode
2482e5b6d6dSopenharmony_cicode point values. This is very similar to the newer escape sequences used in
2492e5b6d6dSopenharmony_ciJava and defined in the latest C and C++ standards. Since ICU is not a compiler
2502e5b6d6dSopenharmony_ciextension, the "unescaping" is done at runtime and the backslash itself must be
2512e5b6d6dSopenharmony_ciescaped (duplicated) so that the compiler does not attempt to "unescape" the
2522e5b6d6dSopenharmony_cisequence itself.
2532e5b6d6dSopenharmony_ci
2542e5b6d6dSopenharmony_ci## Handling Lengths, Indexes, and Offsets in Strings
2552e5b6d6dSopenharmony_ci
2562e5b6d6dSopenharmony_ciThe length of a string and all indexes and offsets related to the string are
2572e5b6d6dSopenharmony_cialways counted in terms of UChar code units, not in terms of UChar32 code
2582e5b6d6dSopenharmony_cipoints. (This is the same as in common C library functions that use `char *`
2592e5b6d6dSopenharmony_cistrings with multi-byte encodings.)
2602e5b6d6dSopenharmony_ci
2612e5b6d6dSopenharmony_ciOften, a user thinks of a "character" as a complete unit in a language, like an
2622e5b6d6dSopenharmony_ci'Ä', while it may be represented with multiple Unicode code points including a
2632e5b6d6dSopenharmony_cibase character and combining marks. (See the Unicode standard for details.) This
2642e5b6d6dSopenharmony_cioften requires users to index and pass strings (UnicodeString or `UChar *`) with
2652e5b6d6dSopenharmony_cimultiple code units or code points. It cannot be done with single-integer
2662e5b6d6dSopenharmony_cicharacter types. Indexing of such "characters" is done with the BreakIterator
2672e5b6d6dSopenharmony_ciclass (in C: ubrk_ functions).
2682e5b6d6dSopenharmony_ci
2692e5b6d6dSopenharmony_ciEven with such "higher-level" indexing functions, the actual index values will
2702e5b6d6dSopenharmony_cibe expressed in terms of UChar code units. When more than one code unit is used
2712e5b6d6dSopenharmony_ciat a time, the index value changes by more than one at a time.
2722e5b6d6dSopenharmony_ci
2732e5b6d6dSopenharmony_ciICU uses signed 32-bit integers (int32_t) for lengths and offsets. Because of
2742e5b6d6dSopenharmony_ciinternal computations, strings (and arrays in general) are limited to 1G base
2752e5b6d6dSopenharmony_ciunits or 2G bytes, whichever is smaller.
2762e5b6d6dSopenharmony_ci
2772e5b6d6dSopenharmony_ci## Using C Strings: NUL-Terminated vs. Length Parameters
2782e5b6d6dSopenharmony_ci
2792e5b6d6dSopenharmony_ciStrings are either terminated with a NUL character (code point 0, U+0000) or
2802e5b6d6dSopenharmony_citheir length is specified. In the latter case, it is possible to have one or
2812e5b6d6dSopenharmony_cimore NUL characters inside the string.
2822e5b6d6dSopenharmony_ci
2832e5b6d6dSopenharmony_ci**Input string** arguments are typically passed with two parameters: The (const)
2842e5b6d6dSopenharmony_ci`UChar *` pointer and an int32_t length argument. If the length is -1 then the
2852e5b6d6dSopenharmony_cistring must be NUL-terminated and the ICU function will call the u_strlen()
2862e5b6d6dSopenharmony_cimethod or treat it equivalently. If the input string contains embedded NUL
2872e5b6d6dSopenharmony_cicharacters, then the length must be specified.
2882e5b6d6dSopenharmony_ci
2892e5b6d6dSopenharmony_ci**Output string** arguments are typically passed with a destination `UChar *`
2902e5b6d6dSopenharmony_cipointer and an int32_t capacity argument and the function returns the length of
2912e5b6d6dSopenharmony_cithe output as an int32_t. There is also almost always a UErrorCode argument.
2922e5b6d6dSopenharmony_ciEssentially, a `UChar[]` array is passed in with its start and the number of
2932e5b6d6dSopenharmony_ciavailable UChars. The array is filled with the output and if space permits the
2942e5b6d6dSopenharmony_cioutput will be NUL-terminated. The length of the output string is returned. In
2952e5b6d6dSopenharmony_ciall cases the length of the output string does not include the terminating NUL.
2962e5b6d6dSopenharmony_ciThis is the same behavior found in most ICU and non-ICU string APIs, for example
2972e5b6d6dSopenharmony_ciu_strlen(). The output string may **contain** NUL characters as part of its
2982e5b6d6dSopenharmony_ciactual contents, depending on the input and the operation. Note that the
2992e5b6d6dSopenharmony_ciUErrorCode parameter is used to indicate both errors and warnings (non-errors).
3002e5b6d6dSopenharmony_ciThe following describes some of the situations in which the UErrorCode will be
3012e5b6d6dSopenharmony_ciset to a non-zero value:
3022e5b6d6dSopenharmony_ci
3032e5b6d6dSopenharmony_ci1.  If the output length is greater than the output array capacity, then the
3042e5b6d6dSopenharmony_ci    UErrorCode will be set to U_BUFFER_OVERFLOW_ERROR and the contents of the
3052e5b6d6dSopenharmony_ci    output array is undefined.
3062e5b6d6dSopenharmony_ci
3072e5b6d6dSopenharmony_ci2.  If the output length is equal to the capacity, then the output has been
3082e5b6d6dSopenharmony_ci    completely written minus the terminating NUL. This is also indicated by
3092e5b6d6dSopenharmony_ci    setting the UErrorCode to U_STRING_NOT_TERMINATED_WARNING.
3102e5b6d6dSopenharmony_ci    Note that U_STRING_NOT_TERMINATED_WARNING does not indicate failure (it
3112e5b6d6dSopenharmony_ci    passes the U_SUCCESS() macro).
3122e5b6d6dSopenharmony_ci    Note also that it is more reliable to check the output length against the
3132e5b6d6dSopenharmony_ci    capacity, rather than checking for the warning code, because warning codes
3142e5b6d6dSopenharmony_ci    do not cause the early termination of a function and may subsequently be
3152e5b6d6dSopenharmony_ci    overwritten.
3162e5b6d6dSopenharmony_ci
3172e5b6d6dSopenharmony_ci3.  If neither of these two conditions apply, the error code will indicate
3182e5b6d6dSopenharmony_ci    success and not a U_STRING_NOT_TERMINATED_WARNING. (If a
3192e5b6d6dSopenharmony_ci    U_STRING_NOT_TERMINATED_WARNING code had been set in the UErrorCode
3202e5b6d6dSopenharmony_ci    parameter before the function call, then it is reset to a U_ZERO_ERROR.)
3212e5b6d6dSopenharmony_ci
3222e5b6d6dSopenharmony_ci**Preflighting:** The returned length is always the full output length even if
3232e5b6d6dSopenharmony_cithe output buffer is too small. It is possible to pass in a capacity of 0 (and
3242e5b6d6dSopenharmony_cian output array pointer of NUL) for "pure preflighting" to determine the
3252e5b6d6dSopenharmony_cinecessary output buffer size. Add one to make the output string NUL-terminated.
3262e5b6d6dSopenharmony_ci
3272e5b6d6dSopenharmony_ciNote that — whether the caller intends to "preflight" or not — if the output
3282e5b6d6dSopenharmony_cilength is equal to or greater than the capacity, then the UErrorCode is set to
3292e5b6d6dSopenharmony_ciU_STRING_NOT_TERMINATED_WARNING or U_BUFFER_OVERFLOW_ERROR respectively, as
3302e5b6d6dSopenharmony_cidescribed above.
3312e5b6d6dSopenharmony_ci
3322e5b6d6dSopenharmony_ciHowever, "pure preflighting" is very expensive because the operation has to be
3332e5b6d6dSopenharmony_ciprocessed twice — once for calculating the output length, and a second time to
3342e5b6d6dSopenharmony_ciactually generate the output. It is much more efficient to always provide an
3352e5b6d6dSopenharmony_cioutput buffer that is expected to be large enough for most cases, and to
3362e5b6d6dSopenharmony_cireallocate and repeat the operation only when an overflow occurred. (Remember to
3372e5b6d6dSopenharmony_cireset the UErrorCode to U_ZERO_ERROR before calling the function again.) In
3382e5b6d6dSopenharmony_ciC/C++, the initial output buffer can be a stack buffer. In case of a
3392e5b6d6dSopenharmony_cireallocation, it may be possible and useful to cache and reuse the new, larger
3402e5b6d6dSopenharmony_cibuffer.
3412e5b6d6dSopenharmony_ci
3422e5b6d6dSopenharmony_ci> :point_right: **Note**:*The exception to these rules are the ANSI-C-style functions like u_strcpy(),
3432e5b6d6dSopenharmony_ciwhich generally require NUL-terminated strings, forbid embedded NULs, and do not
3442e5b6d6dSopenharmony_citake capacity arguments for buffer overflow checking.*
3452e5b6d6dSopenharmony_ci
3462e5b6d6dSopenharmony_ci## Using Unicode Strings in C
3472e5b6d6dSopenharmony_ci
3482e5b6d6dSopenharmony_ciIn C, Unicode strings are similar to standard `char *` strings. Unicode strings
3492e5b6d6dSopenharmony_ciare arrays of UChar and most APIs take a `UChar *` pointer to the first element
3502e5b6d6dSopenharmony_ciand an input length and/or output capacity, see above. ICU has a number of
3512e5b6d6dSopenharmony_cifunctions that provide the Unicode equivalent of the stdlib functions such as
3522e5b6d6dSopenharmony_cistrcpy(), strstr(), etc. Compared with their C standard counterparts, their
3532e5b6d6dSopenharmony_cifunction names begin with u_. Otherwise, their semantics are equivalent. These
3542e5b6d6dSopenharmony_cifunctions are defined in icu/source/common/unicode/ustring.h.
3552e5b6d6dSopenharmony_ci
3562e5b6d6dSopenharmony_ci### Code Point Access
3572e5b6d6dSopenharmony_ci
3582e5b6d6dSopenharmony_ciSometimes, Unicode code points need to be accessed in C for iteration, movement
3592e5b6d6dSopenharmony_ciforward, or movement backward in a string. A string might also need to be
3602e5b6d6dSopenharmony_ciwritten from code points values. ICU provides a number of macros that are
3612e5b6d6dSopenharmony_cidefined in the icu/source/common/unicode/utf.h and utf8.h/utf16.h headers that
3622e5b6d6dSopenharmony_ciit includes (utf.h is in turn included with utypes.h).
3632e5b6d6dSopenharmony_ci
3642e5b6d6dSopenharmony_ciMacros for 16-bit Unicode strings have a U16_ prefix. For example:
3652e5b6d6dSopenharmony_ci
3662e5b6d6dSopenharmony_ci    U16_NEXT(s, i, length, c)
3672e5b6d6dSopenharmony_ci    U16_PREV(s, start, i, c)
3682e5b6d6dSopenharmony_ci    U16_APPEND(s, i, length, c, isError)
3692e5b6d6dSopenharmony_ci
3702e5b6d6dSopenharmony_ciThere are also macros with a U_ prefix for code point range checks (e.g., test
3712e5b6d6dSopenharmony_cifor non-character code point), and U8_ macros for 8-bit (UTF-8) strings. See the
3722e5b6d6dSopenharmony_ciheader files and the API References for more details.
3732e5b6d6dSopenharmony_ci
3742e5b6d6dSopenharmony_ci#### UTF Macros before ICU 2.4
3752e5b6d6dSopenharmony_ci
3762e5b6d6dSopenharmony_ciIn ICU 2.4, the utf\*.h macros have been revamped, improved, simplified, and
3772e5b6d6dSopenharmony_cirenamed. The old macros continue to be available. They are in utf_old.h,
3782e5b6d6dSopenharmony_citogether with an explanation of the change. utf.h, utf8.h and utf16.h contain
3792e5b6d6dSopenharmony_cithe new macros instead. The new macros are intended to be more consistent, more
3802e5b6d6dSopenharmony_ciuseful, and less confusing. Some macros were simply renamed for consistency with
3812e5b6d6dSopenharmony_cia new naming scheme.
3822e5b6d6dSopenharmony_ci
3832e5b6d6dSopenharmony_ciThe documentation of the old macros has been removed. If you need it, see a User
3842e5b6d6dSopenharmony_ciGuide version from ICU 4.2 or earlier (see the [download
3852e5b6d6dSopenharmony_cipage](https://icu.unicode.org/download)).
3862e5b6d6dSopenharmony_ci
3872e5b6d6dSopenharmony_ciC Unicode String Literals
3882e5b6d6dSopenharmony_ci
3892e5b6d6dSopenharmony_ciThere is a pair of macros that together enable users to instantiate a Unicode
3902e5b6d6dSopenharmony_cistring in C — a `UChar []` array — from a C string literal:
3912e5b6d6dSopenharmony_ci
3922e5b6d6dSopenharmony_ci    /*
3932e5b6d6dSopenharmony_ci    * In C, we need two macros: one to declare the UChar[] array, and
3942e5b6d6dSopenharmony_ci    * one to populate it; the second one is a noop on platforms where
3952e5b6d6dSopenharmony_ci    * wchar_t is compatible with UChar and ASCII-based.
3962e5b6d6dSopenharmony_ci    * The length of the string literal must be counted for both macros.
3972e5b6d6dSopenharmony_ci    */
3982e5b6d6dSopenharmony_ci    /* declare the invString array for the string */
3992e5b6d6dSopenharmony_ci    U_STRING_DECL(invString, "such characters are safe 123 %-.", 32);
4002e5b6d6dSopenharmony_ci    /* populate it with the characters */
4012e5b6d6dSopenharmony_ci    U_STRING_INIT(invString, "such characters are safe 123 %-.", 32);
4022e5b6d6dSopenharmony_ci
4032e5b6d6dSopenharmony_ciWith invariant characters, it is also possible to efficiently convert `char *`
4042e5b6d6dSopenharmony_cistrings to and from UChar \ strings:
4052e5b6d6dSopenharmony_ci
4062e5b6d6dSopenharmony_ci    static const char *cs1="such characters are safe 123 %-.";
4072e5b6d6dSopenharmony_ci    static UChar us1[40];
4082e5b6d6dSopenharmony_ci    static char cs2[40];
4092e5b6d6dSopenharmony_ci    u_charsToUChars(cs1, us1, 33); /* include the terminating NUL */
4102e5b6d6dSopenharmony_ci    u_UCharsToChars(us1, cs2, 33);
4112e5b6d6dSopenharmony_ci
4122e5b6d6dSopenharmony_ci## Testing for well-formed UTF-16 strings
4132e5b6d6dSopenharmony_ci
4142e5b6d6dSopenharmony_ciIt is sometimes useful to test if a 16-bit Unicode string is well-formed UTF-16,
4152e5b6d6dSopenharmony_cithat is, that it does not contain unpaired surrogate code units. For a boolean
4162e5b6d6dSopenharmony_citest, call a function like u_strToUTF8() which sets an error code if the input
4172e5b6d6dSopenharmony_cistring is malformed. (Provide a zero-capacity destination buffer and treat the
4182e5b6d6dSopenharmony_cibuffer overflow error as "is well-formed".) If you need to know the position of
4192e5b6d6dSopenharmony_cithe unpaired surrogate, you can iterate through the string with U16_NEXT() and
4202e5b6d6dSopenharmony_ciU_IS_SURROGATE().
4212e5b6d6dSopenharmony_ci
4222e5b6d6dSopenharmony_ci## Using Unicode Strings in C++
4232e5b6d6dSopenharmony_ci
4242e5b6d6dSopenharmony_ci[UnicodeString](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUnicodeString.html) is
4252e5b6d6dSopenharmony_cia C++ string class that wraps a UChar array and associated bookkeeping. It
4262e5b6d6dSopenharmony_ciprovides a rich set of string handling functions.
4272e5b6d6dSopenharmony_ci
4282e5b6d6dSopenharmony_ciUnicodeString combines elements of both the Java String and StringBuffer
4292e5b6d6dSopenharmony_ciclasses. Many UnicodeString functions are named and work similar to Java String
4302e5b6d6dSopenharmony_cimethods but modify the object (UnicodeString is "mutable").
4312e5b6d6dSopenharmony_ci
4322e5b6d6dSopenharmony_ciUnicodeString provides functions for random access and use (insert/append/find
4332e5b6d6dSopenharmony_cietc.) of both code units and code points. For each non-iterative string/code
4342e5b6d6dSopenharmony_cipoint macro in utf.h there is at least one UnicodeString member function. The
4352e5b6d6dSopenharmony_cinames of most of these functions contain "32" to indicate the use of a UChar32.
4362e5b6d6dSopenharmony_ci
4372e5b6d6dSopenharmony_ciCode point and code unit iteration is provided by the
4382e5b6d6dSopenharmony_ci[CharacterIterator](characteriterator.md) abstract class and its subclasses.
4392e5b6d6dSopenharmony_ciThere are concrete iterator implementations for UnicodeString objects and plain
4402e5b6d6dSopenharmony_ci`UChar []` arrays.
4412e5b6d6dSopenharmony_ci
4422e5b6d6dSopenharmony_ciMost UnicodeString constructors and functions do not have a UErrorCode
4432e5b6d6dSopenharmony_ciparameter. Instead, if the construction of a UnicodeString fails, for example
4442e5b6d6dSopenharmony_ciwhen it is constructed from a NULL `UChar *` pointer, then the UnicodeString
4452e5b6d6dSopenharmony_ciobject becomes "bogus". This can be tested with the isBogus() function. A
4462e5b6d6dSopenharmony_ciUnicodeString can be put into the "bogus" state explicitly with the setToBogus()
4472e5b6d6dSopenharmony_cifunction. This is different from an empty string (although a "bogus" string also
4482e5b6d6dSopenharmony_cireturns true from isEmpty()) and may be used equivalently to NULL in `UChar *` C
4492e5b6d6dSopenharmony_ciAPIs (or null references in Java, or NULL values in SQL). A string remains
4502e5b6d6dSopenharmony_ci"bogus" until a non-bogus string value is assigned to it. For complete details
4512e5b6d6dSopenharmony_ciof the behavior of "bogus" strings see the description of the setToBogus()
4522e5b6d6dSopenharmony_cifunction.
4532e5b6d6dSopenharmony_ci
4542e5b6d6dSopenharmony_ciSome APIs work with the
4552e5b6d6dSopenharmony_ci[Replaceable](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classReplaceable.html)
4562e5b6d6dSopenharmony_ciabstract class. It defines a simple interface for random access and text
4572e5b6d6dSopenharmony_cimodification and is useful for operations on text that may have associated
4582e5b6d6dSopenharmony_cimeta-data (e.g., styled text), especially in the Transliterator API.
4592e5b6d6dSopenharmony_ciUnicodeString implements Replaceable.
4602e5b6d6dSopenharmony_ci
4612e5b6d6dSopenharmony_ci### C++ Unicode String Literals
4622e5b6d6dSopenharmony_ci
4632e5b6d6dSopenharmony_ciLike in C, there are macros that enable users to instantiate a UnicodeString
4642e5b6d6dSopenharmony_cifrom a C string literal. One macro requires the length of the string as in the C
4652e5b6d6dSopenharmony_cimacros, the other one implies a strlen().
4662e5b6d6dSopenharmony_ci
4672e5b6d6dSopenharmony_ci    UnicodeString s1=UNICODE_STRING("such characters are safe 123 %-.", 32);
4682e5b6d6dSopenharmony_ci    UnicodeString s1=UNICODE_STRING_SIMPLE("such characters are safe 123 %-.");
4692e5b6d6dSopenharmony_ci
4702e5b6d6dSopenharmony_ciIt is possible to efficiently convert between invariant-character strings and
4712e5b6d6dSopenharmony_ciUnicodeStrings by using constructor, setTo() or extract() overloads that take
4722e5b6d6dSopenharmony_cicodepage data (`const char *`) and specifying an empty string ("") as the
4732e5b6d6dSopenharmony_cicodepage name.
4742e5b6d6dSopenharmony_ci
4752e5b6d6dSopenharmony_ci## Using C++ Strings in C APIs
4762e5b6d6dSopenharmony_ci
4772e5b6d6dSopenharmony_ciThe internal buffer of UnicodeString objects is available for direct handling in
4782e5b6d6dSopenharmony_ciC (or C-style) APIs that take `UChar *` arguments. It is possible but usually not
4792e5b6d6dSopenharmony_cinecessary to copy the string contents with one of the extract functions. The
4802e5b6d6dSopenharmony_cifollowing describes several direct buffer access methods.
4812e5b6d6dSopenharmony_ci
4822e5b6d6dSopenharmony_ciThe UnicodeString function getBuffer() const returns a readonly const `UChar *`.
4832e5b6d6dSopenharmony_ciThe length of the string is indicated by UnicodeString's length() function.
4842e5b6d6dSopenharmony_ciGenerally, UnicodeString does not NUL-terminate the contents of its internal
4852e5b6d6dSopenharmony_cibuffer. However, it is possible to check for a NUL character if the length of
4862e5b6d6dSopenharmony_cithe string is less than the capacity of the buffer. The following code is an
4872e5b6d6dSopenharmony_ciexample of how to check the capacity of the buffer:
4882e5b6d6dSopenharmony_ci`(s.length()<s.getCapacity() && buffer[s.length()]==0)`
4892e5b6d6dSopenharmony_ci
4902e5b6d6dSopenharmony_ciAn easier way to NUL-terminate the buffer and get a `const UChar *` pointer to it
4912e5b6d6dSopenharmony_ciis the getTerminatedBuffer() function. Unlike getBuffer() const,
4922e5b6d6dSopenharmony_cigetTerminatedBuffer() is not a const function because it may have to (reallocate
4932e5b6d6dSopenharmony_ciand) modify the buffer to append a terminating NUL. Therefore, use getBuffer()
4942e5b6d6dSopenharmony_ciconst if you do not need a NUL-terminated buffer.
4952e5b6d6dSopenharmony_ci
4962e5b6d6dSopenharmony_ciThere is also a pair of functions that allow controlled write access to the
4972e5b6d6dSopenharmony_cibuffer of a UnicodeString: `UChar *getBuffer(int32_t minCapacity)` and
4982e5b6d6dSopenharmony_ci`releaseBuffer(int32_t newLength)`. `UChar *getBuffer(int32_t minCapacity)`
4992e5b6d6dSopenharmony_ciprovides a writeable buffer of at least the requested capacity and returns a
5002e5b6d6dSopenharmony_cipointer to it. The actual capacity of the buffer after the
5012e5b6d6dSopenharmony_ci`getBuffer(minCapacity)` call may be larger than the requested capacity and can be
5022e5b6d6dSopenharmony_cidetermined with `getCapacity()`.
5032e5b6d6dSopenharmony_ci
5042e5b6d6dSopenharmony_ciOnce the buffer contents are modified, the buffer must be released with the
5052e5b6d6dSopenharmony_ci`releaseBuffer(int32_t newLength)` function, which sets the new length of the
5062e5b6d6dSopenharmony_ciUnicodeString (newLength=-1 can be passed to determine the length of
5072e5b6d6dSopenharmony_ciNUL-terminated contents like `u_strlen()`).
5082e5b6d6dSopenharmony_ci
5092e5b6d6dSopenharmony_ciBetween the `getBuffer(minCapacity)` and `releaseBuffer(newLength)` function calls,
5102e5b6d6dSopenharmony_cithe contents of the UnicodeString is unknown and the object behaves like it
5112e5b6d6dSopenharmony_cicontains an empty string. A nested `getBuffer(minCapacity)`, `getBuffer() const` or
5122e5b6d6dSopenharmony_ci`getTerminatedBuffer()` will fail (return NULL) and modifications of the string
5132e5b6d6dSopenharmony_civia UnicodeString member functions will have no effect. Copying a string with an
5142e5b6d6dSopenharmony_ci"open buffer" yields an empty copy. The move constructor, move assignment
5152e5b6d6dSopenharmony_cioperator and Return Value Optimization (RVO) transfer the state, including the
5162e5b6d6dSopenharmony_ciopen buffer.
5172e5b6d6dSopenharmony_ci
5182e5b6d6dSopenharmony_ciSee the UnicodeString API documentation for more information.
5192e5b6d6dSopenharmony_ci
5202e5b6d6dSopenharmony_ci## Using C Strings in C++ APIs
5212e5b6d6dSopenharmony_ci
5222e5b6d6dSopenharmony_ciThere are efficient ways to wrap C-style strings in C++ UnicodeString objects
5232e5b6d6dSopenharmony_ciwithout copying the string contents. In order to use C strings in C++ APIs, the
5242e5b6d6dSopenharmony_ci`UChar *` pointer and length need to be wrapped into a UnicodeString. This can be
5252e5b6d6dSopenharmony_cidone efficiently in two ways: With a readonly alias and a writable alias. The
5262e5b6d6dSopenharmony_ciUnicodeString object that is constructed actually uses the `UChar *` pointer as
5272e5b6d6dSopenharmony_ciits internal buffer pointer instead of allocating a new buffer and copying the
5282e5b6d6dSopenharmony_cistring contents.
5292e5b6d6dSopenharmony_ci
5302e5b6d6dSopenharmony_ciIf the original string is a readonly `const UChar *`, then the UnicodeString must
5312e5b6d6dSopenharmony_cibe constructed with a read only alias. If the original string is a writable
5322e5b6d6dSopenharmony_ci(non-const) `UChar *` and is to be modified (e.g., if the `UChar *` buffer is an
5332e5b6d6dSopenharmony_cioutput buffer) then the UnicodeString should be constructed with a writeable
5342e5b6d6dSopenharmony_cialias. For more details see the section "Maximizing Performance with the
5352e5b6d6dSopenharmony_ciUnicodeString Storage Model" and search the unistr.h header file for "alias".
5362e5b6d6dSopenharmony_ci
5372e5b6d6dSopenharmony_ci## Maximizing Performance with the UnicodeString Storage Model
5382e5b6d6dSopenharmony_ci
5392e5b6d6dSopenharmony_ciUnicodeString uses four storage methods to maximize performance and minimize
5402e5b6d6dSopenharmony_cimemory consumption:
5412e5b6d6dSopenharmony_ci
5422e5b6d6dSopenharmony_ci1.  Short strings are normally stored inside the UnicodeString object. The
5432e5b6d6dSopenharmony_ci    object has fields for the "bookkeeping" and a small UChar array. When the
5442e5b6d6dSopenharmony_ci    object is copied, the internal characters are copied into the destination
5452e5b6d6dSopenharmony_ci    object.
5462e5b6d6dSopenharmony_ci2.  Longer strings are normally stored in allocated memory. The allocated UChar
5472e5b6d6dSopenharmony_ci    array is preceded by a reference counter. When the string object is copied,
5482e5b6d6dSopenharmony_ci    the allocated buffer is shared by incrementing the reference counter. If any
5492e5b6d6dSopenharmony_ci    of the objects that share the same string buffer are modified, they receive
5502e5b6d6dSopenharmony_ci    their own copy of the buffer and decrement the reference counter of the
5512e5b6d6dSopenharmony_ci    previously co-used buffer.
5522e5b6d6dSopenharmony_ci3.  A UnicodeString can be constructed (or set with a setTo() function) so that
5532e5b6d6dSopenharmony_ci    it aliases a readonly buffer instead of copying the characters. In this
5542e5b6d6dSopenharmony_ci    case, the string object uses this aliased buffer for as long as the object
5552e5b6d6dSopenharmony_ci    is not modified and it will never attempt to modify or release the buffer.
5562e5b6d6dSopenharmony_ci    This model has copy-on-write semantics. For example, when the string object
5572e5b6d6dSopenharmony_ci    is modified, the buffer contents are first copied into writable memory
5582e5b6d6dSopenharmony_ci    (inside the object for short strings or the allocated buffer for longer
5592e5b6d6dSopenharmony_ci    strings). When a UnicodeString with a readonly setting is copied to another
5602e5b6d6dSopenharmony_ci    UnicodeString using the fastCopyFrom() function, then both string objects
5612e5b6d6dSopenharmony_ci    share the same readonly setting and point to the same storage. Copying a
5622e5b6d6dSopenharmony_ci    string with the normal assignment operator or copy constructor will copy the
5632e5b6d6dSopenharmony_ci    buffer. This prevents accidental misuse of readonly-aliased strings. (This
5642e5b6d6dSopenharmony_ci    is new in ICU 2.4; earlier, the assignment operator and copy constructor
5652e5b6d6dSopenharmony_ci    behaved like the new fastCopyFrom() does now.)
5662e5b6d6dSopenharmony_ci    **Important:**
5672e5b6d6dSopenharmony_ci    1.  The aliased buffer must remain valid for as long as any UnicodeString
5682e5b6d6dSopenharmony_ci        object aliases it. This includes unmodified fastCopyFrom()and
5692e5b6d6dSopenharmony_ci        `movedFrom()` copies of the object (including moves via the move
5702e5b6d6dSopenharmony_ci        constructor and move assignment operator), and when the compiler uses
5712e5b6d6dSopenharmony_ci        Return Value Optimization (RVO) where a function returns a UnicodeString
5722e5b6d6dSopenharmony_ci        by value.
5732e5b6d6dSopenharmony_ci    2.  Be prepared that return-by-value may either make a copy (which does not
5742e5b6d6dSopenharmony_ci        preserve aliasing), or moves the value or uses RVO (which do preserve
5752e5b6d6dSopenharmony_ci        aliasing).
5762e5b6d6dSopenharmony_ci    3.  It is an error to readonly-alias temporary buffers and then pass the
5772e5b6d6dSopenharmony_ci        resulting UnicodeString objects (or references/pointers to them) to APIs
5782e5b6d6dSopenharmony_ci        that store them for longer than the buffers are valid.
5792e5b6d6dSopenharmony_ci    4.  If it is necessary to make sure that a string is not a readonly alias,
5802e5b6d6dSopenharmony_ci        then use any modifying function without actually changing the contents
5812e5b6d6dSopenharmony_ci        (for example, s.setCharAt(0, s.charAt(0))).
5822e5b6d6dSopenharmony_ci    5.  In ICU 2.4 and later, a simple assignment or copy construction will also
5832e5b6d6dSopenharmony_ci        copy the buffer.
5842e5b6d6dSopenharmony_ci4.  A UnicodeString can be constructed (or set with a setTo() function) so that
5852e5b6d6dSopenharmony_ci    it aliases a writable buffer instead of copying the characters. The
5862e5b6d6dSopenharmony_ci    difference from the above is that the string object writes through to this
5872e5b6d6dSopenharmony_ci    aliased buffer for write operations. A new buffer is allocated and the
5882e5b6d6dSopenharmony_ci    contents are copied only when the capacity of the buffer is not sufficient.
5892e5b6d6dSopenharmony_ci    An efficient way to get the string contents into the original buffer is to
5902e5b6d6dSopenharmony_ci    use the `extract(..., UChar *dst, ...)` function.
5912e5b6d6dSopenharmony_ci    The `extract(..., UChar *dst, ...)` function copies the string contents only if the dst buffer is
5922e5b6d6dSopenharmony_ci    different from the buffer of the string object itself. If a string grows and
5932e5b6d6dSopenharmony_ci    shrinks during a sequence of operations, then it will not use the same
5942e5b6d6dSopenharmony_ci    buffer, even if the string would fit. When a UnicodeString with a writeable
5952e5b6d6dSopenharmony_ci    alias is assigned to another UnicodeString, the contents are always copied.
5962e5b6d6dSopenharmony_ci    The destination string will not point to the buffer that the source string
5972e5b6d6dSopenharmony_ci    aliases point to. However, a move constructor, move assignment operator, and
5982e5b6d6dSopenharmony_ci    Return Value Optimization (RVO) do preserve aliasing.
5992e5b6d6dSopenharmony_ci
6002e5b6d6dSopenharmony_ciIn general, UnicodeString objects have "copy-on-write" semantics. Several
6012e5b6d6dSopenharmony_ciobjects may share the same string buffer, but a modification only affects the
6022e5b6d6dSopenharmony_ciobject that is modified itself. This is achieved by copying the string contents
6032e5b6d6dSopenharmony_ciif it is not owned exclusively by this one object. Only after that is the object
6042e5b6d6dSopenharmony_cimodified.
6052e5b6d6dSopenharmony_ci
6062e5b6d6dSopenharmony_ciEven though it is fairly efficient to copy UnicodeString objects, it is even
6072e5b6d6dSopenharmony_cimore efficient, if possible, to work with references or pointers. Functions that
6082e5b6d6dSopenharmony_cioutput strings can be faster by appending their results to a UnicodeString that
6092e5b6d6dSopenharmony_ciis passed in by reference, compared with returning a UnicodeString object or
6102e5b6d6dSopenharmony_cijust setting the local results alone into a string reference.
6112e5b6d6dSopenharmony_ci
6122e5b6d6dSopenharmony_ci> :point_right: **Note**: *UnicodeStrings can be copied in a thread-safe manner by just using their
6132e5b6d6dSopenharmony_cistandard copy constructors and assignment operators. fastCopyFrom() is also
6142e5b6d6dSopenharmony_cithread-safe, but if the original string is a readonly alias, then the copy
6152e5b6d6dSopenharmony_cishares the same aliased buffer.*
6162e5b6d6dSopenharmony_ci
6172e5b6d6dSopenharmony_ci## Using UTF-8 strings with ICU
6182e5b6d6dSopenharmony_ci
6192e5b6d6dSopenharmony_ciAs mentioned in the overview of this chapter, ICU and most other
6202e5b6d6dSopenharmony_ciUnicode-supporting software uses 16-bit Unicode for internal processing.
6212e5b6d6dSopenharmony_ciHowever, there are circumstances where UTF-8 is used instead. This is usually
6222e5b6d6dSopenharmony_cithe case for software that does little or no processing of non-ASCII characters,
6232e5b6d6dSopenharmony_ciand/or for APIs that predate Unicode, use byte-based strings, and cannot be
6242e5b6d6dSopenharmony_cichanged or replaced for various reasons.
6252e5b6d6dSopenharmony_ci
6262e5b6d6dSopenharmony_ciA common perception is that UTF-8 has an advantage because it was designed for
6272e5b6d6dSopenharmony_cicompatibility with byte-based, ASCII-based systems, although it was designed for
6282e5b6d6dSopenharmony_cistring storage (of Unicode characters in Unix file names) rather than for
6292e5b6d6dSopenharmony_ciprocessing performance.
6302e5b6d6dSopenharmony_ci
6312e5b6d6dSopenharmony_ciWhile ICU mostly does not natively use UTF-8 strings, there are many ways to
6322e5b6d6dSopenharmony_ciwork with UTF-8 strings and ICU. For more information see the newer
6332e5b6d6dSopenharmony_ci[UTF-8](utf-8.md) subpage.
6342e5b6d6dSopenharmony_ci
6352e5b6d6dSopenharmony_ci## Using UTF-32 strings with ICU
6362e5b6d6dSopenharmony_ci
6372e5b6d6dSopenharmony_ciIt is even rarer to use UTF-32 for string processing than UTF-8. While 32-bit
6382e5b6d6dSopenharmony_ciUnicode is convenient because it is the only fixed-width UTF, there are few or
6392e5b6d6dSopenharmony_cino legacy systems with 32-bit string processing that would benefit from a
6402e5b6d6dSopenharmony_cicompatible format, and the memory bandwidth requirements of UTF-32 diminish the
6412e5b6d6dSopenharmony_ciperformance and handling advantage of the fixed-width format.
6422e5b6d6dSopenharmony_ci
6432e5b6d6dSopenharmony_ciOver time, the wchar_t type of some C/C++ compilers became a 32-bit integer, and
6442e5b6d6dSopenharmony_cisome C libraries do use it for Unicode processing. However, application software
6452e5b6d6dSopenharmony_ciwith good Unicode support tends to have little use for the rudimentary Unicode
6462e5b6d6dSopenharmony_ciand Internationalization support of the standard C/C++ libraries and often uses
6472e5b6d6dSopenharmony_cicustom types (like ICU's) and UTF-16 or UTF-8.
6482e5b6d6dSopenharmony_ci
6492e5b6d6dSopenharmony_ciFor those systems where 32-bit Unicode strings are used, ICU offers some
6502e5b6d6dSopenharmony_ciconvenience functions.
6512e5b6d6dSopenharmony_ci
6522e5b6d6dSopenharmony_ci1.  Conversion of whole strings: u_strFromUTF32() and u_strFromUTF32() in
6532e5b6d6dSopenharmony_ci    ustring.h.
6542e5b6d6dSopenharmony_ci
6552e5b6d6dSopenharmony_ci2.  Access to code points is trivial and does not require any macros.
6562e5b6d6dSopenharmony_ci
6572e5b6d6dSopenharmony_ci3.  Using a UTF-32 converter with all of the ICU conversion APIs in ucnv.h,
6582e5b6d6dSopenharmony_ci    including ones with an "Algorithmic" suffix.
6592e5b6d6dSopenharmony_ci
6602e5b6d6dSopenharmony_ci4.  UnicodeString has `fromUTF32()` and `toUTF32()` methods.
6612e5b6d6dSopenharmony_ci
6622e5b6d6dSopenharmony_ci5.  For conversion directly between UTF-32 and another charset use
6632e5b6d6dSopenharmony_ci    ucnv_convertEx(). However, since ICU converters work with byte streams in
6642e5b6d6dSopenharmony_ci    external charsets on the non-"Unicode" side, the UTF-32 string will be
6652e5b6d6dSopenharmony_ci    treated as a byte stream (UTF-32 Character Encoding *Scheme*) rather than a
6662e5b6d6dSopenharmony_ci    sequence of 32-bit code units (UTF-32 Character Encoding *Form*). The
6672e5b6d6dSopenharmony_ci    correct converter must be used: UTF-32BE or UTF-32LE according to the
6682e5b6d6dSopenharmony_ci    platform endianness (U_IS_BIG_ENDIAN). Treating the string like a byte
6692e5b6d6dSopenharmony_ci    stream also makes a difference in data types (`char *`), lengths and indexes
6702e5b6d6dSopenharmony_ci    (counting bytes), and NUL-termination handling (input NUL-termination not
6712e5b6d6dSopenharmony_ci    possible, output writes only a NUL byte, not a NUL 32-bit code unit). For
6722e5b6d6dSopenharmony_ci    the difference between internal encoding forms and external encoding schemes
6732e5b6d6dSopenharmony_ci    see the Unicode Standard.
6742e5b6d6dSopenharmony_ci
6752e5b6d6dSopenharmony_ci6.  Some ICU APIs work with a CharacterIterator, a UText or a UCharIterator
6762e5b6d6dSopenharmony_ci    instead of directly with a C/C++ string parameter. There is currently no ICU
6772e5b6d6dSopenharmony_ci    instance of any of these interfaces that reads UTF-32, although an
6782e5b6d6dSopenharmony_ci    application could provide one.
6792e5b6d6dSopenharmony_ci
6802e5b6d6dSopenharmony_ci## Changes in ICU 2.0
6812e5b6d6dSopenharmony_ci
6822e5b6d6dSopenharmony_ciBeginning with ICU release 2.0, there are a few changes to the ICU string
6832e5b6d6dSopenharmony_cifacilities compared with earlier ICU releases.
6842e5b6d6dSopenharmony_ci
6852e5b6d6dSopenharmony_ciSome of the NUL-termination behavior was inconsistent across the ICU API
6862e5b6d6dSopenharmony_cifunctions. In particular, the following functions used to count the terminating
6872e5b6d6dSopenharmony_ciNUL character in their output length (counted one more before ICU 2.0 than now):
6882e5b6d6dSopenharmony_ciucnv_toUChars, ucnv_fromUChars, uloc_getLanguage, uloc_getCountry,
6892e5b6d6dSopenharmony_ciuloc_getVariant, uloc_getName, uloc_getDisplayLanguage, uloc_getDisplayCountry,
6902e5b6d6dSopenharmony_ciuloc_getDisplayVariant, uloc_getDisplayName
6912e5b6d6dSopenharmony_ci
6922e5b6d6dSopenharmony_ciSome functions used to set an overflow error code even when only the terminating
6932e5b6d6dSopenharmony_ciNUL did not fit into the output buffer. These functions now set UErrorCode to
6942e5b6d6dSopenharmony_ciU_STRING_NOT_TERMINATED_WARNING rather than to U_BUFFER_OVERFLOW_ERROR.
6952e5b6d6dSopenharmony_ci
6962e5b6d6dSopenharmony_ciThe aliasing UnicodeString constructors and most extract functions have existed
6972e5b6d6dSopenharmony_cifor several releases prior to ICU 2.0. There is now an additional extract
6982e5b6d6dSopenharmony_cifunction with a UErrorCode parameter. Also, the getBuffer, releaseBuffer and
6992e5b6d6dSopenharmony_cigetCapacity functions are new to ICU 2.0.
7002e5b6d6dSopenharmony_ci
7012e5b6d6dSopenharmony_ciFor more information about these changes, please consult the old and new API
7022e5b6d6dSopenharmony_cidocumentation.
703