userguide/strings/index.md

2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_cilayout: default
2e5b6d6dSopenharmony_cititle: Chars and Strings
2e5b6d6dSopenharmony_cinav_order: 600
2e5b6d6dSopenharmony_cihas_children: true
2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_ci<!--
2e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
2e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
2e5b6d6dSopenharmony_ci-->
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci# Strings
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Overview
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis section explains how to handle Unicode strings with ICU in C and C++.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSample code is available in the ICU source code library at
2e5b6d6dSopenharmony_ci[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/ustring/ustring.cpp)
2e5b6d6dSopenharmony_ci.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Text Access Overview
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciStrings are the most common and fundamental form of handling text in software.
2e5b6d6dSopenharmony_ciLogically, and often physically, they contain contiguous arrays (vectors) of
2e5b6d6dSopenharmony_cibasic units. Most of the ICU API functions work directly with simple strings,
2e5b6d6dSopenharmony_ciand where possible, this is preferred.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSometimes, text needs to be accessed via more powerful and complicated methods.
2e5b6d6dSopenharmony_ciFor example, text may be stored in discontiguous chunks in order to deal with
2e5b6d6dSopenharmony_cifrequent modification (like typing) and large amounts, or it may not be stored
2e5b6d6dSopenharmony_ciin the internal encoding, or it may have associated attributes like bold or
2e5b6d6dSopenharmony_ciitalic styles.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Guidance
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU provides multiple text access interfaces which were added over time. If
2e5b6d6dSopenharmony_cisimple strings cannot be used, then consider the following:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  [UText](utext.md): Added in ICU4C 3.4 as a technology preview. Intended to
2e5b6d6dSopenharmony_ci    be the strategic text access API for use with ICU. C API, high performance,
2e5b6d6dSopenharmony_ci    writable, supports native indexes for efficient non-UTF-16 text storage. So
2e5b6d6dSopenharmony_ci    far (3.4) only supported in BreakIterator. Some API changes are anticipated
2e5b6d6dSopenharmony_ci    for ICU 3.6.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  Replaceable (Java & C++) and UReplaceable (C): Writable, designed for use
2e5b6d6dSopenharmony_ci    with Transliterator.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci3.  CharacterIterator (Java JDK & C++): Read-only, used in many APIs. Large
2e5b6d6dSopenharmony_ci    differences between the JDK and C++ versions.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci4.  UCharacterIterator (Java): Back-port of the C++ CharacterIterator to ICU4J
2e5b6d6dSopenharmony_ci    for support of supplementary code points and post-increment iteration.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci5.  UCharIterator (C): Read-only, C interface used mostly in incremental
2e5b6d6dSopenharmony_ci    normalization and collation.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following provides some historical perspective and comparison between the
2e5b6d6dSopenharmony_ciinterfaces.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### CharacterIterator
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU has long provided the CharacterIterator interface for some services. It
2e5b6d6dSopenharmony_ciallows for abstract text access, but has limitations:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  It has a per-character function call overhead.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  Originally, it was designed for UCS-2 operation and did not support direct
2e5b6d6dSopenharmony_ci    handling of supplementary Unicode code points. Such support was later added.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci3.  Its pre-increment iteration semantics are uncommon, and are inefficient when
2e5b6d6dSopenharmony_ci    used with a variable-width encoding form (UTF-16). Functions for
2e5b6d6dSopenharmony_ci    post-increment iteration were added later.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci4.  The C++ version added iteration start/limit boundaries only because the C++
2e5b6d6dSopenharmony_ci    UnicodeString copies string contents during substringing; the Java
2e5b6d6dSopenharmony_ci    CharacterIterator does not have these extra boundaries – substringing is
2e5b6d6dSopenharmony_ci    more efficient in Java.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci5.  CharacterIterator is not available for use in C.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci6.  CharacterIterator is a read-only interface.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci7.  It uses UTF-16 indexes into the text, which is not efficient for other
2e5b6d6dSopenharmony_ci    encoding forms.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci8.  With the additions to the API over time, the number of methods that have to
2e5b6d6dSopenharmony_ci    be overridden by subclasses has become rather large.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe core Java adopted an early version of CharacterIterator; later
2e5b6d6dSopenharmony_cifunctionality, like support for supplementary code points, was back-ported from
2e5b6d6dSopenharmony_ciICU4C to ICU4J to form the UCharacterIterator class.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe UCharIterator C interface was added to allow for incremental normalization
2e5b6d6dSopenharmony_ciand collation in C. It is entirely code unit (UChar)-oriented, uses only
2e5b6d6dSopenharmony_cipost-increment iteration and has a smaller number of overridable methods.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Replaceable
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe Replaceable (Java & C++) and UReplaceable (C) interfaces are designed for,
2e5b6d6dSopenharmony_ciand used in, Transliterator. They are random-access interfaces, not iterators.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### UText
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe [UText](utext.md) text access interface was designed as a possible
2e5b6d6dSopenharmony_cireplacement for all previous interfaces listed above, with additional
2e5b6d6dSopenharmony_cifunctionality. It allows for high-performance operation through the use of
2e5b6d6dSopenharmony_cistorage-native indexes (for efficient use of non-UTF-16 text) and through
2e5b6d6dSopenharmony_ciaccessing multiple characters per function call. Code point iteration is
2e5b6d6dSopenharmony_ciavailable with functions as well as with C macros, for maximum performance.
2e5b6d6dSopenharmony_ciUText is also writable, mostly patterned after Replaceable. For details see the
2e5b6d6dSopenharmony_ciUText chaper.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Strings in ICU
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Strings in Java
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn Java, ICU uses the standard String and StringBuffer classes, `char[]`, etc.
2e5b6d6dSopenharmony_ciSee the Java documentation for details.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Strings in C/C++
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciStrings in C and C++ are, at the lowest level, arrays of some particular base
2e5b6d6dSopenharmony_citype. In most cases, the base type is a char, which is an 8-bit byte in modern
2e5b6d6dSopenharmony_cicompilers. Some APIs use a "wide character" type wchar_t that is typically 8,
2e5b6d6dSopenharmony_ci16, or 32 bits wide and upwards compatible with char. C code passes `char *` or
2e5b6d6dSopenharmony_ciwchar_t pointers to the first element of an array. C++ enables you to create a
2e5b6d6dSopenharmony_ciclass for encapsulating these kinds of character arrays in handy and safe
2e5b6d6dSopenharmony_ciobjects.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe interpretation of the byte or wchar_t values depends on the platform, the
2e5b6d6dSopenharmony_cicompiler, the signed state of both char and wchar_t, and the width of wchar_t.
2e5b6d6dSopenharmony_ciThese characteristics are not specified in the language standards. When using
2e5b6d6dSopenharmony_ciinternationalized text, the encoding often uses multiple chars for most
2e5b6d6dSopenharmony_cicharacters and a wchar_t that is wide enough to hold exactly one character code
2e5b6d6dSopenharmony_cipoint value each. Some APIs, especially in the standard library (stdlib), assume
2e5b6d6dSopenharmony_cithat wchar_t strings use a fixed-width encoding with exactly one character code
2e5b6d6dSopenharmony_cipoint per wchar_t.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### ICU: 16-bit Unicode strings
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn order to take advantage of Unicode with its large character repertoire and
2e5b6d6dSopenharmony_ciits well-defined properties, there must be types with consistent definitions and
2e5b6d6dSopenharmony_cisemantics. The Unicode standard defines a default encoding based on 16-bit code
2e5b6d6dSopenharmony_ciunits. This is supported in ICU by the definition of the UChar to be an unsigned
2e5b6d6dSopenharmony_ci16-bit integer type. This is the base type for character arrays for strings in
2e5b6d6dSopenharmony_ciICU.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: *Endianness is not an issue on this level because the interpretation of an
2e5b6d6dSopenharmony_ciinteger is fixed within any given platform.*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWith the UTF-16 encoding form, a single Unicode code point is encoded with
2e5b6d6dSopenharmony_cieither one or two 16-bit UChar code units (unambiguously). "Supplementary" code
2e5b6d6dSopenharmony_cipoints, which are encoded with pairs of code units, are rare in most texts. The
2e5b6d6dSopenharmony_citwo code units are called "surrogates", and their unit value ranges are distinct
2e5b6d6dSopenharmony_cifrom each other and from single-unit value ranges. Code should be generally
2e5b6d6dSopenharmony_cioptimized for the common, single-unit case.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci16-bit Unicode strings in internal processing contain sequences of 16-bit code
2e5b6d6dSopenharmony_ciunits that may not always be well-formed UTF-16. ICU treats single, unpaired
2e5b6d6dSopenharmony_cisurrogates as surrogate code points, i.e., they are returned in per-code point
2e5b6d6dSopenharmony_ciiteration, they are included in the number of code points of a string, and they
2e5b6d6dSopenharmony_ciare generally treated much like normal, unassigned code points in most APIs.
2e5b6d6dSopenharmony_ciSurrogate code points have Unicode properties although they cannot be assigned
2e5b6d6dSopenharmony_cian actual character.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU string handling functions (including append, substring, etc.) do not
2e5b6d6dSopenharmony_ciautomatically protect against producing malformed UTF-16 strings. Most of the
2e5b6d6dSopenharmony_citime, indexes into strings are naturally at code point boundaries because they
2e5b6d6dSopenharmony_ciresult from other functions that always produce such indexes. If necessary, the
2e5b6d6dSopenharmony_ciuser can test for proper boundaries by checking the code unit values, or adjust
2e5b6d6dSopenharmony_ciarbitrary indexes to code point boundaries by using the C macros
2e5b6d6dSopenharmony_ciU16_SET_CP_START() and U16_SET_CP_LIMIT() (see utf.h) and the UnicodeString
2e5b6d6dSopenharmony_cifunctions getChar32Start() and getChar32Limit().
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and
2e5b6d6dSopenharmony_ciconvenience functions (ustring.h), but only a subset of APIs works with UTF-8
2e5b6d6dSopenharmony_cidirectly as string encoding form.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**See the [UTF-8](utf-8.md) subpage for details about working with
2e5b6d6dSopenharmony_ciUTF-8.** Some of the following sections apply to UTF-8 APIs as well; for example
2e5b6d6dSopenharmony_cisections about handling lengths and overflows.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Separate type for single code points
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA Unicode code point is an integer with a value from 0 to 0x10FFFF. ICU 2.4 and
2e5b6d6dSopenharmony_cilater defines the UChar32 type for single code point values as a 32 bits wide
2e5b6d6dSopenharmony_cisigned integer (int32_t). This allows the use of easily testable negative values
2e5b6d6dSopenharmony_cias sentinels, to indicate errors, exceptions or "done" conditions. All negative
2e5b6d6dSopenharmony_civalues and positive values greater than 0x10FFFF are illegal as Unicode code
2e5b6d6dSopenharmony_cipoints.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU 2.2 and earlier defined UChar32 depending on the platform: If the compiler's
2e5b6d6dSopenharmony_ciwchar_t was 32 bits wide, then UChar32 was defined to be the same as wchar_t.
2e5b6d6dSopenharmony_ciOtherwise, it was defined to be an unsigned 32-bit integer. This means that
2e5b6d6dSopenharmony_ciUChar32 was either a signed or unsigned integer type depending on the compiler.
2e5b6d6dSopenharmony_ciThis was meant for better interoperability with existing libraries, but was of
2e5b6d6dSopenharmony_cilittle use because ICU does not process 32-bit strings — UChar32 is only used
2e5b6d6dSopenharmony_cifor single code points. The platform dependence of UChar32 could cause problems
2e5b6d6dSopenharmony_ciwith C++ function overloading.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Compiler-dependent definitions
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe compiler's and the runtime character set's codepage encodings are not
2e5b6d6dSopenharmony_cispecified by the C/C++ language standards and are usually not a Unicode encoding
2e5b6d6dSopenharmony_ciform. They typically depend on the settings of the individual system, process,
2e5b6d6dSopenharmony_cior thread. Therefore, it is not possible to instantiate a Unicode character or
2e5b6d6dSopenharmony_cistring variable directly with C/C++ character or string literals. The only safe
2e5b6d6dSopenharmony_ciway is to use numeric values. It is not an issue for User Interface (UI) strings
2e5b6d6dSopenharmony_cithat are translated. These UI strings are loaded from a resource bundle, which
2e5b6d6dSopenharmony_ciis generated from a text file that can be in Unicode or in any other
2e5b6d6dSopenharmony_ciICU-provided codepage. The binary form of the genrb tool generates UTF-16
2e5b6d6dSopenharmony_cistrings that are ready for direct use.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere is a useful exception to this for program-internal strings and test
2e5b6d6dSopenharmony_cistrings. Within each "family" of character encodings, there is a set of
2e5b6d6dSopenharmony_cicharacters that have the same numeric code values. Such characters include Latin
2e5b6d6dSopenharmony_ciletters, the basic digits, the space, and some punctuation. Most of the ASCII
2e5b6d6dSopenharmony_cigraphic characters are invariant characters. The same set, with different but
2e5b6d6dSopenharmony_ciagain consistent numeric values, is invariant among almost all EBCDIC codepages.
2e5b6d6dSopenharmony_ciFor details, see
2e5b6d6dSopenharmony_ci[icu4c/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html)
2e5b6d6dSopenharmony_ci. With strings that contain only these invariant characters, it is possible to
2e5b6d6dSopenharmony_ciuse efficient ICU constructs to write a C/C++ string literal and use it to
2e5b6d6dSopenharmony_ciinitialize Unicode strings.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn some APIs, ICU uses `char *` strings. This is either for file system paths or
2e5b6d6dSopenharmony_cifor strings that contain invariant characters only (such as locale identifiers).
2e5b6d6dSopenharmony_ciThese strings are in the platform-specific encoding of either ASCII or EBCDIC.
2e5b6d6dSopenharmony_ciAll other codepage differences do not matter for invariant characters and are
2e5b6d6dSopenharmony_cimanipulated by the C stdlib functions like strcpy().
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn some APIs where identifiers are used, ICU uses `char *` strings with invariant
2e5b6d6dSopenharmony_cicharacters. Such strings do not require the full Unicode repertoire and are
2e5b6d6dSopenharmony_cieasier to handle in C and C++ with `char *` string literals and standard C
2e5b6d6dSopenharmony_cilibrary functions. Their useful character repertoire is actually smaller than
2e5b6d6dSopenharmony_cithe set of graphic ASCII characters; for details, see
2e5b6d6dSopenharmony_ci[utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) . Examples of
2e5b6d6dSopenharmony_ci`char *` identifier uses are converter names, locale IDs, and resource bundle
2e5b6d6dSopenharmony_citable keys.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere is another, less efficient way to have human-readable Unicode string
2e5b6d6dSopenharmony_ciliterals in C and C++ code. ICU provides a small number of functions that allow
2e5b6d6dSopenharmony_ciany Unicode characters to be inserted into a string with escape sequences
2e5b6d6dSopenharmony_cisimilar to the one that is used in the C and C++ language. In addition to the
2e5b6d6dSopenharmony_cifamiliar \\n and \\xhh etc., ICU also provides the \\uhhhh syntax with four hex
2e5b6d6dSopenharmony_cidigits and the \\Uhhhhhhhh syntax with eight hex digits for hexadecimal Unicode
2e5b6d6dSopenharmony_cicode point values. This is very similar to the newer escape sequences used in
2e5b6d6dSopenharmony_ciJava and defined in the latest C and C++ standards. Since ICU is not a compiler
2e5b6d6dSopenharmony_ciextension, the "unescaping" is done at runtime and the backslash itself must be
2e5b6d6dSopenharmony_ciescaped (duplicated) so that the compiler does not attempt to "unescape" the
2e5b6d6dSopenharmony_cisequence itself.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Handling Lengths, Indexes, and Offsets in Strings
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe length of a string and all indexes and offsets related to the string are
2e5b6d6dSopenharmony_cialways counted in terms of UChar code units, not in terms of UChar32 code
2e5b6d6dSopenharmony_cipoints. (This is the same as in common C library functions that use `char *`
2e5b6d6dSopenharmony_cistrings with multi-byte encodings.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOften, a user thinks of a "character" as a complete unit in a language, like an
2e5b6d6dSopenharmony_ci'Ä', while it may be represented with multiple Unicode code points including a
2e5b6d6dSopenharmony_cibase character and combining marks. (See the Unicode standard for details.) This
2e5b6d6dSopenharmony_cioften requires users to index and pass strings (UnicodeString or `UChar *`) with
2e5b6d6dSopenharmony_cimultiple code units or code points. It cannot be done with single-integer
2e5b6d6dSopenharmony_cicharacter types. Indexing of such "characters" is done with the BreakIterator
2e5b6d6dSopenharmony_ciclass (in C: ubrk_ functions).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciEven with such "higher-level" indexing functions, the actual index values will
2e5b6d6dSopenharmony_cibe expressed in terms of UChar code units. When more than one code unit is used
2e5b6d6dSopenharmony_ciat a time, the index value changes by more than one at a time.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU uses signed 32-bit integers (int32_t) for lengths and offsets. Because of
2e5b6d6dSopenharmony_ciinternal computations, strings (and arrays in general) are limited to 1G base
2e5b6d6dSopenharmony_ciunits or 2G bytes, whichever is smaller.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using C Strings: NUL-Terminated vs. Length Parameters
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciStrings are either terminated with a NUL character (code point 0, U+0000) or
2e5b6d6dSopenharmony_citheir length is specified. In the latter case, it is possible to have one or
2e5b6d6dSopenharmony_cimore NUL characters inside the string.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Input string** arguments are typically passed with two parameters: The (const)
2e5b6d6dSopenharmony_ci`UChar *` pointer and an int32_t length argument. If the length is -1 then the
2e5b6d6dSopenharmony_cistring must be NUL-terminated and the ICU function will call the u_strlen()
2e5b6d6dSopenharmony_cimethod or treat it equivalently. If the input string contains embedded NUL
2e5b6d6dSopenharmony_cicharacters, then the length must be specified.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Output string** arguments are typically passed with a destination `UChar *`
2e5b6d6dSopenharmony_cipointer and an int32_t capacity argument and the function returns the length of
2e5b6d6dSopenharmony_cithe output as an int32_t. There is also almost always a UErrorCode argument.
2e5b6d6dSopenharmony_ciEssentially, a `UChar[]` array is passed in with its start and the number of
2e5b6d6dSopenharmony_ciavailable UChars. The array is filled with the output and if space permits the
2e5b6d6dSopenharmony_cioutput will be NUL-terminated. The length of the output string is returned. In
2e5b6d6dSopenharmony_ciall cases the length of the output string does not include the terminating NUL.
2e5b6d6dSopenharmony_ciThis is the same behavior found in most ICU and non-ICU string APIs, for example
2e5b6d6dSopenharmony_ciu_strlen(). The output string may **contain** NUL characters as part of its
2e5b6d6dSopenharmony_ciactual contents, depending on the input and the operation. Note that the
2e5b6d6dSopenharmony_ciUErrorCode parameter is used to indicate both errors and warnings (non-errors).
2e5b6d6dSopenharmony_ciThe following describes some of the situations in which the UErrorCode will be
2e5b6d6dSopenharmony_ciset to a non-zero value:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  If the output length is greater than the output array capacity, then the
2e5b6d6dSopenharmony_ci    UErrorCode will be set to U_BUFFER_OVERFLOW_ERROR and the contents of the
2e5b6d6dSopenharmony_ci    output array is undefined.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  If the output length is equal to the capacity, then the output has been
2e5b6d6dSopenharmony_ci    completely written minus the terminating NUL. This is also indicated by
2e5b6d6dSopenharmony_ci    setting the UErrorCode to U_STRING_NOT_TERMINATED_WARNING.
2e5b6d6dSopenharmony_ci    Note that U_STRING_NOT_TERMINATED_WARNING does not indicate failure (it
2e5b6d6dSopenharmony_ci    passes the U_SUCCESS() macro).
2e5b6d6dSopenharmony_ci    Note also that it is more reliable to check the output length against the
2e5b6d6dSopenharmony_ci    capacity, rather than checking for the warning code, because warning codes
2e5b6d6dSopenharmony_ci    do not cause the early termination of a function and may subsequently be
2e5b6d6dSopenharmony_ci    overwritten.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci3.  If neither of these two conditions apply, the error code will indicate
2e5b6d6dSopenharmony_ci    success and not a U_STRING_NOT_TERMINATED_WARNING. (If a
2e5b6d6dSopenharmony_ci    U_STRING_NOT_TERMINATED_WARNING code had been set in the UErrorCode
2e5b6d6dSopenharmony_ci    parameter before the function call, then it is reset to a U_ZERO_ERROR.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**Preflighting:** The returned length is always the full output length even if
2e5b6d6dSopenharmony_cithe output buffer is too small. It is possible to pass in a capacity of 0 (and
2e5b6d6dSopenharmony_cian output array pointer of NUL) for "pure preflighting" to determine the
2e5b6d6dSopenharmony_cinecessary output buffer size. Add one to make the output string NUL-terminated.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNote that — whether the caller intends to "preflight" or not — if the output
2e5b6d6dSopenharmony_cilength is equal to or greater than the capacity, then the UErrorCode is set to
2e5b6d6dSopenharmony_ciU_STRING_NOT_TERMINATED_WARNING or U_BUFFER_OVERFLOW_ERROR respectively, as
2e5b6d6dSopenharmony_cidescribed above.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciHowever, "pure preflighting" is very expensive because the operation has to be
2e5b6d6dSopenharmony_ciprocessed twice — once for calculating the output length, and a second time to
2e5b6d6dSopenharmony_ciactually generate the output. It is much more efficient to always provide an
2e5b6d6dSopenharmony_cioutput buffer that is expected to be large enough for most cases, and to
2e5b6d6dSopenharmony_cireallocate and repeat the operation only when an overflow occurred. (Remember to
2e5b6d6dSopenharmony_cireset the UErrorCode to U_ZERO_ERROR before calling the function again.) In
2e5b6d6dSopenharmony_ciC/C++, the initial output buffer can be a stack buffer. In case of a
2e5b6d6dSopenharmony_cireallocation, it may be possible and useful to cache and reuse the new, larger
2e5b6d6dSopenharmony_cibuffer.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**:*The exception to these rules are the ANSI-C-style functions like u_strcpy(),
2e5b6d6dSopenharmony_ciwhich generally require NUL-terminated strings, forbid embedded NULs, and do not
2e5b6d6dSopenharmony_citake capacity arguments for buffer overflow checking.*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using Unicode Strings in C
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn C, Unicode strings are similar to standard `char *` strings. Unicode strings
2e5b6d6dSopenharmony_ciare arrays of UChar and most APIs take a `UChar *` pointer to the first element
2e5b6d6dSopenharmony_ciand an input length and/or output capacity, see above. ICU has a number of
2e5b6d6dSopenharmony_cifunctions that provide the Unicode equivalent of the stdlib functions such as
2e5b6d6dSopenharmony_cistrcpy(), strstr(), etc. Compared with their C standard counterparts, their
2e5b6d6dSopenharmony_cifunction names begin with u_. Otherwise, their semantics are equivalent. These
2e5b6d6dSopenharmony_cifunctions are defined in icu/source/common/unicode/ustring.h.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Code Point Access
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSometimes, Unicode code points need to be accessed in C for iteration, movement
2e5b6d6dSopenharmony_ciforward, or movement backward in a string. A string might also need to be
2e5b6d6dSopenharmony_ciwritten from code points values. ICU provides a number of macros that are
2e5b6d6dSopenharmony_cidefined in the icu/source/common/unicode/utf.h and utf8.h/utf16.h headers that
2e5b6d6dSopenharmony_ciit includes (utf.h is in turn included with utypes.h).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciMacros for 16-bit Unicode strings have a U16_ prefix. For example:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    U16_NEXT(s, i, length, c)
2e5b6d6dSopenharmony_ci    U16_PREV(s, start, i, c)
2e5b6d6dSopenharmony_ci    U16_APPEND(s, i, length, c, isError)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere are also macros with a U_ prefix for code point range checks (e.g., test
2e5b6d6dSopenharmony_cifor non-character code point), and U8_ macros for 8-bit (UTF-8) strings. See the
2e5b6d6dSopenharmony_ciheader files and the API References for more details.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### UTF Macros before ICU 2.4
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn ICU 2.4, the utf\*.h macros have been revamped, improved, simplified, and
2e5b6d6dSopenharmony_cirenamed. The old macros continue to be available. They are in utf_old.h,
2e5b6d6dSopenharmony_citogether with an explanation of the change. utf.h, utf8.h and utf16.h contain
2e5b6d6dSopenharmony_cithe new macros instead. The new macros are intended to be more consistent, more
2e5b6d6dSopenharmony_ciuseful, and less confusing. Some macros were simply renamed for consistency with
2e5b6d6dSopenharmony_cia new naming scheme.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe documentation of the old macros has been removed. If you need it, see a User
2e5b6d6dSopenharmony_ciGuide version from ICU 4.2 or earlier (see the [download
2e5b6d6dSopenharmony_cipage](https://icu.unicode.org/download)).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciC Unicode String Literals
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere is a pair of macros that together enable users to instantiate a Unicode
2e5b6d6dSopenharmony_cistring in C — a `UChar []` array — from a C string literal:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    /*
2e5b6d6dSopenharmony_ci    * In C, we need two macros: one to declare the UChar[] array, and
2e5b6d6dSopenharmony_ci    * one to populate it; the second one is a noop on platforms where
2e5b6d6dSopenharmony_ci    * wchar_t is compatible with UChar and ASCII-based.
2e5b6d6dSopenharmony_ci    * The length of the string literal must be counted for both macros.
2e5b6d6dSopenharmony_ci    */
2e5b6d6dSopenharmony_ci    /* declare the invString array for the string */
2e5b6d6dSopenharmony_ci    U_STRING_DECL(invString, "such characters are safe 123 %-.", 32);
2e5b6d6dSopenharmony_ci    /* populate it with the characters */
2e5b6d6dSopenharmony_ci    U_STRING_INIT(invString, "such characters are safe 123 %-.", 32);
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWith invariant characters, it is also possible to efficiently convert `char *`
2e5b6d6dSopenharmony_cistrings to and from UChar \ strings:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    static const char *cs1="such characters are safe 123 %-.";
2e5b6d6dSopenharmony_ci    static UChar us1[40];
2e5b6d6dSopenharmony_ci    static char cs2[40];
2e5b6d6dSopenharmony_ci    u_charsToUChars(cs1, us1, 33); /* include the terminating NUL */
2e5b6d6dSopenharmony_ci    u_UCharsToChars(us1, cs2, 33);
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Testing for well-formed UTF-16 strings
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIt is sometimes useful to test if a 16-bit Unicode string is well-formed UTF-16,
2e5b6d6dSopenharmony_cithat is, that it does not contain unpaired surrogate code units. For a boolean
2e5b6d6dSopenharmony_citest, call a function like u_strToUTF8() which sets an error code if the input
2e5b6d6dSopenharmony_cistring is malformed. (Provide a zero-capacity destination buffer and treat the
2e5b6d6dSopenharmony_cibuffer overflow error as "is well-formed".) If you need to know the position of
2e5b6d6dSopenharmony_cithe unpaired surrogate, you can iterate through the string with U16_NEXT() and
2e5b6d6dSopenharmony_ciU_IS_SURROGATE().
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using Unicode Strings in C++
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci[UnicodeString](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUnicodeString.html) is
2e5b6d6dSopenharmony_cia C++ string class that wraps a UChar array and associated bookkeeping. It
2e5b6d6dSopenharmony_ciprovides a rich set of string handling functions.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUnicodeString combines elements of both the Java String and StringBuffer
2e5b6d6dSopenharmony_ciclasses. Many UnicodeString functions are named and work similar to Java String
2e5b6d6dSopenharmony_cimethods but modify the object (UnicodeString is "mutable").
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUnicodeString provides functions for random access and use (insert/append/find
2e5b6d6dSopenharmony_cietc.) of both code units and code points. For each non-iterative string/code
2e5b6d6dSopenharmony_cipoint macro in utf.h there is at least one UnicodeString member function. The
2e5b6d6dSopenharmony_cinames of most of these functions contain "32" to indicate the use of a UChar32.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciCode point and code unit iteration is provided by the
2e5b6d6dSopenharmony_ci[CharacterIterator](characteriterator.md) abstract class and its subclasses.
2e5b6d6dSopenharmony_ciThere are concrete iterator implementations for UnicodeString objects and plain
2e5b6d6dSopenharmony_ci`UChar []` arrays.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciMost UnicodeString constructors and functions do not have a UErrorCode
2e5b6d6dSopenharmony_ciparameter. Instead, if the construction of a UnicodeString fails, for example
2e5b6d6dSopenharmony_ciwhen it is constructed from a NULL `UChar *` pointer, then the UnicodeString
2e5b6d6dSopenharmony_ciobject becomes "bogus". This can be tested with the isBogus() function. A
2e5b6d6dSopenharmony_ciUnicodeString can be put into the "bogus" state explicitly with the setToBogus()
2e5b6d6dSopenharmony_cifunction. This is different from an empty string (although a "bogus" string also
2e5b6d6dSopenharmony_cireturns true from isEmpty()) and may be used equivalently to NULL in `UChar *` C
2e5b6d6dSopenharmony_ciAPIs (or null references in Java, or NULL values in SQL). A string remains
2e5b6d6dSopenharmony_ci"bogus" until a non-bogus string value is assigned to it. For complete details
2e5b6d6dSopenharmony_ciof the behavior of "bogus" strings see the description of the setToBogus()
2e5b6d6dSopenharmony_cifunction.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSome APIs work with the
2e5b6d6dSopenharmony_ci[Replaceable](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classReplaceable.html)
2e5b6d6dSopenharmony_ciabstract class. It defines a simple interface for random access and text
2e5b6d6dSopenharmony_cimodification and is useful for operations on text that may have associated
2e5b6d6dSopenharmony_cimeta-data (e.g., styled text), especially in the Transliterator API.
2e5b6d6dSopenharmony_ciUnicodeString implements Replaceable.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### C++ Unicode String Literals
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciLike in C, there are macros that enable users to instantiate a UnicodeString
2e5b6d6dSopenharmony_cifrom a C string literal. One macro requires the length of the string as in the C
2e5b6d6dSopenharmony_cimacros, the other one implies a strlen().
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    UnicodeString s1=UNICODE_STRING("such characters are safe 123 %-.", 32);
2e5b6d6dSopenharmony_ci    UnicodeString s1=UNICODE_STRING_SIMPLE("such characters are safe 123 %-.");
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIt is possible to efficiently convert between invariant-character strings and
2e5b6d6dSopenharmony_ciUnicodeStrings by using constructor, setTo() or extract() overloads that take
2e5b6d6dSopenharmony_cicodepage data (`const char *`) and specifying an empty string ("") as the
2e5b6d6dSopenharmony_cicodepage name.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using C++ Strings in C APIs
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe internal buffer of UnicodeString objects is available for direct handling in
2e5b6d6dSopenharmony_ciC (or C-style) APIs that take `UChar *` arguments. It is possible but usually not
2e5b6d6dSopenharmony_cinecessary to copy the string contents with one of the extract functions. The
2e5b6d6dSopenharmony_cifollowing describes several direct buffer access methods.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe UnicodeString function getBuffer() const returns a readonly const `UChar *`.
2e5b6d6dSopenharmony_ciThe length of the string is indicated by UnicodeString's length() function.
2e5b6d6dSopenharmony_ciGenerally, UnicodeString does not NUL-terminate the contents of its internal
2e5b6d6dSopenharmony_cibuffer. However, it is possible to check for a NUL character if the length of
2e5b6d6dSopenharmony_cithe string is less than the capacity of the buffer. The following code is an
2e5b6d6dSopenharmony_ciexample of how to check the capacity of the buffer:
2e5b6d6dSopenharmony_ci`(s.length()<s.getCapacity() && buffer[s.length()]==0)`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAn easier way to NUL-terminate the buffer and get a `const UChar *` pointer to it
2e5b6d6dSopenharmony_ciis the getTerminatedBuffer() function. Unlike getBuffer() const,
2e5b6d6dSopenharmony_cigetTerminatedBuffer() is not a const function because it may have to (reallocate
2e5b6d6dSopenharmony_ciand) modify the buffer to append a terminating NUL. Therefore, use getBuffer()
2e5b6d6dSopenharmony_ciconst if you do not need a NUL-terminated buffer.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere is also a pair of functions that allow controlled write access to the
2e5b6d6dSopenharmony_cibuffer of a UnicodeString: `UChar *getBuffer(int32_t minCapacity)` and
2e5b6d6dSopenharmony_ci`releaseBuffer(int32_t newLength)`. `UChar *getBuffer(int32_t minCapacity)`
2e5b6d6dSopenharmony_ciprovides a writeable buffer of at least the requested capacity and returns a
2e5b6d6dSopenharmony_cipointer to it. The actual capacity of the buffer after the
2e5b6d6dSopenharmony_ci`getBuffer(minCapacity)` call may be larger than the requested capacity and can be
2e5b6d6dSopenharmony_cidetermined with `getCapacity()`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOnce the buffer contents are modified, the buffer must be released with the
2e5b6d6dSopenharmony_ci`releaseBuffer(int32_t newLength)` function, which sets the new length of the
2e5b6d6dSopenharmony_ciUnicodeString (newLength=-1 can be passed to determine the length of
2e5b6d6dSopenharmony_ciNUL-terminated contents like `u_strlen()`).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBetween the `getBuffer(minCapacity)` and `releaseBuffer(newLength)` function calls,
2e5b6d6dSopenharmony_cithe contents of the UnicodeString is unknown and the object behaves like it
2e5b6d6dSopenharmony_cicontains an empty string. A nested `getBuffer(minCapacity)`, `getBuffer() const` or
2e5b6d6dSopenharmony_ci`getTerminatedBuffer()` will fail (return NULL) and modifications of the string
2e5b6d6dSopenharmony_civia UnicodeString member functions will have no effect. Copying a string with an
2e5b6d6dSopenharmony_ci"open buffer" yields an empty copy. The move constructor, move assignment
2e5b6d6dSopenharmony_cioperator and Return Value Optimization (RVO) transfer the state, including the
2e5b6d6dSopenharmony_ciopen buffer.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSee the UnicodeString API documentation for more information.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using C Strings in C++ APIs
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThere are efficient ways to wrap C-style strings in C++ UnicodeString objects
2e5b6d6dSopenharmony_ciwithout copying the string contents. In order to use C strings in C++ APIs, the
2e5b6d6dSopenharmony_ci`UChar *` pointer and length need to be wrapped into a UnicodeString. This can be
2e5b6d6dSopenharmony_cidone efficiently in two ways: With a readonly alias and a writable alias. The
2e5b6d6dSopenharmony_ciUnicodeString object that is constructed actually uses the `UChar *` pointer as
2e5b6d6dSopenharmony_ciits internal buffer pointer instead of allocating a new buffer and copying the
2e5b6d6dSopenharmony_cistring contents.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf the original string is a readonly `const UChar *`, then the UnicodeString must
2e5b6d6dSopenharmony_cibe constructed with a read only alias. If the original string is a writable
2e5b6d6dSopenharmony_ci(non-const) `UChar *` and is to be modified (e.g., if the `UChar *` buffer is an
2e5b6d6dSopenharmony_cioutput buffer) then the UnicodeString should be constructed with a writeable
2e5b6d6dSopenharmony_cialias. For more details see the section "Maximizing Performance with the
2e5b6d6dSopenharmony_ciUnicodeString Storage Model" and search the unistr.h header file for "alias".
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Maximizing Performance with the UnicodeString Storage Model
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUnicodeString uses four storage methods to maximize performance and minimize
2e5b6d6dSopenharmony_cimemory consumption:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  Short strings are normally stored inside the UnicodeString object. The
2e5b6d6dSopenharmony_ci    object has fields for the "bookkeeping" and a small UChar array. When the
2e5b6d6dSopenharmony_ci    object is copied, the internal characters are copied into the destination
2e5b6d6dSopenharmony_ci    object.
2e5b6d6dSopenharmony_ci2.  Longer strings are normally stored in allocated memory. The allocated UChar
2e5b6d6dSopenharmony_ci    array is preceded by a reference counter. When the string object is copied,
2e5b6d6dSopenharmony_ci    the allocated buffer is shared by incrementing the reference counter. If any
2e5b6d6dSopenharmony_ci    of the objects that share the same string buffer are modified, they receive
2e5b6d6dSopenharmony_ci    their own copy of the buffer and decrement the reference counter of the
2e5b6d6dSopenharmony_ci    previously co-used buffer.
2e5b6d6dSopenharmony_ci3.  A UnicodeString can be constructed (or set with a setTo() function) so that
2e5b6d6dSopenharmony_ci    it aliases a readonly buffer instead of copying the characters. In this
2e5b6d6dSopenharmony_ci    case, the string object uses this aliased buffer for as long as the object
2e5b6d6dSopenharmony_ci    is not modified and it will never attempt to modify or release the buffer.
2e5b6d6dSopenharmony_ci    This model has copy-on-write semantics. For example, when the string object
2e5b6d6dSopenharmony_ci    is modified, the buffer contents are first copied into writable memory
2e5b6d6dSopenharmony_ci    (inside the object for short strings or the allocated buffer for longer
2e5b6d6dSopenharmony_ci    strings). When a UnicodeString with a readonly setting is copied to another
2e5b6d6dSopenharmony_ci    UnicodeString using the fastCopyFrom() function, then both string objects
2e5b6d6dSopenharmony_ci    share the same readonly setting and point to the same storage. Copying a
2e5b6d6dSopenharmony_ci    string with the normal assignment operator or copy constructor will copy the
2e5b6d6dSopenharmony_ci    buffer. This prevents accidental misuse of readonly-aliased strings. (This
2e5b6d6dSopenharmony_ci    is new in ICU 2.4; earlier, the assignment operator and copy constructor
2e5b6d6dSopenharmony_ci    behaved like the new fastCopyFrom() does now.)
2e5b6d6dSopenharmony_ci    **Important:**
2e5b6d6dSopenharmony_ci    1.  The aliased buffer must remain valid for as long as any UnicodeString
2e5b6d6dSopenharmony_ci        object aliases it. This includes unmodified fastCopyFrom()and
2e5b6d6dSopenharmony_ci        `movedFrom()` copies of the object (including moves via the move
2e5b6d6dSopenharmony_ci        constructor and move assignment operator), and when the compiler uses
2e5b6d6dSopenharmony_ci        Return Value Optimization (RVO) where a function returns a UnicodeString
2e5b6d6dSopenharmony_ci        by value.
2e5b6d6dSopenharmony_ci    2.  Be prepared that return-by-value may either make a copy (which does not
2e5b6d6dSopenharmony_ci        preserve aliasing), or moves the value or uses RVO (which do preserve
2e5b6d6dSopenharmony_ci        aliasing).
2e5b6d6dSopenharmony_ci    3.  It is an error to readonly-alias temporary buffers and then pass the
2e5b6d6dSopenharmony_ci        resulting UnicodeString objects (or references/pointers to them) to APIs
2e5b6d6dSopenharmony_ci        that store them for longer than the buffers are valid.
2e5b6d6dSopenharmony_ci    4.  If it is necessary to make sure that a string is not a readonly alias,
2e5b6d6dSopenharmony_ci        then use any modifying function without actually changing the contents
2e5b6d6dSopenharmony_ci        (for example, s.setCharAt(0, s.charAt(0))).
2e5b6d6dSopenharmony_ci    5.  In ICU 2.4 and later, a simple assignment or copy construction will also
2e5b6d6dSopenharmony_ci        copy the buffer.
2e5b6d6dSopenharmony_ci4.  A UnicodeString can be constructed (or set with a setTo() function) so that
2e5b6d6dSopenharmony_ci    it aliases a writable buffer instead of copying the characters. The
2e5b6d6dSopenharmony_ci    difference from the above is that the string object writes through to this
2e5b6d6dSopenharmony_ci    aliased buffer for write operations. A new buffer is allocated and the
2e5b6d6dSopenharmony_ci    contents are copied only when the capacity of the buffer is not sufficient.
2e5b6d6dSopenharmony_ci    An efficient way to get the string contents into the original buffer is to
2e5b6d6dSopenharmony_ci    use the `extract(..., UChar *dst, ...)` function.
2e5b6d6dSopenharmony_ci    The `extract(..., UChar *dst, ...)` function copies the string contents only if the dst buffer is
2e5b6d6dSopenharmony_ci    different from the buffer of the string object itself. If a string grows and
2e5b6d6dSopenharmony_ci    shrinks during a sequence of operations, then it will not use the same
2e5b6d6dSopenharmony_ci    buffer, even if the string would fit. When a UnicodeString with a writeable
2e5b6d6dSopenharmony_ci    alias is assigned to another UnicodeString, the contents are always copied.
2e5b6d6dSopenharmony_ci    The destination string will not point to the buffer that the source string
2e5b6d6dSopenharmony_ci    aliases point to. However, a move constructor, move assignment operator, and
2e5b6d6dSopenharmony_ci    Return Value Optimization (RVO) do preserve aliasing.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn general, UnicodeString objects have "copy-on-write" semantics. Several
2e5b6d6dSopenharmony_ciobjects may share the same string buffer, but a modification only affects the
2e5b6d6dSopenharmony_ciobject that is modified itself. This is achieved by copying the string contents
2e5b6d6dSopenharmony_ciif it is not owned exclusively by this one object. Only after that is the object
2e5b6d6dSopenharmony_cimodified.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciEven though it is fairly efficient to copy UnicodeString objects, it is even
2e5b6d6dSopenharmony_cimore efficient, if possible, to work with references or pointers. Functions that
2e5b6d6dSopenharmony_cioutput strings can be faster by appending their results to a UnicodeString that
2e5b6d6dSopenharmony_ciis passed in by reference, compared with returning a UnicodeString object or
2e5b6d6dSopenharmony_cijust setting the local results alone into a string reference.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: *UnicodeStrings can be copied in a thread-safe manner by just using their
2e5b6d6dSopenharmony_cistandard copy constructors and assignment operators. fastCopyFrom() is also
2e5b6d6dSopenharmony_cithread-safe, but if the original string is a readonly alias, then the copy
2e5b6d6dSopenharmony_cishares the same aliased buffer.*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using UTF-8 strings with ICU
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAs mentioned in the overview of this chapter, ICU and most other
2e5b6d6dSopenharmony_ciUnicode-supporting software uses 16-bit Unicode for internal processing.
2e5b6d6dSopenharmony_ciHowever, there are circumstances where UTF-8 is used instead. This is usually
2e5b6d6dSopenharmony_cithe case for software that does little or no processing of non-ASCII characters,
2e5b6d6dSopenharmony_ciand/or for APIs that predate Unicode, use byte-based strings, and cannot be
2e5b6d6dSopenharmony_cichanged or replaced for various reasons.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA common perception is that UTF-8 has an advantage because it was designed for
2e5b6d6dSopenharmony_cicompatibility with byte-based, ASCII-based systems, although it was designed for
2e5b6d6dSopenharmony_cistring storage (of Unicode characters in Unix file names) rather than for
2e5b6d6dSopenharmony_ciprocessing performance.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhile ICU mostly does not natively use UTF-8 strings, there are many ways to
2e5b6d6dSopenharmony_ciwork with UTF-8 strings and ICU. For more information see the newer
2e5b6d6dSopenharmony_ci[UTF-8](utf-8.md) subpage.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using UTF-32 strings with ICU
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIt is even rarer to use UTF-32 for string processing than UTF-8. While 32-bit
2e5b6d6dSopenharmony_ciUnicode is convenient because it is the only fixed-width UTF, there are few or
2e5b6d6dSopenharmony_cino legacy systems with 32-bit string processing that would benefit from a
2e5b6d6dSopenharmony_cicompatible format, and the memory bandwidth requirements of UTF-32 diminish the
2e5b6d6dSopenharmony_ciperformance and handling advantage of the fixed-width format.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOver time, the wchar_t type of some C/C++ compilers became a 32-bit integer, and
2e5b6d6dSopenharmony_cisome C libraries do use it for Unicode processing. However, application software
2e5b6d6dSopenharmony_ciwith good Unicode support tends to have little use for the rudimentary Unicode
2e5b6d6dSopenharmony_ciand Internationalization support of the standard C/C++ libraries and often uses
2e5b6d6dSopenharmony_cicustom types (like ICU's) and UTF-16 or UTF-8.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor those systems where 32-bit Unicode strings are used, ICU offers some
2e5b6d6dSopenharmony_ciconvenience functions.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  Conversion of whole strings: u_strFromUTF32() and u_strFromUTF32() in
2e5b6d6dSopenharmony_ci    ustring.h.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci2.  Access to code points is trivial and does not require any macros.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci3.  Using a UTF-32 converter with all of the ICU conversion APIs in ucnv.h,
2e5b6d6dSopenharmony_ci    including ones with an "Algorithmic" suffix.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci4.  UnicodeString has `fromUTF32()` and `toUTF32()` methods.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci5.  For conversion directly between UTF-32 and another charset use
2e5b6d6dSopenharmony_ci    ucnv_convertEx(). However, since ICU converters work with byte streams in
2e5b6d6dSopenharmony_ci    external charsets on the non-"Unicode" side, the UTF-32 string will be
2e5b6d6dSopenharmony_ci    treated as a byte stream (UTF-32 Character Encoding *Scheme*) rather than a
2e5b6d6dSopenharmony_ci    sequence of 32-bit code units (UTF-32 Character Encoding *Form*). The
2e5b6d6dSopenharmony_ci    correct converter must be used: UTF-32BE or UTF-32LE according to the
2e5b6d6dSopenharmony_ci    platform endianness (U_IS_BIG_ENDIAN). Treating the string like a byte
2e5b6d6dSopenharmony_ci    stream also makes a difference in data types (`char *`), lengths and indexes
2e5b6d6dSopenharmony_ci    (counting bytes), and NUL-termination handling (input NUL-termination not
2e5b6d6dSopenharmony_ci    possible, output writes only a NUL byte, not a NUL 32-bit code unit). For
2e5b6d6dSopenharmony_ci    the difference between internal encoding forms and external encoding schemes
2e5b6d6dSopenharmony_ci    see the Unicode Standard.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci6.  Some ICU APIs work with a CharacterIterator, a UText or a UCharIterator
2e5b6d6dSopenharmony_ci    instead of directly with a C/C++ string parameter. There is currently no ICU
2e5b6d6dSopenharmony_ci    instance of any of these interfaces that reads UTF-32, although an
2e5b6d6dSopenharmony_ci    application could provide one.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Changes in ICU 2.0
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciBeginning with ICU release 2.0, there are a few changes to the ICU string
2e5b6d6dSopenharmony_cifacilities compared with earlier ICU releases.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSome of the NUL-termination behavior was inconsistent across the ICU API
2e5b6d6dSopenharmony_cifunctions. In particular, the following functions used to count the terminating
2e5b6d6dSopenharmony_ciNUL character in their output length (counted one more before ICU 2.0 than now):
2e5b6d6dSopenharmony_ciucnv_toUChars, ucnv_fromUChars, uloc_getLanguage, uloc_getCountry,
2e5b6d6dSopenharmony_ciuloc_getVariant, uloc_getName, uloc_getDisplayLanguage, uloc_getDisplayCountry,
2e5b6d6dSopenharmony_ciuloc_getDisplayVariant, uloc_getDisplayName
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciSome functions used to set an overflow error code even when only the terminating
2e5b6d6dSopenharmony_ciNUL did not fit into the output buffer. These functions now set UErrorCode to
2e5b6d6dSopenharmony_ciU_STRING_NOT_TERMINATED_WARNING rather than to U_BUFFER_OVERFLOW_ERROR.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe aliasing UnicodeString constructors and most extract functions have existed
2e5b6d6dSopenharmony_cifor several releases prior to ICU 2.0. There is now an additional extract
2e5b6d6dSopenharmony_cifunction with a UErrorCode parameter. Also, the getBuffer, releaseBuffer and
2e5b6d6dSopenharmony_cigetCapacity functions are new to ICU 2.0.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor more information about these changes, please consult the old and new API
2e5b6d6dSopenharmony_cidocumentation.