userguide/strings/utf-8.md

2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_cilayout: default
2e5b6d6dSopenharmony_cititle: UTF-8
2e5b6d6dSopenharmony_cinav_order: 1
2e5b6d6dSopenharmony_ciparent: Chars and Strings
2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_ci<!--
2e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
2e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
2e5b6d6dSopenharmony_ci-->
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci# UTF-8
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*Note: This page is only relevant for C/C++. In Java, all strings are encoded in
2e5b6d6dSopenharmony_ciUTF-16, except for conversion from bytes to strings (via InputStreamReader or
2e5b6d6dSopenharmony_cisimilar) and from strings to bytes (OutputStreamWriter etc.).*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhile most of ICU works with UTF-16 strings and uses data structures optimized
2e5b6d6dSopenharmony_cifor UTF-16, there are APIs that facilitate working with UTF-8, or are optimized
2e5b6d6dSopenharmony_cifor UTF-8, or work with Unicode code points (21-bit integer values) regardless
2e5b6d6dSopenharmony_ciof string encoding. Some data structures are designed to work equally well with
2e5b6d6dSopenharmony_ciUTF-16 and UTF-8.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t`
2e5b6d6dSopenharmony_cilengths, normally with semantics parallel to UTF-16 handling. (Input length=-1
2e5b6d6dSopenharmony_cimeans NUL-terminated, output is NUL-terminated if there is space, output
2e5b6d6dSopenharmony_cioverflow is handled with preflighting; for details see the parent [Strings
2e5b6d6dSopenharmony_cipage](index.md).) Some newer APIs take an `icu::StringPiece` argument and write
2e5b6d6dSopenharmony_cito an `icu::ByteSink` or to a string class object like `std::string`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Conversion Between UTF-8 and UTF-16
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe simplest way to use UTF-8 strings in UTF-16 APIs is via the C++
2e5b6d6dSopenharmony_ci`icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and
2e5b6d6dSopenharmony_ci`toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and
2e5b6d6dSopenharmony_ci`u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and
2e5b6d6dSopenharmony_ci`u_strFromUTF8Lenient()`.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe conversion functions in `unicode/ucnv.h` are intended for very flexible
2e5b6d6dSopenharmony_cihandling of conversion to/from external byte streams (with customizable error
2e5b6d6dSopenharmony_cihandling and support for split buffers at arbitrary boundaries) which is
2e5b6d6dSopenharmony_cinormally unnecessary for internal strings.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciNote: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods
2e5b6d6dSopenharmony_ciwhich take either a converter object or a charset name. These can be used for
2e5b6d6dSopenharmony_ciUTF-8, but are not as efficient or convenient as the
2e5b6d6dSopenharmony_ci`fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among
2e5b6d6dSopenharmony_ciconversion methods, APIs with a charset name are more convenient but internally
2e5b6d6dSopenharmony_ciopen and close a converter; ones with a converter object parameter avoid this.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## UTF-8 as Default Charset
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU has many functions that take or return `char *` strings that are assumed to
2e5b6d6dSopenharmony_cibe in the default charset which should match the system encoding. Since this
2e5b6d6dSopenharmony_cicould be one of many charsets, and the charset can be different for different
2e5b6d6dSopenharmony_ciprocesses on the same system, ICU uses its conversion framework for converting
2e5b6d6dSopenharmony_cito and from UTF-16.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf it is known that the default charset is always UTF-8 on the target platform,
2e5b6d6dSopenharmony_cithen you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`.
2e5b6d6dSopenharmony_ci(For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1`
2e5b6d6dSopenharmony_cias a compiler flag.) This will change most of the implementation code to use
2e5b6d6dSopenharmony_cidedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the
2e5b6d6dSopenharmony_ciconversion framework. (Avoiding such dependencies helps with statically linked
2e5b6d6dSopenharmony_cilibraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even
2e5b6d6dSopenharmony_ci`UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Low-Level UTF-8 String Operations
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci`unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16
2e5b6d6dSopenharmony_cimacros in `unicode/utf16.h`. The macros handle many cases inline, but call
2e5b6d6dSopenharmony_ciinternal functions for complicated parts of the UTF-8 encoding form. For
2e5b6d6dSopenharmony_ciexample, the following code snippet counts white space characters in a string:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci```c
2e5b6d6dSopenharmony_ci#include "unicode/utypes.h"
2e5b6d6dSopenharmony_ci#include "unicode/stringpiece.h"
2e5b6d6dSopenharmony_ci#include "unicode/utf8.h"
2e5b6d6dSopenharmony_ci#include "unicode/uchar.h"
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciint32_t countWhiteSpace(StringPiece sp) {
2e5b6d6dSopenharmony_ci    const char *s=sp.data();
2e5b6d6dSopenharmony_ci    int32_t length=sp.length();
2e5b6d6dSopenharmony_ci    int32_t count=0;
2e5b6d6dSopenharmony_ci    for(int32_t i=0; i<length;) {
2e5b6d6dSopenharmony_ci        UChar32 c;
2e5b6d6dSopenharmony_ci        U8_NEXT(s, i, length, c);
2e5b6d6dSopenharmony_ci        if(u_isUWhiteSpace(c)) {
2e5b6d6dSopenharmony_ci            ++count;
2e5b6d6dSopenharmony_ci        }
2e5b6d6dSopenharmony_ci    }
2e5b6d6dSopenharmony_ci    return count;
2e5b6d6dSopenharmony_ci}
2e5b6d6dSopenharmony_ci```
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Dedicated UTF-8 APIs
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU has some APIs dedicated for UTF-8. They tend to have been added for "worker
2e5b6d6dSopenharmony_cifunctions" like comparing strings, to avoid the string conversion overhead,
2e5b6d6dSopenharmony_cirather than for "builder functions" like factory methods and attribute setters.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor example, `icu::Collator::compareUTF8()` compares two UTF-8 strings
2e5b6d6dSopenharmony_ciincrementally, without converting all of the two strings to UTF-16 if there is
2e5b6d6dSopenharmony_cian early base letter difference.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci`ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the
2e5b6d6dSopenharmony_citwo `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most
2e5b6d6dSopenharmony_ciother charsets uses a dedicated, optimized code path, avoiding the pivot through
2e5b6d6dSopenharmony_ciUTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well,
2e5b6d6dSopenharmony_cibut that has not been implemented yet as of ICU 4.4.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOther examples: (This list may or may not be complete.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(),
2e5b6d6dSopenharmony_ci    ucasemap_utf8FoldCase()
2e5b6d6dSopenharmony_ci*   ucnvsel_selectForUTF8()
2e5b6d6dSopenharmony_ci*   icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(),
2e5b6d6dSopenharmony_ci    uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.)
2e5b6d6dSopenharmony_ci*   ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey()
2e5b6d6dSopenharmony_ci*   uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8()
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Abstract Text APIs
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU offers several interfaces for text access, designed for different use cases.
2e5b6d6dSopenharmony_ci(Some interfaces are simply newer and more modern than others.) Some ICU
2e5b6d6dSopenharmony_ciservices work with some of these interfaces, and for some of these interfaces
2e5b6d6dSopenharmony_ciICU offers UTF-8 implementations out of the box.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci`UText` can be used with `BreakIterator` APIs (character/word/sentence/...
2e5b6d6dSopenharmony_cisegmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8
2e5b6d6dSopenharmony_cistring.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any
2e5b6d6dSopenharmony_ci    other charset with non-1:1 index conversion to UTF-16) if no dictionary is
2e5b6d6dSopenharmony_ci    supported. This excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).*
2e5b6d6dSopenharmony_ci*   *As a workaround for Thai word breaking, you can convert the string to
2e5b6d6dSopenharmony_ci    UTF-16 and convert indexes to UTF-8 string indexes via
2e5b6d6dSopenharmony_ci    `u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`*
2e5b6d6dSopenharmony_ci*   *ICU 4.4 has a technology preview for UText in the regular expression API,
2e5b6d6dSopenharmony_ci    but some of the UText regex API and semantics are likely to change for ICU
2e5b6d6dSopenharmony_ci    4.6. (Especially indexing semantics.)*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA `UCharIterator` can be used with several collation APIs (although there is
2e5b6d6dSopenharmony_cialso the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`.
2e5b6d6dSopenharmony_ci`uiter_setUTF8()` creates a UCharIterator for a UTF-8 string.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIt is also possible to create a `CharacterIterator` subclass for UTF-8 strings,
2e5b6d6dSopenharmony_cibut `CharacterIterator` has a lot of virtual methods and it requires UTF-16
2e5b6d6dSopenharmony_cistring index semantics.