12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: UTF-8
42e5b6d6dSopenharmony_cinav_order: 1
52e5b6d6dSopenharmony_ciparent: Chars and Strings
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# UTF-8
132e5b6d6dSopenharmony_ci
142e5b6d6dSopenharmony_ci*Note: This page is only relevant for C/C++. In Java, all strings are encoded in
152e5b6d6dSopenharmony_ciUTF-16, except for conversion from bytes to strings (via InputStreamReader or
162e5b6d6dSopenharmony_cisimilar) and from strings to bytes (OutputStreamWriter etc.).*
172e5b6d6dSopenharmony_ci
182e5b6d6dSopenharmony_ciWhile most of ICU works with UTF-16 strings and uses data structures optimized
192e5b6d6dSopenharmony_cifor UTF-16, there are APIs that facilitate working with UTF-8, or are optimized
202e5b6d6dSopenharmony_cifor UTF-8, or work with Unicode code points (21-bit integer values) regardless
212e5b6d6dSopenharmony_ciof string encoding. Some data structures are designed to work equally well with
222e5b6d6dSopenharmony_ciUTF-16 and UTF-8.
232e5b6d6dSopenharmony_ci
242e5b6d6dSopenharmony_ciFor UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t`
252e5b6d6dSopenharmony_cilengths, normally with semantics parallel to UTF-16 handling. (Input length=-1
262e5b6d6dSopenharmony_cimeans NUL-terminated, output is NUL-terminated if there is space, output
272e5b6d6dSopenharmony_cioverflow is handled with preflighting; for details see the parent [Strings
282e5b6d6dSopenharmony_cipage](index.md).) Some newer APIs take an `icu::StringPiece` argument and write
292e5b6d6dSopenharmony_cito an `icu::ByteSink` or to a string class object like `std::string`.
302e5b6d6dSopenharmony_ci
312e5b6d6dSopenharmony_ci## Conversion Between UTF-8 and UTF-16
322e5b6d6dSopenharmony_ci
332e5b6d6dSopenharmony_ciThe simplest way to use UTF-8 strings in UTF-16 APIs is via the C++
342e5b6d6dSopenharmony_ci`icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and
352e5b6d6dSopenharmony_ci`toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`.
362e5b6d6dSopenharmony_ci
372e5b6d6dSopenharmony_ciIn C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and
382e5b6d6dSopenharmony_ci`u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and
392e5b6d6dSopenharmony_ci`u_strFromUTF8Lenient()`.)
402e5b6d6dSopenharmony_ci
412e5b6d6dSopenharmony_ciThe conversion functions in `unicode/ucnv.h` are intended for very flexible
422e5b6d6dSopenharmony_cihandling of conversion to/from external byte streams (with customizable error
432e5b6d6dSopenharmony_cihandling and support for split buffers at arbitrary boundaries) which is
442e5b6d6dSopenharmony_cinormally unnecessary for internal strings.
452e5b6d6dSopenharmony_ci
462e5b6d6dSopenharmony_ciNote: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods
472e5b6d6dSopenharmony_ciwhich take either a converter object or a charset name. These can be used for
482e5b6d6dSopenharmony_ciUTF-8, but are not as efficient or convenient as the
492e5b6d6dSopenharmony_ci`fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among
502e5b6d6dSopenharmony_ciconversion methods, APIs with a charset name are more convenient but internally
512e5b6d6dSopenharmony_ciopen and close a converter; ones with a converter object parameter avoid this.)
522e5b6d6dSopenharmony_ci
532e5b6d6dSopenharmony_ci## UTF-8 as Default Charset
542e5b6d6dSopenharmony_ci
552e5b6d6dSopenharmony_ciICU has many functions that take or return `char *` strings that are assumed to
562e5b6d6dSopenharmony_cibe in the default charset which should match the system encoding. Since this
572e5b6d6dSopenharmony_cicould be one of many charsets, and the charset can be different for different
582e5b6d6dSopenharmony_ciprocesses on the same system, ICU uses its conversion framework for converting
592e5b6d6dSopenharmony_cito and from UTF-16.
602e5b6d6dSopenharmony_ci
612e5b6d6dSopenharmony_ciIf it is known that the default charset is always UTF-8 on the target platform,
622e5b6d6dSopenharmony_cithen you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`.
632e5b6d6dSopenharmony_ci(For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1`
642e5b6d6dSopenharmony_cias a compiler flag.) This will change most of the implementation code to use
652e5b6d6dSopenharmony_cidedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the
662e5b6d6dSopenharmony_ciconversion framework. (Avoiding such dependencies helps with statically linked
672e5b6d6dSopenharmony_cilibraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even
682e5b6d6dSopenharmony_ci`UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].)
692e5b6d6dSopenharmony_ci
702e5b6d6dSopenharmony_ci## Low-Level UTF-8 String Operations
712e5b6d6dSopenharmony_ci
722e5b6d6dSopenharmony_ci`unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16
732e5b6d6dSopenharmony_cimacros in `unicode/utf16.h`. The macros handle many cases inline, but call
742e5b6d6dSopenharmony_ciinternal functions for complicated parts of the UTF-8 encoding form. For
752e5b6d6dSopenharmony_ciexample, the following code snippet counts white space characters in a string:
762e5b6d6dSopenharmony_ci
772e5b6d6dSopenharmony_ci```c
782e5b6d6dSopenharmony_ci#include "unicode/utypes.h"
792e5b6d6dSopenharmony_ci#include "unicode/stringpiece.h"
802e5b6d6dSopenharmony_ci#include "unicode/utf8.h"
812e5b6d6dSopenharmony_ci#include "unicode/uchar.h"
822e5b6d6dSopenharmony_ci
832e5b6d6dSopenharmony_ciint32_t countWhiteSpace(StringPiece sp) {
842e5b6d6dSopenharmony_ci    const char *s=sp.data();
852e5b6d6dSopenharmony_ci    int32_t length=sp.length();
862e5b6d6dSopenharmony_ci    int32_t count=0;
872e5b6d6dSopenharmony_ci    for(int32_t i=0; i<length;) {
882e5b6d6dSopenharmony_ci        UChar32 c;
892e5b6d6dSopenharmony_ci        U8_NEXT(s, i, length, c);
902e5b6d6dSopenharmony_ci        if(u_isUWhiteSpace(c)) {
912e5b6d6dSopenharmony_ci            ++count;
922e5b6d6dSopenharmony_ci        }
932e5b6d6dSopenharmony_ci    }
942e5b6d6dSopenharmony_ci    return count;
952e5b6d6dSopenharmony_ci}
962e5b6d6dSopenharmony_ci```
972e5b6d6dSopenharmony_ci
982e5b6d6dSopenharmony_ci## Dedicated UTF-8 APIs
992e5b6d6dSopenharmony_ci
1002e5b6d6dSopenharmony_ciICU has some APIs dedicated for UTF-8. They tend to have been added for "worker
1012e5b6d6dSopenharmony_cifunctions" like comparing strings, to avoid the string conversion overhead,
1022e5b6d6dSopenharmony_cirather than for "builder functions" like factory methods and attribute setters.
1032e5b6d6dSopenharmony_ci
1042e5b6d6dSopenharmony_ciFor example, `icu::Collator::compareUTF8()` compares two UTF-8 strings
1052e5b6d6dSopenharmony_ciincrementally, without converting all of the two strings to UTF-16 if there is
1062e5b6d6dSopenharmony_cian early base letter difference.
1072e5b6d6dSopenharmony_ci
1082e5b6d6dSopenharmony_ci`ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the
1092e5b6d6dSopenharmony_citwo `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most
1102e5b6d6dSopenharmony_ciother charsets uses a dedicated, optimized code path, avoiding the pivot through
1112e5b6d6dSopenharmony_ciUTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well,
1122e5b6d6dSopenharmony_cibut that has not been implemented yet as of ICU 4.4.)
1132e5b6d6dSopenharmony_ci
1142e5b6d6dSopenharmony_ciOther examples: (This list may or may not be complete.)
1152e5b6d6dSopenharmony_ci
1162e5b6d6dSopenharmony_ci*   ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(),
1172e5b6d6dSopenharmony_ci    ucasemap_utf8FoldCase()
1182e5b6d6dSopenharmony_ci*   ucnvsel_selectForUTF8()
1192e5b6d6dSopenharmony_ci*   icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(),
1202e5b6d6dSopenharmony_ci    uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.)
1212e5b6d6dSopenharmony_ci*   ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey()
1222e5b6d6dSopenharmony_ci*   uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8()
1232e5b6d6dSopenharmony_ci
1242e5b6d6dSopenharmony_ci## Abstract Text APIs
1252e5b6d6dSopenharmony_ci
1262e5b6d6dSopenharmony_ciICU offers several interfaces for text access, designed for different use cases.
1272e5b6d6dSopenharmony_ci(Some interfaces are simply newer and more modern than others.) Some ICU
1282e5b6d6dSopenharmony_ciservices work with some of these interfaces, and for some of these interfaces
1292e5b6d6dSopenharmony_ciICU offers UTF-8 implementations out of the box.
1302e5b6d6dSopenharmony_ci
1312e5b6d6dSopenharmony_ci`UText` can be used with `BreakIterator` APIs (character/word/sentence/...
1322e5b6d6dSopenharmony_cisegmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8
1332e5b6d6dSopenharmony_cistring.
1342e5b6d6dSopenharmony_ci
1352e5b6d6dSopenharmony_ci*   *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any
1362e5b6d6dSopenharmony_ci    other charset with non-1:1 index conversion to UTF-16) if no dictionary is
1372e5b6d6dSopenharmony_ci    supported. This excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).*
1382e5b6d6dSopenharmony_ci*   *As a workaround for Thai word breaking, you can convert the string to
1392e5b6d6dSopenharmony_ci    UTF-16 and convert indexes to UTF-8 string indexes via
1402e5b6d6dSopenharmony_ci    `u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`*
1412e5b6d6dSopenharmony_ci*   *ICU 4.4 has a technology preview for UText in the regular expression API,
1422e5b6d6dSopenharmony_ci    but some of the UText regex API and semantics are likely to change for ICU
1432e5b6d6dSopenharmony_ci    4.6. (Especially indexing semantics.)*
1442e5b6d6dSopenharmony_ci
1452e5b6d6dSopenharmony_ciA `UCharIterator` can be used with several collation APIs (although there is
1462e5b6d6dSopenharmony_cialso the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`.
1472e5b6d6dSopenharmony_ci`uiter_setUTF8()` creates a UCharIterator for a UTF-8 string.
1482e5b6d6dSopenharmony_ci
1492e5b6d6dSopenharmony_ciIt is also possible to create a `CharacterIterator` subclass for UTF-8 strings,
1502e5b6d6dSopenharmony_cibut `CharacterIterator` has a lot of virtual methods and it requires UTF-16
1512e5b6d6dSopenharmony_cistring index semantics.
152