12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: UTF-8 42e5b6d6dSopenharmony_cinav_order: 1 52e5b6d6dSopenharmony_ciparent: Chars and Strings 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# UTF-8 132e5b6d6dSopenharmony_ci 142e5b6d6dSopenharmony_ci*Note: This page is only relevant for C/C++. In Java, all strings are encoded in 152e5b6d6dSopenharmony_ciUTF-16, except for conversion from bytes to strings (via InputStreamReader or 162e5b6d6dSopenharmony_cisimilar) and from strings to bytes (OutputStreamWriter etc.).* 172e5b6d6dSopenharmony_ci 182e5b6d6dSopenharmony_ciWhile most of ICU works with UTF-16 strings and uses data structures optimized 192e5b6d6dSopenharmony_cifor UTF-16, there are APIs that facilitate working with UTF-8, or are optimized 202e5b6d6dSopenharmony_cifor UTF-8, or work with Unicode code points (21-bit integer values) regardless 212e5b6d6dSopenharmony_ciof string encoding. Some data structures are designed to work equally well with 222e5b6d6dSopenharmony_ciUTF-16 and UTF-8. 232e5b6d6dSopenharmony_ci 242e5b6d6dSopenharmony_ciFor UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t` 252e5b6d6dSopenharmony_cilengths, normally with semantics parallel to UTF-16 handling. (Input length=-1 262e5b6d6dSopenharmony_cimeans NUL-terminated, output is NUL-terminated if there is space, output 272e5b6d6dSopenharmony_cioverflow is handled with preflighting; for details see the parent [Strings 282e5b6d6dSopenharmony_cipage](index.md).) Some newer APIs take an `icu::StringPiece` argument and write 292e5b6d6dSopenharmony_cito an `icu::ByteSink` or to a string class object like `std::string`. 302e5b6d6dSopenharmony_ci 312e5b6d6dSopenharmony_ci## Conversion Between UTF-8 and UTF-16 322e5b6d6dSopenharmony_ci 332e5b6d6dSopenharmony_ciThe simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ 342e5b6d6dSopenharmony_ci`icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and 352e5b6d6dSopenharmony_ci`toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`. 362e5b6d6dSopenharmony_ci 372e5b6d6dSopenharmony_ciIn C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and 382e5b6d6dSopenharmony_ci`u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and 392e5b6d6dSopenharmony_ci`u_strFromUTF8Lenient()`.) 402e5b6d6dSopenharmony_ci 412e5b6d6dSopenharmony_ciThe conversion functions in `unicode/ucnv.h` are intended for very flexible 422e5b6d6dSopenharmony_cihandling of conversion to/from external byte streams (with customizable error 432e5b6d6dSopenharmony_cihandling and support for split buffers at arbitrary boundaries) which is 442e5b6d6dSopenharmony_cinormally unnecessary for internal strings. 452e5b6d6dSopenharmony_ci 462e5b6d6dSopenharmony_ciNote: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods 472e5b6d6dSopenharmony_ciwhich take either a converter object or a charset name. These can be used for 482e5b6d6dSopenharmony_ciUTF-8, but are not as efficient or convenient as the 492e5b6d6dSopenharmony_ci`fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among 502e5b6d6dSopenharmony_ciconversion methods, APIs with a charset name are more convenient but internally 512e5b6d6dSopenharmony_ciopen and close a converter; ones with a converter object parameter avoid this.) 522e5b6d6dSopenharmony_ci 532e5b6d6dSopenharmony_ci## UTF-8 as Default Charset 542e5b6d6dSopenharmony_ci 552e5b6d6dSopenharmony_ciICU has many functions that take or return `char *` strings that are assumed to 562e5b6d6dSopenharmony_cibe in the default charset which should match the system encoding. Since this 572e5b6d6dSopenharmony_cicould be one of many charsets, and the charset can be different for different 582e5b6d6dSopenharmony_ciprocesses on the same system, ICU uses its conversion framework for converting 592e5b6d6dSopenharmony_cito and from UTF-16. 602e5b6d6dSopenharmony_ci 612e5b6d6dSopenharmony_ciIf it is known that the default charset is always UTF-8 on the target platform, 622e5b6d6dSopenharmony_cithen you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`. 632e5b6d6dSopenharmony_ci(For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1` 642e5b6d6dSopenharmony_cias a compiler flag.) This will change most of the implementation code to use 652e5b6d6dSopenharmony_cidedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the 662e5b6d6dSopenharmony_ciconversion framework. (Avoiding such dependencies helps with statically linked 672e5b6d6dSopenharmony_cilibraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even 682e5b6d6dSopenharmony_ci`UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].) 692e5b6d6dSopenharmony_ci 702e5b6d6dSopenharmony_ci## Low-Level UTF-8 String Operations 712e5b6d6dSopenharmony_ci 722e5b6d6dSopenharmony_ci`unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16 732e5b6d6dSopenharmony_cimacros in `unicode/utf16.h`. The macros handle many cases inline, but call 742e5b6d6dSopenharmony_ciinternal functions for complicated parts of the UTF-8 encoding form. For 752e5b6d6dSopenharmony_ciexample, the following code snippet counts white space characters in a string: 762e5b6d6dSopenharmony_ci 772e5b6d6dSopenharmony_ci```c 782e5b6d6dSopenharmony_ci#include "unicode/utypes.h" 792e5b6d6dSopenharmony_ci#include "unicode/stringpiece.h" 802e5b6d6dSopenharmony_ci#include "unicode/utf8.h" 812e5b6d6dSopenharmony_ci#include "unicode/uchar.h" 822e5b6d6dSopenharmony_ci 832e5b6d6dSopenharmony_ciint32_t countWhiteSpace(StringPiece sp) { 842e5b6d6dSopenharmony_ci const char *s=sp.data(); 852e5b6d6dSopenharmony_ci int32_t length=sp.length(); 862e5b6d6dSopenharmony_ci int32_t count=0; 872e5b6d6dSopenharmony_ci for(int32_t i=0; i<length;) { 882e5b6d6dSopenharmony_ci UChar32 c; 892e5b6d6dSopenharmony_ci U8_NEXT(s, i, length, c); 902e5b6d6dSopenharmony_ci if(u_isUWhiteSpace(c)) { 912e5b6d6dSopenharmony_ci ++count; 922e5b6d6dSopenharmony_ci } 932e5b6d6dSopenharmony_ci } 942e5b6d6dSopenharmony_ci return count; 952e5b6d6dSopenharmony_ci} 962e5b6d6dSopenharmony_ci``` 972e5b6d6dSopenharmony_ci 982e5b6d6dSopenharmony_ci## Dedicated UTF-8 APIs 992e5b6d6dSopenharmony_ci 1002e5b6d6dSopenharmony_ciICU has some APIs dedicated for UTF-8. They tend to have been added for "worker 1012e5b6d6dSopenharmony_cifunctions" like comparing strings, to avoid the string conversion overhead, 1022e5b6d6dSopenharmony_cirather than for "builder functions" like factory methods and attribute setters. 1032e5b6d6dSopenharmony_ci 1042e5b6d6dSopenharmony_ciFor example, `icu::Collator::compareUTF8()` compares two UTF-8 strings 1052e5b6d6dSopenharmony_ciincrementally, without converting all of the two strings to UTF-16 if there is 1062e5b6d6dSopenharmony_cian early base letter difference. 1072e5b6d6dSopenharmony_ci 1082e5b6d6dSopenharmony_ci`ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the 1092e5b6d6dSopenharmony_citwo `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most 1102e5b6d6dSopenharmony_ciother charsets uses a dedicated, optimized code path, avoiding the pivot through 1112e5b6d6dSopenharmony_ciUTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well, 1122e5b6d6dSopenharmony_cibut that has not been implemented yet as of ICU 4.4.) 1132e5b6d6dSopenharmony_ci 1142e5b6d6dSopenharmony_ciOther examples: (This list may or may not be complete.) 1152e5b6d6dSopenharmony_ci 1162e5b6d6dSopenharmony_ci* ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(), 1172e5b6d6dSopenharmony_ci ucasemap_utf8FoldCase() 1182e5b6d6dSopenharmony_ci* ucnvsel_selectForUTF8() 1192e5b6d6dSopenharmony_ci* icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(), 1202e5b6d6dSopenharmony_ci uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.) 1212e5b6d6dSopenharmony_ci* ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey() 1222e5b6d6dSopenharmony_ci* uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8() 1232e5b6d6dSopenharmony_ci 1242e5b6d6dSopenharmony_ci## Abstract Text APIs 1252e5b6d6dSopenharmony_ci 1262e5b6d6dSopenharmony_ciICU offers several interfaces for text access, designed for different use cases. 1272e5b6d6dSopenharmony_ci(Some interfaces are simply newer and more modern than others.) Some ICU 1282e5b6d6dSopenharmony_ciservices work with some of these interfaces, and for some of these interfaces 1292e5b6d6dSopenharmony_ciICU offers UTF-8 implementations out of the box. 1302e5b6d6dSopenharmony_ci 1312e5b6d6dSopenharmony_ci`UText` can be used with `BreakIterator` APIs (character/word/sentence/... 1322e5b6d6dSopenharmony_cisegmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8 1332e5b6d6dSopenharmony_cistring. 1342e5b6d6dSopenharmony_ci 1352e5b6d6dSopenharmony_ci* *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any 1362e5b6d6dSopenharmony_ci other charset with non-1:1 index conversion to UTF-16) if no dictionary is 1372e5b6d6dSopenharmony_ci supported. This excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).* 1382e5b6d6dSopenharmony_ci* *As a workaround for Thai word breaking, you can convert the string to 1392e5b6d6dSopenharmony_ci UTF-16 and convert indexes to UTF-8 string indexes via 1402e5b6d6dSopenharmony_ci `u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`* 1412e5b6d6dSopenharmony_ci* *ICU 4.4 has a technology preview for UText in the regular expression API, 1422e5b6d6dSopenharmony_ci but some of the UText regex API and semantics are likely to change for ICU 1432e5b6d6dSopenharmony_ci 4.6. (Especially indexing semantics.)* 1442e5b6d6dSopenharmony_ci 1452e5b6d6dSopenharmony_ciA `UCharIterator` can be used with several collation APIs (although there is 1462e5b6d6dSopenharmony_cialso the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`. 1472e5b6d6dSopenharmony_ci`uiter_setUTF8()` creates a UCharIterator for a UTF-8 string. 1482e5b6d6dSopenharmony_ci 1492e5b6d6dSopenharmony_ciIt is also possible to create a `CharacterIterator` subclass for UTF-8 strings, 1502e5b6d6dSopenharmony_cibut `CharacterIterator` has a lot of virtual methods and it requires UTF-16 1512e5b6d6dSopenharmony_cistring index semantics. 152