12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: UnicodeSet 42e5b6d6dSopenharmony_cinav_order: 5 52e5b6d6dSopenharmony_ciparent: Chars and Strings 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# UnicodeSet 132e5b6d6dSopenharmony_ci 142e5b6d6dSopenharmony_ci## Overview 152e5b6d6dSopenharmony_ci 162e5b6d6dSopenharmony_ciA UnicodeSet is an object that represents a set of Unicode characters or 172e5b6d6dSopenharmony_cicharacter strings. The contents of that object can be specified either by 182e5b6d6dSopenharmony_cipatterns or by building them programmatically. 192e5b6d6dSopenharmony_ci 202e5b6d6dSopenharmony_ciHere are a few examples of sets: 212e5b6d6dSopenharmony_ci 222e5b6d6dSopenharmony_ci| Pattern | Description | 232e5b6d6dSopenharmony_ci|--------------|-------------------------------------------------------------| 242e5b6d6dSopenharmony_ci| `[a-z]` | The lower case letters a through z | 252e5b6d6dSopenharmony_ci| `[abc123]` | The six characters a,b,c,1,2 and 3 | 262e5b6d6dSopenharmony_ci| `[\p{Letter}]` | All characters with the Unicode General Category of Letter. | 272e5b6d6dSopenharmony_ci 282e5b6d6dSopenharmony_ci### String Values 292e5b6d6dSopenharmony_ci 302e5b6d6dSopenharmony_ciIn addition to being a set of characters (of Unicode code points), 312e5b6d6dSopenharmony_cia UnicodeSet may also contain string values. Conceptually, the UnicodeSet is 322e5b6d6dSopenharmony_cialways a set of strings, not a set of characters, although in many common use 332e5b6d6dSopenharmony_cicases the strings are all of length one, which reduces to being a set of 342e5b6d6dSopenharmony_cicharacters. 352e5b6d6dSopenharmony_ci 362e5b6d6dSopenharmony_ciThis concept can be confusing when first encountered, probably because similar 372e5b6d6dSopenharmony_ciset constructs from other environments 382e5b6d6dSopenharmony_ci(e.g., character classes in most regular expression implementations) 392e5b6d6dSopenharmony_cican only contain characters. 402e5b6d6dSopenharmony_ci 412e5b6d6dSopenharmony_ciUntil ICU 68, it was not possible for a UnicodeSet to contain the empty string. 422e5b6d6dSopenharmony_ciIn Java, an exception was thrown. In C++, the empty string was silently ignored. 432e5b6d6dSopenharmony_ci 442e5b6d6dSopenharmony_ciStarting with ICU 69 [ICU-13702](https://unicode-org.atlassian.net/browse/ICU-13702) 452e5b6d6dSopenharmony_cithe empty string is supported as a set element; 462e5b6d6dSopenharmony_cihowever, it is ignored in matching functions such as `span(string)`. 472e5b6d6dSopenharmony_ci 482e5b6d6dSopenharmony_ci## UnicodeSet Patterns 492e5b6d6dSopenharmony_ci 502e5b6d6dSopenharmony_ciPatterns are a series of characters bounded by square brackets that contain 512e5b6d6dSopenharmony_cilists of characters and Unicode property sets. Lists are a sequence of 522e5b6d6dSopenharmony_cicharacters that may have ranges indicated by a '-' between two characters, as in 532e5b6d6dSopenharmony_ci"a-z". The sequence specifies the range of all characters from the left to the 542e5b6d6dSopenharmony_ciright, in Unicode order. For example, `[a c d-f m]` is equivalent to `[a c d e f m]`. 552e5b6d6dSopenharmony_ciWhitespace can be freely used for clarity as `[a c d-f m]` means the same 562e5b6d6dSopenharmony_cias `[acd-fm]`. 572e5b6d6dSopenharmony_ci 582e5b6d6dSopenharmony_ciUnicode property sets are specified by a Unicode property, such as `[:Letter:]`. 592e5b6d6dSopenharmony_ciFor a list of supported properties, see the [Properties](properties.md) chapter. 602e5b6d6dSopenharmony_ciFor details on the use of short vs. long property and property value names, see 612e5b6d6dSopenharmony_cithe end of this section. The syntax for specifying the property names is an 622e5b6d6dSopenharmony_ciextension of either POSIX or Perl syntax with the addition of "=value". For 632e5b6d6dSopenharmony_ciexample, you can match letters by using the POSIX syntax `[:Letter:]`, or by 642e5b6d6dSopenharmony_ciusing the Perl-style syntax \\p{Letter}. The type can be omitted for the 652e5b6d6dSopenharmony_ciCategory and Script properties, but is required for other properties. 662e5b6d6dSopenharmony_ci 672e5b6d6dSopenharmony_ciThe table below shows the two kinds of syntax: POSIX and Perl style. Also, the 682e5b6d6dSopenharmony_citable shows the "Negative", which is a property that excludes all characters of 692e5b6d6dSopenharmony_cia given kind. For example, `[:^Letter:]` matches all characters that are not 702e5b6d6dSopenharmony_ci`[:Letter:]`. 712e5b6d6dSopenharmony_ci 722e5b6d6dSopenharmony_ci| | Positive | Negative | 732e5b6d6dSopenharmony_ci|--------------------|------------------|-------------------| 742e5b6d6dSopenharmony_ci| POSIX-style Syntax | `[:type=value:]` | `[:^type=value:]` | 752e5b6d6dSopenharmony_ci| Perl-style Syntax | `\p{type=value}` | `\P{type=value}` | 762e5b6d6dSopenharmony_ci 772e5b6d6dSopenharmony_ciThese following low-level lists or properties then can be freely combined with 782e5b6d6dSopenharmony_cithe normal set operations (union, inverse, difference, and intersection): 792e5b6d6dSopenharmony_ci 802e5b6d6dSopenharmony_ci| | Example | Corresponding Method | Meaning | 812e5b6d6dSopenharmony_ci|-------|-------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 822e5b6d6dSopenharmony_ci| A B | `[[:letter:] [:number:]]` | `A.addAll(B)` | To union two sets A and B, simply concatenate them | 832e5b6d6dSopenharmony_ci| A & B | `[[:letter:] & [a-z]]` | `A.retainAll(B)` | To intersect two sets A and B, use the '&' operator. | 842e5b6d6dSopenharmony_ci| A - B | `[[:letter:] - [a-z]]` | `A.removeAll(B)` | To take the set-difference of two sets A and B, use the '-' operator. | 852e5b6d6dSopenharmony_ci| [^A] | `[^a-z]` | `A.complement(B)` | To invert a set A, place a '^' immediately after the opening '['. Note that the complement only affects code points, not string values. In any other location, the '^' does not have a special meaning. | 862e5b6d6dSopenharmony_ci 872e5b6d6dSopenharmony_ci### Precedence 882e5b6d6dSopenharmony_ci 892e5b6d6dSopenharmony_ciThe binary operators of union, intersection, and set-difference have equal 902e5b6d6dSopenharmony_ciprecedence and bind left-to-right. Thus the following are equivalent: 912e5b6d6dSopenharmony_ci 922e5b6d6dSopenharmony_ci* `[[:letter:] - [a-z] [:number:] & [\u0100-\u01FF]]` 932e5b6d6dSopenharmony_ci* `[[[[[:letter:] - [a-z]] [:number:]] & [\u0100-\u01FF]]` 942e5b6d6dSopenharmony_ci 952e5b6d6dSopenharmony_ciAnother example is that the set `[[ace][bdf\] - [abc][def]]` is **not** 962e5b6d6dSopenharmony_cithe empty set, but instead the set `[def]`. That is because the syntax 972e5b6d6dSopenharmony_cicorresponds to the following UnicodeSet operations: 982e5b6d6dSopenharmony_ci 992e5b6d6dSopenharmony_ci1. start with `[ace]` 1002e5b6d6dSopenharmony_ci2. addAll `[bdf]` *-- we now have `[abcdef]`* 1012e5b6d6dSopenharmony_ci3. removeAll `[abc]` *-- we now have `[def]`* 1022e5b6d6dSopenharmony_ci4. addAll `[def]` *-- no effect, we still have `[def]`* 1032e5b6d6dSopenharmony_ci 1042e5b6d6dSopenharmony_ciThis only really matters where there are the difference and intersection 1052e5b6d6dSopenharmony_cioperations, as the union operation is commutative. To make sure that the - is 1062e5b6d6dSopenharmony_cithe main operator, add brackets to group the operations as desired, such as 1072e5b6d6dSopenharmony_ci`[[ace][bdf] - [[abc][def]]]`. 1082e5b6d6dSopenharmony_ci 1092e5b6d6dSopenharmony_ciAnother caveat with the '&' and '-' operators is that they operate between 1102e5b6d6dSopenharmony_ci**sets**. That is, they must be immediately preceded and immediately followed by 1112e5b6d6dSopenharmony_cia set. For example, the pattern `[[:Lu:]-A]` is illegal, since it is 1122e5b6d6dSopenharmony_ciinterpreted as the set `[:Lu:]` followed by the incomplete range `-A`. To specify 1132e5b6d6dSopenharmony_cithe set of uppercase letters except for 'A', enclose the 'A' in a set: 1142e5b6d6dSopenharmony_ci`[[:Lu:]-[A]]`. 1152e5b6d6dSopenharmony_ci 1162e5b6d6dSopenharmony_ci### Examples 1172e5b6d6dSopenharmony_ci 1182e5b6d6dSopenharmony_ci| `[a]` | The set containing 'a' | 1192e5b6d6dSopenharmony_ci|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 1202e5b6d6dSopenharmony_ci| `[a-z]` | The set containing 'a' through 'z' and all letters in between, in Unicode order | 1212e5b6d6dSopenharmony_ci| `[^a-z]` | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF | 1222e5b6d6dSopenharmony_ci| `[[pat1][pat2]]` | The union of sets specified by pat1 and pat2 | 1232e5b6d6dSopenharmony_ci| `[[pat1]& [pat2]]` | The intersection of sets specified by pat1 and pat2 | 1242e5b6d6dSopenharmony_ci| `[[pat1]- [pat2]]` | The asymmetric difference of sets specified by pat1 and pat2 | 1252e5b6d6dSopenharmony_ci| `[:Lu:]` | The set of characters belonging to the given Unicode category, as defined by `Character.getType()`; in this case, Unicode uppercase letters. The long form for this is `[:UppercaseLetter:]`. | 1262e5b6d6dSopenharmony_ci| `[:L:]` | The set of characters belonging to all Unicode categories starting with 'L', that is, `[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]`. The long form for this is `[:Letter:]`. | 1272e5b6d6dSopenharmony_ci 1282e5b6d6dSopenharmony_ci### String Values in Sets 1292e5b6d6dSopenharmony_ci 1302e5b6d6dSopenharmony_ciString values are enclosed in {curly brackets}. 1312e5b6d6dSopenharmony_ci 1322e5b6d6dSopenharmony_ci| Set expression | Description | 1332e5b6d6dSopenharmony_ci|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 1342e5b6d6dSopenharmony_ci| `[abc{def}]` | A set containing four members, the single characters a, b and c, and the string “def” | 1352e5b6d6dSopenharmony_ci| `[{abc}{def}]` | A set containing two members, the string “abc” and the string “def”. | 1362e5b6d6dSopenharmony_ci| `[{a}{b}{c}]` `[abc]` | These two sets are equivalent. Each contains three items, the three individual characters a, b and c. A {string} containing a single character is equivalent to that same character specified in any other way. | 1372e5b6d6dSopenharmony_ci 1382e5b6d6dSopenharmony_ci### Character Quoting and Escaping in Unicode Set Patterns 1392e5b6d6dSopenharmony_ci 1402e5b6d6dSopenharmony_ci#### Single Quote 1412e5b6d6dSopenharmony_ci 1422e5b6d6dSopenharmony_ciTwo single quotes represents a single quote, either inside or outside single 1432e5b6d6dSopenharmony_ciquotes. 1442e5b6d6dSopenharmony_ci 1452e5b6d6dSopenharmony_ciText within single quotes is not interpreted in any way (except for two adjacent 1462e5b6d6dSopenharmony_cisingle quotes). It is taken as literal text (special characters become 1472e5b6d6dSopenharmony_cinon-special). 1482e5b6d6dSopenharmony_ci 1492e5b6d6dSopenharmony_ciThese quoting conventions for ICU UnicodeSets differ from those of regular 1502e5b6d6dSopenharmony_ciexpression character set expressions. In regular expressions, single quotes have 1512e5b6d6dSopenharmony_cino special meaning and are treated like any other literal character. 1522e5b6d6dSopenharmony_ci 1532e5b6d6dSopenharmony_ci#### Backslash Escapes 1542e5b6d6dSopenharmony_ci 1552e5b6d6dSopenharmony_ciOutside of single quotes, certain backslashed characters have special meaning: 1562e5b6d6dSopenharmony_ci 1572e5b6d6dSopenharmony_ci| `\uhhhh` | Exactly 4 hex digits; h in [0-9A-Fa-f] | 1582e5b6d6dSopenharmony_ci|------------|----------------------------------------| 1592e5b6d6dSopenharmony_ci| `\Uhhhhhhhh` | Exactly 8 hex digits | 1602e5b6d6dSopenharmony_ci| `\xhh` | 1-2 hex digits | 1612e5b6d6dSopenharmony_ci| `\ooo` | 1-3 octal digits; o in [0-7] | 1622e5b6d6dSopenharmony_ci| `\a` | U+0007 (BELL) | 1632e5b6d6dSopenharmony_ci| `\b` | U+0008 (BACKSPACE) | 1642e5b6d6dSopenharmony_ci| `\t` | U+0009 (HORIZONTAL TAB) | 1652e5b6d6dSopenharmony_ci| `\n` | U+000A (LINE FEED) | 1662e5b6d6dSopenharmony_ci| `\v` | U+000B (VERTICAL TAB) | 1672e5b6d6dSopenharmony_ci| `\f` | U+000C (FORM FEED) | 1682e5b6d6dSopenharmony_ci| `\r` | U+000D (CARRIAGE RETURN) | 1692e5b6d6dSopenharmony_ci| `\\` | U+005C (BACKSLASH) | 1702e5b6d6dSopenharmony_ci 1712e5b6d6dSopenharmony_ciAnything else following a backslash is mapped to itself, except in an 1722e5b6d6dSopenharmony_cienvironment where it is defined to have some special meaning. For example, 1732e5b6d6dSopenharmony_ci`\\p{Lu}` is the set of uppercase letters in UnicodeSet. 1742e5b6d6dSopenharmony_ci 1752e5b6d6dSopenharmony_ciAny character formed as the result of a backslash escape loses any special 1762e5b6d6dSopenharmony_cimeaning and is treated as a literal. In particular, note that \\u and \\U 1772e5b6d6dSopenharmony_ciescapes create literal characters. (In contrast, the Java compiler treats 1782e5b6d6dSopenharmony_ciUnicode escapes as just a way to represent arbitrary characters in an ASCII 1792e5b6d6dSopenharmony_cisource file, and any resulting characters are **not** tagged as literals.) 1802e5b6d6dSopenharmony_ci 1812e5b6d6dSopenharmony_ci#### Whitespace 1822e5b6d6dSopenharmony_ci 1832e5b6d6dSopenharmony_ciWhitespace (as defined by our API) is ignored unless it is quoted or 1842e5b6d6dSopenharmony_cibackslashed. 1852e5b6d6dSopenharmony_ci 1862e5b6d6dSopenharmony_ci> :point_right: **Note**: *The rules for quoting and white space handling are common to most ICU APIs that 1872e5b6d6dSopenharmony_ciprocess rule or expression strings, including UnicodeSet, Transliteration and 1882e5b6d6dSopenharmony_ciBreak Iterators.* 1892e5b6d6dSopenharmony_ci 1902e5b6d6dSopenharmony_ci> :point_right: **Note**:*ICU Regular Expression set expressions have a different (but similar) syntax, 1912e5b6d6dSopenharmony_ciand a different set of recognized backslash escapes. \[Sets\] in ICU Regular 1922e5b6d6dSopenharmony_ciExpressions follow the conventions from Perl and Java regular expressions rather 1932e5b6d6dSopenharmony_cithan the pattern syntax from ICU UnicodeSet.* 1942e5b6d6dSopenharmony_ci 1952e5b6d6dSopenharmony_ci## Using a UnicodeSet 1962e5b6d6dSopenharmony_ci 1972e5b6d6dSopenharmony_ciFor best performance, once the set contents is complete, freeze() the set to 1982e5b6d6dSopenharmony_cimake it immutable and to speed up contains() and span() operations (for which it 1992e5b6d6dSopenharmony_cibuilds a small additional data structure). 2002e5b6d6dSopenharmony_ci 2012e5b6d6dSopenharmony_ciThe most basic operation is contains(code point) or, if relevant, 2022e5b6d6dSopenharmony_cicontains(string). 2032e5b6d6dSopenharmony_ci 2042e5b6d6dSopenharmony_ciFor splitting and partitioning strings, it is simpler and faster to use span() 2052e5b6d6dSopenharmony_ciand spanBack() rather than iterate over code points and calling contains(). In 2062e5b6d6dSopenharmony_ciJava, there is also a class UnicodeSetSpanner for somewhat higher-level 2072e5b6d6dSopenharmony_cioperations. See also the “Lookup” section of the [Properties](properties.md) 2082e5b6d6dSopenharmony_cichapter. 2092e5b6d6dSopenharmony_ci 2102e5b6d6dSopenharmony_ci## Programmatically Building UnicodeSets 2112e5b6d6dSopenharmony_ci 2122e5b6d6dSopenharmony_ciICU users can programmatically build a UnicodeSet by adding or removing ranges 2132e5b6d6dSopenharmony_ciof characters or by using the retain (intersection), remove (difference), and 2142e5b6d6dSopenharmony_ciadd (union) operations. 2152e5b6d6dSopenharmony_ci 2162e5b6d6dSopenharmony_ci## Property Values 2172e5b6d6dSopenharmony_ci 2182e5b6d6dSopenharmony_ciThe following property value variants are recognized: 2192e5b6d6dSopenharmony_ci 2202e5b6d6dSopenharmony_ci| Format | Description | Example | 2212e5b6d6dSopenharmony_ci|--------|-----------------------------------------------------------------------------------------------------|-----------------------------------| 2222e5b6d6dSopenharmony_ci| short | omits the type (used to prevent ambiguity and only allowed with the Category and Script properties) | Lu | 2232e5b6d6dSopenharmony_ci| medium | uses an abbreviated type and value | gc=Lu | 2242e5b6d6dSopenharmony_ci| long | uses a full type and value | General_Category=Uppercase_Letter | 2252e5b6d6dSopenharmony_ci 2262e5b6d6dSopenharmony_ciIf the type or value is omitted, then the equals sign is also omitted. The short 2272e5b6d6dSopenharmony_cistyle is only 2282e5b6d6dSopenharmony_ciused for Category and Script properties because these properties are very common 2292e5b6d6dSopenharmony_ciand their omission is unambiguous. 2302e5b6d6dSopenharmony_ci 2312e5b6d6dSopenharmony_ciIn actual practice, you can mix type names and values that are omitted, 2322e5b6d6dSopenharmony_ciabbreviated, or full. For example, if Category=Unassigned you could use what is 2332e5b6d6dSopenharmony_ciin the table explicitly, `\p{gc=Unassigned}`, `\p{Category=Cn}`, or 2342e5b6d6dSopenharmony_ci`\p{Unassigned}`. 2352e5b6d6dSopenharmony_ci 2362e5b6d6dSopenharmony_ciWhen these are processed, case and whitespace are ignored so you may use them 2372e5b6d6dSopenharmony_cifor clarity, if desired. For example, `\p{Category = Uppercase Letter}` or 2382e5b6d6dSopenharmony_ci`\p{Category = uppercase letter}`. 2392e5b6d6dSopenharmony_ci 2402e5b6d6dSopenharmony_ciFor a list of supported properties, see the [Properties](properties.md) chapter. 2412e5b6d6dSopenharmony_ci 2422e5b6d6dSopenharmony_ci## Getting UnicodeSet from Script 2432e5b6d6dSopenharmony_ci 2442e5b6d6dSopenharmony_ciICU provides the functionality of getting UnicodeSet from the script. Here is an 2452e5b6d6dSopenharmony_ciexample of generating a pattern from all the scripts that are associated to a 2462e5b6d6dSopenharmony_ciLocale and then getting the UnicodeSet based on the generated pattern. 2472e5b6d6dSopenharmony_ci 2482e5b6d6dSopenharmony_ci**In C:** 2492e5b6d6dSopenharmony_ci 2502e5b6d6dSopenharmony_ci UErrorCode err = U_ZERO_ERROR; 2512e5b6d6dSopenharmony_ci const int32_t capacity = 10; 2522e5b6d6dSopenharmony_ci const char * shortname = NULL; 2532e5b6d6dSopenharmony_ci int32_t num, j; 2542e5b6d6dSopenharmony_ci int32_t strLength =4; 2552e5b6d6dSopenharmony_ci UChar32 c = 0x00003096 ; 2562e5b6d6dSopenharmony_ci UScriptCode script[10] = {USCRIPT_INVALID_CODE}; 2572e5b6d6dSopenharmony_ci UScriptCode scriptcode = USCRIPT_INVALID_CODE; 2582e5b6d6dSopenharmony_ci num = uscript_getCode("ja",script,capacity, &err); 2592e5b6d6dSopenharmony_ci printf("%s %d \n" ,"Number of script code associated are :", num); 2602e5b6d6dSopenharmony_ci UnicodeString temp = UnicodeString("[", 1, US_INV); 2612e5b6d6dSopenharmony_ci UnicodeString pattern; 2622e5b6d6dSopenharmony_ci for(j=0;j<num;j++){ 2632e5b6d6dSopenharmony_ci shortname = uscript_getShortName(script[j]); 2642e5b6d6dSopenharmony_ci UnicodeString str(shortname,strLength,US_INV); 2652e5b6d6dSopenharmony_ci temp.append("[:"); 2662e5b6d6dSopenharmony_ci temp.append(str); 2672e5b6d6dSopenharmony_ci temp.append(":]+"); 2682e5b6d6dSopenharmony_ci } 2692e5b6d6dSopenharmony_ci pattern = temp.remove(temp.length()-1,1); 2702e5b6d6dSopenharmony_ci pattern.append("]"); 2712e5b6d6dSopenharmony_ci UnicodeSet cnvSet(pattern, err); 2722e5b6d6dSopenharmony_ci printf("%d\n", cnvSet.size()); 2732e5b6d6dSopenharmony_ci printf("%d\n", cnvSet.contains(c)); 2742e5b6d6dSopenharmony_ci 2752e5b6d6dSopenharmony_ci**In Java:** 2762e5b6d6dSopenharmony_ci 2772e5b6d6dSopenharmony_ci ULocale ul = new ULocale("ja"); 2782e5b6d6dSopenharmony_ci int script[] = UScript.getCode(ul); 2792e5b6d6dSopenharmony_ci String str ="["; 2802e5b6d6dSopenharmony_ci for(int i=0;i<script.length;i++){ 2812e5b6d6dSopenharmony_ci str = str + "[:"+UScript.getShortName(script[i])+":]+"; 2822e5b6d6dSopenharmony_ci } 2832e5b6d6dSopenharmony_ci String pattern =str.substring(0, (str.length()-1)); 2842e5b6d6dSopenharmony_ci pattern = pattern + "]"; 2852e5b6d6dSopenharmony_ci System.out.println(pattern); 2862e5b6d6dSopenharmony_ci UnicodeSet ucs = new UnicodeSet(pattern); 2872e5b6d6dSopenharmony_ci System.out.println(ucs.size()); 2882e5b6d6dSopenharmony_ci System.out.println(ucs.contains(0x00003096)); 289