12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: UnicodeSet
42e5b6d6dSopenharmony_cinav_order: 5
52e5b6d6dSopenharmony_ciparent: Chars and Strings
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# UnicodeSet
132e5b6d6dSopenharmony_ci
142e5b6d6dSopenharmony_ci## Overview
152e5b6d6dSopenharmony_ci
162e5b6d6dSopenharmony_ciA UnicodeSet is an object that represents a set of Unicode characters or
172e5b6d6dSopenharmony_cicharacter strings. The contents of that object can be specified either by
182e5b6d6dSopenharmony_cipatterns or by building them programmatically.
192e5b6d6dSopenharmony_ci
202e5b6d6dSopenharmony_ciHere are a few examples of sets:
212e5b6d6dSopenharmony_ci
222e5b6d6dSopenharmony_ci| Pattern | Description |
232e5b6d6dSopenharmony_ci|--------------|-------------------------------------------------------------|
242e5b6d6dSopenharmony_ci| `[a-z]` | The lower case letters a through z |
252e5b6d6dSopenharmony_ci| `[abc123]` | The six characters a,b,c,1,2 and 3 |
262e5b6d6dSopenharmony_ci| `[\p{Letter}]` | All characters with the Unicode General Category of Letter. |
272e5b6d6dSopenharmony_ci
282e5b6d6dSopenharmony_ci### String Values
292e5b6d6dSopenharmony_ci
302e5b6d6dSopenharmony_ciIn addition to being a set of characters (of Unicode code points),
312e5b6d6dSopenharmony_cia UnicodeSet may also contain string values. Conceptually, the UnicodeSet is
322e5b6d6dSopenharmony_cialways a set of strings, not a set of characters, although in many common use
332e5b6d6dSopenharmony_cicases the strings are all of length one, which reduces to being a set of
342e5b6d6dSopenharmony_cicharacters.
352e5b6d6dSopenharmony_ci
362e5b6d6dSopenharmony_ciThis concept can be confusing when first encountered, probably because similar
372e5b6d6dSopenharmony_ciset constructs from other environments
382e5b6d6dSopenharmony_ci(e.g., character classes in most regular expression implementations)
392e5b6d6dSopenharmony_cican only contain characters.
402e5b6d6dSopenharmony_ci
412e5b6d6dSopenharmony_ciUntil ICU 68, it was not possible for a UnicodeSet to contain the empty string.
422e5b6d6dSopenharmony_ciIn Java, an exception was thrown. In C++, the empty string was silently ignored.
432e5b6d6dSopenharmony_ci
442e5b6d6dSopenharmony_ciStarting with ICU 69 [ICU-13702](https://unicode-org.atlassian.net/browse/ICU-13702)
452e5b6d6dSopenharmony_cithe empty string is supported as a set element;
462e5b6d6dSopenharmony_cihowever, it is ignored in matching functions such as `span(string)`.
472e5b6d6dSopenharmony_ci
482e5b6d6dSopenharmony_ci## UnicodeSet Patterns
492e5b6d6dSopenharmony_ci
502e5b6d6dSopenharmony_ciPatterns are a series of characters bounded by square brackets that contain
512e5b6d6dSopenharmony_cilists of characters and Unicode property sets. Lists are a sequence of
522e5b6d6dSopenharmony_cicharacters that may have ranges indicated by a '-' between two characters, as in
532e5b6d6dSopenharmony_ci"a-z". The sequence specifies the range of all characters from the left to the
542e5b6d6dSopenharmony_ciright, in Unicode order. For example, `[a c d-f m]` is equivalent to `[a c d e f m]`.
552e5b6d6dSopenharmony_ciWhitespace can be freely used for clarity as `[a c d-f m]` means the same
562e5b6d6dSopenharmony_cias `[acd-fm]`.
572e5b6d6dSopenharmony_ci
582e5b6d6dSopenharmony_ciUnicode property sets are specified by a Unicode property, such as `[:Letter:]`.
592e5b6d6dSopenharmony_ciFor a list of supported properties, see the [Properties](properties.md) chapter.
602e5b6d6dSopenharmony_ciFor details on the use of short vs. long property and property value names, see
612e5b6d6dSopenharmony_cithe end of this section. The syntax for specifying the property names is an
622e5b6d6dSopenharmony_ciextension of either POSIX or Perl syntax with the addition of "=value". For
632e5b6d6dSopenharmony_ciexample, you can match letters by using the POSIX syntax `[:Letter:]`, or by
642e5b6d6dSopenharmony_ciusing the Perl-style syntax \\p{Letter}. The type can be omitted for the
652e5b6d6dSopenharmony_ciCategory and Script properties, but is required for other properties.
662e5b6d6dSopenharmony_ci
672e5b6d6dSopenharmony_ciThe table below shows the two kinds of syntax: POSIX and Perl style. Also, the
682e5b6d6dSopenharmony_citable shows the "Negative", which is a property that excludes all characters of
692e5b6d6dSopenharmony_cia given kind. For example, `[:^Letter:]` matches all characters that are not
702e5b6d6dSopenharmony_ci`[:Letter:]`.
712e5b6d6dSopenharmony_ci
722e5b6d6dSopenharmony_ci|  | Positive | Negative |
732e5b6d6dSopenharmony_ci|--------------------|------------------|-------------------|
742e5b6d6dSopenharmony_ci| POSIX-style Syntax | `[:type=value:]` | `[:^type=value:]` |
752e5b6d6dSopenharmony_ci| Perl-style Syntax  | `\p{type=value}` | `\P{type=value}`  |
762e5b6d6dSopenharmony_ci
772e5b6d6dSopenharmony_ciThese following low-level lists or properties then can be freely combined with
782e5b6d6dSopenharmony_cithe normal set operations (union, inverse, difference, and intersection):
792e5b6d6dSopenharmony_ci
802e5b6d6dSopenharmony_ci|  | Example | Corresponding Method | Meaning |
812e5b6d6dSopenharmony_ci|-------|-------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
822e5b6d6dSopenharmony_ci| A B | `[[:letter:] [:number:]]` | `A.addAll(B)` | To union two sets A and B, simply concatenate them |
832e5b6d6dSopenharmony_ci| A & B | `[[:letter:] & [a-z]]` | `A.retainAll(B)` | To intersect two sets A and B, use the '&' operator. |
842e5b6d6dSopenharmony_ci| A - B | `[[:letter:] - [a-z]]` | `A.removeAll(B)` | To take the set-difference of two sets  A and B, use the '-' operator. |
852e5b6d6dSopenharmony_ci| [^A] | `[^a-z]` | `A.complement(B)` | To invert a set A, place a '^' immediately after the opening '['.  Note that the complement only affects code points, not string values. In any other location, the '^' does not have a special meaning. |
862e5b6d6dSopenharmony_ci
872e5b6d6dSopenharmony_ci### Precedence
882e5b6d6dSopenharmony_ci
892e5b6d6dSopenharmony_ciThe binary operators of union, intersection, and set-difference have equal
902e5b6d6dSopenharmony_ciprecedence and bind left-to-right. Thus the following are equivalent:
912e5b6d6dSopenharmony_ci
922e5b6d6dSopenharmony_ci*   `[[:letter:] - [a-z] [:number:] & [\u0100-\u01FF]]`
932e5b6d6dSopenharmony_ci*   `[[[[[:letter:] - [a-z]] [:number:]] & [\u0100-\u01FF]]`
942e5b6d6dSopenharmony_ci
952e5b6d6dSopenharmony_ciAnother example is that the set `[[ace][bdf\] - [abc][def]]` is **not**
962e5b6d6dSopenharmony_cithe empty set, but instead the set `[def]`. That is because the syntax
972e5b6d6dSopenharmony_cicorresponds to the following UnicodeSet operations:
982e5b6d6dSopenharmony_ci
992e5b6d6dSopenharmony_ci1.  start with `[ace]`
1002e5b6d6dSopenharmony_ci2.  addAll `[bdf]` *-- we now have `[abcdef]`*
1012e5b6d6dSopenharmony_ci3.  removeAll `[abc]` *-- we now have `[def]`*
1022e5b6d6dSopenharmony_ci4.  addAll `[def]` *-- no effect, we still have `[def]`*
1032e5b6d6dSopenharmony_ci
1042e5b6d6dSopenharmony_ciThis only really matters where there are the difference and intersection
1052e5b6d6dSopenharmony_cioperations, as the union operation is commutative. To make sure that the - is
1062e5b6d6dSopenharmony_cithe main operator, add brackets to group the operations as desired, such as
1072e5b6d6dSopenharmony_ci`[[ace][bdf] - [[abc][def]]]`.
1082e5b6d6dSopenharmony_ci
1092e5b6d6dSopenharmony_ciAnother caveat with the '&' and '-' operators is that they operate between
1102e5b6d6dSopenharmony_ci**sets**. That is, they must be immediately preceded and immediately followed by
1112e5b6d6dSopenharmony_cia set. For example, the pattern `[[:Lu:]-A]` is illegal, since it is
1122e5b6d6dSopenharmony_ciinterpreted as the set `[:Lu:]` followed by the incomplete range `-A`. To specify
1132e5b6d6dSopenharmony_cithe set of uppercase letters except for 'A', enclose the 'A' in a set:
1142e5b6d6dSopenharmony_ci`[[:Lu:]-[A]]`.
1152e5b6d6dSopenharmony_ci
1162e5b6d6dSopenharmony_ci### Examples
1172e5b6d6dSopenharmony_ci
1182e5b6d6dSopenharmony_ci| `[a]` | The set containing 'a' |
1192e5b6d6dSopenharmony_ci|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1202e5b6d6dSopenharmony_ci| `[a-z]` | The set containing 'a' through 'z' and all letters in between, in Unicode order |
1212e5b6d6dSopenharmony_ci| `[^a-z]` | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF |
1222e5b6d6dSopenharmony_ci| `[[pat1][pat2]]` | The union of sets specified by pat1 and pat2 |
1232e5b6d6dSopenharmony_ci| `[[pat1]& [pat2]]` | The intersection of sets specified by pat1 and pat2 |
1242e5b6d6dSopenharmony_ci| `[[pat1]- [pat2]]` | The asymmetric difference of sets specified by pat1 and pat2 |
1252e5b6d6dSopenharmony_ci| `[:Lu:]` | The set of characters belonging to the given Unicode category, as defined by  `Character.getType()`; in this case, Unicode uppercase letters. The long form for this is  `[:UppercaseLetter:]`. |
1262e5b6d6dSopenharmony_ci| `[:L:]` | The set of characters belonging to all Unicode categories starting with 'L', that is,  `[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]`. The long form for this is  `[:Letter:]`. |
1272e5b6d6dSopenharmony_ci
1282e5b6d6dSopenharmony_ci### String Values in Sets
1292e5b6d6dSopenharmony_ci
1302e5b6d6dSopenharmony_ciString values are enclosed in {curly brackets}.
1312e5b6d6dSopenharmony_ci
1322e5b6d6dSopenharmony_ci| Set expression | Description |
1332e5b6d6dSopenharmony_ci|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1342e5b6d6dSopenharmony_ci| `[abc{def}]` | A set containing four members, the single characters a, b and c, and the string “def” |
1352e5b6d6dSopenharmony_ci| `[{abc}{def}]` | A set containing two members, the string “abc” and the string “def”. |
1362e5b6d6dSopenharmony_ci| `[{a}{b}{c}]` `[abc]` | These two sets are equivalent. Each contains three items, the three individual characters a, b and c. A {string} containing a single character is equivalent to that same character specified in any other way. |
1372e5b6d6dSopenharmony_ci
1382e5b6d6dSopenharmony_ci### Character Quoting and Escaping in Unicode Set Patterns
1392e5b6d6dSopenharmony_ci
1402e5b6d6dSopenharmony_ci#### Single Quote
1412e5b6d6dSopenharmony_ci
1422e5b6d6dSopenharmony_ciTwo single quotes represents a single quote, either inside or outside single
1432e5b6d6dSopenharmony_ciquotes.
1442e5b6d6dSopenharmony_ci
1452e5b6d6dSopenharmony_ciText within single quotes is not interpreted in any way (except for two adjacent
1462e5b6d6dSopenharmony_cisingle quotes). It is taken as literal text (special characters become
1472e5b6d6dSopenharmony_cinon-special).
1482e5b6d6dSopenharmony_ci
1492e5b6d6dSopenharmony_ciThese quoting conventions for ICU UnicodeSets differ from those of regular
1502e5b6d6dSopenharmony_ciexpression character set expressions. In regular expressions, single quotes have
1512e5b6d6dSopenharmony_cino special meaning and are treated like any other literal character.
1522e5b6d6dSopenharmony_ci
1532e5b6d6dSopenharmony_ci#### Backslash Escapes
1542e5b6d6dSopenharmony_ci
1552e5b6d6dSopenharmony_ciOutside of single quotes, certain backslashed characters have special meaning:
1562e5b6d6dSopenharmony_ci
1572e5b6d6dSopenharmony_ci| `\uhhhh` | Exactly 4 hex digits; h in [0-9A-Fa-f] |
1582e5b6d6dSopenharmony_ci|------------|----------------------------------------|
1592e5b6d6dSopenharmony_ci| `\Uhhhhhhhh` | Exactly 8 hex digits |
1602e5b6d6dSopenharmony_ci| `\xhh` | 1-2 hex digits |
1612e5b6d6dSopenharmony_ci| `\ooo` | 1-3 octal digits; o in [0-7] |
1622e5b6d6dSopenharmony_ci| `\a` | U+0007 (BELL) |
1632e5b6d6dSopenharmony_ci| `\b` | U+0008 (BACKSPACE) |
1642e5b6d6dSopenharmony_ci| `\t` | U+0009 (HORIZONTAL TAB) |
1652e5b6d6dSopenharmony_ci| `\n` | U+000A (LINE FEED) |
1662e5b6d6dSopenharmony_ci| `\v` | U+000B (VERTICAL TAB) |
1672e5b6d6dSopenharmony_ci| `\f` | U+000C (FORM FEED) |
1682e5b6d6dSopenharmony_ci| `\r` | U+000D (CARRIAGE RETURN) |
1692e5b6d6dSopenharmony_ci| `\\` | U+005C (BACKSLASH) |
1702e5b6d6dSopenharmony_ci
1712e5b6d6dSopenharmony_ciAnything else following a backslash is mapped to itself, except in an
1722e5b6d6dSopenharmony_cienvironment where it is defined to have some special meaning. For example,
1732e5b6d6dSopenharmony_ci`\\p{Lu}` is the set of uppercase letters in UnicodeSet.
1742e5b6d6dSopenharmony_ci
1752e5b6d6dSopenharmony_ciAny character formed as the result of a backslash escape loses any special
1762e5b6d6dSopenharmony_cimeaning and is treated as a literal. In particular, note that \\u and \\U
1772e5b6d6dSopenharmony_ciescapes create literal characters. (In contrast, the Java compiler treats
1782e5b6d6dSopenharmony_ciUnicode escapes as just a way to represent arbitrary characters in an ASCII
1792e5b6d6dSopenharmony_cisource file, and any resulting characters are **not** tagged as literals.)
1802e5b6d6dSopenharmony_ci
1812e5b6d6dSopenharmony_ci#### Whitespace
1822e5b6d6dSopenharmony_ci
1832e5b6d6dSopenharmony_ciWhitespace (as defined by our API) is ignored unless it is quoted or
1842e5b6d6dSopenharmony_cibackslashed.
1852e5b6d6dSopenharmony_ci
1862e5b6d6dSopenharmony_ci> :point_right: **Note**: *The rules for quoting and white space handling are common to most ICU APIs that
1872e5b6d6dSopenharmony_ciprocess rule or expression strings, including UnicodeSet, Transliteration and
1882e5b6d6dSopenharmony_ciBreak Iterators.*
1892e5b6d6dSopenharmony_ci
1902e5b6d6dSopenharmony_ci> :point_right: **Note**:*ICU Regular Expression set expressions have a different (but similar) syntax,
1912e5b6d6dSopenharmony_ciand a different set of recognized backslash escapes. \[Sets\] in ICU Regular
1922e5b6d6dSopenharmony_ciExpressions follow the conventions from Perl and Java regular expressions rather
1932e5b6d6dSopenharmony_cithan the pattern syntax from ICU UnicodeSet.*
1942e5b6d6dSopenharmony_ci
1952e5b6d6dSopenharmony_ci## Using a UnicodeSet
1962e5b6d6dSopenharmony_ci
1972e5b6d6dSopenharmony_ciFor best performance, once the set contents is complete, freeze() the set to
1982e5b6d6dSopenharmony_cimake it immutable and to speed up contains() and span() operations (for which it
1992e5b6d6dSopenharmony_cibuilds a small additional data structure).
2002e5b6d6dSopenharmony_ci
2012e5b6d6dSopenharmony_ciThe most basic operation is contains(code point) or, if relevant,
2022e5b6d6dSopenharmony_cicontains(string).
2032e5b6d6dSopenharmony_ci
2042e5b6d6dSopenharmony_ciFor splitting and partitioning strings, it is simpler and faster to use span()
2052e5b6d6dSopenharmony_ciand spanBack() rather than iterate over code points and calling contains(). In
2062e5b6d6dSopenharmony_ciJava, there is also a class UnicodeSetSpanner for somewhat higher-level
2072e5b6d6dSopenharmony_cioperations. See also the “Lookup” section of the [Properties](properties.md)
2082e5b6d6dSopenharmony_cichapter.
2092e5b6d6dSopenharmony_ci
2102e5b6d6dSopenharmony_ci## Programmatically Building UnicodeSets
2112e5b6d6dSopenharmony_ci
2122e5b6d6dSopenharmony_ciICU users can programmatically build a UnicodeSet by adding or removing ranges
2132e5b6d6dSopenharmony_ciof characters or by using the retain (intersection), remove (difference), and
2142e5b6d6dSopenharmony_ciadd (union) operations.
2152e5b6d6dSopenharmony_ci
2162e5b6d6dSopenharmony_ci## Property Values
2172e5b6d6dSopenharmony_ci
2182e5b6d6dSopenharmony_ciThe following property value variants are recognized:
2192e5b6d6dSopenharmony_ci
2202e5b6d6dSopenharmony_ci| Format | Description | Example |
2212e5b6d6dSopenharmony_ci|--------|-----------------------------------------------------------------------------------------------------|-----------------------------------|
2222e5b6d6dSopenharmony_ci| short | omits the type (used to prevent ambiguity and only allowed with the Category and Script properties) | Lu |
2232e5b6d6dSopenharmony_ci| medium | uses an abbreviated type and value | gc=Lu |
2242e5b6d6dSopenharmony_ci| long | uses a full type and value | General_Category=Uppercase_Letter |
2252e5b6d6dSopenharmony_ci
2262e5b6d6dSopenharmony_ciIf the type or value is omitted, then the equals sign is also omitted. The short
2272e5b6d6dSopenharmony_cistyle is only
2282e5b6d6dSopenharmony_ciused for Category and Script properties because these properties are very common
2292e5b6d6dSopenharmony_ciand their omission is unambiguous.
2302e5b6d6dSopenharmony_ci
2312e5b6d6dSopenharmony_ciIn actual practice, you can mix type names and values that are omitted,
2322e5b6d6dSopenharmony_ciabbreviated, or full. For example, if Category=Unassigned you could use what is
2332e5b6d6dSopenharmony_ciin the table explicitly, `\p{gc=Unassigned}`, `\p{Category=Cn}`, or
2342e5b6d6dSopenharmony_ci`\p{Unassigned}`.
2352e5b6d6dSopenharmony_ci
2362e5b6d6dSopenharmony_ciWhen these are processed, case and whitespace are ignored so you may use them
2372e5b6d6dSopenharmony_cifor clarity, if desired. For example, `\p{Category = Uppercase Letter}` or
2382e5b6d6dSopenharmony_ci`\p{Category = uppercase letter}`.
2392e5b6d6dSopenharmony_ci
2402e5b6d6dSopenharmony_ciFor a list of supported properties, see the [Properties](properties.md) chapter.
2412e5b6d6dSopenharmony_ci
2422e5b6d6dSopenharmony_ci## Getting UnicodeSet from Script
2432e5b6d6dSopenharmony_ci
2442e5b6d6dSopenharmony_ciICU provides the functionality of getting UnicodeSet from the script. Here is an
2452e5b6d6dSopenharmony_ciexample of generating a pattern from all the scripts that are associated to a
2462e5b6d6dSopenharmony_ciLocale and then getting the UnicodeSet based on the generated pattern.
2472e5b6d6dSopenharmony_ci
2482e5b6d6dSopenharmony_ci**In C:**
2492e5b6d6dSopenharmony_ci
2502e5b6d6dSopenharmony_ci    UErrorCode err = U_ZERO_ERROR;
2512e5b6d6dSopenharmony_ci    const int32_t capacity = 10;
2522e5b6d6dSopenharmony_ci    const char * shortname = NULL;
2532e5b6d6dSopenharmony_ci    int32_t num, j;
2542e5b6d6dSopenharmony_ci    int32_t strLength =4;
2552e5b6d6dSopenharmony_ci    UChar32 c = 0x00003096 ;
2562e5b6d6dSopenharmony_ci    UScriptCode script[10] = {USCRIPT_INVALID_CODE};
2572e5b6d6dSopenharmony_ci    UScriptCode scriptcode = USCRIPT_INVALID_CODE;
2582e5b6d6dSopenharmony_ci    num = uscript_getCode("ja",script,capacity, &err);
2592e5b6d6dSopenharmony_ci    printf("%s %d \n" ,"Number of script code associated are :", num);
2602e5b6d6dSopenharmony_ci    UnicodeString temp = UnicodeString("[", 1, US_INV);
2612e5b6d6dSopenharmony_ci    UnicodeString pattern;
2622e5b6d6dSopenharmony_ci    for(j=0;j<num;j++){
2632e5b6d6dSopenharmony_ci        shortname = uscript_getShortName(script[j]);
2642e5b6d6dSopenharmony_ci        UnicodeString str(shortname,strLength,US_INV);
2652e5b6d6dSopenharmony_ci        temp.append("[:");
2662e5b6d6dSopenharmony_ci        temp.append(str);
2672e5b6d6dSopenharmony_ci        temp.append(":]+");
2682e5b6d6dSopenharmony_ci    }
2692e5b6d6dSopenharmony_ci    pattern = temp.remove(temp.length()-1,1);
2702e5b6d6dSopenharmony_ci    pattern.append("]");
2712e5b6d6dSopenharmony_ci    UnicodeSet cnvSet(pattern, err);
2722e5b6d6dSopenharmony_ci    printf("%d\n", cnvSet.size());
2732e5b6d6dSopenharmony_ci    printf("%d\n", cnvSet.contains(c));
2742e5b6d6dSopenharmony_ci
2752e5b6d6dSopenharmony_ci**In Java:**
2762e5b6d6dSopenharmony_ci
2772e5b6d6dSopenharmony_ci    ULocale ul = new ULocale("ja");
2782e5b6d6dSopenharmony_ci    int script[] = UScript.getCode(ul);
2792e5b6d6dSopenharmony_ci    String str ="[";
2802e5b6d6dSopenharmony_ci    for(int i=0;i<script.length;i++){
2812e5b6d6dSopenharmony_ci        str = str + "[:"+UScript.getShortName(script[i])+":]+";
2822e5b6d6dSopenharmony_ci    }
2832e5b6d6dSopenharmony_ci    String pattern =str.substring(0, (str.length()-1));
2842e5b6d6dSopenharmony_ci    pattern = pattern + "]";
2852e5b6d6dSopenharmony_ci    System.out.println(pattern);
2862e5b6d6dSopenharmony_ci    UnicodeSet ucs = new UnicodeSet(pattern);
2872e5b6d6dSopenharmony_ci    System.out.println(ucs.size());
2882e5b6d6dSopenharmony_ci    System.out.println(ucs.contains(0x00003096));
289