userguide/strings/unicodeset.md

2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_cilayout: default
2e5b6d6dSopenharmony_cititle: UnicodeSet
2e5b6d6dSopenharmony_cinav_order: 5
2e5b6d6dSopenharmony_ciparent: Chars and Strings
2e5b6d6dSopenharmony_ci---
2e5b6d6dSopenharmony_ci<!--
2e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
2e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
2e5b6d6dSopenharmony_ci-->
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci# UnicodeSet
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Overview
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciA UnicodeSet is an object that represents a set of Unicode characters or
2e5b6d6dSopenharmony_cicharacter strings. The contents of that object can be specified either by
2e5b6d6dSopenharmony_cipatterns or by building them programmatically.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciHere are a few examples of sets:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci| Pattern | Description |
2e5b6d6dSopenharmony_ci|--------------|-------------------------------------------------------------|
2e5b6d6dSopenharmony_ci| `[a-z]` | The lower case letters a through z |
2e5b6d6dSopenharmony_ci| `[abc123]` | The six characters a,b,c,1,2 and 3 |
2e5b6d6dSopenharmony_ci| `[\p{Letter}]` | All characters with the Unicode General Category of Letter. |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### String Values
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn addition to being a set of characters (of Unicode code points),
2e5b6d6dSopenharmony_cia UnicodeSet may also contain string values. Conceptually, the UnicodeSet is
2e5b6d6dSopenharmony_cialways a set of strings, not a set of characters, although in many common use
2e5b6d6dSopenharmony_cicases the strings are all of length one, which reduces to being a set of
2e5b6d6dSopenharmony_cicharacters.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis concept can be confusing when first encountered, probably because similar
2e5b6d6dSopenharmony_ciset constructs from other environments
2e5b6d6dSopenharmony_ci(e.g., character classes in most regular expression implementations)
2e5b6d6dSopenharmony_cican only contain characters.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUntil ICU 68, it was not possible for a UnicodeSet to contain the empty string.
2e5b6d6dSopenharmony_ciIn Java, an exception was thrown. In C++, the empty string was silently ignored.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciStarting with ICU 69 [ICU-13702](https://unicode-org.atlassian.net/browse/ICU-13702)
2e5b6d6dSopenharmony_cithe empty string is supported as a set element;
2e5b6d6dSopenharmony_cihowever, it is ignored in matching functions such as `span(string)`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## UnicodeSet Patterns
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciPatterns are a series of characters bounded by square brackets that contain
2e5b6d6dSopenharmony_cilists of characters and Unicode property sets. Lists are a sequence of
2e5b6d6dSopenharmony_cicharacters that may have ranges indicated by a '-' between two characters, as in
2e5b6d6dSopenharmony_ci"a-z". The sequence specifies the range of all characters from the left to the
2e5b6d6dSopenharmony_ciright, in Unicode order. For example, `[a c d-f m]` is equivalent to `[a c d e f m]`.
2e5b6d6dSopenharmony_ciWhitespace can be freely used for clarity as `[a c d-f m]` means the same
2e5b6d6dSopenharmony_cias `[acd-fm]`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciUnicode property sets are specified by a Unicode property, such as `[:Letter:]`.
2e5b6d6dSopenharmony_ciFor a list of supported properties, see the [Properties](properties.md) chapter.
2e5b6d6dSopenharmony_ciFor details on the use of short vs. long property and property value names, see
2e5b6d6dSopenharmony_cithe end of this section. The syntax for specifying the property names is an
2e5b6d6dSopenharmony_ciextension of either POSIX or Perl syntax with the addition of "=value". For
2e5b6d6dSopenharmony_ciexample, you can match letters by using the POSIX syntax `[:Letter:]`, or by
2e5b6d6dSopenharmony_ciusing the Perl-style syntax \\p{Letter}. The type can be omitted for the
2e5b6d6dSopenharmony_ciCategory and Script properties, but is required for other properties.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe table below shows the two kinds of syntax: POSIX and Perl style. Also, the
2e5b6d6dSopenharmony_citable shows the "Negative", which is a property that excludes all characters of
2e5b6d6dSopenharmony_cia given kind. For example, `[:^Letter:]` matches all characters that are not
2e5b6d6dSopenharmony_ci`[:Letter:]`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci|  | Positive | Negative |
2e5b6d6dSopenharmony_ci|--------------------|------------------|-------------------|
2e5b6d6dSopenharmony_ci| POSIX-style Syntax | `[:type=value:]` | `[:^type=value:]` |
2e5b6d6dSopenharmony_ci| Perl-style Syntax  | `\p{type=value}` | `\P{type=value}`  |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThese following low-level lists or properties then can be freely combined with
2e5b6d6dSopenharmony_cithe normal set operations (union, inverse, difference, and intersection):
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci|  | Example | Corresponding Method | Meaning |
2e5b6d6dSopenharmony_ci|-------|-------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
2e5b6d6dSopenharmony_ci| A B | `[[:letter:] [:number:]]` | `A.addAll(B)` | To union two sets A and B, simply concatenate them |
2e5b6d6dSopenharmony_ci| A & B | `[[:letter:] & [a-z]]` | `A.retainAll(B)` | To intersect two sets A and B, use the '&' operator. |
2e5b6d6dSopenharmony_ci| A - B | `[[:letter:] - [a-z]]` | `A.removeAll(B)` | To take the set-difference of two sets  A and B, use the '-' operator. |
2e5b6d6dSopenharmony_ci| [^A] | `[^a-z]` | `A.complement(B)` | To invert a set A, place a '^' immediately after the opening '['.  Note that the complement only affects code points, not string values. In any other location, the '^' does not have a special meaning. |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Precedence
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe binary operators of union, intersection, and set-difference have equal
2e5b6d6dSopenharmony_ciprecedence and bind left-to-right. Thus the following are equivalent:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci*   `[[:letter:] - [a-z] [:number:] & [\u0100-\u01FF]]`
2e5b6d6dSopenharmony_ci*   `[[[[[:letter:] - [a-z]] [:number:]] & [\u0100-\u01FF]]`
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAnother example is that the set `[[ace][bdf\] - [abc][def]]` is **not**
2e5b6d6dSopenharmony_cithe empty set, but instead the set `[def]`. That is because the syntax
2e5b6d6dSopenharmony_cicorresponds to the following UnicodeSet operations:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci1.  start with `[ace]`
2e5b6d6dSopenharmony_ci2.  addAll `[bdf]` *-- we now have `[abcdef]`*
2e5b6d6dSopenharmony_ci3.  removeAll `[abc]` *-- we now have `[def]`*
2e5b6d6dSopenharmony_ci4.  addAll `[def]` *-- no effect, we still have `[def]`*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThis only really matters where there are the difference and intersection
2e5b6d6dSopenharmony_cioperations, as the union operation is commutative. To make sure that the - is
2e5b6d6dSopenharmony_cithe main operator, add brackets to group the operations as desired, such as
2e5b6d6dSopenharmony_ci`[[ace][bdf] - [[abc][def]]]`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAnother caveat with the '&' and '-' operators is that they operate between
2e5b6d6dSopenharmony_ci**sets**. That is, they must be immediately preceded and immediately followed by
2e5b6d6dSopenharmony_cia set. For example, the pattern `[[:Lu:]-A]` is illegal, since it is
2e5b6d6dSopenharmony_ciinterpreted as the set `[:Lu:]` followed by the incomplete range `-A`. To specify
2e5b6d6dSopenharmony_cithe set of uppercase letters except for 'A', enclose the 'A' in a set:
2e5b6d6dSopenharmony_ci`[[:Lu:]-[A]]`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Examples
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci| `[a]` | The set containing 'a' |
2e5b6d6dSopenharmony_ci|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
2e5b6d6dSopenharmony_ci| `[a-z]` | The set containing 'a' through 'z' and all letters in between, in Unicode order |
2e5b6d6dSopenharmony_ci| `[^a-z]` | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF |
2e5b6d6dSopenharmony_ci| `[[pat1][pat2]]` | The union of sets specified by pat1 and pat2 |
2e5b6d6dSopenharmony_ci| `[[pat1]& [pat2]]` | The intersection of sets specified by pat1 and pat2 |
2e5b6d6dSopenharmony_ci| `[[pat1]- [pat2]]` | The asymmetric difference of sets specified by pat1 and pat2 |
2e5b6d6dSopenharmony_ci| `[:Lu:]` | The set of characters belonging to the given Unicode category, as defined by  `Character.getType()`; in this case, Unicode uppercase letters. The long form for this is  `[:UppercaseLetter:]`. |
2e5b6d6dSopenharmony_ci| `[:L:]` | The set of characters belonging to all Unicode categories starting with 'L', that is,  `[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]`. The long form for this is  `[:Letter:]`. |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### String Values in Sets
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciString values are enclosed in {curly brackets}.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci| Set expression | Description |
2e5b6d6dSopenharmony_ci|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
2e5b6d6dSopenharmony_ci| `[abc{def}]` | A set containing four members, the single characters a, b and c, and the string “def” |
2e5b6d6dSopenharmony_ci| `[{abc}{def}]` | A set containing two members, the string “abc” and the string “def”. |
2e5b6d6dSopenharmony_ci| `[{a}{b}{c}]` `[abc]` | These two sets are equivalent. Each contains three items, the three individual characters a, b and c. A {string} containing a single character is equivalent to that same character specified in any other way. |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci### Character Quoting and Escaping in Unicode Set Patterns
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Single Quote
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciTwo single quotes represents a single quote, either inside or outside single
2e5b6d6dSopenharmony_ciquotes.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciText within single quotes is not interpreted in any way (except for two adjacent
2e5b6d6dSopenharmony_cisingle quotes). It is taken as literal text (special characters become
2e5b6d6dSopenharmony_cinon-special).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThese quoting conventions for ICU UnicodeSets differ from those of regular
2e5b6d6dSopenharmony_ciexpression character set expressions. In regular expressions, single quotes have
2e5b6d6dSopenharmony_cino special meaning and are treated like any other literal character.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Backslash Escapes
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciOutside of single quotes, certain backslashed characters have special meaning:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci| `\uhhhh` | Exactly 4 hex digits; h in [0-9A-Fa-f] |
2e5b6d6dSopenharmony_ci|------------|----------------------------------------|
2e5b6d6dSopenharmony_ci| `\Uhhhhhhhh` | Exactly 8 hex digits |
2e5b6d6dSopenharmony_ci| `\xhh` | 1-2 hex digits |
2e5b6d6dSopenharmony_ci| `\ooo` | 1-3 octal digits; o in [0-7] |
2e5b6d6dSopenharmony_ci| `\a` | U+0007 (BELL) |
2e5b6d6dSopenharmony_ci| `\b` | U+0008 (BACKSPACE) |
2e5b6d6dSopenharmony_ci| `\t` | U+0009 (HORIZONTAL TAB) |
2e5b6d6dSopenharmony_ci| `\n` | U+000A (LINE FEED) |
2e5b6d6dSopenharmony_ci| `\v` | U+000B (VERTICAL TAB) |
2e5b6d6dSopenharmony_ci| `\f` | U+000C (FORM FEED) |
2e5b6d6dSopenharmony_ci| `\r` | U+000D (CARRIAGE RETURN) |
2e5b6d6dSopenharmony_ci| `\\` | U+005C (BACKSLASH) |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAnything else following a backslash is mapped to itself, except in an
2e5b6d6dSopenharmony_cienvironment where it is defined to have some special meaning. For example,
2e5b6d6dSopenharmony_ci`\\p{Lu}` is the set of uppercase letters in UnicodeSet.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciAny character formed as the result of a backslash escape loses any special
2e5b6d6dSopenharmony_cimeaning and is treated as a literal. In particular, note that \\u and \\U
2e5b6d6dSopenharmony_ciescapes create literal characters. (In contrast, the Java compiler treats
2e5b6d6dSopenharmony_ciUnicode escapes as just a way to represent arbitrary characters in an ASCII
2e5b6d6dSopenharmony_cisource file, and any resulting characters are **not** tagged as literals.)
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci#### Whitespace
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhitespace (as defined by our API) is ignored unless it is quoted or
2e5b6d6dSopenharmony_cibackslashed.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**: *The rules for quoting and white space handling are common to most ICU APIs that
2e5b6d6dSopenharmony_ciprocess rule or expression strings, including UnicodeSet, Transliteration and
2e5b6d6dSopenharmony_ciBreak Iterators.*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci> :point_right: **Note**:*ICU Regular Expression set expressions have a different (but similar) syntax,
2e5b6d6dSopenharmony_ciand a different set of recognized backslash escapes. \[Sets\] in ICU Regular
2e5b6d6dSopenharmony_ciExpressions follow the conventions from Perl and Java regular expressions rather
2e5b6d6dSopenharmony_cithan the pattern syntax from ICU UnicodeSet.*
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Using a UnicodeSet
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor best performance, once the set contents is complete, freeze() the set to
2e5b6d6dSopenharmony_cimake it immutable and to speed up contains() and span() operations (for which it
2e5b6d6dSopenharmony_cibuilds a small additional data structure).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe most basic operation is contains(code point) or, if relevant,
2e5b6d6dSopenharmony_cicontains(string).
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor splitting and partitioning strings, it is simpler and faster to use span()
2e5b6d6dSopenharmony_ciand spanBack() rather than iterate over code points and calling contains(). In
2e5b6d6dSopenharmony_ciJava, there is also a class UnicodeSetSpanner for somewhat higher-level
2e5b6d6dSopenharmony_cioperations. See also the “Lookup” section of the [Properties](properties.md)
2e5b6d6dSopenharmony_cichapter.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Programmatically Building UnicodeSets
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU users can programmatically build a UnicodeSet by adding or removing ranges
2e5b6d6dSopenharmony_ciof characters or by using the retain (intersection), remove (difference), and
2e5b6d6dSopenharmony_ciadd (union) operations.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Property Values
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciThe following property value variants are recognized:
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci| Format | Description | Example |
2e5b6d6dSopenharmony_ci|--------|-----------------------------------------------------------------------------------------------------|-----------------------------------|
2e5b6d6dSopenharmony_ci| short | omits the type (used to prevent ambiguity and only allowed with the Category and Script properties) | Lu |
2e5b6d6dSopenharmony_ci| medium | uses an abbreviated type and value | gc=Lu |
2e5b6d6dSopenharmony_ci| long | uses a full type and value | General_Category=Uppercase_Letter |
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIf the type or value is omitted, then the equals sign is also omitted. The short
2e5b6d6dSopenharmony_cistyle is only
2e5b6d6dSopenharmony_ciused for Category and Script properties because these properties are very common
2e5b6d6dSopenharmony_ciand their omission is unambiguous.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciIn actual practice, you can mix type names and values that are omitted,
2e5b6d6dSopenharmony_ciabbreviated, or full. For example, if Category=Unassigned you could use what is
2e5b6d6dSopenharmony_ciin the table explicitly, `\p{gc=Unassigned}`, `\p{Category=Cn}`, or
2e5b6d6dSopenharmony_ci`\p{Unassigned}`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciWhen these are processed, case and whitespace are ignored so you may use them
2e5b6d6dSopenharmony_cifor clarity, if desired. For example, `\p{Category = Uppercase Letter}` or
2e5b6d6dSopenharmony_ci`\p{Category = uppercase letter}`.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciFor a list of supported properties, see the [Properties](properties.md) chapter.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci## Getting UnicodeSet from Script
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ciICU provides the functionality of getting UnicodeSet from the script. Here is an
2e5b6d6dSopenharmony_ciexample of generating a pattern from all the scripts that are associated to a
2e5b6d6dSopenharmony_ciLocale and then getting the UnicodeSet based on the generated pattern.
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**In C:**
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    UErrorCode err = U_ZERO_ERROR;
2e5b6d6dSopenharmony_ci    const int32_t capacity = 10;
2e5b6d6dSopenharmony_ci    const char * shortname = NULL;
2e5b6d6dSopenharmony_ci    int32_t num, j;
2e5b6d6dSopenharmony_ci    int32_t strLength =4;
2e5b6d6dSopenharmony_ci    UChar32 c = 0x00003096 ;
2e5b6d6dSopenharmony_ci    UScriptCode script[10] = {USCRIPT_INVALID_CODE};
2e5b6d6dSopenharmony_ci    UScriptCode scriptcode = USCRIPT_INVALID_CODE;
2e5b6d6dSopenharmony_ci    num = uscript_getCode("ja",script,capacity, &err);
2e5b6d6dSopenharmony_ci    printf("%s %d \n" ,"Number of script code associated are :", num);
2e5b6d6dSopenharmony_ci    UnicodeString temp = UnicodeString("[", 1, US_INV);
2e5b6d6dSopenharmony_ci    UnicodeString pattern;
2e5b6d6dSopenharmony_ci    for(j=0;j<num;j++){
2e5b6d6dSopenharmony_ci        shortname = uscript_getShortName(script[j]);
2e5b6d6dSopenharmony_ci        UnicodeString str(shortname,strLength,US_INV);
2e5b6d6dSopenharmony_ci        temp.append("[:");
2e5b6d6dSopenharmony_ci        temp.append(str);
2e5b6d6dSopenharmony_ci        temp.append(":]+");
2e5b6d6dSopenharmony_ci    }
2e5b6d6dSopenharmony_ci    pattern = temp.remove(temp.length()-1,1);
2e5b6d6dSopenharmony_ci    pattern.append("]");
2e5b6d6dSopenharmony_ci    UnicodeSet cnvSet(pattern, err);
2e5b6d6dSopenharmony_ci    printf("%d\n", cnvSet.size());
2e5b6d6dSopenharmony_ci    printf("%d\n", cnvSet.contains(c));
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci**In Java:**
2e5b6d6dSopenharmony_ci
2e5b6d6dSopenharmony_ci    ULocale ul = new ULocale("ja");
2e5b6d6dSopenharmony_ci    int script[] = UScript.getCode(ul);
2e5b6d6dSopenharmony_ci    String str ="[";
2e5b6d6dSopenharmony_ci    for(int i=0;i<script.length;i++){
2e5b6d6dSopenharmony_ci        str = str + "[:"+UScript.getShortName(script[i])+":]+";
2e5b6d6dSopenharmony_ci    }
2e5b6d6dSopenharmony_ci    String pattern =str.substring(0, (str.length()-1));
2e5b6d6dSopenharmony_ci    pattern = pattern + "]";
2e5b6d6dSopenharmony_ci    System.out.println(pattern);
2e5b6d6dSopenharmony_ci    UnicodeSet ucs = new UnicodeSet(pattern);
2e5b6d6dSopenharmony_ci    System.out.println(ucs.size());
2e5b6d6dSopenharmony_ci    System.out.println(ucs.contains(0x00003096));