12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: String Search
42e5b6d6dSopenharmony_cinav_order: 4
52e5b6d6dSopenharmony_ciparent: Collation
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2020 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# String Search Service
132e5b6d6dSopenharmony_ci{: .no_toc }
142e5b6d6dSopenharmony_ci
152e5b6d6dSopenharmony_ci## Contents
162e5b6d6dSopenharmony_ci{: .no_toc .text-delta }
172e5b6d6dSopenharmony_ci
182e5b6d6dSopenharmony_ci1. TOC
192e5b6d6dSopenharmony_ci{:toc}
202e5b6d6dSopenharmony_ci
212e5b6d6dSopenharmony_ci---
222e5b6d6dSopenharmony_ci
232e5b6d6dSopenharmony_ci## Overview
242e5b6d6dSopenharmony_ci
252e5b6d6dSopenharmony_ciString searching, also known as string matching, is a very important subject in
262e5b6d6dSopenharmony_cithe wider domain of text processing and analysis. Many software applications use
272e5b6d6dSopenharmony_cithe basic string search algorithm in the implementations on most operating
282e5b6d6dSopenharmony_cisystems. With the popularity of Internet, the quantity of available data from
292e5b6d6dSopenharmony_cidifferent parts of the world has increased dramatically within a short time.
302e5b6d6dSopenharmony_ciTherefore, a string search algorithm that is language-aware has become more
312e5b6d6dSopenharmony_ciimportant. A bitwise match that uses the `u_strstr` (C), `UnicodeString::indexOf`
322e5b6d6dSopenharmony_ci(C++) or `String.indexOf` (Java) APIs will not yield the correct result specific
332e5b6d6dSopenharmony_cito a particular language's requirements. The APIs will not yield the correct
342e5b6d6dSopenharmony_ciresult because all the issues that are important to language-sensitive collation
352e5b6d6dSopenharmony_ciare also applicable to text searching. The following lists those issues which
362e5b6d6dSopenharmony_ciare applicable to text searching:
372e5b6d6dSopenharmony_ci
382e5b6d6dSopenharmony_ci1.  Accented letters\
392e5b6d6dSopenharmony_ci    In English, accents are treated as minor variations of a letter. In French,
402e5b6d6dSopenharmony_ci    accented letters have much more significance as they can actually change the
412e5b6d6dSopenharmony_ci    meaning of a word. Very often, an accented letter is actually a distinct
422e5b6d6dSopenharmony_ci    letter. For example, letter 'å' (\\u00e5) may be just a letter 'a' with an
432e5b6d6dSopenharmony_ci    accent symbol to English speakers. However, it is actually a distinct letter
442e5b6d6dSopenharmony_ci    in Danish; in Danish searching for 'a' should generally not match 'å' and
452e5b6d6dSopenharmony_ci    vice versa. In some cases, such as in traditional German, an accented letter
462e5b6d6dSopenharmony_ci    is short-hand for something longer. In sorting, an 'ä' (\\u00e4) is treated
472e5b6d6dSopenharmony_ci    as 'ae'. Note that primary- and secondary-level distinctions for *searching*
482e5b6d6dSopenharmony_ci    may not be the same as those for sorting; in ICU, many languages provide a
492e5b6d6dSopenharmony_ci    special "search" collator with the appropriate level settings for search.
502e5b6d6dSopenharmony_ci
512e5b6d6dSopenharmony_ci2.  Conjoined letters\
522e5b6d6dSopenharmony_ci    Special handling is required when a single letter is treated equivalent to
532e5b6d6dSopenharmony_ci    two distinct letters and vice versa. For example, in German, the letter 'ß'
542e5b6d6dSopenharmony_ci    (\\u00df) is treated as 'ss' in sorting. Also, in most languages, 'æ'
552e5b6d6dSopenharmony_ci    (\\u00e6) is considered equivalent to the letter 'a' followed by the letter
562e5b6d6dSopenharmony_ci    'e'. Also, the ligatures are often treated as distinct letters by
572e5b6d6dSopenharmony_ci    themselves. For example, 'ch' is treated as a distinct letter between the
582e5b6d6dSopenharmony_ci    letter 'c' and the letter 'd' in Spanish.
592e5b6d6dSopenharmony_ci
602e5b6d6dSopenharmony_ci3.  Ignorable punctuation\
612e5b6d6dSopenharmony_ci    As in collation, it is important that the user is able to choose to ignore
622e5b6d6dSopenharmony_ci    punctuation symbols while the user searches for a pattern in the string. For
632e5b6d6dSopenharmony_ci    example, a user may search for "blackbird" and want to include entries such
642e5b6d6dSopenharmony_ci    as "black-bird".
652e5b6d6dSopenharmony_ci
662e5b6d6dSopenharmony_ci## ICU String Search Model
672e5b6d6dSopenharmony_ci
682e5b6d6dSopenharmony_ciThe ICU string search service provides similar APIs to the other text iterating
692e5b6d6dSopenharmony_ciservices. Allowing users to specify the starting position and direction within
702e5b6d6dSopenharmony_cithe text string to be searched. For more information, please see the [Boundary
712e5b6d6dSopenharmony_ciAnalysis](../boundaryanalysis/index.md) chapter. The user can locate one or all
722e5b6d6dSopenharmony_cioccurrences of a pattern in a string. For a given collator, a pattern match is
732e5b6d6dSopenharmony_cilocated at the offsets <start, end> in a string if the collator finds that the
742e5b6d6dSopenharmony_cisub-string between the start and end is equal.
752e5b6d6dSopenharmony_ci
762e5b6d6dSopenharmony_ciThe string search service supports two different types of canonical match
772e5b6d6dSopenharmony_cibehavior.
782e5b6d6dSopenharmony_ci
792e5b6d6dSopenharmony_ciLet S' be the sub-string of a text string S between the offsets start and end
802e5b6d6dSopenharmony_ci<start, end>.
812e5b6d6dSopenharmony_ciA pattern string P matches a text string S at the offsets <start, end> if
822e5b6d6dSopenharmony_ci
832e5b6d6dSopenharmony_ci1.  option 1. P matches some canonical equivalent string of S'. Suppose the
842e5b6d6dSopenharmony_ci    collator used for searching has a tertiary collation strength, all accents
852e5b6d6dSopenharmony_ci    are non-ignorable. If the pattern "a\\u0300" is searched in the target text
862e5b6d6dSopenharmony_ci    "a\\u0325\\u0300", a match will be found, since the target text is
872e5b6d6dSopenharmony_ci    canonically equivalent to "a\\u0300\\u0325"
882e5b6d6dSopenharmony_ci
892e5b6d6dSopenharmony_ci2.  option 2. P matches S' and if P starts or ends with a combining mark, there
902e5b6d6dSopenharmony_ci    exists no non-ignorable combining mark before or after S' in S respectively.
912e5b6d6dSopenharmony_ci    Following the example above, the pattern "a\\u0300" will not find a match in
922e5b6d6dSopenharmony_ci    "a\\u0325\\u0300", since there exists a non-ignorable accent '\\u0325' in
932e5b6d6dSopenharmony_ci    the middle of 'a' and '\\u0300'. Even with a target text of
942e5b6d6dSopenharmony_ci    "a\\u0300\\u0325" a match will not be found because of the non-ignorable
952e5b6d6dSopenharmony_ci    trailing accent \\u0325.
962e5b6d6dSopenharmony_ci
972e5b6d6dSopenharmony_ciOne restriction is to be noted for option 1. Currently there are no composite
982e5b6d6dSopenharmony_cicharacters that consists of a character with combining class greater than 0
992e5b6d6dSopenharmony_cibefore a character with combining class equals to 0. However, if such a
1002e5b6d6dSopenharmony_cicharacter exists in the future, the string search service may not work correctly
1012e5b6d6dSopenharmony_ciwith option 1 when such characters are encountered.
1022e5b6d6dSopenharmony_ci
1032e5b6d6dSopenharmony_ciFurthermore, option 1 could generate more than one "encompassing" matches. For
1042e5b6d6dSopenharmony_ciexample, in Danish, 'å' (\\u00e5) and 'aa' are considered equivalent. So the
1052e5b6d6dSopenharmony_cipattern "baad" will match "a--båd--man" (a--b\\u00e5d--man) at the start offset
1062e5b6d6dSopenharmony_ciat 3 and the end offset 5. However, the start offset can be 1 or 2 and the end
1072e5b6d6dSopenharmony_cioffset can be 6 or 7, because "-" (hyphen) is ignorable for a certain collation.
1082e5b6d6dSopenharmony_ciThe ICU implementation always returns the offsets of the shortest match
1092e5b6d6dSopenharmony_cisub-string. To be more exact, the string search added a "tightest" match
1102e5b6d6dSopenharmony_cicondition. In other words, if the pattern matches at offsets <start, end> as
1112e5b6d6dSopenharmony_ciwell as offsets <start + 1, end>, the offsets <start, end> are not considered a
1122e5b6d6dSopenharmony_cimatch. Likewise, if the pattern matches at offsets <start, end> as well as
1132e5b6d6dSopenharmony_cioffsets <start, end + 1>, the offsets <start, end + 1> are not considered a
1142e5b6d6dSopenharmony_cimatch. Therefore, when the option 1 is chosen in Danish collator, 'baad' will
1152e5b6d6dSopenharmony_cimatch in the string "a--båd--man" (a--b\\u00e5d--man) ONLY at offsets <3,5>.
1162e5b6d6dSopenharmony_ci
1172e5b6d6dSopenharmony_ciThe default behavior is that described in option 2 above. To obtain the behavior
1182e5b6d6dSopenharmony_cidescribed in option 1, you must set the normalization mode to ON in the collator
1192e5b6d6dSopenharmony_ciused for search.
1202e5b6d6dSopenharmony_ci
1212e5b6d6dSopenharmony_ci> :point_right: **Note**: The "tightest match" behavior described above
1222e5b6d6dSopenharmony_ci> is defined as "Minimal Match" in
1232e5b6d6dSopenharmony_ci> [Section 8 Searching and Matching in UTS #10 Unicode Collation Collation Algorithm](http://www.unicode.org/reports/tr10/#Searching).
1242e5b6d6dSopenharmony_ci> "Medial Match" and "Maximal Match" are not yet implemented by the ICU String Search service.
1252e5b6d6dSopenharmony_ci
1262e5b6d6dSopenharmony_ciThe string search service also supports two varieties of “asymmetric search” as
1272e5b6d6dSopenharmony_cidescribed in *[Section 8.2 Asymmetric Search in UTS #10 Unicode Collation
1282e5b6d6dSopenharmony_ciCollation Algorithm](http://www.unicode.org/reports/tr10/#Asymmetric_Search)*.
1292e5b6d6dSopenharmony_ciWith asymmetric search, for example, unaccented characters are treated as
1302e5b6d6dSopenharmony_ci“wildcards” that may match any character with the same primary weight, this
1312e5b6d6dSopenharmony_cibehavior can be applied just to characters in the search pattern, or to
1322e5b6d6dSopenharmony_cicharacters in both the search pattern and the searched text. With the former
1332e5b6d6dSopenharmony_cibehavior, searching with French behavior for 'e' might match 'e', 'è', 'é', 'ê',
1342e5b6d6dSopenharmony_ciand so one, while search for 'é' would only match 'é'.
1352e5b6d6dSopenharmony_ci
1362e5b6d6dSopenharmony_ciBoth a locale or collator can be used to specify the language-sensitive rules
1372e5b6d6dSopenharmony_cifor searches. When a locale is specified, a collator will be created internally
1382e5b6d6dSopenharmony_ciand the StringSearch instance that is created is responsible for the ownership
1392e5b6d6dSopenharmony_ciof the collator. All the collation attributes will be considered during the
1402e5b6d6dSopenharmony_cistring search operation. However, the users only can set the collator attributes
1412e5b6d6dSopenharmony_ciusing the collator APIs. Normalization is usually done within collation and the
1422e5b6d6dSopenharmony_ciprocess is outside the scope of the string search service.
1432e5b6d6dSopenharmony_ci
1442e5b6d6dSopenharmony_ciAs in other iterator interfaces, the string search service provides APIs to
1452e5b6d6dSopenharmony_ciperform string matching for the first pattern occurrence, immediate next,
1462e5b6d6dSopenharmony_ciprevious match, and the last pattern occurrence. There are also options to allow
1472e5b6d6dSopenharmony_cifor overlapping matching. For example, in English, if the string is "ababab" and
1482e5b6d6dSopenharmony_cithe pattern is "abab", overlapping matching produces results of offsets <0, 3>
1492e5b6d6dSopenharmony_ciand <2, 5>. Otherwise, the mutually exclusive matching produces the result
1502e5b6d6dSopenharmony_cioffset <0, 3> only. To find a whole word match, the user can provide a
1512e5b6d6dSopenharmony_cilocale-specific `BreakIterator` object to a `StringSearch` instance to correctly
1522e5b6d6dSopenharmony_cilocate the word boundaries. For example, if "c" exists in the string "abc", a
1532e5b6d6dSopenharmony_cimatch is returned. However, the behavior can be overwritten by supplying a word
1542e5b6d6dSopenharmony_ci`BreakIterator`.
1552e5b6d6dSopenharmony_ci
1562e5b6d6dSopenharmony_ciThe minimum unit of match is aligned to an extended grapheme cluster in the ICU
1572e5b6d6dSopenharmony_cistring search service implementation defined by [UAX #29 Unicode Text
1582e5b6d6dSopenharmony_ciSegmentation](http://www.unicode.org/reports/tr29/). Therefore, all matches will
1592e5b6d6dSopenharmony_cibegin and end on extended grapheme cluster boundaries. If the given input search
1602e5b6d6dSopenharmony_cipattern starts with non-base character, no matches will be returned.
1612e5b6d6dSopenharmony_ciWhen there are contractions in the collation sequence and the contraction
1622e5b6d6dSopenharmony_cihappens to span across the boundary of a match, it is not considered a match.
1632e5b6d6dSopenharmony_ciFor example, in traditional Spanish where 'ch' is a contraction, the "har"
1642e5b6d6dSopenharmony_cipattern will not match in the string "uno charo". Boundaries that are
1652e5b6d6dSopenharmony_cidiscontiguous contractions will yield a match result similar to those described
1662e5b6d6dSopenharmony_ciabove, where the end of the match returned will be one character before the
1672e5b6d6dSopenharmony_ciimmediate following base letter. In addition, only the first match will be
1682e5b6d6dSopenharmony_cilocated if a pattern contains only combining marks and the search string
1692e5b6d6dSopenharmony_cicontains more than one occurrences of the pattern consecutively. For example, if
1702e5b6d6dSopenharmony_cithe user searches for the pattern "´" (\\u00b4) in the string "A´´B",
1712e5b6d6dSopenharmony_ci(A\\u00b4\\u00b4B) the result will be offsets <1, 2>.
1722e5b6d6dSopenharmony_ci
1732e5b6d6dSopenharmony_ci### Example
1742e5b6d6dSopenharmony_ci
1752e5b6d6dSopenharmony_ci**In C:**
1762e5b6d6dSopenharmony_ci
1772e5b6d6dSopenharmony_ci```c
1782e5b6d6dSopenharmony_ci    char *tgtstr = "The quick brown fox jumps over the lazy dog.";
1792e5b6d6dSopenharmony_ci    char *patstr = "fox";
1802e5b6d6dSopenharmony_ci    UChar target[64];
1812e5b6d6dSopenharmony_ci
1822e5b6d6dSopenharmony_ci    UChar pattern[16];
1832e5b6d6dSopenharmony_ci    int pos = 0;
1842e5b6d6dSopenharmony_ci    UErrorCode status = U_ZERO_ERROR;
1852e5b6d6dSopenharmony_ci    UStringSearch *search = NULL;
1862e5b6d6dSopenharmony_ci
1872e5b6d6dSopenharmony_ci    u_uastrcpy(target, tgtstr);
1882e5b6d6dSopenharmony_ci    u_uastrcpy(pattern, patstr);
1892e5b6d6dSopenharmony_ci
1902e5b6d6dSopenharmony_ci
1912e5b6d6dSopenharmony_ci    search = usearch_open(pattern, -1, target, -1, "en_US", 
1922e5b6d6dSopenharmony_ci                          NULL, &status);
1932e5b6d6dSopenharmony_ci
1942e5b6d6dSopenharmony_ci
1952e5b6d6dSopenharmony_ci    if (U_FAILURE(status)) {
1962e5b6d6dSopenharmony_ci        fprintf(stderr, "Could not create a UStringSearch.\n");
1972e5b6d6dSopenharmony_ci        return;
1982e5b6d6dSopenharmony_ci    }
1992e5b6d6dSopenharmony_ci
2002e5b6d6dSopenharmony_ci    for(pos = usearch_first(search, &status);
2012e5b6d6dSopenharmony_ci        U_SUCCESS(status) && pos != USEARCH_DONE;
2022e5b6d6dSopenharmony_ci        pos = usearch_next(search, &status))
2032e5b6d6dSopenharmony_ci    {
2042e5b6d6dSopenharmony_ci        fprintf(stdout, "Match found at position %d.\n", pos);
2052e5b6d6dSopenharmony_ci    }
2062e5b6d6dSopenharmony_ci
2072e5b6d6dSopenharmony_ci    if (U_FAILURE(status)) {
2082e5b6d6dSopenharmony_ci        fprintf(stderr, "Error searching for pattern.\n");
2092e5b6d6dSopenharmony_ci    }
2102e5b6d6dSopenharmony_ci```
2112e5b6d6dSopenharmony_ci
2122e5b6d6dSopenharmony_ci**In C++:**
2132e5b6d6dSopenharmony_ci
2142e5b6d6dSopenharmony_ci```c++
2152e5b6d6dSopenharmony_ci    UErrorCode status = U_ZERO_ERROR;
2162e5b6d6dSopenharmony_ci    UnicodeString target("Jackdaws love my big sphinx of quartz.");
2172e5b6d6dSopenharmony_ci    UnicodeString pattern("sphinx");
2182e5b6d6dSopenharmony_ci    StringSearch search(pattern, target, Locale::getUS(), NULL, status);
2192e5b6d6dSopenharmony_ci
2202e5b6d6dSopenharmony_ci
2212e5b6d6dSopenharmony_ci    if (U_FAILURE(status)) {
2222e5b6d6dSopenharmony_ci        fprintf(stderr, "Could not create a StringSearch object.\n");
2232e5b6d6dSopenharmony_ci        return;
2242e5b6d6dSopenharmony_ci    }
2252e5b6d6dSopenharmony_ci
2262e5b6d6dSopenharmony_ci    for(int pos = search.first(status);
2272e5b6d6dSopenharmony_ci        U_SUCCESS(status) && pos != USEARCH_DONE;
2282e5b6d6dSopenharmony_ci        pos = search.next(status))
2292e5b6d6dSopenharmony_ci    {
2302e5b6d6dSopenharmony_ci        fprintf(stdout, "Match found at position %d.\n", pos);
2312e5b6d6dSopenharmony_ci    }
2322e5b6d6dSopenharmony_ci
2332e5b6d6dSopenharmony_ci    if (U_FAILURE(status)) {
2342e5b6d6dSopenharmony_ci        fprintf(stderr, "Error searching for pattern.\n");
2352e5b6d6dSopenharmony_ci    }
2362e5b6d6dSopenharmony_ci```
2372e5b6d6dSopenharmony_ci
2382e5b6d6dSopenharmony_ci**In Java:**
2392e5b6d6dSopenharmony_ci
2402e5b6d6dSopenharmony_ci```java
2412e5b6d6dSopenharmony_ci    StringCharacterIterator target = new StringCharacterIterator(
2422e5b6d6dSopenharmony_ci                                         "Pack my box with five dozen liquor jugs.");
2432e5b6d6dSopenharmony_ci    String pattern = "box";
2442e5b6d6dSopenharmony_ci
2452e5b6d6dSopenharmony_ci    try {
2462e5b6d6dSopenharmony_ci        StringSearch search = new StringSearch(pattern, target, Locale.US);
2472e5b6d6dSopenharmony_ci
2482e5b6d6dSopenharmony_ci
2492e5b6d6dSopenharmony_ci        for(int pos = search.first();
2502e5b6d6dSopenharmony_ci            pos != StringSearch.DONE;
2512e5b6d6dSopenharmony_ci            pos = search.next())
2522e5b6d6dSopenharmony_ci        {
2532e5b6d6dSopenharmony_ci            System.out.println("Match found for pattern at position " + pos); 
2542e5b6d6dSopenharmony_ci        }
2552e5b6d6dSopenharmony_ci    } catch (Exception e) {
2562e5b6d6dSopenharmony_ci        System.err.println("StringSearch failure: " + e.toString());
2572e5b6d6dSopenharmony_ci    }
2582e5b6d6dSopenharmony_ci```
2592e5b6d6dSopenharmony_ci
2602e5b6d6dSopenharmony_ci## Performance and Other Implications
2612e5b6d6dSopenharmony_ci
2622e5b6d6dSopenharmony_ciThe ICU string search service is designed to be on top of the ICU collation
2632e5b6d6dSopenharmony_ciservice. Therefore, all the performance implications that apply to a collator
2642e5b6d6dSopenharmony_ciare also applicable to the string search service. To obtain the best
2652e5b6d6dSopenharmony_ciperformance, use the default collator attributes described in the Performance
2662e5b6d6dSopenharmony_ciand Storage Implications on Attributes section in the [Collation Service
2672e5b6d6dSopenharmony_ciArchitecture](architecture#performance-and-storage-implications-of-attributes)
2682e5b6d6dSopenharmony_cichapter. In addition, users need to be aware of
2692e5b6d6dSopenharmony_cithe following `StringSearch` specific considerations:
2702e5b6d6dSopenharmony_ci
2712e5b6d6dSopenharmony_ci### Search Algorithm
2722e5b6d6dSopenharmony_ci
2732e5b6d6dSopenharmony_ciICU4C (C/C++) releases up to 3.8 used the Boyer-Moore search algorithm in the string
2742e5b6d6dSopenharmony_cisearch service. There were some known issues in these previous releases.
2752e5b6d6dSopenharmony_ci(See ICU tickets [ICU-5024](https://unicode-org.atlassian.net/browse/ICU-5024),
2762e5b6d6dSopenharmony_ci[ICU-5382](https://unicode-org.atlassian.net/browse/ICU-5382),
2772e5b6d6dSopenharmony_ci[ICU-5420](https://unicode-org.atlassian.net/browse/ICU-5420)).
2782e5b6d6dSopenharmony_ci
2792e5b6d6dSopenharmony_ciIn ICU4C 4.0, the string search service was updated to use a simple linear search
2802e5b6d6dSopenharmony_cialgorithm, which locates a match by shifting a cursor in the target text one by one,
2812e5b6d6dSopenharmony_ciand these issues were fixed.
2822e5b6d6dSopenharmony_ci
2832e5b6d6dSopenharmony_ciIn ICU4C 4.0.1, the Boyer-Moore search code was reintroduced as a separate API with
2842e5b6d6dSopenharmony_citechnology preview status. However, in ICU4C 51.1, this was removed.
2852e5b6d6dSopenharmony_ci(See ICU ticket [ICU-9573](https://unicode-org.atlassian.net/browse/ICU-9573)).
2862e5b6d6dSopenharmony_ci
2872e5b6d6dSopenharmony_ciSimilarly, in ICU4J 53 (Java) the Boyer-Moore search algorithm was replaced by the
2882e5b6d6dSopenharmony_cisimple linear search algorithm, ported from ICU4C. (See ICU ticket [ICU-6288](https://unicode-org.atlassian.net/browse/ICU-6288)).
2892e5b6d6dSopenharmony_ci
2902e5b6d6dSopenharmony_ciThe Boyer-Moore search algorithm is based on automata or combinatorial properties of strings and
2912e5b6d6dSopenharmony_cipre-processes the pattern and known to be much faster than the linear search
2922e5b6d6dSopenharmony_ciwhen search pattern length is longer. According to performance evaluation
2932e5b6d6dSopenharmony_cibetween these two implementations, the Boyer-Moore search is faster than the
2942e5b6d6dSopenharmony_cilinear search when the pattern text is longer than 3 or 4 characters.
2952e5b6d6dSopenharmony_ciHowever, it is very tricky to get correct results with a collation-based Boyer-Moore search.
2962e5b6d6dSopenharmony_ci
2972e5b6d6dSopenharmony_ci### Change Iterating Direction
2982e5b6d6dSopenharmony_ci
2992e5b6d6dSopenharmony_ciThe ICU string search service provides a set of very dynamic APIs that allow
3002e5b6d6dSopenharmony_ciusers to change the iterating direction randomly. For example, users can search
3012e5b6d6dSopenharmony_cifor a particular word going forward by calling the `usearch_next` (C),
3022e5b6d6dSopenharmony_ci`StringSearch::next` (C++) or `StringSearch.next` (Java) APIs and then search
3032e5b6d6dSopenharmony_cibackwards at any point of the search operation by calling the `usearch_previous`
3042e5b6d6dSopenharmony_ci(C), `StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Another
3052e5b6d6dSopenharmony_ciway to change the iterating direction is by calling the `usearch_reset` (C),
3062e5b6d6dSopenharmony_ci`StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Though the
3072e5b6d6dSopenharmony_cidirection change can occur without calling the reset APIs first, this operation
3082e5b6d6dSopenharmony_cicomes with a reduction in speed.
3092e5b6d6dSopenharmony_ci
3102e5b6d6dSopenharmony_ci> :point_right: **Note**: The backward search is not available with the
3112e5b6d6dSopenharmony_ci> ICU4C Boyer-Moore search technology preview introduced in ICU4C 4.0.1
3122e5b6d6dSopenharmony_ci> and only available with the linear search implementation.
3132e5b6d6dSopenharmony_ci
3142e5b6d6dSopenharmony_ci### Thai and Lao Character Boundaries
3152e5b6d6dSopenharmony_ci
3162e5b6d6dSopenharmony_ciIn collation, certain Thai and Lao vowels are swapped with the next character.
3172e5b6d6dSopenharmony_ciFor example, the text string "A ขเ" (A \\u0e02\\u0e40) is processed internally
3182e5b6d6dSopenharmony_ciin collation as
3192e5b6d6dSopenharmony_ci"A เข" (A \\u0e40\\u0e02). Therefore, if the user searches for the pattern "Aเ"
3202e5b6d6dSopenharmony_ci(A\\u0e40) in "A ขเ" (A \\u0e02\\u0e40) the string search service will match
3212e5b6d6dSopenharmony_cistarting at offset 0. Since this normalization process is internal to collation,
3222e5b6d6dSopenharmony_cithere is no notification that the swapping has happened. The return result
3232e5b6d6dSopenharmony_cioffsets in this example will be <0, 2> even though the range would encompass one
3242e5b6d6dSopenharmony_ciextra character.
3252e5b6d6dSopenharmony_ci
3262e5b6d6dSopenharmony_ci### Case Level Search
3272e5b6d6dSopenharmony_ci
3282e5b6d6dSopenharmony_ciCase level string search is currently done with the strength set to tertiary.
3292e5b6d6dSopenharmony_ciWhen searching with the strength set to primary and the case level attribute
3302e5b6d6dSopenharmony_citurned on, results given may not be correct. The case level attribute is
3312e5b6d6dSopenharmony_cidifferent from tertiary strength in that accents are ignored but case
3322e5b6d6dSopenharmony_cidifferences are not. Suppose you wanted to search for “A” in the text
3332e5b6d6dSopenharmony_ci“ABC\\u00C5a”. The match found should be at 0 and 3 if using the case level
3342e5b6d6dSopenharmony_ciattribute. However, searching with the case level attribute turned on finds
3352e5b6d6dSopenharmony_cimatches at 0, 3, and 4, which includes the lower case 'a'. To ensure that case
3362e5b6d6dSopenharmony_cilevel differences are not ignored, string search must be done with at least
3372e5b6d6dSopenharmony_citertiary strength.
338