12e5b6d6dSopenharmony_ci---
22e5b6d6dSopenharmony_cilayout: default
32e5b6d6dSopenharmony_cititle: ICU Data Build Tool
42e5b6d6dSopenharmony_cinav_order: 1
52e5b6d6dSopenharmony_ciparent: ICU Data
62e5b6d6dSopenharmony_ci---
72e5b6d6dSopenharmony_ci<!--
82e5b6d6dSopenharmony_ci© 2019 and later: Unicode, Inc. and others.
92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html
102e5b6d6dSopenharmony_ci-->
112e5b6d6dSopenharmony_ci
122e5b6d6dSopenharmony_ci# ICU Data Build Tool
132e5b6d6dSopenharmony_ci{: .no_toc }
142e5b6d6dSopenharmony_ci
152e5b6d6dSopenharmony_ci## Contents
162e5b6d6dSopenharmony_ci{: .no_toc .text-delta }
172e5b6d6dSopenharmony_ci
182e5b6d6dSopenharmony_ci1. TOC
192e5b6d6dSopenharmony_ci{:toc}
202e5b6d6dSopenharmony_ci
212e5b6d6dSopenharmony_ci---
222e5b6d6dSopenharmony_ci
232e5b6d6dSopenharmony_ci## Overview
242e5b6d6dSopenharmony_ci
252e5b6d6dSopenharmony_ciICU 64 provides a tool for configuring your ICU locale data file with finer
262e5b6d6dSopenharmony_cigranularity.  This page explains how to use this tool to customize and reduce
272e5b6d6dSopenharmony_ciyour data file size.
282e5b6d6dSopenharmony_ci
292e5b6d6dSopenharmony_ci## Overview: What is in the ICU data file?
302e5b6d6dSopenharmony_ci
312e5b6d6dSopenharmony_ciThere are hundreds of **locales** supported in ICU (including script and
322e5b6d6dSopenharmony_ciregion variants), and ICU supports many different **features**.  For each
332e5b6d6dSopenharmony_cilocale and for each feature, data is stored in one or more data files.
342e5b6d6dSopenharmony_ci
352e5b6d6dSopenharmony_ciThose data files are compiled and then bundled into a `.dat` file called
362e5b6d6dSopenharmony_cisomething like `icudt64l.dat`, which is little-endian data for ICU 64. This
372e5b6d6dSopenharmony_cidat file is packaged into the `libicudata.so` on Linux or `libicudata.dll.a`
382e5b6d6dSopenharmony_cion Windows. In ICU4J, it is bundled into a jar file named `icudata.jar`.
392e5b6d6dSopenharmony_ci
402e5b6d6dSopenharmony_ciAt a high level, the size of the ICU data file corresponds to the
412e5b6d6dSopenharmony_cicross-product of locales and features, except that not all features require
422e5b6d6dSopenharmony_cilocale-specific data, and not all locales require data for all features. The
432e5b6d6dSopenharmony_cidata file contents can be approximately visualized like this:
442e5b6d6dSopenharmony_ci
452e5b6d6dSopenharmony_ci<img alt="Features vs. Locales" src="../assets/features_locales.svg" style="max-width:600px" />
462e5b6d6dSopenharmony_ci
472e5b6d6dSopenharmony_ciThe `icudt64l.dat` file is 27 MiB uncompressed and 11 MiB gzipped.  This file
482e5b6d6dSopenharmony_cisize is too large for certain use cases, such as bundling the data file into a
492e5b6d6dSopenharmony_cismartphone app or an embedded device.  This is something the ICU Data Build
502e5b6d6dSopenharmony_ciTool aims to solve.
512e5b6d6dSopenharmony_ci
522e5b6d6dSopenharmony_ci## ICU Data Configuration File
532e5b6d6dSopenharmony_ci
542e5b6d6dSopenharmony_ciThe ICU Data Build Tool enables you to write a configuration file that
552e5b6d6dSopenharmony_cispecifies what features and locales to include in a custom data bundle.
562e5b6d6dSopenharmony_ci
572e5b6d6dSopenharmony_ciThe configuration file may be written in either [JSON](http://json.org/) or
582e5b6d6dSopenharmony_ci[Hjson](https://hjson.org/).  To build ICU4C with custom data, set the
592e5b6d6dSopenharmony_ci`ICU_DATA_FILTER_FILE` environment variable when running `runConfigureICU` on
602e5b6d6dSopenharmony_ciUnix or when building the data package on Windows.  For example:
612e5b6d6dSopenharmony_ci
622e5b6d6dSopenharmony_ci    ICU_DATA_FILTER_FILE=filters.json path/to/icu4c/source/runConfigureICU Linux
632e5b6d6dSopenharmony_ci
642e5b6d6dSopenharmony_ci**Important:** You *must* have the data sources in order to use the ICU Data
652e5b6d6dSopenharmony_ciBuild Tool. Check for the file icu4c/source/data/locales/root.txt. If that file
662e5b6d6dSopenharmony_ciis missing, you need to download "icu4c-\*-data.zip", delete the old
672e5b6d6dSopenharmony_ciicu4c/source/data directory, and replace it with the data directory from the zip
682e5b6d6dSopenharmony_cifile. If there is a \*.dat file in icu4c/source/data/in, that file will be used
692e5b6d6dSopenharmony_cieven if you gave ICU custom filter rules.
702e5b6d6dSopenharmony_ci
712e5b6d6dSopenharmony_ciIn order to use Hjson syntax, the `hjson` pip module must be installed on
722e5b6d6dSopenharmony_ciyour system.  You should also consider installing the `jsonschema` module to
732e5b6d6dSopenharmony_ciprint messages when errors are found in your config file.
742e5b6d6dSopenharmony_ci
752e5b6d6dSopenharmony_ci    $ pip3 install --user hjson jsonschema
762e5b6d6dSopenharmony_ci
772e5b6d6dSopenharmony_ciTo build ICU4J with custom data, you must first build ICU4C with custom data
782e5b6d6dSopenharmony_ciand then generate the JAR file.  For more information on building ICU4J, read the
792e5b6d6dSopenharmony_ci[ICU4J Readme](../icu4j/).
802e5b6d6dSopenharmony_ci
812e5b6d6dSopenharmony_ci### Locale Slicing
822e5b6d6dSopenharmony_ci
832e5b6d6dSopenharmony_ciThe simplest way to slice ICU data is by locale.  The ICU Data Build Tool
842e5b6d6dSopenharmony_cimakes it easy to select your desired locales to suit a number of use cases.
852e5b6d6dSopenharmony_ci
862e5b6d6dSopenharmony_ci#### Filtering by Language Only
872e5b6d6dSopenharmony_ci
882e5b6d6dSopenharmony_ciHere is a *filters.json* file that builds ICU data with support for English,
892e5b6d6dSopenharmony_ciChinese, and German, including *all* script and regional variants for those
902e5b6d6dSopenharmony_cilanguages:
912e5b6d6dSopenharmony_ci
922e5b6d6dSopenharmony_ci    {
932e5b6d6dSopenharmony_ci      "localeFilter": {
942e5b6d6dSopenharmony_ci        "filterType": "language",
952e5b6d6dSopenharmony_ci        "includelist": [
962e5b6d6dSopenharmony_ci          "en",
972e5b6d6dSopenharmony_ci          "de",
982e5b6d6dSopenharmony_ci          "zh"
992e5b6d6dSopenharmony_ci        ]
1002e5b6d6dSopenharmony_ci      }
1012e5b6d6dSopenharmony_ci    }
1022e5b6d6dSopenharmony_ci
1032e5b6d6dSopenharmony_ciThe *filterType* "language" only supports slicing by entire languages.
1042e5b6d6dSopenharmony_ci
1052e5b6d6dSopenharmony_ci##### Terminology: Includelist, Excludelist, Whitelist, Blacklist
1062e5b6d6dSopenharmony_ci
1072e5b6d6dSopenharmony_ciPrior to ICU 68, use `"whitelist"` and `"blacklist"` instead of `"includelist"`
1082e5b6d6dSopenharmony_ciand `"excludelist"`, respectively. ICU 68 allows all four terms.
1092e5b6d6dSopenharmony_ci
1102e5b6d6dSopenharmony_ci#### Filtering by Locale
1112e5b6d6dSopenharmony_ci
1122e5b6d6dSopenharmony_ciFor more control, use *filterType* "locale".  Here is a *filters.hjson* file that
1132e5b6d6dSopenharmony_ciincludes the same three languages as above, including regional variants, but
1142e5b6d6dSopenharmony_cionly the default script (e.g., Simplified Han for Chinese):
1152e5b6d6dSopenharmony_ci
1162e5b6d6dSopenharmony_ci    localeFilter: {
1172e5b6d6dSopenharmony_ci      filterType: locale
1182e5b6d6dSopenharmony_ci      includelist: [
1192e5b6d6dSopenharmony_ci        en
1202e5b6d6dSopenharmony_ci        de
1212e5b6d6dSopenharmony_ci        zh
1222e5b6d6dSopenharmony_ci      ]
1232e5b6d6dSopenharmony_ci    }
1242e5b6d6dSopenharmony_ci
1252e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.*
1262e5b6d6dSopenharmony_ci
1272e5b6d6dSopenharmony_ci#### Adding Script Variants (includeScripts = true)
1282e5b6d6dSopenharmony_ci
1292e5b6d6dSopenharmony_ciYou may set the *includeScripts* option to true to include all scripts for a
1302e5b6d6dSopenharmony_cilanguage while using *filterType* "locale".  This results in behavior similar
1312e5b6d6dSopenharmony_cito *filterType* "language".  In the following JSON example, all scripts for
1322e5b6d6dSopenharmony_ciChinese are included:
1332e5b6d6dSopenharmony_ci
1342e5b6d6dSopenharmony_ci    {
1352e5b6d6dSopenharmony_ci      "localeFilter": {
1362e5b6d6dSopenharmony_ci        "filterType": "locale",
1372e5b6d6dSopenharmony_ci        "includeScripts": true,
1382e5b6d6dSopenharmony_ci        "includelist": [
1392e5b6d6dSopenharmony_ci          "en",
1402e5b6d6dSopenharmony_ci          "de",
1412e5b6d6dSopenharmony_ci          "zh"
1422e5b6d6dSopenharmony_ci        ]
1432e5b6d6dSopenharmony_ci      }
1442e5b6d6dSopenharmony_ci    }
1452e5b6d6dSopenharmony_ci
1462e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.*
1472e5b6d6dSopenharmony_ci
1482e5b6d6dSopenharmony_ciIf you wish to explicitly list the scripts, you may put the script code in the
1492e5b6d6dSopenharmony_cilocale tag in the whitelist, and you do not need the *includeScripts* option
1502e5b6d6dSopenharmony_cienabled.  For example, in Hjson, to include Han Traditional ***but not Han
1512e5b6d6dSopenharmony_ciSimplified***:
1522e5b6d6dSopenharmony_ci
1532e5b6d6dSopenharmony_ci    localeFilter: {
1542e5b6d6dSopenharmony_ci      filterType: locale
1552e5b6d6dSopenharmony_ci      includelist: [
1562e5b6d6dSopenharmony_ci        en
1572e5b6d6dSopenharmony_ci        de
1582e5b6d6dSopenharmony_ci        zh_Hant
1592e5b6d6dSopenharmony_ci      ]
1602e5b6d6dSopenharmony_ci    }
1612e5b6d6dSopenharmony_ci
1622e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.*
1632e5b6d6dSopenharmony_ci
1642e5b6d6dSopenharmony_ci**Note:** the option *includeScripts* is only supported at the language level;
1652e5b6d6dSopenharmony_cii.e., in order to include all scripts for a particular language, you must
1662e5b6d6dSopenharmony_cispecify the language alone, without a region tag.
1672e5b6d6dSopenharmony_ci
1682e5b6d6dSopenharmony_ci#### Removing Regional Variants (includeChildren = false)
1692e5b6d6dSopenharmony_ci
1702e5b6d6dSopenharmony_ciIf you wish to enumerate exactly which regional variants you wish to support,
1712e5b6d6dSopenharmony_ciyou may use *filterType* "locale" with the *includeChildren* setting turned to
1722e5b6d6dSopenharmony_cifalse.  The following *filters.hjson* file includes English (US), English
1732e5b6d6dSopenharmony_ci(UK), German (Germany), and Chinese (China, Han Simplified), as well as their
1742e5b6d6dSopenharmony_cidependencies, *but not* other regional variants like English (Australia),
1752e5b6d6dSopenharmony_ciGerman (Switzerland), or Chinese (Taiwan, Han Traditional):
1762e5b6d6dSopenharmony_ci
1772e5b6d6dSopenharmony_ci    localeFilter: {
1782e5b6d6dSopenharmony_ci      filterType: locale
1792e5b6d6dSopenharmony_ci      includeChildren: false
1802e5b6d6dSopenharmony_ci      includelist: [
1812e5b6d6dSopenharmony_ci        en_US
1822e5b6d6dSopenharmony_ci        en_GB
1832e5b6d6dSopenharmony_ci        de_DE
1842e5b6d6dSopenharmony_ci        zh_CN
1852e5b6d6dSopenharmony_ci      ]
1862e5b6d6dSopenharmony_ci    }
1872e5b6d6dSopenharmony_ci
1882e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.*
1892e5b6d6dSopenharmony_ci
1902e5b6d6dSopenharmony_ciIncluding dependencies, the above filter would include the following data files:
1912e5b6d6dSopenharmony_ci
1922e5b6d6dSopenharmony_ci- root.txt
1932e5b6d6dSopenharmony_ci- en.txt
1942e5b6d6dSopenharmony_ci- en_US.txt
1952e5b6d6dSopenharmony_ci- en_001.txt
1962e5b6d6dSopenharmony_ci- en_GB.txt
1972e5b6d6dSopenharmony_ci- de.txt
1982e5b6d6dSopenharmony_ci- de_DE.txt
1992e5b6d6dSopenharmony_ci- zh.txt
2002e5b6d6dSopenharmony_ci- zh_Hans.txt
2012e5b6d6dSopenharmony_ci- zh_Hans_CN.txt
2022e5b6d6dSopenharmony_ci- zh_CN.txt
2032e5b6d6dSopenharmony_ci
2042e5b6d6dSopenharmony_ci### File Slicing (coarse-grained features)
2052e5b6d6dSopenharmony_ci
2062e5b6d6dSopenharmony_ciICU provides a lot of features, of which you probably need only a small subset
2072e5b6d6dSopenharmony_cifor your application.  Feature slicing is a powerful way to prune out data for
2082e5b6d6dSopenharmony_ciany features you are not using.
2092e5b6d6dSopenharmony_ci
2102e5b6d6dSopenharmony_ci***CAUTION:*** When slicing by features, you must manually include all
2112e5b6d6dSopenharmony_cidependencies.  For example, if you are formatting dates, you must include not
2122e5b6d6dSopenharmony_cionly the date formatting data but also the number formatting data, since dates
2132e5b6d6dSopenharmony_cicontain numbers.  Expect to spend a fair bit of time debugging your feature
2142e5b6d6dSopenharmony_cifilter to get it to work the way you expect it to.
2152e5b6d6dSopenharmony_ci
2162e5b6d6dSopenharmony_ciThe data for many ICU features live in individual files.  The ICU Data Build
2172e5b6d6dSopenharmony_ciTool puts similar *types* of files into categories.  The following table
2182e5b6d6dSopenharmony_cisummarizes the ICU data files and their corresponding features and categories:
2192e5b6d6dSopenharmony_ci
2202e5b6d6dSopenharmony_ci| Feature | Category ID(s) | Data Files <br/> ([icu4c/source/data](https://github.com/unicode-org/icu/tree/main/icu4c/source/data)) | Resource Size <br/> (as of ICU 64) |
2212e5b6d6dSopenharmony_ci|---|---|---|---|
2222e5b6d6dSopenharmony_ci| Break Iteration | `"brkitr_rules"` <br/> `"brkitr_dictionaries"` <br/> `"brkitr_tree"` | brkitr/rules/\*.txt <br/> brkitr/dictionaries/\*.txt <br/> brkitr/\*.txt | 522 KiB <br/> **2.8 MiB** <br/> 14 KiB |
2232e5b6d6dSopenharmony_ci| Charset Conversion | `"conversion_mappings"` | mappings/\*.ucm | **4.9 MiB** |
2242e5b6d6dSopenharmony_ci| Collation <br/> *[more info](#collation-ucadata)* | `"coll_ucadata"` <br/> `"coll_tree"` | in/coll/ucadata-\*.icu <br/> coll/\*.txt | 511 KiB <br/> **2.8 MiB** |
2252e5b6d6dSopenharmony_ci| Confusables | `"confusables"` | unidata/confusables\*.txt | 45 KiB |
2262e5b6d6dSopenharmony_ci| Currencies | `"misc"` <br/> `"curr_supplemental"` <br/> `"curr_tree"` | misc/currencyNumericCodes.txt <br/> curr/supplementalData.txt <br/> curr/\*.txt | 3.1 KiB <br/> 27 KiB <br/> **2.5 MiB** |
2272e5b6d6dSopenharmony_ci| Language Display <br/> Names | `"lang_tree"` | lang/\*.txt | **2.1 MiB** |
2282e5b6d6dSopenharmony_ci| Language Tags | `"misc"` | misc/keyTypeData.txt <br/> misc/langInfo.txt <br/> misc/likelySubtags.txt <br/> misc/metadata.txt | 6.8 KiB <br/> 37 KiB <br/> 53 KiB <br/> 33 KiB |
2292e5b6d6dSopenharmony_ci| Normalization | `"normalization"` | in/\*.nrm except in/nfc.nrm | 160 KiB |
2302e5b6d6dSopenharmony_ci| Plural Rules | `"misc"` | misc/pluralRanges.txt <br/> misc/plurals.txt | 3.3 KiB <br/> 33 KiB |
2312e5b6d6dSopenharmony_ci| Region Display <br/> Names | `"region_tree"` | region/\*.txt | **1.1 MiB** |
2322e5b6d6dSopenharmony_ci| Rule-Based <br/> Number Formatting <br/> (Spellout, Ordinals) | `"rbnf_tree"` | rbnf/\*.txt | 538 KiB |
2332e5b6d6dSopenharmony_ci| StringPrep | `"stringprep"` | sprep/\*.txt | 193 KiB |
2342e5b6d6dSopenharmony_ci| Time Zones | `"misc"` <br/> `"zone_tree"` <br/> `"zone_supplemental"` | misc/metaZones.txt <br/> misc/timezoneTypes.txt <br/> misc/windowsZones.txt <br/> misc/zoneinfo64.txt <br/> zone/\*.txt <br/> zone/tzdbNames.txt | 41 KiB <br/> 20 KiB <br/> 22 KiB <br/> 151 KiB <br/> **2.7 MiB** <br/> 4.8 KiB |
2352e5b6d6dSopenharmony_ci| Transliteration | `"translit"` | translit/\*.txt | 685 KiB |
2362e5b6d6dSopenharmony_ci| Unicode Emoji<br/>Properties | `"uemoji"` | in/uemoji.icu | 13 KiB |
2372e5b6d6dSopenharmony_ci| Unicode Character <br/> Names | `"unames"` | in/unames.icu | 269 KiB |
2382e5b6d6dSopenharmony_ci| Unicode Text Layout | `"ulayout"` | in/ulayout.icu | 14 KiB |
2392e5b6d6dSopenharmony_ci| Units | `"unit_tree"` | unit/\*.txt | **1.7 MiB** |
2402e5b6d6dSopenharmony_ci| **OTHER** | `"cnvalias"` <br/> `"misc"` <br/> `"locales_tree"` | mappings/convrtrs.txt <br/> misc/dayPeriods.txt <br/> misc/genderList.txt <br/> misc/numberingSystems.txt <br/> misc/supplementalData.txt <br/> locales/\*.txt | 63 KiB <br/> 19 KiB <br/> 0.5 KiB <br/> 5.6 KiB <br/> 228 KiB <br/> **2.4 MiB** |
2412e5b6d6dSopenharmony_ci
2422e5b6d6dSopenharmony_ci#### Additive and Subtractive Modes
2432e5b6d6dSopenharmony_ci
2442e5b6d6dSopenharmony_ciThe ICU Data Build Tool allows two strategies for selecting features:
2452e5b6d6dSopenharmony_ci*additive* mode and *subtractive* mode.
2462e5b6d6dSopenharmony_ci
2472e5b6d6dSopenharmony_ciThe default is to use subtractive mode. This means that all ICU data is
2482e5b6d6dSopenharmony_ciincluded, and your configurations can remove or change data from that baseline.
2492e5b6d6dSopenharmony_ciAdditive mode means that you start with an *empty* ICU data file, and you must
2502e5b6d6dSopenharmony_ciexplicitly add the data required for your application.
2512e5b6d6dSopenharmony_ci
2522e5b6d6dSopenharmony_ciThere are two concrete differences between additive and subtractive mode:
2532e5b6d6dSopenharmony_ci
2542e5b6d6dSopenharmony_ci|                         | Additive    | Subtractive |
2552e5b6d6dSopenharmony_ci|-------------------------|-------------|-------------|
2562e5b6d6dSopenharmony_ci| Default Feature Filter  | `"exclude"` | `"include"` |
2572e5b6d6dSopenharmony_ci| Default Resource Filter | `"-/"`, `"+/%%ALIAS"`, `"+/%%Parent"` | `"+/"` |
2582e5b6d6dSopenharmony_ci
2592e5b6d6dSopenharmony_ciTo enable additive mode, add the following setting to your filter file:
2602e5b6d6dSopenharmony_ci
2612e5b6d6dSopenharmony_ci    strategy: "additive"
2622e5b6d6dSopenharmony_ci
2632e5b6d6dSopenharmony_ci**Caution:** If using `"-/"` or similar top-level exclusion rules, be aware of
2642e5b6d6dSopenharmony_cithe fields `"+/%%Parent"` and `"+/%%ALIAS"`, which are required in locale tree
2652e5b6d6dSopenharmony_ciresource bundles. Excluding these paths may cause unexpected locale fallback
2662e5b6d6dSopenharmony_cibehavior.
2672e5b6d6dSopenharmony_ci
2682e5b6d6dSopenharmony_ci#### Filter Types
2692e5b6d6dSopenharmony_ci
2702e5b6d6dSopenharmony_ciYou may list *filters* for each category in the *featureFilters* section of
2712e5b6d6dSopenharmony_ciyour config file.  What follows are examples of the possible types of filters.
2722e5b6d6dSopenharmony_ci
2732e5b6d6dSopenharmony_ci##### Inclusion Filter
2742e5b6d6dSopenharmony_ci
2752e5b6d6dSopenharmony_ciTo include a category, use the string `"include"` as your filter.
2762e5b6d6dSopenharmony_ci
2772e5b6d6dSopenharmony_ci    featureFilters: {
2782e5b6d6dSopenharmony_ci      locales_tree: include
2792e5b6d6dSopenharmony_ci    }
2802e5b6d6dSopenharmony_ci
2812e5b6d6dSopenharmony_ciIf the category is a locale tree (ends with `_tree`), the inclusion filter
2822e5b6d6dSopenharmony_ciresolves to the `localeFilter`; for more information, see the section
2832e5b6d6dSopenharmony_ci"Locale-Tree Categories." Otherwise, the inclusion filter causes all files in
2842e5b6d6dSopenharmony_cithe category to be included.
2852e5b6d6dSopenharmony_ci
2862e5b6d6dSopenharmony_ci**NOTE:** When subtractive mode is used (default), all categories implicitly
2872e5b6d6dSopenharmony_cistart with `"include"` as their filter.
2882e5b6d6dSopenharmony_ci
2892e5b6d6dSopenharmony_ci##### Exclusion Filter
2902e5b6d6dSopenharmony_ci
2912e5b6d6dSopenharmony_ciTo exclude an entire category, use *filterType* "exclude".  For example, to
2922e5b6d6dSopenharmony_ciexclude all confusables data:
2932e5b6d6dSopenharmony_ci
2942e5b6d6dSopenharmony_ci    featureFilters: {
2952e5b6d6dSopenharmony_ci      confusables: {
2962e5b6d6dSopenharmony_ci        filterType: exclude
2972e5b6d6dSopenharmony_ci      }
2982e5b6d6dSopenharmony_ci    }
2992e5b6d6dSopenharmony_ci
3002e5b6d6dSopenharmony_ciSince ICU 65, you can also write simply:
3012e5b6d6dSopenharmony_ci
3022e5b6d6dSopenharmony_ci    featureFilters: {
3032e5b6d6dSopenharmony_ci      confusables: exclude
3042e5b6d6dSopenharmony_ci    }
3052e5b6d6dSopenharmony_ci
3062e5b6d6dSopenharmony_ci**NOTE:** When additive mode is used, all categories implicitly start with
3072e5b6d6dSopenharmony_ci`"exclude"` as their filter.
3082e5b6d6dSopenharmony_ci
3092e5b6d6dSopenharmony_ci##### File Name Filter
3102e5b6d6dSopenharmony_ci
3112e5b6d6dSopenharmony_ciTo exclude certain files out of a category, use the file name filter, which is
3122e5b6d6dSopenharmony_cithe default type of filter when *filterType* is not specified.  For example,
3132e5b6d6dSopenharmony_cito include the Burmese break iteration dictionary but not any other
3142e5b6d6dSopenharmony_cidictionaries:
3152e5b6d6dSopenharmony_ci
3162e5b6d6dSopenharmony_ci    featureFilters: {
3172e5b6d6dSopenharmony_ci      brkitr_dictionaries: {
3182e5b6d6dSopenharmony_ci        includelist: [
3192e5b6d6dSopenharmony_ci          burmesedict
3202e5b6d6dSopenharmony_ci        ]
3212e5b6d6dSopenharmony_ci      }
3222e5b6d6dSopenharmony_ci    }
3232e5b6d6dSopenharmony_ci
3242e5b6d6dSopenharmony_ciDo *not* include directories or file extensions.  They will be added
3252e5b6d6dSopenharmony_ciautomatically for you.  Note that all files in a particular category have the
3262e5b6d6dSopenharmony_cisame directory and extension.
3272e5b6d6dSopenharmony_ci
3282e5b6d6dSopenharmony_ciYou can use either `"includelist"` or `"excludelist"` for the file name filter.
3292e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.*
3302e5b6d6dSopenharmony_ci
3312e5b6d6dSopenharmony_ci##### Regex Filter
3322e5b6d6dSopenharmony_ci
3332e5b6d6dSopenharmony_ciTo exclude filenames matching a certain regular expression, use *filterType*
3342e5b6d6dSopenharmony_ci"regex".  For example, to reject the CJK-specific break iteration rules:
3352e5b6d6dSopenharmony_ci
3362e5b6d6dSopenharmony_ci    featureFilters: {
3372e5b6d6dSopenharmony_ci      brkitr_rules: {
3382e5b6d6dSopenharmony_ci        filterType: regex
3392e5b6d6dSopenharmony_ci        excludelist: [
3402e5b6d6dSopenharmony_ci          ^.*_cj$
3412e5b6d6dSopenharmony_ci        ]
3422e5b6d6dSopenharmony_ci      }
3432e5b6d6dSopenharmony_ci    }
3442e5b6d6dSopenharmony_ci
3452e5b6d6dSopenharmony_ciThe Python standard library [*re*
3462e5b6d6dSopenharmony_cimodule](https://docs.python.org/3/library/re.html) is used for evaluating the
3472e5b6d6dSopenharmony_ciregular expressions.  In case the regular expression engine is changed in the
3482e5b6d6dSopenharmony_cifuture, however, you are encouraged to restrict yourself to a simple set of
3492e5b6d6dSopenharmony_ciregular expression operators.
3502e5b6d6dSopenharmony_ci
3512e5b6d6dSopenharmony_ciAs above, do not include directories or file extensions, and you can use
3522e5b6d6dSopenharmony_cieither a whitelist or a blacklist.
3532e5b6d6dSopenharmony_ci
3542e5b6d6dSopenharmony_ci##### Union Filter
3552e5b6d6dSopenharmony_ci
3562e5b6d6dSopenharmony_ciYou can combine the results of multiple filters with *filterType* "union".
3572e5b6d6dSopenharmony_ciThis filter matches files that match *at least one* of the provided filters.
3582e5b6d6dSopenharmony_ciThe syntax is:
3592e5b6d6dSopenharmony_ci
3602e5b6d6dSopenharmony_ci    {
3612e5b6d6dSopenharmony_ci      filterType: union
3622e5b6d6dSopenharmony_ci      unionOf: [
3632e5b6d6dSopenharmony_ci        { /* filter 1 */ },
3642e5b6d6dSopenharmony_ci        { /* filter 2 */ },
3652e5b6d6dSopenharmony_ci        // ...
3662e5b6d6dSopenharmony_ci      ]
3672e5b6d6dSopenharmony_ci    }
3682e5b6d6dSopenharmony_ci
3692e5b6d6dSopenharmony_ciThis filter type is useful for combining "locale" filters with different
3702e5b6d6dSopenharmony_ciincludeScripts or includeChildren options.
3712e5b6d6dSopenharmony_ci
3722e5b6d6dSopenharmony_ci#### Locale-Tree Categories
3732e5b6d6dSopenharmony_ci
3742e5b6d6dSopenharmony_ciSeveral categories have the `_tree` suffix.  These categories are for "locale
3752e5b6d6dSopenharmony_citrees": they contain locale-specific data.  ***The [localeFilter configuration
3762e5b6d6dSopenharmony_cioption](#slicing-data-by-locale) sets the default file filter for all `_tree`
3772e5b6d6dSopenharmony_cicategories.***
3782e5b6d6dSopenharmony_ci
3792e5b6d6dSopenharmony_ciIf you want to include different locales for different locale file trees, you
3802e5b6d6dSopenharmony_cican override their filter in the *featureFilters* section of the config file.
3812e5b6d6dSopenharmony_ciFor example, to include only Italian data for currency symbols *instead of*
3822e5b6d6dSopenharmony_cithe common locales specified in *localeFilter*, you can do the following:
3832e5b6d6dSopenharmony_ci
3842e5b6d6dSopenharmony_ci    featureFilters:
3852e5b6d6dSopenharmony_ci      curr_tree: {
3862e5b6d6dSopenharmony_ci        filterType: locale
3872e5b6d6dSopenharmony_ci        includelist: [
3882e5b6d6dSopenharmony_ci          it
3892e5b6d6dSopenharmony_ci        ]
3902e5b6d6dSopenharmony_ci      }
3912e5b6d6dSopenharmony_ci    }
3922e5b6d6dSopenharmony_ci
3932e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.*
3942e5b6d6dSopenharmony_ci
3952e5b6d6dSopenharmony_ciYou can exclude an entire `_tree` category without affecting other categories.
3962e5b6d6dSopenharmony_ciFor example, to exclude region display names:
3972e5b6d6dSopenharmony_ci
3982e5b6d6dSopenharmony_ci    featureFilters: {
3992e5b6d6dSopenharmony_ci      region_tree: {
4002e5b6d6dSopenharmony_ci        filterType: exclude
4012e5b6d6dSopenharmony_ci      }
4022e5b6d6dSopenharmony_ci    }
4032e5b6d6dSopenharmony_ci
4042e5b6d6dSopenharmony_ciNote that you are able to use any of the other filter types for `_tree`
4052e5b6d6dSopenharmony_cicategories, but you must be very careful that you are including all of the
4062e5b6d6dSopenharmony_cicorrect files.  For example, `en_GB` requires `en_001`, and you must always
4072e5b6d6dSopenharmony_ciinclude `root`.  If you use the "language" or "locale" filter types, this
4082e5b6d6dSopenharmony_cilogic is done for you.
4092e5b6d6dSopenharmony_ci
4102e5b6d6dSopenharmony_ci### Resource Bundle Slicing (fine-grained features)
4112e5b6d6dSopenharmony_ci
4122e5b6d6dSopenharmony_ciThe third section of the ICU filter config file is *resourceFilters*.  With
4132e5b6d6dSopenharmony_cithis section, you can dive inside resource bundle files to remove even more
4142e5b6d6dSopenharmony_cidata.
4152e5b6d6dSopenharmony_ci
4162e5b6d6dSopenharmony_ciYou can apply resource filters to all locale tree categories as well as to
4172e5b6d6dSopenharmony_cicategories that include resource bundles, such as the `"misc"` category.
4182e5b6d6dSopenharmony_ci
4192e5b6d6dSopenharmony_ciFor example, consider measurement units.  There is one unit file per locale (example:
4202e5b6d6dSopenharmony_ci[en.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unit/en.txt)),
4212e5b6d6dSopenharmony_ciand that file contains data for all measurement units in CLDR.  However, if
4222e5b6d6dSopenharmony_ciyou are only formatting distances, for example, you may need the data for only
4232e5b6d6dSopenharmony_cia small set of units.
4242e5b6d6dSopenharmony_ci
4252e5b6d6dSopenharmony_ciHere is how you could include units of length in the "short" style but no
4262e5b6d6dSopenharmony_ciother units:
4272e5b6d6dSopenharmony_ci
4282e5b6d6dSopenharmony_ci    resourceFilters: [
4292e5b6d6dSopenharmony_ci      {
4302e5b6d6dSopenharmony_ci        categories: [
4312e5b6d6dSopenharmony_ci          unit_tree
4322e5b6d6dSopenharmony_ci        ]
4332e5b6d6dSopenharmony_ci        rules: [
4342e5b6d6dSopenharmony_ci          -/units
4352e5b6d6dSopenharmony_ci          -/unitsNarrow
4362e5b6d6dSopenharmony_ci          -/unitsShort
4372e5b6d6dSopenharmony_ci          +/unitsShort/length
4382e5b6d6dSopenharmony_ci        ]
4392e5b6d6dSopenharmony_ci      }
4402e5b6d6dSopenharmony_ci    ]
4412e5b6d6dSopenharmony_ci
4422e5b6d6dSopenharmony_ciConceptually, the rules are applied from top to bottom.  First, all data for
4432e5b6d6dSopenharmony_ciall three styes of units are removed, and then the short length units are
4442e5b6d6dSopenharmony_ciadded back.
4452e5b6d6dSopenharmony_ci
4462e5b6d6dSopenharmony_ci**NOTE:** In subtractive mode, resource paths are *included* by default. In
4472e5b6d6dSopenharmony_ciadditive mode, resource paths are *excluded* by default.
4482e5b6d6dSopenharmony_ci
4492e5b6d6dSopenharmony_ci#### Wildcard Character
4502e5b6d6dSopenharmony_ci
4512e5b6d6dSopenharmony_ciYou can use the wildcard character (`*`) to match a piece of the resource
4522e5b6d6dSopenharmony_cipath.  For example, to include length units for all three styles, you can do:
4532e5b6d6dSopenharmony_ci
4542e5b6d6dSopenharmony_ci    resourceFilters: [
4552e5b6d6dSopenharmony_ci      {
4562e5b6d6dSopenharmony_ci        categories: [
4572e5b6d6dSopenharmony_ci          unit_tree
4582e5b6d6dSopenharmony_ci        ]
4592e5b6d6dSopenharmony_ci        rules: [
4602e5b6d6dSopenharmony_ci          -/units
4612e5b6d6dSopenharmony_ci          -/unitsNarrow
4622e5b6d6dSopenharmony_ci          -/unitsShort
4632e5b6d6dSopenharmony_ci          +/*/length
4642e5b6d6dSopenharmony_ci        ]
4652e5b6d6dSopenharmony_ci      }
4662e5b6d6dSopenharmony_ci    ]
4672e5b6d6dSopenharmony_ci
4682e5b6d6dSopenharmony_ciThe wildcard must be the only character in its path segment. Future ICU
4692e5b6d6dSopenharmony_civersions may expand the syntax.
4702e5b6d6dSopenharmony_ci
4712e5b6d6dSopenharmony_ci#### Resource Filter for Specific File
4722e5b6d6dSopenharmony_ci
4732e5b6d6dSopenharmony_ciThe resource filter object takes an optional *files* setting which accepts a
4742e5b6d6dSopenharmony_cifile filter in the same syntax used above for file filtering.  For example, if
4752e5b6d6dSopenharmony_ciyou wanted to apply a filter to misc/supplementalData.txt, you could do the
4762e5b6d6dSopenharmony_cifollowing (this example removes calendar data):
4772e5b6d6dSopenharmony_ci
4782e5b6d6dSopenharmony_ci    resourceFilters: [
4792e5b6d6dSopenharmony_ci      {
4802e5b6d6dSopenharmony_ci        categories: ["misc"]
4812e5b6d6dSopenharmony_ci        files: {
4822e5b6d6dSopenharmony_ci          includelist: ["supplementalData"]
4832e5b6d6dSopenharmony_ci        }
4842e5b6d6dSopenharmony_ci        rules: [
4852e5b6d6dSopenharmony_ci          -/calendarData
4862e5b6d6dSopenharmony_ci        ]
4872e5b6d6dSopenharmony_ci      }
4882e5b6d6dSopenharmony_ci    ]
4892e5b6d6dSopenharmony_ci
4902e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.*
4912e5b6d6dSopenharmony_ci
4922e5b6d6dSopenharmony_ci#### Combining Multiple Resource Filter Specs
4932e5b6d6dSopenharmony_ci
4942e5b6d6dSopenharmony_ciYou can also list multiple resource filter objects in the *resourceFilters*
4952e5b6d6dSopenharmony_ciarray; the filters are added from top to bottom.  For example, here is an
4962e5b6d6dSopenharmony_ciadvanced configuration that includes "mile" for en-US and "kilometer" for
4972e5b6d6dSopenharmony_cien-CA; this also makes use of the *files* option:
4982e5b6d6dSopenharmony_ci
4992e5b6d6dSopenharmony_ci    resourceFilters: [
5002e5b6d6dSopenharmony_ci      {
5012e5b6d6dSopenharmony_ci        categories: ["unit_tree"]
5022e5b6d6dSopenharmony_ci        rules: [
5032e5b6d6dSopenharmony_ci          -/units
5042e5b6d6dSopenharmony_ci          -/unitsNarrow
5052e5b6d6dSopenharmony_ci          -/unitsShort
5062e5b6d6dSopenharmony_ci        ]
5072e5b6d6dSopenharmony_ci      },
5082e5b6d6dSopenharmony_ci      {
5092e5b6d6dSopenharmony_ci        categories: ["unit_tree"]
5102e5b6d6dSopenharmony_ci        files: {
5112e5b6d6dSopenharmony_ci          filterType: locale
5122e5b6d6dSopenharmony_ci          includelist: ["en_US"]
5132e5b6d6dSopenharmony_ci        }
5142e5b6d6dSopenharmony_ci        rules: [
5152e5b6d6dSopenharmony_ci          +/*/length/mile
5162e5b6d6dSopenharmony_ci        ]
5172e5b6d6dSopenharmony_ci      },
5182e5b6d6dSopenharmony_ci      {
5192e5b6d6dSopenharmony_ci        categories: ["unit_tree"]
5202e5b6d6dSopenharmony_ci        files: {
5212e5b6d6dSopenharmony_ci          filterType: locale
5222e5b6d6dSopenharmony_ci          includelist: ["en_CA"]
5232e5b6d6dSopenharmony_ci        }
5242e5b6d6dSopenharmony_ci        rules: [
5252e5b6d6dSopenharmony_ci          +/*/length/kilometer
5262e5b6d6dSopenharmony_ci        ]
5272e5b6d6dSopenharmony_ci      }
5282e5b6d6dSopenharmony_ci    ]
5292e5b6d6dSopenharmony_ci
5302e5b6d6dSopenharmony_ciThe above example would give en-US these resource filter rules:
5312e5b6d6dSopenharmony_ci
5322e5b6d6dSopenharmony_ci    -/units
5332e5b6d6dSopenharmony_ci    -/unitsNarrow
5342e5b6d6dSopenharmony_ci    -/unitsShort
5352e5b6d6dSopenharmony_ci    +/*/length/mile
5362e5b6d6dSopenharmony_ci
5372e5b6d6dSopenharmony_ciand en-CA these resource filter rules:
5382e5b6d6dSopenharmony_ci
5392e5b6d6dSopenharmony_ci    -/units
5402e5b6d6dSopenharmony_ci    -/unitsNarrow
5412e5b6d6dSopenharmony_ci    -/unitsShort
5422e5b6d6dSopenharmony_ci    +/*/length/kilometer
5432e5b6d6dSopenharmony_ci
5442e5b6d6dSopenharmony_ciIn accordance with *filterType* "locale", the parent locales *en* and *root*
5452e5b6d6dSopenharmony_ciwould get both units; this is required since both en-US and en-CA may inherit
5462e5b6d6dSopenharmony_cifrom the parent locale:
5472e5b6d6dSopenharmony_ci
5482e5b6d6dSopenharmony_ci    -/units
5492e5b6d6dSopenharmony_ci    -/unitsNarrow
5502e5b6d6dSopenharmony_ci    -/unitsShort
5512e5b6d6dSopenharmony_ci    +/*/length/mile
5522e5b6d6dSopenharmony_ci    +/*/length/kilometer
5532e5b6d6dSopenharmony_ci
5542e5b6d6dSopenharmony_ci## Debugging Tips
5552e5b6d6dSopenharmony_ci
5562e5b6d6dSopenharmony_ci**Run Python directly:** If you do not want to wait for ./runConfigureICU to
5572e5b6d6dSopenharmony_cifinish, you can directly re-generate the rules using your filter file with the
5582e5b6d6dSopenharmony_cifollowing command line run from *iuc4c/source*.
5592e5b6d6dSopenharmony_ci
5602e5b6d6dSopenharmony_ci    $ PYTHONPATH=python python3 -m icutools.databuilder \
5612e5b6d6dSopenharmony_ci      --mode=gnumake --src_dir=data > data/rules.mk
5622e5b6d6dSopenharmony_ci
5632e5b6d6dSopenharmony_ci**Install jsonschema:** Install the `jsonschema` pip package to get warnings
5642e5b6d6dSopenharmony_ciabout problems with your filter file.
5652e5b6d6dSopenharmony_ci
5662e5b6d6dSopenharmony_ci**See what data is being used:** ICU is instrumented to allow you to trace
5672e5b6d6dSopenharmony_ciwhich resources are used at runtime. This can help you determine what data you
5682e5b6d6dSopenharmony_cineed to include. For more information, see [tracing.md](tracing.md).
5692e5b6d6dSopenharmony_ci
5702e5b6d6dSopenharmony_ci**Inspect data/rules.mk:** The Python script outputs the file *rules.mk*
5712e5b6d6dSopenharmony_ciinside *iuc4c/source/data*. To see what is going to get built, you can inspect
5722e5b6d6dSopenharmony_cithat file. First build ICU normally, and copy *rules.mk* to
5732e5b6d6dSopenharmony_ci*rules_default.mk*. Then build ICU with your filter file. Now you can take the
5742e5b6d6dSopenharmony_cidiff between *rules_default.mk* and *rules.mk* to see exactly what your filter
5752e5b6d6dSopenharmony_cifile is removing.
5762e5b6d6dSopenharmony_ci
5772e5b6d6dSopenharmony_ci**Inspect the output:** After a `make clean` and `make` with a new *rules.mk*,
5782e5b6d6dSopenharmony_ciyou can look inside the directory *icu4c/source/data/out* to see the files
5792e5b6d6dSopenharmony_cithat got built.
5802e5b6d6dSopenharmony_ci
5812e5b6d6dSopenharmony_ci**Inspect the compiled resource filter rules:** If you are using a resource
5822e5b6d6dSopenharmony_cifilter, the resource filter rules get compiled for each individual locale
5832e5b6d6dSopenharmony_ciinside *icu4c/source/data/out/tmp/filters*. You can look at those files to see
5842e5b6d6dSopenharmony_ciwhat filter rules are being applied to each individual locale.
5852e5b6d6dSopenharmony_ci
5862e5b6d6dSopenharmony_ci**Run genrb in verbose mode:** For debugging a resource filter, you can run
5872e5b6d6dSopenharmony_cigenrb in verbose mode to see which resources got stripped. To do this, first
5882e5b6d6dSopenharmony_ciinspect the make output and find a command line like this:
5892e5b6d6dSopenharmony_ci
5902e5b6d6dSopenharmony_ci    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/genrb --filterDir ./out/tmp/filters/unit_tree -s ./unit -d ./out/build/icudt64l/unit/ -i ./out/build/icudt64l --usePoolBundle ./out/build/icudt64l/unit/ -k en.txt
5912e5b6d6dSopenharmony_ci
5922e5b6d6dSopenharmony_ciCopy that command line and re-run it from *icu4c/source/data* with the `-v`
5932e5b6d6dSopenharmony_ciflag added to the end. The command will print out exactly which resource paths
5942e5b6d6dSopenharmony_ciare being included and excluded as well as a model of the filter rules applied
5952e5b6d6dSopenharmony_cito this file.
5962e5b6d6dSopenharmony_ci
5972e5b6d6dSopenharmony_ci**Inspect .res files with derb:** The `derb` tool can convert .res files back
5982e5b6d6dSopenharmony_cito .txt files after filtering. For example, to convert the above unit res file
5992e5b6d6dSopenharmony_ciback to a txt file, you can run this command from *icu4c/source*:
6002e5b6d6dSopenharmony_ci
6012e5b6d6dSopenharmony_ci    LD_LIBRARY_PATH=lib bin/derb data/out/build/icudt64l/unit/en.res
6022e5b6d6dSopenharmony_ci
6032e5b6d6dSopenharmony_ciThat will produce a file *en.txt* in your current directory, which is the
6042e5b6d6dSopenharmony_cioriginal *data/unit/en.txt* but after resource filters were applied.
6052e5b6d6dSopenharmony_ci
6062e5b6d6dSopenharmony_ci*Tip:* derb expects your res files to be rooted in a directory named
6072e5b6d6dSopenharmony_ci`icudt64l` (corresponding to your current ICU version and endianness). If your
6082e5b6d6dSopenharmony_cifiles are not in such a directory, derb fails with U_MISSING_RESOURCE_ERROR.
6092e5b6d6dSopenharmony_ci
6102e5b6d6dSopenharmony_ci**Put complex rules first** and **use the wildcard `*` sparingly:** The order
6112e5b6d6dSopenharmony_ciof the filter rules matters a great deal in how effective your data size
6122e5b6d6dSopenharmony_cireduction can be, and the wildcard `*` can sometimes produce behavior that is
6132e5b6d6dSopenharmony_citricky to reason about. For example, these three lists of filter rules look
6142e5b6d6dSopenharmony_cisimilar on first glance but actually produce different output:
6152e5b6d6dSopenharmony_ci
6162e5b6d6dSopenharmony_ci<table>
6172e5b6d6dSopenharmony_ci<tr>
6182e5b6d6dSopenharmony_ci<th>Unit Resource Filter Rules</th>
6192e5b6d6dSopenharmony_ci<th>Unit Resource Size</th>
6202e5b6d6dSopenharmony_ci<th>Commentary</th>
6212e5b6d6dSopenharmony_ci<th>Result</th>
6222e5b6d6dSopenharmony_ci</tr>
6232e5b6d6dSopenharmony_ci<tr><td><pre>
6242e5b6d6dSopenharmony_ci-/*/*
6252e5b6d6dSopenharmony_ci+/*/digital
6262e5b6d6dSopenharmony_ci-/*/digital/*/dnam
6272e5b6d6dSopenharmony_ci-/durationUnits
6282e5b6d6dSopenharmony_ci-/units
6292e5b6d6dSopenharmony_ci-/unitsNarrow
6302e5b6d6dSopenharmony_ci</pre></td><td>77 KiB</td><td>
6312e5b6d6dSopenharmony_ciFirst, remove all unit types. Then, add back digital units across all unit
6322e5b6d6dSopenharmony_ciwidths. Then, remove display names from digital units. Then, remove duration
6332e5b6d6dSopenharmony_ciunit patterns and long and narrow forms.
6342e5b6d6dSopenharmony_ci</td><td>
6352e5b6d6dSopenharmony_ciDigital units in short form are included; all other units are removed.
6362e5b6d6dSopenharmony_ci</td></tr>
6372e5b6d6dSopenharmony_ci<tr><td><pre>
6382e5b6d6dSopenharmony_ci-/durationUnits
6392e5b6d6dSopenharmony_ci-/units
6402e5b6d6dSopenharmony_ci-/unitsNarrow
6412e5b6d6dSopenharmony_ci-/*/*
6422e5b6d6dSopenharmony_ci+/*/digital
6432e5b6d6dSopenharmony_ci-/*/digital/*/dnam
6442e5b6d6dSopenharmony_ci</pre></td><td>125 KiB</td><td>
6452e5b6d6dSopenharmony_ciFirst, remove duration unit patterns and long and narrow forms. Then, remove
6462e5b6d6dSopenharmony_ciall unit types. Then, add back digital units across all unit widths. Then,
6472e5b6d6dSopenharmony_ciremove display names from digital units.
6482e5b6d6dSopenharmony_ci</td><td>
6492e5b6d6dSopenharmony_ciDigital units are included <em>in all widths</em>; all other units are removed.
6502e5b6d6dSopenharmony_ci</td></tr>
6512e5b6d6dSopenharmony_ci<tr><td><pre>
6522e5b6d6dSopenharmony_ci-/*/*
6532e5b6d6dSopenharmony_ci+/*/digital
6542e5b6d6dSopenharmony_ci-/*/*/*/dnam
6552e5b6d6dSopenharmony_ci-/durationUnits
6562e5b6d6dSopenharmony_ci-/units
6572e5b6d6dSopenharmony_ci-/unitsNarrow
6582e5b6d6dSopenharmony_ci</pre></td><td>191 KiB</td><td>
6592e5b6d6dSopenharmony_ciFirst, remove all unit types. Then, add back digital units across all unit
6602e5b6d6dSopenharmony_ciwidths. Then, remove display names from all units. Then, remove duration unit
6612e5b6d6dSopenharmony_cipatterns and long and narrow forms.
6622e5b6d6dSopenharmony_ci</td><td>
6632e5b6d6dSopenharmony_ciDigital units in short form are included, as is the <em>tree structure</em>
6642e5b6d6dSopenharmony_cifor all other units, even though the other units have no real data.
6652e5b6d6dSopenharmony_ci</td></tr>
6662e5b6d6dSopenharmony_ci</table>
6672e5b6d6dSopenharmony_ci
6682e5b6d6dSopenharmony_ciBy design, empty tree structure is retained in the unit bundle. This is
6692e5b6d6dSopenharmony_cibecause there are numerous instances in ICU data where the presence of an
6702e5b6d6dSopenharmony_ciempty tree carries meaning. However, it means that you must be careful when
6712e5b6d6dSopenharmony_cibuilding resource filter rules in order to achieve the optimal data bundle
6722e5b6d6dSopenharmony_cisize.
6732e5b6d6dSopenharmony_ci
6742e5b6d6dSopenharmony_ciUsing the `-v` option in genrb (described above) is helpful when debugging
6752e5b6d6dSopenharmony_cithese types of issues.
6762e5b6d6dSopenharmony_ci
6772e5b6d6dSopenharmony_ci## Other Features of the ICU Data Build Tool
6782e5b6d6dSopenharmony_ci
6792e5b6d6dSopenharmony_ciWhile data filtering is the primary reason the ICU Data Build Tool was
6802e5b6d6dSopenharmony_cideveloped, there are there are additional use cases.
6812e5b6d6dSopenharmony_ci
6822e5b6d6dSopenharmony_ci### Running Data Build without Configure/Make
6832e5b6d6dSopenharmony_ci
6842e5b6d6dSopenharmony_ciYou can build the dat file outside of the ICU build system by directly
6852e5b6d6dSopenharmony_ciinvoking the Python icutools.databuilder.  Run the following command to see the
6862e5b6d6dSopenharmony_cihelp text for the CLI tool:
6872e5b6d6dSopenharmony_ci
6882e5b6d6dSopenharmony_ci    $ PYTHONPATH=path/to/icu4c/source/python python3 -m icutools.databuilder --help
6892e5b6d6dSopenharmony_ci
6902e5b6d6dSopenharmony_ci### Collation UCAData
6912e5b6d6dSopenharmony_ci
6922e5b6d6dSopenharmony_ciFor using collation (sorting and searching) in any language, the "root"
6932e5b6d6dSopenharmony_cicollation data file must be included. It provides the Unicode CLDR default
6942e5b6d6dSopenharmony_cisort order for all code points, and forms the basis for language-specific
6952e5b6d6dSopenharmony_citailorings as well as for custom collators built at runtime.
6962e5b6d6dSopenharmony_ci
6972e5b6d6dSopenharmony_ciThere are two versions of the root collation data file:
6982e5b6d6dSopenharmony_ci
6992e5b6d6dSopenharmony_ci- ucadata-unihan.txt (compiled size: 511 KiB)
7002e5b6d6dSopenharmony_ci- ucadata-implicithan.txt (compiled size: 178 KiB)
7012e5b6d6dSopenharmony_ci
7022e5b6d6dSopenharmony_ciThe unihan version sorts Han characters in radical-stroke order according to
7032e5b6d6dSopenharmony_ciUnicode, which is a somewhat useful default sort order, especially for use
7042e5b6d6dSopenharmony_ciwith non-CJK languages.  The implicithan version sorts Han characters in the
7052e5b6d6dSopenharmony_ciorder of their Unicode assignment, which is similar to radical-stroke order
7062e5b6d6dSopenharmony_cifor common characters but arbitrary for others.  For more information, see
7072e5b6d6dSopenharmony_ci[UTS #10 §10.1.3](https://www.unicode.org/reports/tr10/#Implicit_Weights).
7082e5b6d6dSopenharmony_ci
7092e5b6d6dSopenharmony_ciBy default, the unihan version is used.  The unihan version of the data file
7102e5b6d6dSopenharmony_ciis much larger than that for implicithan, so if you need collation but also
7112e5b6d6dSopenharmony_cismall data, then you may want to select the implicithan version.  To use the
7122e5b6d6dSopenharmony_ciimplicithan version, put the following setting in your *filters.json* file:
7132e5b6d6dSopenharmony_ci
7142e5b6d6dSopenharmony_ci    {
7152e5b6d6dSopenharmony_ci      "collationUCAData": "implicithan"
7162e5b6d6dSopenharmony_ci    }
7172e5b6d6dSopenharmony_ci
7182e5b6d6dSopenharmony_ci### Disable Pool Bundle
7192e5b6d6dSopenharmony_ci
7202e5b6d6dSopenharmony_ciBy default, ICU uses a "pool bundle" to store strings shared between locales.
7212e5b6d6dSopenharmony_ciThis saves space and is recommended for most users. However, when developing
7222e5b6d6dSopenharmony_cia system where locale data files may be added "on the fly" and not included in
7232e5b6d6dSopenharmony_cithe original ICU distribution, those additional data files may not be able to
7242e5b6d6dSopenharmony_ciuse a pool bundle due to name collisions with the existing pool bundle.
7252e5b6d6dSopenharmony_ci
7262e5b6d6dSopenharmony_ciTo disable the pool bundle in the current ICU build, put the following setting
7272e5b6d6dSopenharmony_ciin your *filters.json* file:
7282e5b6d6dSopenharmony_ci
7292e5b6d6dSopenharmony_ci    {
7302e5b6d6dSopenharmony_ci      "usePoolBundle": false
7312e5b6d6dSopenharmony_ci    }
7322e5b6d6dSopenharmony_ci
7332e5b6d6dSopenharmony_ci### File Substitution
7342e5b6d6dSopenharmony_ci
7352e5b6d6dSopenharmony_ciUsing the configuration file, you can perform whole-file substitutions.  For
7362e5b6d6dSopenharmony_ciexample, suppose you want to replace the transliteration rules for
7372e5b6d6dSopenharmony_ci*Zawgyi_my*.  You could create a directory called `my_icu_substitutions`
7382e5b6d6dSopenharmony_cicontaining your new `Zawgyi_my.txt` rule file, and then put this in your
7392e5b6d6dSopenharmony_ciconfiguration file:
7402e5b6d6dSopenharmony_ci
7412e5b6d6dSopenharmony_ci    fileReplacements: {
7422e5b6d6dSopenharmony_ci      directory: "/path/to/my_icu_substitutions"
7432e5b6d6dSopenharmony_ci      replacements: [
7442e5b6d6dSopenharmony_ci        {
7452e5b6d6dSopenharmony_ci          src: "Zawgyi_my.txt"
7462e5b6d6dSopenharmony_ci          dest: "translit/Zawgyi_my.txt"
7472e5b6d6dSopenharmony_ci        },
7482e5b6d6dSopenharmony_ci        "misc/dayPeriods.txt"
7492e5b6d6dSopenharmony_ci      ]
7502e5b6d6dSopenharmony_ci    }
7512e5b6d6dSopenharmony_ci
7522e5b6d6dSopenharmony_ci`directory` should either be an absolute path, or a path starting with one of
7532e5b6d6dSopenharmony_cithe following, and it should not contain a trailing slash:
7542e5b6d6dSopenharmony_ci
7552e5b6d6dSopenharmony_ci- "$SRC" for the *icu4c/source/data* directory in the source tree
7562e5b6d6dSopenharmony_ci- "$FILTERS" for the directory containing filters.json
7572e5b6d6dSopenharmony_ci- "$CWD" for your current working directory
7582e5b6d6dSopenharmony_ci
7592e5b6d6dSopenharmony_ciWhen the entry in the `replacements` array is an object, the `src` and `dest`
7602e5b6d6dSopenharmony_cifields indicate, for each file in the source directory (`src`), what file in
7612e5b6d6dSopenharmony_cithe ICU hierarchy it should replace (`dest`). When the entry is a string, the
7622e5b6d6dSopenharmony_cisame relative path is used for both `src` and `dest`.
7632e5b6d6dSopenharmony_ci
7642e5b6d6dSopenharmony_ciWhole-file substitution happens before all other filters are applied.
765