12e5b6d6dSopenharmony_ci--- 22e5b6d6dSopenharmony_cilayout: default 32e5b6d6dSopenharmony_cititle: ICU Data Build Tool 42e5b6d6dSopenharmony_cinav_order: 1 52e5b6d6dSopenharmony_ciparent: ICU Data 62e5b6d6dSopenharmony_ci--- 72e5b6d6dSopenharmony_ci<!-- 82e5b6d6dSopenharmony_ci© 2019 and later: Unicode, Inc. and others. 92e5b6d6dSopenharmony_ciLicense & terms of use: http://www.unicode.org/copyright.html 102e5b6d6dSopenharmony_ci--> 112e5b6d6dSopenharmony_ci 122e5b6d6dSopenharmony_ci# ICU Data Build Tool 132e5b6d6dSopenharmony_ci{: .no_toc } 142e5b6d6dSopenharmony_ci 152e5b6d6dSopenharmony_ci## Contents 162e5b6d6dSopenharmony_ci{: .no_toc .text-delta } 172e5b6d6dSopenharmony_ci 182e5b6d6dSopenharmony_ci1. TOC 192e5b6d6dSopenharmony_ci{:toc} 202e5b6d6dSopenharmony_ci 212e5b6d6dSopenharmony_ci--- 222e5b6d6dSopenharmony_ci 232e5b6d6dSopenharmony_ci## Overview 242e5b6d6dSopenharmony_ci 252e5b6d6dSopenharmony_ciICU 64 provides a tool for configuring your ICU locale data file with finer 262e5b6d6dSopenharmony_cigranularity. This page explains how to use this tool to customize and reduce 272e5b6d6dSopenharmony_ciyour data file size. 282e5b6d6dSopenharmony_ci 292e5b6d6dSopenharmony_ci## Overview: What is in the ICU data file? 302e5b6d6dSopenharmony_ci 312e5b6d6dSopenharmony_ciThere are hundreds of **locales** supported in ICU (including script and 322e5b6d6dSopenharmony_ciregion variants), and ICU supports many different **features**. For each 332e5b6d6dSopenharmony_cilocale and for each feature, data is stored in one or more data files. 342e5b6d6dSopenharmony_ci 352e5b6d6dSopenharmony_ciThose data files are compiled and then bundled into a `.dat` file called 362e5b6d6dSopenharmony_cisomething like `icudt64l.dat`, which is little-endian data for ICU 64. This 372e5b6d6dSopenharmony_cidat file is packaged into the `libicudata.so` on Linux or `libicudata.dll.a` 382e5b6d6dSopenharmony_cion Windows. In ICU4J, it is bundled into a jar file named `icudata.jar`. 392e5b6d6dSopenharmony_ci 402e5b6d6dSopenharmony_ciAt a high level, the size of the ICU data file corresponds to the 412e5b6d6dSopenharmony_cicross-product of locales and features, except that not all features require 422e5b6d6dSopenharmony_cilocale-specific data, and not all locales require data for all features. The 432e5b6d6dSopenharmony_cidata file contents can be approximately visualized like this: 442e5b6d6dSopenharmony_ci 452e5b6d6dSopenharmony_ci<img alt="Features vs. Locales" src="../assets/features_locales.svg" style="max-width:600px" /> 462e5b6d6dSopenharmony_ci 472e5b6d6dSopenharmony_ciThe `icudt64l.dat` file is 27 MiB uncompressed and 11 MiB gzipped. This file 482e5b6d6dSopenharmony_cisize is too large for certain use cases, such as bundling the data file into a 492e5b6d6dSopenharmony_cismartphone app or an embedded device. This is something the ICU Data Build 502e5b6d6dSopenharmony_ciTool aims to solve. 512e5b6d6dSopenharmony_ci 522e5b6d6dSopenharmony_ci## ICU Data Configuration File 532e5b6d6dSopenharmony_ci 542e5b6d6dSopenharmony_ciThe ICU Data Build Tool enables you to write a configuration file that 552e5b6d6dSopenharmony_cispecifies what features and locales to include in a custom data bundle. 562e5b6d6dSopenharmony_ci 572e5b6d6dSopenharmony_ciThe configuration file may be written in either [JSON](http://json.org/) or 582e5b6d6dSopenharmony_ci[Hjson](https://hjson.org/). To build ICU4C with custom data, set the 592e5b6d6dSopenharmony_ci`ICU_DATA_FILTER_FILE` environment variable when running `runConfigureICU` on 602e5b6d6dSopenharmony_ciUnix or when building the data package on Windows. For example: 612e5b6d6dSopenharmony_ci 622e5b6d6dSopenharmony_ci ICU_DATA_FILTER_FILE=filters.json path/to/icu4c/source/runConfigureICU Linux 632e5b6d6dSopenharmony_ci 642e5b6d6dSopenharmony_ci**Important:** You *must* have the data sources in order to use the ICU Data 652e5b6d6dSopenharmony_ciBuild Tool. Check for the file icu4c/source/data/locales/root.txt. If that file 662e5b6d6dSopenharmony_ciis missing, you need to download "icu4c-\*-data.zip", delete the old 672e5b6d6dSopenharmony_ciicu4c/source/data directory, and replace it with the data directory from the zip 682e5b6d6dSopenharmony_cifile. If there is a \*.dat file in icu4c/source/data/in, that file will be used 692e5b6d6dSopenharmony_cieven if you gave ICU custom filter rules. 702e5b6d6dSopenharmony_ci 712e5b6d6dSopenharmony_ciIn order to use Hjson syntax, the `hjson` pip module must be installed on 722e5b6d6dSopenharmony_ciyour system. You should also consider installing the `jsonschema` module to 732e5b6d6dSopenharmony_ciprint messages when errors are found in your config file. 742e5b6d6dSopenharmony_ci 752e5b6d6dSopenharmony_ci $ pip3 install --user hjson jsonschema 762e5b6d6dSopenharmony_ci 772e5b6d6dSopenharmony_ciTo build ICU4J with custom data, you must first build ICU4C with custom data 782e5b6d6dSopenharmony_ciand then generate the JAR file. For more information on building ICU4J, read the 792e5b6d6dSopenharmony_ci[ICU4J Readme](../icu4j/). 802e5b6d6dSopenharmony_ci 812e5b6d6dSopenharmony_ci### Locale Slicing 822e5b6d6dSopenharmony_ci 832e5b6d6dSopenharmony_ciThe simplest way to slice ICU data is by locale. The ICU Data Build Tool 842e5b6d6dSopenharmony_cimakes it easy to select your desired locales to suit a number of use cases. 852e5b6d6dSopenharmony_ci 862e5b6d6dSopenharmony_ci#### Filtering by Language Only 872e5b6d6dSopenharmony_ci 882e5b6d6dSopenharmony_ciHere is a *filters.json* file that builds ICU data with support for English, 892e5b6d6dSopenharmony_ciChinese, and German, including *all* script and regional variants for those 902e5b6d6dSopenharmony_cilanguages: 912e5b6d6dSopenharmony_ci 922e5b6d6dSopenharmony_ci { 932e5b6d6dSopenharmony_ci "localeFilter": { 942e5b6d6dSopenharmony_ci "filterType": "language", 952e5b6d6dSopenharmony_ci "includelist": [ 962e5b6d6dSopenharmony_ci "en", 972e5b6d6dSopenharmony_ci "de", 982e5b6d6dSopenharmony_ci "zh" 992e5b6d6dSopenharmony_ci ] 1002e5b6d6dSopenharmony_ci } 1012e5b6d6dSopenharmony_ci } 1022e5b6d6dSopenharmony_ci 1032e5b6d6dSopenharmony_ciThe *filterType* "language" only supports slicing by entire languages. 1042e5b6d6dSopenharmony_ci 1052e5b6d6dSopenharmony_ci##### Terminology: Includelist, Excludelist, Whitelist, Blacklist 1062e5b6d6dSopenharmony_ci 1072e5b6d6dSopenharmony_ciPrior to ICU 68, use `"whitelist"` and `"blacklist"` instead of `"includelist"` 1082e5b6d6dSopenharmony_ciand `"excludelist"`, respectively. ICU 68 allows all four terms. 1092e5b6d6dSopenharmony_ci 1102e5b6d6dSopenharmony_ci#### Filtering by Locale 1112e5b6d6dSopenharmony_ci 1122e5b6d6dSopenharmony_ciFor more control, use *filterType* "locale". Here is a *filters.hjson* file that 1132e5b6d6dSopenharmony_ciincludes the same three languages as above, including regional variants, but 1142e5b6d6dSopenharmony_cionly the default script (e.g., Simplified Han for Chinese): 1152e5b6d6dSopenharmony_ci 1162e5b6d6dSopenharmony_ci localeFilter: { 1172e5b6d6dSopenharmony_ci filterType: locale 1182e5b6d6dSopenharmony_ci includelist: [ 1192e5b6d6dSopenharmony_ci en 1202e5b6d6dSopenharmony_ci de 1212e5b6d6dSopenharmony_ci zh 1222e5b6d6dSopenharmony_ci ] 1232e5b6d6dSopenharmony_ci } 1242e5b6d6dSopenharmony_ci 1252e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.* 1262e5b6d6dSopenharmony_ci 1272e5b6d6dSopenharmony_ci#### Adding Script Variants (includeScripts = true) 1282e5b6d6dSopenharmony_ci 1292e5b6d6dSopenharmony_ciYou may set the *includeScripts* option to true to include all scripts for a 1302e5b6d6dSopenharmony_cilanguage while using *filterType* "locale". This results in behavior similar 1312e5b6d6dSopenharmony_cito *filterType* "language". In the following JSON example, all scripts for 1322e5b6d6dSopenharmony_ciChinese are included: 1332e5b6d6dSopenharmony_ci 1342e5b6d6dSopenharmony_ci { 1352e5b6d6dSopenharmony_ci "localeFilter": { 1362e5b6d6dSopenharmony_ci "filterType": "locale", 1372e5b6d6dSopenharmony_ci "includeScripts": true, 1382e5b6d6dSopenharmony_ci "includelist": [ 1392e5b6d6dSopenharmony_ci "en", 1402e5b6d6dSopenharmony_ci "de", 1412e5b6d6dSopenharmony_ci "zh" 1422e5b6d6dSopenharmony_ci ] 1432e5b6d6dSopenharmony_ci } 1442e5b6d6dSopenharmony_ci } 1452e5b6d6dSopenharmony_ci 1462e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.* 1472e5b6d6dSopenharmony_ci 1482e5b6d6dSopenharmony_ciIf you wish to explicitly list the scripts, you may put the script code in the 1492e5b6d6dSopenharmony_cilocale tag in the whitelist, and you do not need the *includeScripts* option 1502e5b6d6dSopenharmony_cienabled. For example, in Hjson, to include Han Traditional ***but not Han 1512e5b6d6dSopenharmony_ciSimplified***: 1522e5b6d6dSopenharmony_ci 1532e5b6d6dSopenharmony_ci localeFilter: { 1542e5b6d6dSopenharmony_ci filterType: locale 1552e5b6d6dSopenharmony_ci includelist: [ 1562e5b6d6dSopenharmony_ci en 1572e5b6d6dSopenharmony_ci de 1582e5b6d6dSopenharmony_ci zh_Hant 1592e5b6d6dSopenharmony_ci ] 1602e5b6d6dSopenharmony_ci } 1612e5b6d6dSopenharmony_ci 1622e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.* 1632e5b6d6dSopenharmony_ci 1642e5b6d6dSopenharmony_ci**Note:** the option *includeScripts* is only supported at the language level; 1652e5b6d6dSopenharmony_cii.e., in order to include all scripts for a particular language, you must 1662e5b6d6dSopenharmony_cispecify the language alone, without a region tag. 1672e5b6d6dSopenharmony_ci 1682e5b6d6dSopenharmony_ci#### Removing Regional Variants (includeChildren = false) 1692e5b6d6dSopenharmony_ci 1702e5b6d6dSopenharmony_ciIf you wish to enumerate exactly which regional variants you wish to support, 1712e5b6d6dSopenharmony_ciyou may use *filterType* "locale" with the *includeChildren* setting turned to 1722e5b6d6dSopenharmony_cifalse. The following *filters.hjson* file includes English (US), English 1732e5b6d6dSopenharmony_ci(UK), German (Germany), and Chinese (China, Han Simplified), as well as their 1742e5b6d6dSopenharmony_cidependencies, *but not* other regional variants like English (Australia), 1752e5b6d6dSopenharmony_ciGerman (Switzerland), or Chinese (Taiwan, Han Traditional): 1762e5b6d6dSopenharmony_ci 1772e5b6d6dSopenharmony_ci localeFilter: { 1782e5b6d6dSopenharmony_ci filterType: locale 1792e5b6d6dSopenharmony_ci includeChildren: false 1802e5b6d6dSopenharmony_ci includelist: [ 1812e5b6d6dSopenharmony_ci en_US 1822e5b6d6dSopenharmony_ci en_GB 1832e5b6d6dSopenharmony_ci de_DE 1842e5b6d6dSopenharmony_ci zh_CN 1852e5b6d6dSopenharmony_ci ] 1862e5b6d6dSopenharmony_ci } 1872e5b6d6dSopenharmony_ci 1882e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.* 1892e5b6d6dSopenharmony_ci 1902e5b6d6dSopenharmony_ciIncluding dependencies, the above filter would include the following data files: 1912e5b6d6dSopenharmony_ci 1922e5b6d6dSopenharmony_ci- root.txt 1932e5b6d6dSopenharmony_ci- en.txt 1942e5b6d6dSopenharmony_ci- en_US.txt 1952e5b6d6dSopenharmony_ci- en_001.txt 1962e5b6d6dSopenharmony_ci- en_GB.txt 1972e5b6d6dSopenharmony_ci- de.txt 1982e5b6d6dSopenharmony_ci- de_DE.txt 1992e5b6d6dSopenharmony_ci- zh.txt 2002e5b6d6dSopenharmony_ci- zh_Hans.txt 2012e5b6d6dSopenharmony_ci- zh_Hans_CN.txt 2022e5b6d6dSopenharmony_ci- zh_CN.txt 2032e5b6d6dSopenharmony_ci 2042e5b6d6dSopenharmony_ci### File Slicing (coarse-grained features) 2052e5b6d6dSopenharmony_ci 2062e5b6d6dSopenharmony_ciICU provides a lot of features, of which you probably need only a small subset 2072e5b6d6dSopenharmony_cifor your application. Feature slicing is a powerful way to prune out data for 2082e5b6d6dSopenharmony_ciany features you are not using. 2092e5b6d6dSopenharmony_ci 2102e5b6d6dSopenharmony_ci***CAUTION:*** When slicing by features, you must manually include all 2112e5b6d6dSopenharmony_cidependencies. For example, if you are formatting dates, you must include not 2122e5b6d6dSopenharmony_cionly the date formatting data but also the number formatting data, since dates 2132e5b6d6dSopenharmony_cicontain numbers. Expect to spend a fair bit of time debugging your feature 2142e5b6d6dSopenharmony_cifilter to get it to work the way you expect it to. 2152e5b6d6dSopenharmony_ci 2162e5b6d6dSopenharmony_ciThe data for many ICU features live in individual files. The ICU Data Build 2172e5b6d6dSopenharmony_ciTool puts similar *types* of files into categories. The following table 2182e5b6d6dSopenharmony_cisummarizes the ICU data files and their corresponding features and categories: 2192e5b6d6dSopenharmony_ci 2202e5b6d6dSopenharmony_ci| Feature | Category ID(s) | Data Files <br/> ([icu4c/source/data](https://github.com/unicode-org/icu/tree/main/icu4c/source/data)) | Resource Size <br/> (as of ICU 64) | 2212e5b6d6dSopenharmony_ci|---|---|---|---| 2222e5b6d6dSopenharmony_ci| Break Iteration | `"brkitr_rules"` <br/> `"brkitr_dictionaries"` <br/> `"brkitr_tree"` | brkitr/rules/\*.txt <br/> brkitr/dictionaries/\*.txt <br/> brkitr/\*.txt | 522 KiB <br/> **2.8 MiB** <br/> 14 KiB | 2232e5b6d6dSopenharmony_ci| Charset Conversion | `"conversion_mappings"` | mappings/\*.ucm | **4.9 MiB** | 2242e5b6d6dSopenharmony_ci| Collation <br/> *[more info](#collation-ucadata)* | `"coll_ucadata"` <br/> `"coll_tree"` | in/coll/ucadata-\*.icu <br/> coll/\*.txt | 511 KiB <br/> **2.8 MiB** | 2252e5b6d6dSopenharmony_ci| Confusables | `"confusables"` | unidata/confusables\*.txt | 45 KiB | 2262e5b6d6dSopenharmony_ci| Currencies | `"misc"` <br/> `"curr_supplemental"` <br/> `"curr_tree"` | misc/currencyNumericCodes.txt <br/> curr/supplementalData.txt <br/> curr/\*.txt | 3.1 KiB <br/> 27 KiB <br/> **2.5 MiB** | 2272e5b6d6dSopenharmony_ci| Language Display <br/> Names | `"lang_tree"` | lang/\*.txt | **2.1 MiB** | 2282e5b6d6dSopenharmony_ci| Language Tags | `"misc"` | misc/keyTypeData.txt <br/> misc/langInfo.txt <br/> misc/likelySubtags.txt <br/> misc/metadata.txt | 6.8 KiB <br/> 37 KiB <br/> 53 KiB <br/> 33 KiB | 2292e5b6d6dSopenharmony_ci| Normalization | `"normalization"` | in/\*.nrm except in/nfc.nrm | 160 KiB | 2302e5b6d6dSopenharmony_ci| Plural Rules | `"misc"` | misc/pluralRanges.txt <br/> misc/plurals.txt | 3.3 KiB <br/> 33 KiB | 2312e5b6d6dSopenharmony_ci| Region Display <br/> Names | `"region_tree"` | region/\*.txt | **1.1 MiB** | 2322e5b6d6dSopenharmony_ci| Rule-Based <br/> Number Formatting <br/> (Spellout, Ordinals) | `"rbnf_tree"` | rbnf/\*.txt | 538 KiB | 2332e5b6d6dSopenharmony_ci| StringPrep | `"stringprep"` | sprep/\*.txt | 193 KiB | 2342e5b6d6dSopenharmony_ci| Time Zones | `"misc"` <br/> `"zone_tree"` <br/> `"zone_supplemental"` | misc/metaZones.txt <br/> misc/timezoneTypes.txt <br/> misc/windowsZones.txt <br/> misc/zoneinfo64.txt <br/> zone/\*.txt <br/> zone/tzdbNames.txt | 41 KiB <br/> 20 KiB <br/> 22 KiB <br/> 151 KiB <br/> **2.7 MiB** <br/> 4.8 KiB | 2352e5b6d6dSopenharmony_ci| Transliteration | `"translit"` | translit/\*.txt | 685 KiB | 2362e5b6d6dSopenharmony_ci| Unicode Emoji<br/>Properties | `"uemoji"` | in/uemoji.icu | 13 KiB | 2372e5b6d6dSopenharmony_ci| Unicode Character <br/> Names | `"unames"` | in/unames.icu | 269 KiB | 2382e5b6d6dSopenharmony_ci| Unicode Text Layout | `"ulayout"` | in/ulayout.icu | 14 KiB | 2392e5b6d6dSopenharmony_ci| Units | `"unit_tree"` | unit/\*.txt | **1.7 MiB** | 2402e5b6d6dSopenharmony_ci| **OTHER** | `"cnvalias"` <br/> `"misc"` <br/> `"locales_tree"` | mappings/convrtrs.txt <br/> misc/dayPeriods.txt <br/> misc/genderList.txt <br/> misc/numberingSystems.txt <br/> misc/supplementalData.txt <br/> locales/\*.txt | 63 KiB <br/> 19 KiB <br/> 0.5 KiB <br/> 5.6 KiB <br/> 228 KiB <br/> **2.4 MiB** | 2412e5b6d6dSopenharmony_ci 2422e5b6d6dSopenharmony_ci#### Additive and Subtractive Modes 2432e5b6d6dSopenharmony_ci 2442e5b6d6dSopenharmony_ciThe ICU Data Build Tool allows two strategies for selecting features: 2452e5b6d6dSopenharmony_ci*additive* mode and *subtractive* mode. 2462e5b6d6dSopenharmony_ci 2472e5b6d6dSopenharmony_ciThe default is to use subtractive mode. This means that all ICU data is 2482e5b6d6dSopenharmony_ciincluded, and your configurations can remove or change data from that baseline. 2492e5b6d6dSopenharmony_ciAdditive mode means that you start with an *empty* ICU data file, and you must 2502e5b6d6dSopenharmony_ciexplicitly add the data required for your application. 2512e5b6d6dSopenharmony_ci 2522e5b6d6dSopenharmony_ciThere are two concrete differences between additive and subtractive mode: 2532e5b6d6dSopenharmony_ci 2542e5b6d6dSopenharmony_ci| | Additive | Subtractive | 2552e5b6d6dSopenharmony_ci|-------------------------|-------------|-------------| 2562e5b6d6dSopenharmony_ci| Default Feature Filter | `"exclude"` | `"include"` | 2572e5b6d6dSopenharmony_ci| Default Resource Filter | `"-/"`, `"+/%%ALIAS"`, `"+/%%Parent"` | `"+/"` | 2582e5b6d6dSopenharmony_ci 2592e5b6d6dSopenharmony_ciTo enable additive mode, add the following setting to your filter file: 2602e5b6d6dSopenharmony_ci 2612e5b6d6dSopenharmony_ci strategy: "additive" 2622e5b6d6dSopenharmony_ci 2632e5b6d6dSopenharmony_ci**Caution:** If using `"-/"` or similar top-level exclusion rules, be aware of 2642e5b6d6dSopenharmony_cithe fields `"+/%%Parent"` and `"+/%%ALIAS"`, which are required in locale tree 2652e5b6d6dSopenharmony_ciresource bundles. Excluding these paths may cause unexpected locale fallback 2662e5b6d6dSopenharmony_cibehavior. 2672e5b6d6dSopenharmony_ci 2682e5b6d6dSopenharmony_ci#### Filter Types 2692e5b6d6dSopenharmony_ci 2702e5b6d6dSopenharmony_ciYou may list *filters* for each category in the *featureFilters* section of 2712e5b6d6dSopenharmony_ciyour config file. What follows are examples of the possible types of filters. 2722e5b6d6dSopenharmony_ci 2732e5b6d6dSopenharmony_ci##### Inclusion Filter 2742e5b6d6dSopenharmony_ci 2752e5b6d6dSopenharmony_ciTo include a category, use the string `"include"` as your filter. 2762e5b6d6dSopenharmony_ci 2772e5b6d6dSopenharmony_ci featureFilters: { 2782e5b6d6dSopenharmony_ci locales_tree: include 2792e5b6d6dSopenharmony_ci } 2802e5b6d6dSopenharmony_ci 2812e5b6d6dSopenharmony_ciIf the category is a locale tree (ends with `_tree`), the inclusion filter 2822e5b6d6dSopenharmony_ciresolves to the `localeFilter`; for more information, see the section 2832e5b6d6dSopenharmony_ci"Locale-Tree Categories." Otherwise, the inclusion filter causes all files in 2842e5b6d6dSopenharmony_cithe category to be included. 2852e5b6d6dSopenharmony_ci 2862e5b6d6dSopenharmony_ci**NOTE:** When subtractive mode is used (default), all categories implicitly 2872e5b6d6dSopenharmony_cistart with `"include"` as their filter. 2882e5b6d6dSopenharmony_ci 2892e5b6d6dSopenharmony_ci##### Exclusion Filter 2902e5b6d6dSopenharmony_ci 2912e5b6d6dSopenharmony_ciTo exclude an entire category, use *filterType* "exclude". For example, to 2922e5b6d6dSopenharmony_ciexclude all confusables data: 2932e5b6d6dSopenharmony_ci 2942e5b6d6dSopenharmony_ci featureFilters: { 2952e5b6d6dSopenharmony_ci confusables: { 2962e5b6d6dSopenharmony_ci filterType: exclude 2972e5b6d6dSopenharmony_ci } 2982e5b6d6dSopenharmony_ci } 2992e5b6d6dSopenharmony_ci 3002e5b6d6dSopenharmony_ciSince ICU 65, you can also write simply: 3012e5b6d6dSopenharmony_ci 3022e5b6d6dSopenharmony_ci featureFilters: { 3032e5b6d6dSopenharmony_ci confusables: exclude 3042e5b6d6dSopenharmony_ci } 3052e5b6d6dSopenharmony_ci 3062e5b6d6dSopenharmony_ci**NOTE:** When additive mode is used, all categories implicitly start with 3072e5b6d6dSopenharmony_ci`"exclude"` as their filter. 3082e5b6d6dSopenharmony_ci 3092e5b6d6dSopenharmony_ci##### File Name Filter 3102e5b6d6dSopenharmony_ci 3112e5b6d6dSopenharmony_ciTo exclude certain files out of a category, use the file name filter, which is 3122e5b6d6dSopenharmony_cithe default type of filter when *filterType* is not specified. For example, 3132e5b6d6dSopenharmony_cito include the Burmese break iteration dictionary but not any other 3142e5b6d6dSopenharmony_cidictionaries: 3152e5b6d6dSopenharmony_ci 3162e5b6d6dSopenharmony_ci featureFilters: { 3172e5b6d6dSopenharmony_ci brkitr_dictionaries: { 3182e5b6d6dSopenharmony_ci includelist: [ 3192e5b6d6dSopenharmony_ci burmesedict 3202e5b6d6dSopenharmony_ci ] 3212e5b6d6dSopenharmony_ci } 3222e5b6d6dSopenharmony_ci } 3232e5b6d6dSopenharmony_ci 3242e5b6d6dSopenharmony_ciDo *not* include directories or file extensions. They will be added 3252e5b6d6dSopenharmony_ciautomatically for you. Note that all files in a particular category have the 3262e5b6d6dSopenharmony_cisame directory and extension. 3272e5b6d6dSopenharmony_ci 3282e5b6d6dSopenharmony_ciYou can use either `"includelist"` or `"excludelist"` for the file name filter. 3292e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.* 3302e5b6d6dSopenharmony_ci 3312e5b6d6dSopenharmony_ci##### Regex Filter 3322e5b6d6dSopenharmony_ci 3332e5b6d6dSopenharmony_ciTo exclude filenames matching a certain regular expression, use *filterType* 3342e5b6d6dSopenharmony_ci"regex". For example, to reject the CJK-specific break iteration rules: 3352e5b6d6dSopenharmony_ci 3362e5b6d6dSopenharmony_ci featureFilters: { 3372e5b6d6dSopenharmony_ci brkitr_rules: { 3382e5b6d6dSopenharmony_ci filterType: regex 3392e5b6d6dSopenharmony_ci excludelist: [ 3402e5b6d6dSopenharmony_ci ^.*_cj$ 3412e5b6d6dSopenharmony_ci ] 3422e5b6d6dSopenharmony_ci } 3432e5b6d6dSopenharmony_ci } 3442e5b6d6dSopenharmony_ci 3452e5b6d6dSopenharmony_ciThe Python standard library [*re* 3462e5b6d6dSopenharmony_cimodule](https://docs.python.org/3/library/re.html) is used for evaluating the 3472e5b6d6dSopenharmony_ciregular expressions. In case the regular expression engine is changed in the 3482e5b6d6dSopenharmony_cifuture, however, you are encouraged to restrict yourself to a simple set of 3492e5b6d6dSopenharmony_ciregular expression operators. 3502e5b6d6dSopenharmony_ci 3512e5b6d6dSopenharmony_ciAs above, do not include directories or file extensions, and you can use 3522e5b6d6dSopenharmony_cieither a whitelist or a blacklist. 3532e5b6d6dSopenharmony_ci 3542e5b6d6dSopenharmony_ci##### Union Filter 3552e5b6d6dSopenharmony_ci 3562e5b6d6dSopenharmony_ciYou can combine the results of multiple filters with *filterType* "union". 3572e5b6d6dSopenharmony_ciThis filter matches files that match *at least one* of the provided filters. 3582e5b6d6dSopenharmony_ciThe syntax is: 3592e5b6d6dSopenharmony_ci 3602e5b6d6dSopenharmony_ci { 3612e5b6d6dSopenharmony_ci filterType: union 3622e5b6d6dSopenharmony_ci unionOf: [ 3632e5b6d6dSopenharmony_ci { /* filter 1 */ }, 3642e5b6d6dSopenharmony_ci { /* filter 2 */ }, 3652e5b6d6dSopenharmony_ci // ... 3662e5b6d6dSopenharmony_ci ] 3672e5b6d6dSopenharmony_ci } 3682e5b6d6dSopenharmony_ci 3692e5b6d6dSopenharmony_ciThis filter type is useful for combining "locale" filters with different 3702e5b6d6dSopenharmony_ciincludeScripts or includeChildren options. 3712e5b6d6dSopenharmony_ci 3722e5b6d6dSopenharmony_ci#### Locale-Tree Categories 3732e5b6d6dSopenharmony_ci 3742e5b6d6dSopenharmony_ciSeveral categories have the `_tree` suffix. These categories are for "locale 3752e5b6d6dSopenharmony_citrees": they contain locale-specific data. ***The [localeFilter configuration 3762e5b6d6dSopenharmony_cioption](#slicing-data-by-locale) sets the default file filter for all `_tree` 3772e5b6d6dSopenharmony_cicategories.*** 3782e5b6d6dSopenharmony_ci 3792e5b6d6dSopenharmony_ciIf you want to include different locales for different locale file trees, you 3802e5b6d6dSopenharmony_cican override their filter in the *featureFilters* section of the config file. 3812e5b6d6dSopenharmony_ciFor example, to include only Italian data for currency symbols *instead of* 3822e5b6d6dSopenharmony_cithe common locales specified in *localeFilter*, you can do the following: 3832e5b6d6dSopenharmony_ci 3842e5b6d6dSopenharmony_ci featureFilters: 3852e5b6d6dSopenharmony_ci curr_tree: { 3862e5b6d6dSopenharmony_ci filterType: locale 3872e5b6d6dSopenharmony_ci includelist: [ 3882e5b6d6dSopenharmony_ci it 3892e5b6d6dSopenharmony_ci ] 3902e5b6d6dSopenharmony_ci } 3912e5b6d6dSopenharmony_ci } 3922e5b6d6dSopenharmony_ci 3932e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.* 3942e5b6d6dSopenharmony_ci 3952e5b6d6dSopenharmony_ciYou can exclude an entire `_tree` category without affecting other categories. 3962e5b6d6dSopenharmony_ciFor example, to exclude region display names: 3972e5b6d6dSopenharmony_ci 3982e5b6d6dSopenharmony_ci featureFilters: { 3992e5b6d6dSopenharmony_ci region_tree: { 4002e5b6d6dSopenharmony_ci filterType: exclude 4012e5b6d6dSopenharmony_ci } 4022e5b6d6dSopenharmony_ci } 4032e5b6d6dSopenharmony_ci 4042e5b6d6dSopenharmony_ciNote that you are able to use any of the other filter types for `_tree` 4052e5b6d6dSopenharmony_cicategories, but you must be very careful that you are including all of the 4062e5b6d6dSopenharmony_cicorrect files. For example, `en_GB` requires `en_001`, and you must always 4072e5b6d6dSopenharmony_ciinclude `root`. If you use the "language" or "locale" filter types, this 4082e5b6d6dSopenharmony_cilogic is done for you. 4092e5b6d6dSopenharmony_ci 4102e5b6d6dSopenharmony_ci### Resource Bundle Slicing (fine-grained features) 4112e5b6d6dSopenharmony_ci 4122e5b6d6dSopenharmony_ciThe third section of the ICU filter config file is *resourceFilters*. With 4132e5b6d6dSopenharmony_cithis section, you can dive inside resource bundle files to remove even more 4142e5b6d6dSopenharmony_cidata. 4152e5b6d6dSopenharmony_ci 4162e5b6d6dSopenharmony_ciYou can apply resource filters to all locale tree categories as well as to 4172e5b6d6dSopenharmony_cicategories that include resource bundles, such as the `"misc"` category. 4182e5b6d6dSopenharmony_ci 4192e5b6d6dSopenharmony_ciFor example, consider measurement units. There is one unit file per locale (example: 4202e5b6d6dSopenharmony_ci[en.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unit/en.txt)), 4212e5b6d6dSopenharmony_ciand that file contains data for all measurement units in CLDR. However, if 4222e5b6d6dSopenharmony_ciyou are only formatting distances, for example, you may need the data for only 4232e5b6d6dSopenharmony_cia small set of units. 4242e5b6d6dSopenharmony_ci 4252e5b6d6dSopenharmony_ciHere is how you could include units of length in the "short" style but no 4262e5b6d6dSopenharmony_ciother units: 4272e5b6d6dSopenharmony_ci 4282e5b6d6dSopenharmony_ci resourceFilters: [ 4292e5b6d6dSopenharmony_ci { 4302e5b6d6dSopenharmony_ci categories: [ 4312e5b6d6dSopenharmony_ci unit_tree 4322e5b6d6dSopenharmony_ci ] 4332e5b6d6dSopenharmony_ci rules: [ 4342e5b6d6dSopenharmony_ci -/units 4352e5b6d6dSopenharmony_ci -/unitsNarrow 4362e5b6d6dSopenharmony_ci -/unitsShort 4372e5b6d6dSopenharmony_ci +/unitsShort/length 4382e5b6d6dSopenharmony_ci ] 4392e5b6d6dSopenharmony_ci } 4402e5b6d6dSopenharmony_ci ] 4412e5b6d6dSopenharmony_ci 4422e5b6d6dSopenharmony_ciConceptually, the rules are applied from top to bottom. First, all data for 4432e5b6d6dSopenharmony_ciall three styes of units are removed, and then the short length units are 4442e5b6d6dSopenharmony_ciadded back. 4452e5b6d6dSopenharmony_ci 4462e5b6d6dSopenharmony_ci**NOTE:** In subtractive mode, resource paths are *included* by default. In 4472e5b6d6dSopenharmony_ciadditive mode, resource paths are *excluded* by default. 4482e5b6d6dSopenharmony_ci 4492e5b6d6dSopenharmony_ci#### Wildcard Character 4502e5b6d6dSopenharmony_ci 4512e5b6d6dSopenharmony_ciYou can use the wildcard character (`*`) to match a piece of the resource 4522e5b6d6dSopenharmony_cipath. For example, to include length units for all three styles, you can do: 4532e5b6d6dSopenharmony_ci 4542e5b6d6dSopenharmony_ci resourceFilters: [ 4552e5b6d6dSopenharmony_ci { 4562e5b6d6dSopenharmony_ci categories: [ 4572e5b6d6dSopenharmony_ci unit_tree 4582e5b6d6dSopenharmony_ci ] 4592e5b6d6dSopenharmony_ci rules: [ 4602e5b6d6dSopenharmony_ci -/units 4612e5b6d6dSopenharmony_ci -/unitsNarrow 4622e5b6d6dSopenharmony_ci -/unitsShort 4632e5b6d6dSopenharmony_ci +/*/length 4642e5b6d6dSopenharmony_ci ] 4652e5b6d6dSopenharmony_ci } 4662e5b6d6dSopenharmony_ci ] 4672e5b6d6dSopenharmony_ci 4682e5b6d6dSopenharmony_ciThe wildcard must be the only character in its path segment. Future ICU 4692e5b6d6dSopenharmony_civersions may expand the syntax. 4702e5b6d6dSopenharmony_ci 4712e5b6d6dSopenharmony_ci#### Resource Filter for Specific File 4722e5b6d6dSopenharmony_ci 4732e5b6d6dSopenharmony_ciThe resource filter object takes an optional *files* setting which accepts a 4742e5b6d6dSopenharmony_cifile filter in the same syntax used above for file filtering. For example, if 4752e5b6d6dSopenharmony_ciyou wanted to apply a filter to misc/supplementalData.txt, you could do the 4762e5b6d6dSopenharmony_cifollowing (this example removes calendar data): 4772e5b6d6dSopenharmony_ci 4782e5b6d6dSopenharmony_ci resourceFilters: [ 4792e5b6d6dSopenharmony_ci { 4802e5b6d6dSopenharmony_ci categories: ["misc"] 4812e5b6d6dSopenharmony_ci files: { 4822e5b6d6dSopenharmony_ci includelist: ["supplementalData"] 4832e5b6d6dSopenharmony_ci } 4842e5b6d6dSopenharmony_ci rules: [ 4852e5b6d6dSopenharmony_ci -/calendarData 4862e5b6d6dSopenharmony_ci ] 4872e5b6d6dSopenharmony_ci } 4882e5b6d6dSopenharmony_ci ] 4892e5b6d6dSopenharmony_ci 4902e5b6d6dSopenharmony_ci*If using ICU 67 or earlier, see note above regarding allowed keywords.* 4912e5b6d6dSopenharmony_ci 4922e5b6d6dSopenharmony_ci#### Combining Multiple Resource Filter Specs 4932e5b6d6dSopenharmony_ci 4942e5b6d6dSopenharmony_ciYou can also list multiple resource filter objects in the *resourceFilters* 4952e5b6d6dSopenharmony_ciarray; the filters are added from top to bottom. For example, here is an 4962e5b6d6dSopenharmony_ciadvanced configuration that includes "mile" for en-US and "kilometer" for 4972e5b6d6dSopenharmony_cien-CA; this also makes use of the *files* option: 4982e5b6d6dSopenharmony_ci 4992e5b6d6dSopenharmony_ci resourceFilters: [ 5002e5b6d6dSopenharmony_ci { 5012e5b6d6dSopenharmony_ci categories: ["unit_tree"] 5022e5b6d6dSopenharmony_ci rules: [ 5032e5b6d6dSopenharmony_ci -/units 5042e5b6d6dSopenharmony_ci -/unitsNarrow 5052e5b6d6dSopenharmony_ci -/unitsShort 5062e5b6d6dSopenharmony_ci ] 5072e5b6d6dSopenharmony_ci }, 5082e5b6d6dSopenharmony_ci { 5092e5b6d6dSopenharmony_ci categories: ["unit_tree"] 5102e5b6d6dSopenharmony_ci files: { 5112e5b6d6dSopenharmony_ci filterType: locale 5122e5b6d6dSopenharmony_ci includelist: ["en_US"] 5132e5b6d6dSopenharmony_ci } 5142e5b6d6dSopenharmony_ci rules: [ 5152e5b6d6dSopenharmony_ci +/*/length/mile 5162e5b6d6dSopenharmony_ci ] 5172e5b6d6dSopenharmony_ci }, 5182e5b6d6dSopenharmony_ci { 5192e5b6d6dSopenharmony_ci categories: ["unit_tree"] 5202e5b6d6dSopenharmony_ci files: { 5212e5b6d6dSopenharmony_ci filterType: locale 5222e5b6d6dSopenharmony_ci includelist: ["en_CA"] 5232e5b6d6dSopenharmony_ci } 5242e5b6d6dSopenharmony_ci rules: [ 5252e5b6d6dSopenharmony_ci +/*/length/kilometer 5262e5b6d6dSopenharmony_ci ] 5272e5b6d6dSopenharmony_ci } 5282e5b6d6dSopenharmony_ci ] 5292e5b6d6dSopenharmony_ci 5302e5b6d6dSopenharmony_ciThe above example would give en-US these resource filter rules: 5312e5b6d6dSopenharmony_ci 5322e5b6d6dSopenharmony_ci -/units 5332e5b6d6dSopenharmony_ci -/unitsNarrow 5342e5b6d6dSopenharmony_ci -/unitsShort 5352e5b6d6dSopenharmony_ci +/*/length/mile 5362e5b6d6dSopenharmony_ci 5372e5b6d6dSopenharmony_ciand en-CA these resource filter rules: 5382e5b6d6dSopenharmony_ci 5392e5b6d6dSopenharmony_ci -/units 5402e5b6d6dSopenharmony_ci -/unitsNarrow 5412e5b6d6dSopenharmony_ci -/unitsShort 5422e5b6d6dSopenharmony_ci +/*/length/kilometer 5432e5b6d6dSopenharmony_ci 5442e5b6d6dSopenharmony_ciIn accordance with *filterType* "locale", the parent locales *en* and *root* 5452e5b6d6dSopenharmony_ciwould get both units; this is required since both en-US and en-CA may inherit 5462e5b6d6dSopenharmony_cifrom the parent locale: 5472e5b6d6dSopenharmony_ci 5482e5b6d6dSopenharmony_ci -/units 5492e5b6d6dSopenharmony_ci -/unitsNarrow 5502e5b6d6dSopenharmony_ci -/unitsShort 5512e5b6d6dSopenharmony_ci +/*/length/mile 5522e5b6d6dSopenharmony_ci +/*/length/kilometer 5532e5b6d6dSopenharmony_ci 5542e5b6d6dSopenharmony_ci## Debugging Tips 5552e5b6d6dSopenharmony_ci 5562e5b6d6dSopenharmony_ci**Run Python directly:** If you do not want to wait for ./runConfigureICU to 5572e5b6d6dSopenharmony_cifinish, you can directly re-generate the rules using your filter file with the 5582e5b6d6dSopenharmony_cifollowing command line run from *iuc4c/source*. 5592e5b6d6dSopenharmony_ci 5602e5b6d6dSopenharmony_ci $ PYTHONPATH=python python3 -m icutools.databuilder \ 5612e5b6d6dSopenharmony_ci --mode=gnumake --src_dir=data > data/rules.mk 5622e5b6d6dSopenharmony_ci 5632e5b6d6dSopenharmony_ci**Install jsonschema:** Install the `jsonschema` pip package to get warnings 5642e5b6d6dSopenharmony_ciabout problems with your filter file. 5652e5b6d6dSopenharmony_ci 5662e5b6d6dSopenharmony_ci**See what data is being used:** ICU is instrumented to allow you to trace 5672e5b6d6dSopenharmony_ciwhich resources are used at runtime. This can help you determine what data you 5682e5b6d6dSopenharmony_cineed to include. For more information, see [tracing.md](tracing.md). 5692e5b6d6dSopenharmony_ci 5702e5b6d6dSopenharmony_ci**Inspect data/rules.mk:** The Python script outputs the file *rules.mk* 5712e5b6d6dSopenharmony_ciinside *iuc4c/source/data*. To see what is going to get built, you can inspect 5722e5b6d6dSopenharmony_cithat file. First build ICU normally, and copy *rules.mk* to 5732e5b6d6dSopenharmony_ci*rules_default.mk*. Then build ICU with your filter file. Now you can take the 5742e5b6d6dSopenharmony_cidiff between *rules_default.mk* and *rules.mk* to see exactly what your filter 5752e5b6d6dSopenharmony_cifile is removing. 5762e5b6d6dSopenharmony_ci 5772e5b6d6dSopenharmony_ci**Inspect the output:** After a `make clean` and `make` with a new *rules.mk*, 5782e5b6d6dSopenharmony_ciyou can look inside the directory *icu4c/source/data/out* to see the files 5792e5b6d6dSopenharmony_cithat got built. 5802e5b6d6dSopenharmony_ci 5812e5b6d6dSopenharmony_ci**Inspect the compiled resource filter rules:** If you are using a resource 5822e5b6d6dSopenharmony_cifilter, the resource filter rules get compiled for each individual locale 5832e5b6d6dSopenharmony_ciinside *icu4c/source/data/out/tmp/filters*. You can look at those files to see 5842e5b6d6dSopenharmony_ciwhat filter rules are being applied to each individual locale. 5852e5b6d6dSopenharmony_ci 5862e5b6d6dSopenharmony_ci**Run genrb in verbose mode:** For debugging a resource filter, you can run 5872e5b6d6dSopenharmony_cigenrb in verbose mode to see which resources got stripped. To do this, first 5882e5b6d6dSopenharmony_ciinspect the make output and find a command line like this: 5892e5b6d6dSopenharmony_ci 5902e5b6d6dSopenharmony_ci LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/genrb --filterDir ./out/tmp/filters/unit_tree -s ./unit -d ./out/build/icudt64l/unit/ -i ./out/build/icudt64l --usePoolBundle ./out/build/icudt64l/unit/ -k en.txt 5912e5b6d6dSopenharmony_ci 5922e5b6d6dSopenharmony_ciCopy that command line and re-run it from *icu4c/source/data* with the `-v` 5932e5b6d6dSopenharmony_ciflag added to the end. The command will print out exactly which resource paths 5942e5b6d6dSopenharmony_ciare being included and excluded as well as a model of the filter rules applied 5952e5b6d6dSopenharmony_cito this file. 5962e5b6d6dSopenharmony_ci 5972e5b6d6dSopenharmony_ci**Inspect .res files with derb:** The `derb` tool can convert .res files back 5982e5b6d6dSopenharmony_cito .txt files after filtering. For example, to convert the above unit res file 5992e5b6d6dSopenharmony_ciback to a txt file, you can run this command from *icu4c/source*: 6002e5b6d6dSopenharmony_ci 6012e5b6d6dSopenharmony_ci LD_LIBRARY_PATH=lib bin/derb data/out/build/icudt64l/unit/en.res 6022e5b6d6dSopenharmony_ci 6032e5b6d6dSopenharmony_ciThat will produce a file *en.txt* in your current directory, which is the 6042e5b6d6dSopenharmony_cioriginal *data/unit/en.txt* but after resource filters were applied. 6052e5b6d6dSopenharmony_ci 6062e5b6d6dSopenharmony_ci*Tip:* derb expects your res files to be rooted in a directory named 6072e5b6d6dSopenharmony_ci`icudt64l` (corresponding to your current ICU version and endianness). If your 6082e5b6d6dSopenharmony_cifiles are not in such a directory, derb fails with U_MISSING_RESOURCE_ERROR. 6092e5b6d6dSopenharmony_ci 6102e5b6d6dSopenharmony_ci**Put complex rules first** and **use the wildcard `*` sparingly:** The order 6112e5b6d6dSopenharmony_ciof the filter rules matters a great deal in how effective your data size 6122e5b6d6dSopenharmony_cireduction can be, and the wildcard `*` can sometimes produce behavior that is 6132e5b6d6dSopenharmony_citricky to reason about. For example, these three lists of filter rules look 6142e5b6d6dSopenharmony_cisimilar on first glance but actually produce different output: 6152e5b6d6dSopenharmony_ci 6162e5b6d6dSopenharmony_ci<table> 6172e5b6d6dSopenharmony_ci<tr> 6182e5b6d6dSopenharmony_ci<th>Unit Resource Filter Rules</th> 6192e5b6d6dSopenharmony_ci<th>Unit Resource Size</th> 6202e5b6d6dSopenharmony_ci<th>Commentary</th> 6212e5b6d6dSopenharmony_ci<th>Result</th> 6222e5b6d6dSopenharmony_ci</tr> 6232e5b6d6dSopenharmony_ci<tr><td><pre> 6242e5b6d6dSopenharmony_ci-/*/* 6252e5b6d6dSopenharmony_ci+/*/digital 6262e5b6d6dSopenharmony_ci-/*/digital/*/dnam 6272e5b6d6dSopenharmony_ci-/durationUnits 6282e5b6d6dSopenharmony_ci-/units 6292e5b6d6dSopenharmony_ci-/unitsNarrow 6302e5b6d6dSopenharmony_ci</pre></td><td>77 KiB</td><td> 6312e5b6d6dSopenharmony_ciFirst, remove all unit types. Then, add back digital units across all unit 6322e5b6d6dSopenharmony_ciwidths. Then, remove display names from digital units. Then, remove duration 6332e5b6d6dSopenharmony_ciunit patterns and long and narrow forms. 6342e5b6d6dSopenharmony_ci</td><td> 6352e5b6d6dSopenharmony_ciDigital units in short form are included; all other units are removed. 6362e5b6d6dSopenharmony_ci</td></tr> 6372e5b6d6dSopenharmony_ci<tr><td><pre> 6382e5b6d6dSopenharmony_ci-/durationUnits 6392e5b6d6dSopenharmony_ci-/units 6402e5b6d6dSopenharmony_ci-/unitsNarrow 6412e5b6d6dSopenharmony_ci-/*/* 6422e5b6d6dSopenharmony_ci+/*/digital 6432e5b6d6dSopenharmony_ci-/*/digital/*/dnam 6442e5b6d6dSopenharmony_ci</pre></td><td>125 KiB</td><td> 6452e5b6d6dSopenharmony_ciFirst, remove duration unit patterns and long and narrow forms. Then, remove 6462e5b6d6dSopenharmony_ciall unit types. Then, add back digital units across all unit widths. Then, 6472e5b6d6dSopenharmony_ciremove display names from digital units. 6482e5b6d6dSopenharmony_ci</td><td> 6492e5b6d6dSopenharmony_ciDigital units are included <em>in all widths</em>; all other units are removed. 6502e5b6d6dSopenharmony_ci</td></tr> 6512e5b6d6dSopenharmony_ci<tr><td><pre> 6522e5b6d6dSopenharmony_ci-/*/* 6532e5b6d6dSopenharmony_ci+/*/digital 6542e5b6d6dSopenharmony_ci-/*/*/*/dnam 6552e5b6d6dSopenharmony_ci-/durationUnits 6562e5b6d6dSopenharmony_ci-/units 6572e5b6d6dSopenharmony_ci-/unitsNarrow 6582e5b6d6dSopenharmony_ci</pre></td><td>191 KiB</td><td> 6592e5b6d6dSopenharmony_ciFirst, remove all unit types. Then, add back digital units across all unit 6602e5b6d6dSopenharmony_ciwidths. Then, remove display names from all units. Then, remove duration unit 6612e5b6d6dSopenharmony_cipatterns and long and narrow forms. 6622e5b6d6dSopenharmony_ci</td><td> 6632e5b6d6dSopenharmony_ciDigital units in short form are included, as is the <em>tree structure</em> 6642e5b6d6dSopenharmony_cifor all other units, even though the other units have no real data. 6652e5b6d6dSopenharmony_ci</td></tr> 6662e5b6d6dSopenharmony_ci</table> 6672e5b6d6dSopenharmony_ci 6682e5b6d6dSopenharmony_ciBy design, empty tree structure is retained in the unit bundle. This is 6692e5b6d6dSopenharmony_cibecause there are numerous instances in ICU data where the presence of an 6702e5b6d6dSopenharmony_ciempty tree carries meaning. However, it means that you must be careful when 6712e5b6d6dSopenharmony_cibuilding resource filter rules in order to achieve the optimal data bundle 6722e5b6d6dSopenharmony_cisize. 6732e5b6d6dSopenharmony_ci 6742e5b6d6dSopenharmony_ciUsing the `-v` option in genrb (described above) is helpful when debugging 6752e5b6d6dSopenharmony_cithese types of issues. 6762e5b6d6dSopenharmony_ci 6772e5b6d6dSopenharmony_ci## Other Features of the ICU Data Build Tool 6782e5b6d6dSopenharmony_ci 6792e5b6d6dSopenharmony_ciWhile data filtering is the primary reason the ICU Data Build Tool was 6802e5b6d6dSopenharmony_cideveloped, there are there are additional use cases. 6812e5b6d6dSopenharmony_ci 6822e5b6d6dSopenharmony_ci### Running Data Build without Configure/Make 6832e5b6d6dSopenharmony_ci 6842e5b6d6dSopenharmony_ciYou can build the dat file outside of the ICU build system by directly 6852e5b6d6dSopenharmony_ciinvoking the Python icutools.databuilder. Run the following command to see the 6862e5b6d6dSopenharmony_cihelp text for the CLI tool: 6872e5b6d6dSopenharmony_ci 6882e5b6d6dSopenharmony_ci $ PYTHONPATH=path/to/icu4c/source/python python3 -m icutools.databuilder --help 6892e5b6d6dSopenharmony_ci 6902e5b6d6dSopenharmony_ci### Collation UCAData 6912e5b6d6dSopenharmony_ci 6922e5b6d6dSopenharmony_ciFor using collation (sorting and searching) in any language, the "root" 6932e5b6d6dSopenharmony_cicollation data file must be included. It provides the Unicode CLDR default 6942e5b6d6dSopenharmony_cisort order for all code points, and forms the basis for language-specific 6952e5b6d6dSopenharmony_citailorings as well as for custom collators built at runtime. 6962e5b6d6dSopenharmony_ci 6972e5b6d6dSopenharmony_ciThere are two versions of the root collation data file: 6982e5b6d6dSopenharmony_ci 6992e5b6d6dSopenharmony_ci- ucadata-unihan.txt (compiled size: 511 KiB) 7002e5b6d6dSopenharmony_ci- ucadata-implicithan.txt (compiled size: 178 KiB) 7012e5b6d6dSopenharmony_ci 7022e5b6d6dSopenharmony_ciThe unihan version sorts Han characters in radical-stroke order according to 7032e5b6d6dSopenharmony_ciUnicode, which is a somewhat useful default sort order, especially for use 7042e5b6d6dSopenharmony_ciwith non-CJK languages. The implicithan version sorts Han characters in the 7052e5b6d6dSopenharmony_ciorder of their Unicode assignment, which is similar to radical-stroke order 7062e5b6d6dSopenharmony_cifor common characters but arbitrary for others. For more information, see 7072e5b6d6dSopenharmony_ci[UTS #10 §10.1.3](https://www.unicode.org/reports/tr10/#Implicit_Weights). 7082e5b6d6dSopenharmony_ci 7092e5b6d6dSopenharmony_ciBy default, the unihan version is used. The unihan version of the data file 7102e5b6d6dSopenharmony_ciis much larger than that for implicithan, so if you need collation but also 7112e5b6d6dSopenharmony_cismall data, then you may want to select the implicithan version. To use the 7122e5b6d6dSopenharmony_ciimplicithan version, put the following setting in your *filters.json* file: 7132e5b6d6dSopenharmony_ci 7142e5b6d6dSopenharmony_ci { 7152e5b6d6dSopenharmony_ci "collationUCAData": "implicithan" 7162e5b6d6dSopenharmony_ci } 7172e5b6d6dSopenharmony_ci 7182e5b6d6dSopenharmony_ci### Disable Pool Bundle 7192e5b6d6dSopenharmony_ci 7202e5b6d6dSopenharmony_ciBy default, ICU uses a "pool bundle" to store strings shared between locales. 7212e5b6d6dSopenharmony_ciThis saves space and is recommended for most users. However, when developing 7222e5b6d6dSopenharmony_cia system where locale data files may be added "on the fly" and not included in 7232e5b6d6dSopenharmony_cithe original ICU distribution, those additional data files may not be able to 7242e5b6d6dSopenharmony_ciuse a pool bundle due to name collisions with the existing pool bundle. 7252e5b6d6dSopenharmony_ci 7262e5b6d6dSopenharmony_ciTo disable the pool bundle in the current ICU build, put the following setting 7272e5b6d6dSopenharmony_ciin your *filters.json* file: 7282e5b6d6dSopenharmony_ci 7292e5b6d6dSopenharmony_ci { 7302e5b6d6dSopenharmony_ci "usePoolBundle": false 7312e5b6d6dSopenharmony_ci } 7322e5b6d6dSopenharmony_ci 7332e5b6d6dSopenharmony_ci### File Substitution 7342e5b6d6dSopenharmony_ci 7352e5b6d6dSopenharmony_ciUsing the configuration file, you can perform whole-file substitutions. For 7362e5b6d6dSopenharmony_ciexample, suppose you want to replace the transliteration rules for 7372e5b6d6dSopenharmony_ci*Zawgyi_my*. You could create a directory called `my_icu_substitutions` 7382e5b6d6dSopenharmony_cicontaining your new `Zawgyi_my.txt` rule file, and then put this in your 7392e5b6d6dSopenharmony_ciconfiguration file: 7402e5b6d6dSopenharmony_ci 7412e5b6d6dSopenharmony_ci fileReplacements: { 7422e5b6d6dSopenharmony_ci directory: "/path/to/my_icu_substitutions" 7432e5b6d6dSopenharmony_ci replacements: [ 7442e5b6d6dSopenharmony_ci { 7452e5b6d6dSopenharmony_ci src: "Zawgyi_my.txt" 7462e5b6d6dSopenharmony_ci dest: "translit/Zawgyi_my.txt" 7472e5b6d6dSopenharmony_ci }, 7482e5b6d6dSopenharmony_ci "misc/dayPeriods.txt" 7492e5b6d6dSopenharmony_ci ] 7502e5b6d6dSopenharmony_ci } 7512e5b6d6dSopenharmony_ci 7522e5b6d6dSopenharmony_ci`directory` should either be an absolute path, or a path starting with one of 7532e5b6d6dSopenharmony_cithe following, and it should not contain a trailing slash: 7542e5b6d6dSopenharmony_ci 7552e5b6d6dSopenharmony_ci- "$SRC" for the *icu4c/source/data* directory in the source tree 7562e5b6d6dSopenharmony_ci- "$FILTERS" for the directory containing filters.json 7572e5b6d6dSopenharmony_ci- "$CWD" for your current working directory 7582e5b6d6dSopenharmony_ci 7592e5b6d6dSopenharmony_ciWhen the entry in the `replacements` array is an object, the `src` and `dest` 7602e5b6d6dSopenharmony_cifields indicate, for each file in the source directory (`src`), what file in 7612e5b6d6dSopenharmony_cithe ICU hierarchy it should replace (`dest`). When the entry is a string, the 7622e5b6d6dSopenharmony_cisame relative path is used for both `src` and `dest`. 7632e5b6d6dSopenharmony_ci 7642e5b6d6dSopenharmony_ciWhole-file substitution happens before all other filters are applied. 765