1---
2layout: default
3title: ICU Data
4nav_order: 1600
5has_children: true
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# ICU Data
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25ICU makes use of a wide variety of data tables to provide many of its services.
26Examples include converter mapping tables, collation rules, transliteration
27rules, break iterator rules and dictionaries, and other locale data. Additional
28data can be provided by users, either as customizations of ICU's data or as new
29data altogether.
30
31This section describes how ICU data is stored and located at run time. It also
32describes how ICU data can be customized to suit the needs of a particular
33application.
34
35For simple use of ICU's predefined data, this section on data management can
36safely be skipped. The data is built into a library that is loaded along with
37the rest of ICU. No specific action or setup is required of either the
38application program or the execution environment.
39
40Update: as of ICU 64, the standard data library is over 20 MB in size. We have
41introduced a new tool, the [ICU Data Build Tool](./buildtool.md),
42to give you more control over what goes into your ICU locale data file.
43
44> :point_right: **Note**: ICU for C by default comes with pre-built data.
45> The source data files are included as an "icu\*data.zip" file starting in ICU4C 49.
46> Previously, they were not included unless ICU is downloaded from the [source repository](https://icu.unicode.org/repository).
47
48## ICU and CLDR Data
49
50Most of ICU's data is sourced from [CLDR](http://cldr.unicode.org), the [Common
51Locale Data Repository](http://cldr.unicode.org) project. Do not file bugs
52against ICU to request data changes in CLDR, see the CLDR project's page itself.
53Also note that most ICU data files are therefore autogenerated from CLDR, and so
54manually editing them is not usually recommended.
55
56Data which is NOT sourced from CLDR includes:
57
58*   [Conversion Data](conversion/data.md)
59*   Break Iterator Dictionary Data ( Thai, CJK, etc )
60*   Break Iterator Rule Data (as of this writing, it is manually kept in sync
61    with the CLDR datasets)
62
63For information on building ICU data from CLDR, see the
64[cldr-icu-readme](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt).
65
66## ICU Data Directory
67
68The ICU data directory is the default location for all ICU data. Any requests
69for data items that do not include an explicit directory path will be resolved
70to files located in the ICU data directory.
71
72The ICU data directory is determined as follows:
73
741.  If the application has called the function `u_setDataDirectory()`, use the
75    directory specified there, otherwise:
76
772.  If the environment variable `ICU_DATA` is set, use that, otherwise:
78
793.  If the C preprocessor variable `ICU_DATA_DIR` was set at the time ICU was
80    built, use its compiled-in value.
81
824.  Otherwise, the ICU data directory is an empty string. This is the default
83    behavior for ICU using a shared library for its data and provides the
84    highest data loading performance.
85
86> :point_right: **Note**: `u_setDataDirectory()` is not thread-safe. Call it
87> *before* calling ICU APIs from multiple threads. If you use both
88> `u_setDataDirectory()` and `u_init()`, then use `u_setDataDirectory()` first.
89> 
90> *Earlier versions of ICU supported two additional schemes: setting a data
91> directory relative to the location of the ICU shared libraries, and on Windows,
92> taking a location from the registry. These have both been removed to make the
93> behavior more predictable and easier to understand.*
94
95The ICU data directory does not need to be set in order to reference the
96standard built-in ICU data. Applications that just use standard ICU capabilities
97(converters, locales, collation, etc.) but do not build and reference their own
98data do not need to specify an ICU data directory.
99
100### Multiple-Item ICU Data Directory Values
101
102The ICU data directory string can contain multiple directories as well as .dat
103path/filenames. They must be separated by the path separator that is used on the
104platform, for example a semicolon (`;`) on Windows. Data files will be searched in
105all directories and .dat package files in the order of the directory string. For
106details, see the example below.
107
108## Default ICU Data
109
110The default ICU data consists of the data needed for the converters, collators,
111locales, etc. that are provided with ICU. Default data must be present in order
112for ICU to function.
113
114The default data is most commonly built into a shared library that is installed
115with the other ICU libraries. Nothing is required of the application for this
116mechanism to work. ICU provides additional options for loading the default data
117if more flexibility is required.
118
119Here are the steps followed by ICU to locate its default data. This procedure
120happens only once per process, at the time an ICU data item is first requested.
121
1221.  If the application has called the function `udata_setCommonData()`, use the
123    data that was provided. The application specifies the address in memory of
124    an image of an ICU common format data file (either in shared-library format
125    or .dat package file format).
126
1272.  Examine the contents of the default ICU data shared library. If it contains
128    data, use that data. If the data library is empty, a stub library, proceed
129    to the next step. (A data shared library must always be present in order for
130    ICU to successfully link and load. A stub data library is used when the
131    actual ICU common data is to be provided from another source).
132
1333.  Dynamically load (memory map, typically) a common format (.dat) file
134    containing the default ICU data. Loading is described in the section
135    [How Data Loading Works](#how-data-loading-works). The path to
136    the data is of the form  "icudt\<version\>\<flag\>", where \<version\> is
137    the two-digit ICU version number, and \<flag\> is a letter indicating the
138    internal format of the file (see the
139    [Sharing ICU Data Between Platforms](#sharing-icu-data-between-platforms)
140    section).
141
142Once the default ICU data has been located, loading of individual data items
143proceeds as described in the section
144[How Data Loading Works](#how-data-loading-works).
145
146## Building and Linking against ICU data
147
148When using ICU's configure or runConfigureICU tool to build, several different
149methods of packging are available.
150
151> :point_right: **Note**: in all cases, you **must** link all ICU tools and
152applications against a "data library": either a data library containing the ICU
153data, or against the "stubdata" library located in icu/source/stubdata. For
154example, even if ICU is built in "files" mode, you must still link against the
155"stubdata" library or an undefined symbol error occurs.
156
157*   `--with-data-packaging=library`
158    This mode builds a shared library (DLL or .so). This is the simplest mode to
159    use, and is the default.
160    To use: link your application against the common and data libraries.
161    This is the only directly supported behavior on Windows builds.
162*   `--with-data-packaging=static`
163    This option builds ICU data as a single (large) static library. This mode is
164    more complex to use. If you encounter errors, you may need to build ICU
165    multiple times.
166*   `--with-data-packaging=files`
167    With this option, ICU outputs separate individual files (.res, .cnv, etc)
168    which will be loaded at runtime. Read the rest of this document, especially
169    the sections that discuss the ICU directory path.
170*   `--with-data-packaging=archive`
171    With this option, ICU outputs a single "icudt__.dat" file containing ICU
172    data. Read the rest of this document, especially the sections that discuss
173    the ICU directory path.
174
175## Time Zone Data
176
177Because time zone data requires frequent updates in response to countries
178changing their transition dates for daylight saving time, ICU provides
179additional options for loading time zone data from separate files, thus avoiding
180the need to update a combined ICU data package. Further information is found
181under [Time Zones](../datetime/timezone/index.md).
182
183## Application Data
184
185ICU-based applications can ship and use their own data for localized strings,
186custom conversion tables, etc. Each data item file must have a package name as a
187prefix, and this package name must match the basename of a .dat package file, if
188one is used. The package name must be used in ICU APIs, for example in
189`udata_setAppData()` (instead of `udata_setCommonData()` which is only used for
190ICU's own data) and in the pathname argument of `ures_open()`.
191
192The only real difference to ICU's own data is that application data cannot be
193simply loaded by specifying a NULL value for the path arguments of ICU APIs, and
194application data will not be used by APIs that do not have path/package name
195arguments at all.
196
197The most important APIs that allow application data to be used are for Resource
198Bundles, which are most often used for localized strings and other data. There
199are also functions like `ucnv_openPackage()` that allow to specify application
200data, and the `udata.h` API can be used to load any data with minimum
201requirements on the binary format, and without ICU interpreting the contents of
202the data.
203
204The `pkgdata` tool, which is used to package the data into various formats (e.g.
205shared library), has an option (`--without-assembly` or `-w`) to not use
206assembly code when building and packaging the application specific data into a
207shared library. Building the data with assembly code, which is enabled by
208default, is faster and more efficient; however, there are some platform
209specific issues that may arise. The `--without-assembly` option may be
210necessary on certain platforms (e.g. Linux) which have trouble properly loading
211application data when it was built with assembly code and is packaged as a
212shared library.
213
214## Alignment
215
216ICU data is designed to be 16-aligned, with natural alignment of values inside
217the data structure, so that the data is usable as is when memory-mapped.
218("16-aligned" means that the start address is a multiple of 16 bytes.)
219
220Memory-mapping (as well as memory allocation) provides at least 16-alignment on
221modern platforms. Some CPUs require n-alignment of types of size n bytes (and
222crash on unaligned reads), other CPUs usually operate faster on data that is
223aligned properly.
224
225Some of the ICU code explicitly checks for proper alignment.
226
227The `icupkg` tool places data items into the .dat file at start offsets that are
228multiples of 16 bytes.
229
230When using `genccode` to directly write a .o/.obj file, or to write assembler
231code, it specifies at least 16-alignment. When using `genccode` to write C code,
232it prepends the data with a double value which should yield at least 8-alignment
233on most platforms (usually `sizeof(double)=8`).
234
235## Flexibility vs. Installation vs. Performance
236
237There are choices that affect ICU data loading and depend on application
238requirements.
239
240### Data in Shared Libraries/DLLs vs. .dat package files
241
242Building ICU data into shared libraries (`--with-data-packaging=library`) is the
243most convenient packaging method because shared libraries (DLLs) are easily
244found if they are in the same directory as the application libraries, or if they
245are on the system library path. The application installer usually just copies
246the ICU shared libraries in the same place. On the other hand, shared libraries
247are not portable.
248
249Packaging data into .dat files (`--with-data-packaging=archive`) allows them to
250be shared across platforms, but they must either be loaded by the application
251and set with `udata_setCommonData()` or `udata_setAppData()`, or they must be
252in a known location that is included in the ICU data directory string. This
253requires the application installer, or the application itself at runtime, to
254locate the ICU and/or application data by setting the ICU data directory (see
255the [ICU Data Directory](#icu-data-directory) section above) or by
256loading the data and providing it to one of the `udata_setXYZData()` functions.
257
258Unlike shared libraries, .dat package files can be taken apart into separate
259data item files with the decmn ICU tool. This allows post-installation
260modification of a package file. The `gencmn` and `pkgdata` ICU tools can then be
261used to reassemble the .dat package file.
262
263For more information about .dat package files see the section [Sharing ICU Data
264Between Platforms](#sharing-icu-data-between-platforms) below.
265
266### Data Overriding vs. Loading Performance
267
268If the ICU data directory string is empty, then ICU will not attempt to load
269data from the file system. It is then only possible to load data from the
270linked-in shared library or via `udata_setCommonData()` and
271`udata_setAppData()`. This is inflexible but provides the highest performance.
272
273If the ICU data directory string is not empty, then data items are searched in
274all directories and matching .dat files mentioned before checking in
275already-loaded package files. This allows overriding of packaged data items with
276single files after installation but costs some time for filesystem accesses.
277This is usually done only once per data item; see
278[User Data Caching](#user-data-caching) below.
279
280### Single Data Files vs. Packages
281
282Single data files (`--with-data-packaging=files`) are easy to replace and can
283override items inside data packages. However, it is usually desirable to reduce
284the number of files during installation, and package files use less disk space
285than many small files.
286
287## How Data Loading Works
288
289ICU data items are referenced by three names - a path, a name and a type. The
290following are some examples:
291
292path                         |   name   | type
293-----------------------------|----------|-------
294 c:\\some\\path\\dataLibName | test     | dat
295 no path                     | cnvalias | icu
296 no path                     | cp1252   | cnv
297 no path                     | en       | res
298 no path                     | uprops   | icu
299
300
301Items with 'no path' specified are loaded from the default ICU data.
302
303Application data items include a path, and will be loaded from user data files,
304not from the ICU default data. For application data, the path argument need not
305contain an actual directory, but must contain the application data's package
306name after the last directory separator character (or by itself if there is no
307directory). If the path argument contains a directory, then it is logically
308prepended to the ICU data directory string and searched first for data. The path
309argument can contain at most one directory. (Path separators like semicolon (;)
310are not handled here.)
311
312> :point_right: **Note**: The ICU data directory string itself may
313contain multiple directories and path/filenames to .dat package files. See the
314[ICU Data Directory](#icu-data-directory) section.
315
316It is recommended to not include the directory in the path argument but to make
317sure via setting the application data or the ICU data directory string that the
318data can be located. This simplifies program maintenance and improves
319robustness.
320
321See the API descriptions for the functions `udata_open()` and
322`udata_openChoice()` for additional information on opening ICU data from within
323an application.
324
325Data items can exist as individual files, or a number of them can be packaged
326together in a single file for greater efficiency in loading and convenience of
327distribution. The combined files are called Common Files.
328
329Based on the supplied path and name, ICU searches several possible locations
330when opening data. To make things more concrete in the following descriptions,
331the following values of path, name and type are used:
332
333```
334path = "c:\\some\\path\\dataLibName"
335name = "test"
336type = "res"
337```
338
339In this case, "dataLibName" is the "package name" part of the path argument, and
340"c:\\some\\path\\" is the directory part of it.
341
342The search sequence for the data for "test.res" is as follows (the first
343successful loading attempt wins):
344
3451.  Try to load the file "dataLibName_test.res" from c:\\some\\data\\.
346
3472.  Try to load the file "dataLibName_test.res" from each of the directories in
348    the ICU data directory string.
349
3503.  Try to locate the data package for the package name "dataLibName".
351
3521.  Try to locate the data package in the internal cache.
353
3542.  Try to load the package file "dataLibName.dat" from c:\\some\\data\\.
355
3563.  Try to load the package file "dataLibName.dat" from each of the directories
357    in the ICU data directory string.
358
359The first steps, loading the data item from an individual file, are omitted if
360no directory is specified in either the path argument or the ICU data directory
361string.
362
363Package files are loaded at most once and then cached. They are identified only
364by their package name. Whenever a data item is requested from a package and that
365package has been loaded before, then the cached package is used immediately
366instead of searching through the filesystem.
367
368> :point_right: **Note**: ICU versions before 2.2 always searched data packages
369before looking for individual files, which made it impossible to override
370packaged data items. See the ICU 2.2 download page and the readme for more
371information about the changes.
372
373## User Data Caching
374
375Once loaded, data package files are cached, and stay loaded for the duration of
376the process. Any requests for data items from an already loaded data package
377file are routed directly to the cached data. No additional search for loadable
378files is made.
379
380The user data cache is keyed by the base file name portion of the requested
381path, with any directory portion stripped off and ignored. Using the previous
382example, for the path name "c:\\some\\path\\dataLibName", the cache key is
383"dataLibName". After this is cached, a subsequent request for "dataLibName", no
384matter what directory path is specified, will resolve to the cached data.
385
386Data can be explicitly added to the cache of common format data by means of the
387`udata_setAppData()` function. This function takes as input the path (name) and
388a pointer to a memory image of a .dat file. The data is added to the cache,
389causing any subsequent requests for data items from that file name to be routed
390to the cache.
391
392Only data package files are cached. Separate data files that contain just a
393single data item are not cached; for these, multiple requests to ICU to open the
394data will result in multiple requests to the operating system to open the
395underlying file.
396
397However, most ICU services (Resource Bundles, conversion, etc.) themselves cache
398loaded data, so that data is usually loaded only once until the end of the
399process (or until `u_cleanup()` or `ucnv_flushCache()` or similar are called.)
400
401There is no mechanism for removing or updating cached data files.
402
403## Directory Separator Characters
404
405If a directory separator (generally '/' or '\\') is needed in a path parameter,
406use the form that is native to the platform. The ICU header `"putil.h"` defines
407`U_FILE_SEP_CHAR` appropriately for the platform.
408
409> :point_right: **Note**: On Windows, the directory separator must be '\\' for
410any paths passed to ICU APIs. This is different from native Windows APIs, which
411generally allow either '/' or '\\'.
412
413## Sharing ICU Data Between Platforms
414
415ICU's default data is (at the time of this writing) about 8 MB in size. Because
416it is normally built as a shared library, the file format is specific to each
417platform (operating system). The data libraries can not be shared between
418platforms even though the actual data contents are identical.
419
420By distributing the default data in the form of common format .dat files rather
421than as shared libraries, a single data file can be shared among multiple
422platforms. This is beneficial if a single distribution of the application (a CD,
423for example) includes binaries for many platforms, and the size requirements for
424replicating the ICU data for each platform are a problem.
425
426ICU common format data files are not completely interchangeable between
427platforms. The format depends on these properties of the platform:
428
4291.  Byte Ordering (little endian vs. big endian)
430
4312.  Base character set - ASCII or EBCDIC
432
433This means, for example, that ICU data files are interchangeable between Windows
434and Linux on X86 (both are ASCII little endian), or between Macintosh and
435Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC
436and Solaris on X86 (different byte ordering).
437
438The single letter following the version number in the file name of the default
439ICU data file encodes the properties of the file as follows:
440
441```
442icudt19l.dat Little Endian, ASCII
443icudt19b.dat Big Endian, ASCII
444icudt19e.dat Big Endian, EBCDIC
445```
446
447(There are no little endian EBCDIC systems. All non-EBCDIC encodings include an
448invariant subset of ASCII that is sufficient to enable these files to
449interoperate.)
450
451The packaging of the default ICU data as a .dat file rather than as a shared
452library is requested by using an option in the configure script at build time.
453Nothing is required at run time; ICU finds and uses whatever form of the data is
454available.
455
456> :point_right: **Note**: When the ICU data is built in the form of shared
457libraries, the library names have platform-specific prefixes and suffixes. On
458Unix-style platforms, all the libraries have the "lib" prefix and one of the
459usual (".dll", ".so", ".sl", etc.) suffixes. Other than these prefixes and
460suffixes, the library names are the same as the above .dat files.
461
462## Customizing ICU's Data Library
463
464ICU includes a standard library of data that is about 16 MB in size. Most of
465this consists of conversion tables and locale information. The data itself is
466normally placed into a single shared library.
467
468Update: as of ICU 64, the standard data library is over 20 MB in size. We have
469introduced a new tool, the [ICU Data Build Tool](./buildtool.md),
470to replace the makefiles explained below and give you more control over what
471goes into your ICU locale data file.
472
473### Adding Converters to ICU
474
475The first step is to obtain or create a .ucm (source) mapping data file for the
476desired converter. A large archive of converter data is maintained by the ICU
477team at <https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm>
478
479We will use `solaris-eucJP-2.7.ucm`, available from the repository mentioned
480above, as an example.
481
482#### Build the Converter
483
484Converter source files are compiled into binary converter files (.cnv files) by
485using the icu tool makeconv. For the example, you can use this command
486
487```
488makeconv -v solaris-eucJP-2.7.ucm
489```
490
491Some of the .ucm files from the repository will need additional header
492information before they can be built. Use the error messages from the makeconv
493tool, .ucm files for similar converters, and the ICU user guide documentation of
494.ucm files as a guide when making changes. For the `solaris-eucJP-2.7.ucm`
495example, we will borrow the missing header fields from
496`source/data/mappings/ibm-33722_P12A-2000.ucm`, which is the standard ICU eucJP
497converter data.
498
499The ucm file format is described in the
500["Conversion Data" chapter](../conversion/data.md) of this user guide.
501
502After adjustment, the header of the `solaris-eucJP-2.7.ucm` file contains these
503items:
504
505```
506<code_set_name>   "solaris-eucJP-2.7"
507<subchar>         \\x3F
508<uconv_class>     "MBCS"
509
510<mb_cur_max>      3
511<mb_cur_min>      1
512
513<icu:state>       0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
514<icu:state>       a1-fe
515<icu:state>       a1-e4
516<icu:state>       a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
517<icu:state>       a1-fe
518```
519
520The binary converter file produced by the `makeconv` tool is
521`solaris-eucJP-2.7.cnv`.
522
523#### Installation
524
525Copy the new .cnv file to the desired location for use. Set the environment
526variable `ICU_DATA` to the directory containing the data, or, alternatively,
527from within an application, tell ICU the location of the new data with the
528function `u_setDataDirectory()` before using the new converter.
529
530If ICU is already obtaining data from files rather than a shared library,
531install the new file in the same location as the existing ICU data file(s), and
532don't change/set the environment variable or data directory.
533
534If you do not want to add a converter to ICU's base data, you can also generate
535a conversion table with `makeconv`, use pkgdata to generate your own package and
536use the `ucnv_openPackage()` to open up a converter with that conversion table
537from the generated package.
538
539#### Building the new converter into ICU
540
541The need to install a separate file and inform ICU of the data directory can be
542avoided by building the new converter into ICU's standard data library. Here is
543the procedure for doing so:
544
5451.  Move the .ucm file(s) for the converter(s) to be added (
546    `solaris-eucJP-2.7.ucm` for our example) into the directory
547    `source/data/mappings/`
548
5492.  Create, or edit, if it already exists, the file
550    `source/data/mappings/ucmlocal.mk`. Add this line:
551    
552    ```
553    UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm
554    ```
555    
556    Any number of converters can be listed. Extend the list to new lines with a
557    back slash at the end of the line. The `ucmlocal.mk` file is described in
558    more detail in `source/data/mappings/ucmfiles.mk` (Even though they use very
559    different build systems, `ucmlocal.mk` is used for both the Windows and UNIX
560    builds.)
561
5623.  Add the converter name and aliases to `source/data/mappings/convrtrs.txt`.
563    This will allow your converter to be shown in the list of available
564    converters when you call the `ucnv_getAvailableName(`) function. The file
565    syntax is described within the file.
566
5674.  Rebuild the ICU data.
568    For Windows, from MSVC choose the makedata project from the GUI, then build
569    the project.
570    For UNIX, `cd icu/source/data; gmake`
571
572When opening an ICU converter (`ucnv_open()`), the converter name can not be
573qualified with a path that indicates the directory or common data file
574containing the corresponding converter data. The required data must be present
575either in the main ICU data library or as a separate .cnv file located in the
576ICU data directory. This is different from opening resources or other types of
577ICU data, which do allow a path.
578
579### Adding Locale Data to ICU's Data
580
581If you have data for a locale that is not included in ICU's standard build, then
582you can add it to the build in a very similar way as with conversion tables
583above. The ICU project provides a large number of additional locales in its
584[locale
585repository](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/)
586on the web. Most of this locale data is derived from the CLDR ([Common Locale
587Data Repository](http://www.unicode.org/cldr/)) project.
588
589Dropping the txt file into the correct place in the source tree is sufficient to
590add it to your ICU build. You will need to re-configure in order to pick it up.
591
592## Customizing ICU's Data Library for ICU 63 or earlier
593The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space.
594
595> :point_right: **Note**: ICU for C by default comes with pre-built data.
596The source data files are included as an "icu\*data.zip" file starting in ICU4C
59749. Previously, they were not included unless ICU is downloaded from the
598[source repository](https://github.com/unicode-org/icu). Alternatively, the
599[Data Customizer](http://apps.icu-project.org/datacustom/) may be used to
600customize the pre-built data.
601
602ICU can load data from individual data files as well as from its default
603library, so building a customized library when adding additional data is not
604strictly necessary. Adding to ICU's library can simplify application
605installation by eliminating the need to include separate files with an
606application distribution, and the need to tell ICU where they are installed.
607
608Reducing the size of ICU's data by eliminating unneeded resources can make
609sense on small systems with limited or no disk, but for desktop or server
610systems there is no real advantage to trimming. ICU's data is memory mapped
611into an application's address space, and only those portions of the data
612actually being used are ever paged in, so there are no significant RAM savings.
613As for disk space, with the large size of today's hard drives, saving a few MB
614is not worth the bother.
615
616By default, ICU builds with a large set of converters and with all available
617locales. This means that any extra items added must be provided by the
618application developer. There is no extra ICU-supplied data that could be
619specified.
620
621### Details
622
623The converters and resources that ICU builds are in the following configuration
624files. They are only available when building from ICU's source code repository.
625Normally, the standard ICU distribution do not include these files.
626
627File                              | Description
628----------------------------------|--------------
629source/data/locales/resfiles.mk   | The standard set of locale data resource bundles
630source/data/locales/reslocal.mk   | User-provided file with additional resource bundles
631source/data/coll/colfiles.mk      | The standard set of collation data resource bundles
632source/data/coll/collocal.mk      | User-provided file with additional collation resource bundles
633source/data/brkitr/brkfiles.mk    | The standard set of break iterator data resource bundles
634source/data/brkitr/brklocal.mk    | User-provided file with additional break iterator resource bundles
635source/data/translit/trnsfiles.mk | The standard set of transliterator resource files
636source/data/translit/trnslocal.mk | User-provided file with a set of additional transliterator resource files
637source/data/mappings/ucmcore.mk   | Core set of conversion tables for MIME/Unix/Windows
638source/data/mappings/ucmfiles.mk  | Additional, large set of conversion tables for a wide range of uses
639source/data/mappings/ucmebcdic.mk | Large set of EBCDIC conversion tables
640source/data/mappings/ucmlocal.mk  | User-provided file with additional conversion tables
641source/data/misc/miscfiles.mk     | Miscellaneous data, like timezone information 
642
643These files function identically for both Windows and UNIX builds of ICU. ICU
644will automatically update the list of installed locales returned by
645`uloc_getAvailable()` whenever `resfiles.mk` or `reslocal.mk` are updated and
646the ICU data library is rebuilt. These files are only needed while building ICU.
647If any of these files are removed or renamed, the size of the ICU data library
648will be reduced.
649
650The optional files `reslocal.mk` and `ucmlocal.mk` are not included as part of
651a standard ICU distribution. Thus these customization files do not need to be
652merged or updated when updating versions of ICU.
653
654Both `reslocal.mk` and `ucmlocal.mk` are makefile includes. So the usual rules
655for makefiles apply. Lines may be continued by preceding the end of the line to
656be continued with a back slash. Lines beginning with a # are comments. See
657`ucmfiles.mk` and `resfiles.mk` for additional information.
658
659### Reducing the Size of ICU's Data: Conversion Tables
660
661The size of the ICU data file in the standard build configuration is about 8 MB.
662The majority of this is used for conversion tables. ICU comes with so many
663conversion tables because many ICU users need to support many encodings from
664many platforms. There are conversion tables for EBCDIC and DOS codepages, for
665ISO 2022 variants, and for small variations of popular encodings.
666
667> :point_right: **Important**: ICU provides full internationalization
668functionality without **any** conversion table data. The common library
669contains code to handle several important encodings algorithmically: US-ASCII,
670ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e.,
671US-ASCII, ISO-8859-1, and all Unicode charsets; see
672source/data/mappings/convrtrs.txt for the current list).
673
674Therefore, the easiest way to reduce the size of ICU's data by a lot (without
675limitation of I18N support) is to reduce the number of conversion tables that
676are built into the data file.
677
678The conversion tables are listed for the build process in several makefiles
679`source/data/mappings/ucm\*.mk`, roughly grouped by how commonly they are used.
680If you remove or rename any of these files, then the ICU build will exclude the
681conversion tables that are listed in that file. Beginning with ICU 2.0, all of
682these makefiles including the main one are optional. If you remove all of them,
683then ICU will include only very few conversion tables for "fallback" encodings
684(see note below).
685
686If you remove or rename all `ucm\*.mk` files, then ICU's data is reduced to
687about 3.6 MB. If you remove all these files except for `ucmcore.mk`, then ICU's
688data is reduced to about 4.7 MB, while keeping support for a core set of common
689MIME/Unix/Windows encodings.
690
691> :point_right: **Note**: If you remove the conversion table for an encoding
692that could be a default encoding on one of your platforms, then ICU will not be
693able to instantiate a default converter. In this case, ICU 2.0 and up will
694automatically fall back to a "lowest common denominator" and load a converter
695for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be
696good enough for converting strings that contain only "ASCII" characters (see the
697comment about "invariant characters" in `utypes.h`).
698*When ICU is built with a reduced set of conversion tables, then some tests will
699fail that test the behavior of the converters based on known features of some
700encodings. Also, building the testdata will fail if you remove some conversion
701tables that are necessary for that (to test non-ASCII/Unicode resource bundle
702source files, for example). You can ignore these failures. Build with the
703standard set of conversion tables, if you want to run the tests.* 
704
705### Reducing the Size of ICU's Data: Locale Data
706
707If you need to reduce the size of ICU's data even further, then you need to
708remove other files or parts of files from the build as well.
709
710There are a number of different subdirectories of 'data' containing locale data
711split out by section. Each subdirectory has its own **.mk** file listing the
712locales which will be built. Subdirectories include **lang** for language names
713and **curr** for currency names.
714
715You can remove data for entire locales by removing their files from
716`source/data/locales/resfiles.mk` or the appropriate other .mk file. ICU will
717then use the data of the parent locale instead, which is root.txt. If you
718remove all resource bundles for a given language and its country/region/variant
719sublocales, **do not remove root.txt!** Also, do not remove a parent locale if
720child locales exist. For example, do not remove "en" while retaining "en_US".
721
722### Reducing the Size of ICU's Data: Collation Data
723
724Collation data (for sorting, searching and alphabetic indexes) is also large,
725especially the collation data for East Asian languages because they define
726multiple orderings of tens of thousands of Han characters. You can remove the
727collation data for those languages by removing references to those locales from
728`source/data/coll/colfiles.mk` files. When you do that, the collation for those
729languages will fall back to the root collator, that is, you lose
730language-specific behavior.
731
732A much less radical approach is to keep the collation data tables but remove the
733tailoring rule strings from which they were built. Those rule strings are
734rarely used at runtime. For documentation about their use and how to remove
735them see the section "Building on Existing Locales" in the
736[Collation Customization chapter](collation/customization/index.md).
737
738### Adding Locale Data to ICU's Data
739You need to write a resource bundle file for it with a structure like the
740existing locale resource bundles (e.g. `source/data/locales/ja.txt, ru_RU.txt`,
741`kok_IN.txt`) and add it by writing a file `source/data/locales/reslocal.mk`
742just like above. In this file, define the list of additional resource bundles as
743
744```
745GENRB_SOURCE_LOCAL=myLocale.txt other.txt ...
746```
747
748Starting in ICU 2.2, these added locales are automatically listed by
749`uloc_getAvailable()`.
750
751## ICU Data File Formats
752
753ICU uses several kinds of data files with specific source (plain text) and
754binary data formats. The following lists provides links to descriptions of those
755formats.
756
757Each ICU data object begins with a header before the actual, specific data. The
758header consists of a 16-bit header length value, the two "magic" bytes DA 27 and
759a [UDataInfo](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/structUDataInfo.html#_details)
760structure which specifies the data object's endianness, charset family, format,
761data version, etc.
762
763(This is not the case for the trie structures, which are not stand-alone,
764loadable data objects.)
765
766### Public Data Files
767
768#### ICU.dat package files
769*   Source format: (list of files provided as input to the icupkg tool, or
770         on the gencmn tool command line)
771*    Binary format: .dat:
772     [source/tools/toolutil/pkg_gencmn.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/toolutil/pkg_gencmn.cpp)
773*    Generator tool:
774         [icupkg](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/icupkg)
775         or
776         [gencmn](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gencmn)
777         
778#### Resource bundles
779*   Source format: .txt:
780    [icuhtml/design/bnf_rb.txt](https://github.com/unicode-org/icu-docs/blob/main/design/bnf_rb.txt)
781*   Binary format: .res:
782    [source/common/uresdata.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/uresdata.h)
783*   Generator tool:
784    [genrb](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/genrb)
785
786#### Unicode conversion mapping tables
787*   Source format: .ucm: [Conversion Data chapter](../conversion/data.md)
788*   Binary format: .cnv:
789    [source/common/ucnvmbcs.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ucnvmbcs.h)
790*   Generator tool:
791    [makeconv](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/makeconv)
792
793#### Conversion (charset) aliases
794*   Source format:
795    [source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt):
796    contains format description. The command "uconv -l --canon" will also
797    generate the alias table from the currently used copy of ICU.
798*   Binary format: cnvalias.icu:
799    [source/common/ucnv_io.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ucnv_io.cpp)
800*   Generator tool:
801    [gencnval](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gencnval)
802
803#### Unicode Character Data (Properties; for Java only: hardcoded in C common library)
804*   Source format:
805    [source/data/unidata/ppucd.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/ppucd.txt):
806    [Preparsed UCD](https://icu.unicode.org/design/props/ppucd)
807*   Binary format: uprops.icu:
808    [tools/unicode/c/genprops/corepropsbuilder.cpp](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/corepropsbuilder.cpp)
809*   Generator tool:
810    [genprops](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops)
811
812#### Unicode Character Data (Case mappings; for Java only: hardcoded in C common library)
813*   Source format:
814    [source/data/unidata/*.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata):
815    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
816*   Binary format: ucase.icu:
817    [tools/unicode/c/genprops/casepropsbuilder.cpp](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/casepropsbuilder.cpp)
818*   Generator tool:
819    [genprops](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops)
820
821#### Unicode Character Data (BiDi, and Arabic shaping; for Java only: hardcoded in C common library)
822*   Source format:
823    [source/data/unidata/*.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata):
824    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
825*   Binary format: ubidi.icu:
826    [tools/unicode/c/genprops/bidipropsbuilder.cpp](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/bidipropsbuilder.cpp)
827*   Generator tool:
828    [genprops](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops)
829
830#### Unicode Character Data (Normalization since ICU 4.4) & custom normalization data
831*   Source format:
832    [source/data/unidata/norm2/*.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/norm2):
833    Files derived from the [Unicode Character
834    Database](https://www.unicode.org/onlinedat/online.html), or custom data.
835*   Binary format: .nrm:
836    [source/common/normalizer2impl.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/normalizer2impl.h)
837*   Generator tool:
838    [gennorm2](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gennorm2)
839
840#### Unicode Character Data (Character names)
841*   Source format:
842    [source/data/unidata/UnicodeData.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/UnicodeData.txt):
843    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
844*   Binary format: unames.icu:
845    [tools/unicode/c/genprops/namespropsbuilder.cpp](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/namespropsbuilder.cpp)
846*   Generator tool:
847    [genprops](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops)
848
849#### Unicode Character Data (Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8)
850*   Source format: [UCD Property*Aliases.txt](http://www.unicode.org/Public/UNIDATA/):
851                   [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
852*   Binary format: pnames.icu:
853    [source/common/propname.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/propname.h)
854*   Generator tool:
855    [genprops](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops)
856
857#### Unicode Character Data (Text layout properties since ICU 64)
858*   Source format:
859    [source/data/unidata/ppucd.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/ppucd.txt):
860    [Preparsed UCD](https://icu.unicode.org/design/props/ppucd)
861*   Binary format: ulayout.icu:
862    [tools/unicode/c/genprops/layoutpropsbuilder.cpp](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/layoutpropsbuilder.cpp)
863*   Generator tool:
864    [genprops](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops)
865
866#### Unicode Character Data (Emoji properties since ICU 70)
867Emoji properties of code points moved out of uprops.icu.
868Emoji properties of strings added.
869*   Source format:
870    [source/data/unidata/emoji-sequences.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/emoji-sequences.txt) and
871    [source/data/unidata/emoji-zwj-sequences.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/emoji-zwj-sequences.txt):
872    [UTS #51 Data Files](https://www.unicode.org/reports/tr51/#Data_Files)
873*   Binary format: uemoji.icu:
874    [tools/unicode/c/genprops/emojipropsbuilder.cpp](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/emojipropsbuilder.cpp)
875*   Generator tool:
876    [genprops](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops)
877
878#### Collation data (root collation & tailorings; ICU 53 & later)
879*   Source format: Original data from allkeys_CLDR.txt in
880    [CLDR Root Collation Data Files](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files)
881    processed into
882    [source/data/unidata/FractionalUCA.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/FractionalUCA.txt)
883    by
884    [tool at unicode.org maintained by Mark Davis](https://sites.google.com/site/unicodetools/#TOC-UCA)
885    (call the Main class with option writeFractionalUCA); source tailorings (text rules) in
886    [source/data/coll/*.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll)
887    resource bundles: [Collation Customization chapter](../collation/customization/index.md).
888*   Binary format: ucadata.icu & binary tailorings in resource bundles:
889    [source/i18n/collationdatareader.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/i18n/collationdatareader.h)
890*   Generator tool:
891    [genuca](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genuca),
892    [genrb](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/genrb)
893
894#### Rule-based break iterator data
895*   Source format: .txt: [Boundary Analysis chapter](boundaryanalysis/index.md)
896*   Binary format: .brk:
897    [source/common/rbbidata.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/rbbidata.h)
898*   Generator tool:
899    [genbrk](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/genbrk)
900
901#### Dictionary-based break iterator data (ICU 50 & later)
902*   Source format: txt: [gendict.cpp
903    comments](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gendict/gendict.cpp)
904*   Binary format: .dict: see
905    [source/common/dictionarydata.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/dictionarydata.h
906*   Generator tool:
907    [gendict](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gendict)
908
909#### Rule-based transform (transliterator) data
910*   Source format: .txt (in resource bundles): [Transform Rule Tutorial chapter](transforms/general/rules.md)
911*   Binary format: Uses genrb to make binary format
912*   Generator tool: Does not apply
913
914#### Time zone data (ICU 4.4 & later)
915*   Source format:
916    [source/data/misc/zoneinfo64.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/misc/zoneinfo64.txt):
917    ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz
918*   Binary format: zoneinfo64.res (generated by genrb and
919    [tzcode tools](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/tzcode/readme.txt)).
920*   Generator tool: Does not apply
921
922#### StringPrep profile data
923*   Source format:
924    [source/data/sprep/rfc3491.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/sprep/rfc3491.txt):
925*   Binary format: .spp:
926    [source/tools/gensprep/store.c](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gensprep/store.c)
927*   Generator tool:
928    [gensprep](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gensprep)
929
930#### Confusables data
931*   Source format:
932    [source/data/unidata/confusables.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/confusables.txt),
933    [source/data/unidata/confusablesWholeScript.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/confusablesWholeScript.txt)
934*   Binary format: .spp:
935    [confusables.cfu: source/i18n/uspoof_impl.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/i18n/uspoof_impl.h)
936*   Generator tool: [gencfu](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gencfu)
937
938### Public Data Files (old versions)
939
940#### Unicode Character Data (Normalization before ICU 4.4; for Java only: was hardcoded in C common library)
941*   Source format:
942    [source/data/unidata/*.txt]((https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata):
943    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
944*   Binary format: unorm.icu:
945    [source/common/unormimp.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unormimp.h)
946*   Generator tool: gennorm
947
948#### Unicode Character Data (Property [value] aliases before ICU 4.8)
949*   Source format: source/data/unidata/Property*Aliases.txt: [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
950*   Binary format: pnames.icu: source/common/propname.h (ICU 4.6)
951*   Generator tool: genpname
952
953#### Collation data (UCA, code points to weights; ICU 52 & earlier)
954*   Source format: Same as in ICU 53
955*   Binary format: ucadata.icu & binary tailorings in resource bundles: source/i18n/ucol_imp.h (ICU 52)
956*   Generator tool:
957    [genuca](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genuca),
958    [genrb](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/genrb)
959
960#### Collation data (Inverse UCA, weights->code points; ICU 52 & earlier)
961*   Source format: Processed from FractionalUCA.txt like ICU 52 ucadata.icu
962*   Binary format: invuca.icu: source/i18n/ucol_imp.h (ICU 52)
963*   Generator tool:
964    [genuca](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genuca)
965
966#### Dictionary-based break iterator data (ICU 49 & earlier)
967*   Source format: .txt: genctd.cpp comments
968*   Binary format: ctd: see CompactTrieHeader in source/common/triedict.cpp
969*   Generator tool: genctd
970
971#### Time zone data (Before ICU 4.4)
972*   Source format: .source/data/misc/zoneinfo.txt (ICU 4.2): ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz 
973*   Binary format: zoneinfo64.res (generated by genrb and
974    [tzcode tools](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/tzcode/readme.txt)).
975*   Generator tool: Does not apply
976
977### Non-File API Binary Data
978
979#### Converter selector data
980*   Source format: none
981*   Binary format:
982    [source/common/ucnvsel.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ucnvsel.cpp)
983*   Generator tool:
984    [ucnvsel_open()](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ucnvsel.cpp)
985
986### Test-Only Data Files
987
988#### test.icu (for udata API testing)
989*   Source format: none (fixed output from gentest when not using -r or -j options)
990*   Binary format: test.icu: see `createData()` in
991                   [source/tools/gentest/gentest.c](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gentest/gentest.c)
992*   Generator tool:
993    [gentest](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/gentest/gentest.c)
994
995### Other Data Structures
996
997#### UCPTrie (C)/CodePointTrie (Java) (maps code points to integers)
998*   Source format: (public builder API)
999*   Binary format:
1000    [ICU Code Point Tries design doc](https://icu.unicode.org/design/struct/utrie),
1001    [icu4c/source/common/ucptrie_impl.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ucptrie_impl.h)
1002*   Generator tool: (builder class)
1003
1004#### UTrie2 (C)/Trie2 (Java) (maps code points to integers)
1005*   Source format: (internal builder API)
1006*   Binary format:
1007    [ICU Code Point Tries design doc](https://icu.unicode.org/design/struct/utrie),
1008    [icu4c/source/common/utrie2_impl.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/utrie2_impl.h)
1009*   Generator tool: (builder class)
1010
1011#### BytesTrie (maps byte sequences to 32-bit integers)
1012*   Source format: (public builder API)
1013*   Binary format:
1014    [BytesTrie design doc](https://icu.unicode.org/design/struct/tries/bytestrie),
1015    [icu4c/source/common/unicode/bytestrie.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/bytestrie.h)
1016*   Generator tool: (builder class)
1017
1018#### UCharsTrie (C++)/CharsTrie (Java) (maps 16-bit-Unicode strings to 32-bit integers)
1019*   Source format: (public builder API)
1020*   Binary format:
1021    [UCharsTrie design doc](https://icu.unicode.org/design/struct/tries/ucharstrie),
1022    [icu4c/source/common/unicode/ucharstrie.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ucharstrie.h)
1023*   Generator tool: (builder class)
1024
1025## ICU4J Resource Information
1026
1027Starting with release 2.1, ICU4J includes its own resource information which is
1028completely independent of the JRE resource information. (Note, ICU4J 2.8 to 3.4,
1029time zone information depends on the underlying JRE). The new ICU4J information
1030is equivalent to the information in ICU4C and many resources are, in fact, the
1031same binary files that ICU4C uses.
1032
1033By default the ICU4J distribution includes all of the standard resource
1034information. It is located under the directory `com/ibm/icu/impl/data`.
1035Depending on the service, the data is in different locations and in different
1036formats. Note: This will continue to change from release to release, so clients
1037should not depend on the exact organization of the data in ICU4J.
1038
10391.  The primary **locale data** is under the directory icudt38b, as a set of
1040    ".res" files whose names are the locale identifiers. Locale naming is
1041    documented in the `com.ibm.icu.util.ULocale` class, and the use of these
1042    names in     searching for resources is documented in
1043    `com.ibm.icu.util.UResourceBundle`.
1044
10452.  The **collation data** is under the directory `icudt38b/coll`, as a set of
1046    ".res" files.
1047
10483.  The **rule-based transliterator data** is under the directory
1049    `icudt38b/translit` as a set of ".res" files. (**Note:** the Han
1050    transliterator test data is no longer included in the core icu4j.jar file by
1051    default.)
1052
10534.  The **rule-based number format data** is under the directory `icudt38b/rbnf`
1054    as a set of ".res" files.
1055
10565.  The **break iterator data** is directly under the data directory, as a set
1057    of ".brk" files, named according to the type of break and the locale where
1058    there are locale-specific versions.
1059
10606.  The **holiday data** is under the data directory, as a set of ".class"
1061    files, named "HolidayBundle_" followed by the locale ID.
1062
10637.  The **character property data** as well as assorted **normalization data**
1064    and default **unicode collation algorithm (UCA) data** is found under the
1065    data directory as a set of ".icu" files.
1066
10678.  The **character set converter data** is under the directory `icudt38b/`, as
1068    a set of ".cnv" files. These files are currently included only in
1069    icu-charset.jar.
1070
10719.  The **time zone data** is named `zoneinfo.res` under the directory
1072    `icudt38b`.
1073
1074Some of the data files alias or otherwise reference data from other data files.
1075One reason for this is because some locale names have changed. For example,
1076he_IL used to be iw_IL. In order to support both names but not duplicate the
1077data, one of the resource files refers to the other file's data. In other cases,
1078a file may alias a portion of another file's data in order to save space.
1079Currently ICU4J provides no tool for revealing these dependencies.
1080
1081> :point_right: **Note**: Java's Locale class silently converts the language
1082code "he" to "iw" when you construct the Locale (for versions of Java through
1083Java 5). Thus Java cannot be used to locate resources that use the "he" language
1084code. ICU, on the other hand, does not perform this conversion in ULocale, and
1085instead uses aliasing in the locale data to represent the same set of data under
1086different locale ids.
1087
1088Resource files that use locale ids form a hierarchy, with up to four levels: a
1089root, language, region (country), and variant. Searches for locale data attempt
1090to match as far down the hierarchy as possible, for example, "he_IL" will match
1091he_IL, but "he_US" will match he (since there is no US variant for he, and
1092"xx_YY will match root (the default fallback locale) since there is no xx
1093language code in the locale hierarchy. Again, see `java.util.ResourceBundle` for
1094more information.
1095
1096Currently ICU4J provides no tool for revealing these dependencies between data
1097files, so trimming the data directly in the ICU4J project is a hit-or-miss
1098affair. The key point when you remove data is to make sure to remove all
1099dependencies on that data as well. For example, if you remove he.res, you need
1100to remove he_IL.res, since it is lower in the hierarchy, and you must remove
1101iw.res, since it references he.res, and iw_IL.res, since it depends on it (and
1102also references he_IL.res).
1103
1104Unfortunately, the jar tool in the JDK provides no way to remove items from a
1105jar file. Thus you have to extract the resources, remove the ones you don't
1106want, and then create a new jar file with the remaining resources. See the jar
1107tool information for how to do this. Before 'rejaring' the files, be sure to
1108thoroughly test your application with the remaining resources, making sure each
1109required resource is present.
1110
1111#### Using additional resource files with ICU4J
1112
1113> :point_right: **Note**: Resource file formats can change across releases of ICU4J!
1114> 
1115> *The format of ICU4J resources is not part of the API. Clients who develop their
1116> own resources for use with ICU4J should be prepared to regenerate them when they
1117> move to new releases of ICU4J.*
1118
1119We are still developing ICU4J's resource mechanism. Currently it is not possible
1120to mix icu's new binary .res resources with traditional java-style .class or
1121.txt resources. We might allow for this in a future release, but since the
1122resource data and format is not formally supported, you run the risk of
1123incompatibilities with future releases of ICU4J.
1124
1125Resource data in ICU4J is checked in to the repository as a jar file containing
1126the resource binaries, icudata.jar. This means that inspecting the contents of
1127these resources is difficult. They currently are compiled from ICU4C .txt file
1128data. You can view the contents of the ICU4C text resource files to understand
1129the contents of the ICU4J resources.
1130
1131The files in icudata.jar get extracted to com/ibm/icu/impl/data in the build
1132directory when the 'core' target is built. Building the 'resources' target will
1133force the resources to once again be extracted. Extraction will overwrite any
1134corresponding resource files already in that directory.
1135
1136### Building ICU4J Resources from ICU4C
1137
1138#### Requirements
1139
11401.  [ICU4C](https://icu.unicode.org/download)
1141
11422.  Compilers and tools required for [building ICU4C](../icu4c/build.md).
1143
11443.  J2SE SDK version 5 or above
1145
1146#### Procedure
1147
11481.  Download and build ICU4C on a Windows or Linux machine. For instructions on downloading and building ICU4C, please click
1149    [here](../icu4c/build.md).
1150
11512.  Follow the remaining instructions in
1152    the [ICU4J Readme](../icu4j/).
1153