16e652d70Sopenharmony_ciUnicode ident
26e652d70Sopenharmony_ci=============
36e652d70Sopenharmony_ci
46e652d70Sopenharmony_ci[<img alt="github" src="https://img.shields.io/badge/github-dtolnay/unicode--ident-8da0cb?style=for-the-badge&labelColor=555555&logo=github" height="20">](https://github.com/dtolnay/unicode-ident)
56e652d70Sopenharmony_ci[<img alt="crates.io" src="https://img.shields.io/crates/v/unicode-ident.svg?style=for-the-badge&color=fc8d62&logo=rust" height="20">](https://crates.io/crates/unicode-ident)
66e652d70Sopenharmony_ci[<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-unicode--ident-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs" height="20">](https://docs.rs/unicode-ident)
76e652d70Sopenharmony_ci[<img alt="build status" src="https://img.shields.io/github/actions/workflow/status/dtolnay/unicode-ident/ci.yml?branch=master&style=for-the-badge" height="20">](https://github.com/dtolnay/unicode-ident/actions?query=branch%3Amaster)
86e652d70Sopenharmony_ci
96e652d70Sopenharmony_ciImplementation of [Unicode Standard Annex #31][tr31] for determining which
106e652d70Sopenharmony_ci`char` values are valid in programming language identifiers.
116e652d70Sopenharmony_ci
126e652d70Sopenharmony_ci[tr31]: https://www.unicode.org/reports/tr31/
136e652d70Sopenharmony_ci
146e652d70Sopenharmony_ciThis crate is a better optimized implementation of the older `unicode-xid`
156e652d70Sopenharmony_cicrate. This crate uses less static storage, and is able to classify both ASCII
166e652d70Sopenharmony_ciand non-ASCII codepoints with better performance, 2&ndash;10&times; faster than
176e652d70Sopenharmony_ci`unicode-xid`.
186e652d70Sopenharmony_ci
196e652d70Sopenharmony_ci<br>
206e652d70Sopenharmony_ci
216e652d70Sopenharmony_ci## Comparison of performance
226e652d70Sopenharmony_ci
236e652d70Sopenharmony_ciThe following table shows a comparison between five Unicode identifier
246e652d70Sopenharmony_ciimplementations.
256e652d70Sopenharmony_ci
266e652d70Sopenharmony_ci- `unicode-ident` is this crate;
276e652d70Sopenharmony_ci- [`unicode-xid`] is a widely used crate run by the "unicode-rs" org;
286e652d70Sopenharmony_ci- `ucd-trie` and `fst` are two data structures supported by the [`ucd-generate`] tool;
296e652d70Sopenharmony_ci- [`roaring`] is a Rust implementation of Roaring bitmap.
306e652d70Sopenharmony_ci
316e652d70Sopenharmony_ciThe *static storage* column shows the total size of `static` tables that the
326e652d70Sopenharmony_cicrate bakes into your binary, measured in 1000s of bytes.
336e652d70Sopenharmony_ci
346e652d70Sopenharmony_ciThe remaining columns show the **cost per call** to evaluate whether a single
356e652d70Sopenharmony_ci`char` has the XID\_Start or XID\_Continue Unicode property, comparing across
366e652d70Sopenharmony_cidifferent ratios of ASCII to non-ASCII codepoints in the input data.
376e652d70Sopenharmony_ci
386e652d70Sopenharmony_ci[`unicode-xid`]: https://github.com/unicode-rs/unicode-xid
396e652d70Sopenharmony_ci[`ucd-generate`]: https://github.com/BurntSushi/ucd-generate
406e652d70Sopenharmony_ci[`roaring`]: https://github.com/RoaringBitmap/roaring-rs
416e652d70Sopenharmony_ci
426e652d70Sopenharmony_ci| | static storage | 0% nonascii | 1% | 10% | 100% nonascii |
436e652d70Sopenharmony_ci|---|---|---|---|---|---|
446e652d70Sopenharmony_ci| **`unicode-ident`** | 10.0 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns |
456e652d70Sopenharmony_ci| **`unicode-xid`** | 11.5 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns |
466e652d70Sopenharmony_ci| **`ucd-trie`** | 10.2 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns |
476e652d70Sopenharmony_ci| **`fst`** | 138 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns |
486e652d70Sopenharmony_ci| **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns |
496e652d70Sopenharmony_ci
506e652d70Sopenharmony_ciSource code for the benchmark is provided in the *bench* directory of this repo
516e652d70Sopenharmony_ciand may be repeated by running `cargo criterion`.
526e652d70Sopenharmony_ci
536e652d70Sopenharmony_ci<br>
546e652d70Sopenharmony_ci
556e652d70Sopenharmony_ci## Comparison of data structures
566e652d70Sopenharmony_ci
576e652d70Sopenharmony_ci#### unicode-xid
586e652d70Sopenharmony_ci
596e652d70Sopenharmony_ciThey use a sorted array of character ranges, and do a binary search to look up
606e652d70Sopenharmony_ciwhether a given character lands inside one of those ranges.
616e652d70Sopenharmony_ci
626e652d70Sopenharmony_ci```rust
636e652d70Sopenharmony_cistatic XID_Continue_table: [(char, char); 763] = [
646e652d70Sopenharmony_ci    ('\u{30}', '\u{39}'),  // 0-9
656e652d70Sopenharmony_ci    ('\u{41}', '\u{5a}'),  // A-Z
666e652d70Sopenharmony_ci676e652d70Sopenharmony_ci    ('\u{e0100}', '\u{e01ef}'),
686e652d70Sopenharmony_ci];
696e652d70Sopenharmony_ci```
706e652d70Sopenharmony_ci
716e652d70Sopenharmony_ciThe static storage used by this data structure scales with the number of
726e652d70Sopenharmony_cicontiguous ranges of identifier codepoints in Unicode. Every table entry
736e652d70Sopenharmony_ciconsumes 8 bytes, because it consists of a pair of 32-bit `char` values.
746e652d70Sopenharmony_ci
756e652d70Sopenharmony_ciIn some ranges of the Unicode codepoint space, this is quite a sparse
766e652d70Sopenharmony_cirepresentation &ndash; there are some ranges where tens of thousands of adjacent
776e652d70Sopenharmony_cicodepoints are all valid identifier characters. In other places, the
786e652d70Sopenharmony_cirepresentation is quite inefficient. A characater like `µ` (U+00B5) which is
796e652d70Sopenharmony_cisurrounded by non-identifier codepoints consumes 64 bits in the table, while it
806e652d70Sopenharmony_ciwould be just 1 bit in a dense bitmap.
816e652d70Sopenharmony_ci
826e652d70Sopenharmony_ciOn a system with 64-byte cache lines, binary searching the table touches 7 cache
836e652d70Sopenharmony_cilines on average. Each cache line fits only 8 table entries. Additionally, the
846e652d70Sopenharmony_cibranching performed during the binary search is probably mostly unpredictable to
856e652d70Sopenharmony_cithe branch predictor.
866e652d70Sopenharmony_ci
876e652d70Sopenharmony_ciOverall, the crate ends up being about 10&times; slower on non-ASCII input
886e652d70Sopenharmony_cicompared to the fastest crate.
896e652d70Sopenharmony_ci
906e652d70Sopenharmony_ciA potential improvement would be to pack the table entries more compactly.
916e652d70Sopenharmony_ciRust's `char` type is a 21-bit integer padded to 32 bits, which means every
926e652d70Sopenharmony_citable entry is holding 22 bits of wasted space, adding up to 3.9 K. They could
936e652d70Sopenharmony_ciinstead fit every table entry into 6 bytes, leaving out some of the padding, for
946e652d70Sopenharmony_cia 25% improvement in space used. With some cleverness it may be possible to fit
956e652d70Sopenharmony_ciin 5 bytes or even 4 bytes by storing a low char and an extent, instead of low
966e652d70Sopenharmony_cichar and high char. I don't expect that performance would improve much but this
976e652d70Sopenharmony_cicould be the most efficient for space across all the libraries, needing only
986e652d70Sopenharmony_ciabout 7 K to store.
996e652d70Sopenharmony_ci
1006e652d70Sopenharmony_ci#### ucd-trie
1016e652d70Sopenharmony_ci
1026e652d70Sopenharmony_ciTheir data structure is a compressed trie set specifically tailored for Unicode
1036e652d70Sopenharmony_cicodepoints. The design is credited to Raph Levien in [rust-lang/rust#33098].
1046e652d70Sopenharmony_ci
1056e652d70Sopenharmony_ci[rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098
1066e652d70Sopenharmony_ci
1076e652d70Sopenharmony_ci```rust
1086e652d70Sopenharmony_cipub struct TrieSet {
1096e652d70Sopenharmony_ci    tree1_level1: &'static [u64; 32],
1106e652d70Sopenharmony_ci    tree2_level1: &'static [u8; 992],
1116e652d70Sopenharmony_ci    tree2_level2: &'static [u64],
1126e652d70Sopenharmony_ci    tree3_level1: &'static [u8; 256],
1136e652d70Sopenharmony_ci    tree3_level2: &'static [u8],
1146e652d70Sopenharmony_ci    tree3_level3: &'static [u64],
1156e652d70Sopenharmony_ci}
1166e652d70Sopenharmony_ci```
1176e652d70Sopenharmony_ci
1186e652d70Sopenharmony_ciIt represents codepoint sets using a trie to achieve prefix compression. The
1196e652d70Sopenharmony_cifinal states of the trie are embedded in leaves or "chunks", where each chunk is
1206e652d70Sopenharmony_cia 64-bit integer. Each bit position of the integer corresponds to whether a
1216e652d70Sopenharmony_ciparticular codepoint is in the set or not. These chunks are not just a compact
1226e652d70Sopenharmony_cirepresentation of the final states of the trie, but are also a form of suffix
1236e652d70Sopenharmony_cicompression. In particular, if multiple ranges of 64 contiguous codepoints have
1246e652d70Sopenharmony_cithe same Unicode properties, then they all map to the same chunk in the final
1256e652d70Sopenharmony_cilevel of the trie.
1266e652d70Sopenharmony_ci
1276e652d70Sopenharmony_ciBeing tailored for Unicode codepoints, this trie is partitioned into three
1286e652d70Sopenharmony_cidisjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints \[0,
1296e652d70Sopenharmony_ci0x800), the second \[0x800, 0x10000) and the third \[0x10000, 0x110000). These
1306e652d70Sopenharmony_cipartitions conveniently correspond to the space of 1 or 2 byte UTF-8 encoded
1316e652d70Sopenharmony_cicodepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints,
1326e652d70Sopenharmony_cirespectively.
1336e652d70Sopenharmony_ci
1346e652d70Sopenharmony_ciLookups in this data structure are significantly more efficient than binary
1356e652d70Sopenharmony_cisearch. A lookup touches either 1, 2, or 3 cache lines based on which of the
1366e652d70Sopenharmony_citrie partitions is being accessed.
1376e652d70Sopenharmony_ci
1386e652d70Sopenharmony_ciOne possible performance improvement would be for this crate to expose a way to
1396e652d70Sopenharmony_ciquery based on a UTF-8 encoded string, returning the Unicode property
1406e652d70Sopenharmony_cicorresponding to the first character in the string. Without such an API, the
1416e652d70Sopenharmony_cicaller is required to tokenize their UTF-8 encoded input data into `char`, hand
1426e652d70Sopenharmony_cithe `char` into `ucd-trie`, only for `ucd-trie` to undo that work by converting
1436e652d70Sopenharmony_ciback into the variable-length representation for trie traversal.
1446e652d70Sopenharmony_ci
1456e652d70Sopenharmony_ci#### fst
1466e652d70Sopenharmony_ci
1476e652d70Sopenharmony_ciUses a [finite state transducer][fst]. This representation is built into
1486e652d70Sopenharmony_ci[ucd-generate] but I am not aware of any advantage over the `ucd-trie`
1496e652d70Sopenharmony_cirepresentation. In particular `ucd-trie` is optimized for storing Unicode
1506e652d70Sopenharmony_ciproperties while `fst` is not.
1516e652d70Sopenharmony_ci
1526e652d70Sopenharmony_ci[fst]: https://github.com/BurntSushi/fst
1536e652d70Sopenharmony_ci[ucd-generate]: https://github.com/BurntSushi/ucd-generate
1546e652d70Sopenharmony_ci
1556e652d70Sopenharmony_ciAs far as I can tell, the main thing that causes `fst` to have large size and
1566e652d70Sopenharmony_cislow lookups for this use case relative to `ucd-trie` is that it does not
1576e652d70Sopenharmony_cispecialize for the fact that only 21 of the 32 bits in a `char` are meaningful.
1586e652d70Sopenharmony_ciThere are some dense arrays in the structure with large ranges that could never
1596e652d70Sopenharmony_cipossibly be used.
1606e652d70Sopenharmony_ci
1616e652d70Sopenharmony_ci#### roaring
1626e652d70Sopenharmony_ci
1636e652d70Sopenharmony_ciThis crate is a pure-Rust implementation of [Roaring Bitmap], a data structure
1646e652d70Sopenharmony_cidesigned for storing sets of 32-bit unsigned integers.
1656e652d70Sopenharmony_ci
1666e652d70Sopenharmony_ci[Roaring Bitmap]: https://roaringbitmap.org/about/
1676e652d70Sopenharmony_ci
1686e652d70Sopenharmony_ciRoaring bitmaps are compressed bitmaps which tend to outperform conventional
1696e652d70Sopenharmony_cicompressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be
1706e652d70Sopenharmony_cihundreds of times faster and they often offer significantly better compression.
1716e652d70Sopenharmony_ci
1726e652d70Sopenharmony_ciIn this use case the performance was reasonably competitive but still
1736e652d70Sopenharmony_cisubstantially slower than the Unicode-optimized crates. Meanwhile the
1746e652d70Sopenharmony_cicompression was significantly worse, requiring 6&times; as much storage for the
1756e652d70Sopenharmony_cidata structure.
1766e652d70Sopenharmony_ci
1776e652d70Sopenharmony_ciI also benchmarked the [`croaring`] crate which is an FFI wrapper around the C
1786e652d70Sopenharmony_cireference implementation of Roaring Bitmap. This crate was consistently about
1796e652d70Sopenharmony_ci15% slower than pure-Rust `roaring`, which could just be FFI overhead. I did not
1806e652d70Sopenharmony_ciinvestigate further.
1816e652d70Sopenharmony_ci
1826e652d70Sopenharmony_ci[`croaring`]: https://crates.io/crates/croaring
1836e652d70Sopenharmony_ci
1846e652d70Sopenharmony_ci#### unicode-ident
1856e652d70Sopenharmony_ci
1866e652d70Sopenharmony_ciThis crate is most similar to the `ucd-trie` library, in that it's based on
1876e652d70Sopenharmony_cibitmaps stored in the leafs of a trie representation, achieving both prefix
1886e652d70Sopenharmony_cicompression and suffix compression.
1896e652d70Sopenharmony_ci
1906e652d70Sopenharmony_ciThe key differences are:
1916e652d70Sopenharmony_ci
1926e652d70Sopenharmony_ci- Uses a single 2-level trie, rather than 3 disjoint partitions of different
1936e652d70Sopenharmony_ci  depth each.
1946e652d70Sopenharmony_ci- Uses significantly larger chunks: 512 bits rather than 64 bits.
1956e652d70Sopenharmony_ci- Compresses the XID\_Start and XID\_Continue properties together
1966e652d70Sopenharmony_ci  simultaneously, rather than duplicating identical trie leaf chunks across the
1976e652d70Sopenharmony_ci  two.
1986e652d70Sopenharmony_ci
1996e652d70Sopenharmony_ciThe following diagram show the XID\_Start and XID\_Continue Unicode boolean
2006e652d70Sopenharmony_ciproperties in uncompressed form, in row-major order:
2016e652d70Sopenharmony_ci
2026e652d70Sopenharmony_ci<table>
2036e652d70Sopenharmony_ci<tr><th>XID_Start</th><th>XID_Continue</th></tr>
2046e652d70Sopenharmony_ci<tr>
2056e652d70Sopenharmony_ci<td><img alt="XID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td>
2066e652d70Sopenharmony_ci<td><img alt="XID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td>
2076e652d70Sopenharmony_ci</tr>
2086e652d70Sopenharmony_ci</table>
2096e652d70Sopenharmony_ci
2106e652d70Sopenharmony_ciUncompressed, these would take 140 K to store, which is beyond what would be
2116e652d70Sopenharmony_cireasonable. However, as you can see there is a large degree of similarity
2126e652d70Sopenharmony_cibetween the two bitmaps and across the rows, which lends well to compression.
2136e652d70Sopenharmony_ci
2146e652d70Sopenharmony_ciThis crate stores one 512-bit "row" of the above bitmaps in the leaf level of a
2156e652d70Sopenharmony_citrie, and a single additional level to index into the leafs. It turns out there
2166e652d70Sopenharmony_ciare 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient to
2176e652d70Sopenharmony_ciindex them.
2186e652d70Sopenharmony_ci
2196e652d70Sopenharmony_ciThe chunk size of 512 bits is selected as the size that minimizes the total size
2206e652d70Sopenharmony_ciof the data structure. A smaller chunk, like 256 or 128 bits, would achieve
2216e652d70Sopenharmony_cibetter deduplication but require a larger index. A larger chunk would increase
2226e652d70Sopenharmony_ciredundancy in the leaf bitmaps. 512 bit chunks are the optimum for total size of
2236e652d70Sopenharmony_cithe index plus leaf bitmaps.
2246e652d70Sopenharmony_ci
2256e652d70Sopenharmony_ciIn fact since there are only 124 unique chunks, we can use an 8-bit index with a
2266e652d70Sopenharmony_cispare bit to index at the half-chunk level. This achieves an additional 8.5%
2276e652d70Sopenharmony_cicompression by eliminating redundancies between the second half of any chunk and
2286e652d70Sopenharmony_cithe first half of any other chunk. Note that this is not the same as using
2296e652d70Sopenharmony_cichunks which are half the size, because it does not necessitate raising the size
2306e652d70Sopenharmony_ciof the trie's first level.
2316e652d70Sopenharmony_ci
2326e652d70Sopenharmony_ciIn contrast to binary search or the `ucd-trie` crate, performing lookups in this
2336e652d70Sopenharmony_cidata structure is straight-line code with no need for branching.
2346e652d70Sopenharmony_ci
2356e652d70Sopenharmony_ci```asm
2366e652d70Sopenharmony_ciis_xid_start:
2376e652d70Sopenharmony_ci	mov eax, edi
2386e652d70Sopenharmony_ci	shr eax, 9
2396e652d70Sopenharmony_ci	lea rcx, [rip + unicode_ident::tables::TRIE_START]
2406e652d70Sopenharmony_ci	add rcx, rax
2416e652d70Sopenharmony_ci	xor eax, eax
2426e652d70Sopenharmony_ci	cmp edi, 201728
2436e652d70Sopenharmony_ci	cmovb rax, rcx
2446e652d70Sopenharmony_ci	test rax, rax
2456e652d70Sopenharmony_ci	lea rcx, [rip + .L__unnamed_1]
2466e652d70Sopenharmony_ci	cmovne rcx, rax
2476e652d70Sopenharmony_ci	movzx eax, byte ptr [rcx]
2486e652d70Sopenharmony_ci	shl rax, 5
2496e652d70Sopenharmony_ci	mov ecx, edi
2506e652d70Sopenharmony_ci	shr ecx, 3
2516e652d70Sopenharmony_ci	and ecx, 63
2526e652d70Sopenharmony_ci	add rcx, rax
2536e652d70Sopenharmony_ci	lea rax, [rip + unicode_ident::tables::LEAF]
2546e652d70Sopenharmony_ci	mov al, byte ptr [rax + rcx]
2556e652d70Sopenharmony_ci	and dil, 7
2566e652d70Sopenharmony_ci	mov ecx, edi
2576e652d70Sopenharmony_ci	shr al, cl
2586e652d70Sopenharmony_ci	and al, 1
2596e652d70Sopenharmony_ci	ret
2606e652d70Sopenharmony_ci```
2616e652d70Sopenharmony_ci
2626e652d70Sopenharmony_ci<br>
2636e652d70Sopenharmony_ci
2646e652d70Sopenharmony_ci## License
2656e652d70Sopenharmony_ci
2666e652d70Sopenharmony_ciUse of the Unicode Character Database, as this crate does, is governed by the <a
2676e652d70Sopenharmony_cihref="LICENSE-UNICODE">Unicode License Agreement &ndash; Data Files and Software
2686e652d70Sopenharmony_ci(2016)</a>.
2696e652d70Sopenharmony_ci
2706e652d70Sopenharmony_ciAll intellectual property within this crate that is **not generated** using the
2716e652d70Sopenharmony_ciUnicode Character Database as input is licensed under either of <a
2726e652d70Sopenharmony_cihref="LICENSE-APACHE">Apache License, Version 2.0</a> or <a
2736e652d70Sopenharmony_cihref="LICENSE-MIT">MIT license</a> at your option.
2746e652d70Sopenharmony_ci
2756e652d70Sopenharmony_ciThe **generated** files incorporate tabular data derived from the Unicode
2766e652d70Sopenharmony_ciCharacter Database, together with intellectual property from the original source
2776e652d70Sopenharmony_cicode content of the crate. One must comply with the terms of both the Unicode
2786e652d70Sopenharmony_ciLicense Agreement and either of the Apache license or MIT license when those
2796e652d70Sopenharmony_cigenerated files are involved.
2806e652d70Sopenharmony_ci
2816e652d70Sopenharmony_ciUnless you explicitly state otherwise, any contribution intentionally submitted
2826e652d70Sopenharmony_cifor inclusion in this crate by you, as defined in the Apache-2.0 license, shall
2836e652d70Sopenharmony_cibe licensed as just described, without any additional terms or conditions.
284