16e652d70Sopenharmony_ciUnicode ident 26e652d70Sopenharmony_ci============= 36e652d70Sopenharmony_ci 46e652d70Sopenharmony_ci[<img alt="github" src="https://img.shields.io/badge/github-dtolnay/unicode--ident-8da0cb?style=for-the-badge&labelColor=555555&logo=github" height="20">](https://github.com/dtolnay/unicode-ident) 56e652d70Sopenharmony_ci[<img alt="crates.io" src="https://img.shields.io/crates/v/unicode-ident.svg?style=for-the-badge&color=fc8d62&logo=rust" height="20">](https://crates.io/crates/unicode-ident) 66e652d70Sopenharmony_ci[<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-unicode--ident-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs" height="20">](https://docs.rs/unicode-ident) 76e652d70Sopenharmony_ci[<img alt="build status" src="https://img.shields.io/github/actions/workflow/status/dtolnay/unicode-ident/ci.yml?branch=master&style=for-the-badge" height="20">](https://github.com/dtolnay/unicode-ident/actions?query=branch%3Amaster) 86e652d70Sopenharmony_ci 96e652d70Sopenharmony_ciImplementation of [Unicode Standard Annex #31][tr31] for determining which 106e652d70Sopenharmony_ci`char` values are valid in programming language identifiers. 116e652d70Sopenharmony_ci 126e652d70Sopenharmony_ci[tr31]: https://www.unicode.org/reports/tr31/ 136e652d70Sopenharmony_ci 146e652d70Sopenharmony_ciThis crate is a better optimized implementation of the older `unicode-xid` 156e652d70Sopenharmony_cicrate. This crate uses less static storage, and is able to classify both ASCII 166e652d70Sopenharmony_ciand non-ASCII codepoints with better performance, 2–10× faster than 176e652d70Sopenharmony_ci`unicode-xid`. 186e652d70Sopenharmony_ci 196e652d70Sopenharmony_ci<br> 206e652d70Sopenharmony_ci 216e652d70Sopenharmony_ci## Comparison of performance 226e652d70Sopenharmony_ci 236e652d70Sopenharmony_ciThe following table shows a comparison between five Unicode identifier 246e652d70Sopenharmony_ciimplementations. 256e652d70Sopenharmony_ci 266e652d70Sopenharmony_ci- `unicode-ident` is this crate; 276e652d70Sopenharmony_ci- [`unicode-xid`] is a widely used crate run by the "unicode-rs" org; 286e652d70Sopenharmony_ci- `ucd-trie` and `fst` are two data structures supported by the [`ucd-generate`] tool; 296e652d70Sopenharmony_ci- [`roaring`] is a Rust implementation of Roaring bitmap. 306e652d70Sopenharmony_ci 316e652d70Sopenharmony_ciThe *static storage* column shows the total size of `static` tables that the 326e652d70Sopenharmony_cicrate bakes into your binary, measured in 1000s of bytes. 336e652d70Sopenharmony_ci 346e652d70Sopenharmony_ciThe remaining columns show the **cost per call** to evaluate whether a single 356e652d70Sopenharmony_ci`char` has the XID\_Start or XID\_Continue Unicode property, comparing across 366e652d70Sopenharmony_cidifferent ratios of ASCII to non-ASCII codepoints in the input data. 376e652d70Sopenharmony_ci 386e652d70Sopenharmony_ci[`unicode-xid`]: https://github.com/unicode-rs/unicode-xid 396e652d70Sopenharmony_ci[`ucd-generate`]: https://github.com/BurntSushi/ucd-generate 406e652d70Sopenharmony_ci[`roaring`]: https://github.com/RoaringBitmap/roaring-rs 416e652d70Sopenharmony_ci 426e652d70Sopenharmony_ci| | static storage | 0% nonascii | 1% | 10% | 100% nonascii | 436e652d70Sopenharmony_ci|---|---|---|---|---|---| 446e652d70Sopenharmony_ci| **`unicode-ident`** | 10.0 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns | 456e652d70Sopenharmony_ci| **`unicode-xid`** | 11.5 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns | 466e652d70Sopenharmony_ci| **`ucd-trie`** | 10.2 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns | 476e652d70Sopenharmony_ci| **`fst`** | 138 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns | 486e652d70Sopenharmony_ci| **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns | 496e652d70Sopenharmony_ci 506e652d70Sopenharmony_ciSource code for the benchmark is provided in the *bench* directory of this repo 516e652d70Sopenharmony_ciand may be repeated by running `cargo criterion`. 526e652d70Sopenharmony_ci 536e652d70Sopenharmony_ci<br> 546e652d70Sopenharmony_ci 556e652d70Sopenharmony_ci## Comparison of data structures 566e652d70Sopenharmony_ci 576e652d70Sopenharmony_ci#### unicode-xid 586e652d70Sopenharmony_ci 596e652d70Sopenharmony_ciThey use a sorted array of character ranges, and do a binary search to look up 606e652d70Sopenharmony_ciwhether a given character lands inside one of those ranges. 616e652d70Sopenharmony_ci 626e652d70Sopenharmony_ci```rust 636e652d70Sopenharmony_cistatic XID_Continue_table: [(char, char); 763] = [ 646e652d70Sopenharmony_ci ('\u{30}', '\u{39}'), // 0-9 656e652d70Sopenharmony_ci ('\u{41}', '\u{5a}'), // A-Z 666e652d70Sopenharmony_ci … 676e652d70Sopenharmony_ci ('\u{e0100}', '\u{e01ef}'), 686e652d70Sopenharmony_ci]; 696e652d70Sopenharmony_ci``` 706e652d70Sopenharmony_ci 716e652d70Sopenharmony_ciThe static storage used by this data structure scales with the number of 726e652d70Sopenharmony_cicontiguous ranges of identifier codepoints in Unicode. Every table entry 736e652d70Sopenharmony_ciconsumes 8 bytes, because it consists of a pair of 32-bit `char` values. 746e652d70Sopenharmony_ci 756e652d70Sopenharmony_ciIn some ranges of the Unicode codepoint space, this is quite a sparse 766e652d70Sopenharmony_cirepresentation – there are some ranges where tens of thousands of adjacent 776e652d70Sopenharmony_cicodepoints are all valid identifier characters. In other places, the 786e652d70Sopenharmony_cirepresentation is quite inefficient. A characater like `µ` (U+00B5) which is 796e652d70Sopenharmony_cisurrounded by non-identifier codepoints consumes 64 bits in the table, while it 806e652d70Sopenharmony_ciwould be just 1 bit in a dense bitmap. 816e652d70Sopenharmony_ci 826e652d70Sopenharmony_ciOn a system with 64-byte cache lines, binary searching the table touches 7 cache 836e652d70Sopenharmony_cilines on average. Each cache line fits only 8 table entries. Additionally, the 846e652d70Sopenharmony_cibranching performed during the binary search is probably mostly unpredictable to 856e652d70Sopenharmony_cithe branch predictor. 866e652d70Sopenharmony_ci 876e652d70Sopenharmony_ciOverall, the crate ends up being about 10× slower on non-ASCII input 886e652d70Sopenharmony_cicompared to the fastest crate. 896e652d70Sopenharmony_ci 906e652d70Sopenharmony_ciA potential improvement would be to pack the table entries more compactly. 916e652d70Sopenharmony_ciRust's `char` type is a 21-bit integer padded to 32 bits, which means every 926e652d70Sopenharmony_citable entry is holding 22 bits of wasted space, adding up to 3.9 K. They could 936e652d70Sopenharmony_ciinstead fit every table entry into 6 bytes, leaving out some of the padding, for 946e652d70Sopenharmony_cia 25% improvement in space used. With some cleverness it may be possible to fit 956e652d70Sopenharmony_ciin 5 bytes or even 4 bytes by storing a low char and an extent, instead of low 966e652d70Sopenharmony_cichar and high char. I don't expect that performance would improve much but this 976e652d70Sopenharmony_cicould be the most efficient for space across all the libraries, needing only 986e652d70Sopenharmony_ciabout 7 K to store. 996e652d70Sopenharmony_ci 1006e652d70Sopenharmony_ci#### ucd-trie 1016e652d70Sopenharmony_ci 1026e652d70Sopenharmony_ciTheir data structure is a compressed trie set specifically tailored for Unicode 1036e652d70Sopenharmony_cicodepoints. The design is credited to Raph Levien in [rust-lang/rust#33098]. 1046e652d70Sopenharmony_ci 1056e652d70Sopenharmony_ci[rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098 1066e652d70Sopenharmony_ci 1076e652d70Sopenharmony_ci```rust 1086e652d70Sopenharmony_cipub struct TrieSet { 1096e652d70Sopenharmony_ci tree1_level1: &'static [u64; 32], 1106e652d70Sopenharmony_ci tree2_level1: &'static [u8; 992], 1116e652d70Sopenharmony_ci tree2_level2: &'static [u64], 1126e652d70Sopenharmony_ci tree3_level1: &'static [u8; 256], 1136e652d70Sopenharmony_ci tree3_level2: &'static [u8], 1146e652d70Sopenharmony_ci tree3_level3: &'static [u64], 1156e652d70Sopenharmony_ci} 1166e652d70Sopenharmony_ci``` 1176e652d70Sopenharmony_ci 1186e652d70Sopenharmony_ciIt represents codepoint sets using a trie to achieve prefix compression. The 1196e652d70Sopenharmony_cifinal states of the trie are embedded in leaves or "chunks", where each chunk is 1206e652d70Sopenharmony_cia 64-bit integer. Each bit position of the integer corresponds to whether a 1216e652d70Sopenharmony_ciparticular codepoint is in the set or not. These chunks are not just a compact 1226e652d70Sopenharmony_cirepresentation of the final states of the trie, but are also a form of suffix 1236e652d70Sopenharmony_cicompression. In particular, if multiple ranges of 64 contiguous codepoints have 1246e652d70Sopenharmony_cithe same Unicode properties, then they all map to the same chunk in the final 1256e652d70Sopenharmony_cilevel of the trie. 1266e652d70Sopenharmony_ci 1276e652d70Sopenharmony_ciBeing tailored for Unicode codepoints, this trie is partitioned into three 1286e652d70Sopenharmony_cidisjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints \[0, 1296e652d70Sopenharmony_ci0x800), the second \[0x800, 0x10000) and the third \[0x10000, 0x110000). These 1306e652d70Sopenharmony_cipartitions conveniently correspond to the space of 1 or 2 byte UTF-8 encoded 1316e652d70Sopenharmony_cicodepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints, 1326e652d70Sopenharmony_cirespectively. 1336e652d70Sopenharmony_ci 1346e652d70Sopenharmony_ciLookups in this data structure are significantly more efficient than binary 1356e652d70Sopenharmony_cisearch. A lookup touches either 1, 2, or 3 cache lines based on which of the 1366e652d70Sopenharmony_citrie partitions is being accessed. 1376e652d70Sopenharmony_ci 1386e652d70Sopenharmony_ciOne possible performance improvement would be for this crate to expose a way to 1396e652d70Sopenharmony_ciquery based on a UTF-8 encoded string, returning the Unicode property 1406e652d70Sopenharmony_cicorresponding to the first character in the string. Without such an API, the 1416e652d70Sopenharmony_cicaller is required to tokenize their UTF-8 encoded input data into `char`, hand 1426e652d70Sopenharmony_cithe `char` into `ucd-trie`, only for `ucd-trie` to undo that work by converting 1436e652d70Sopenharmony_ciback into the variable-length representation for trie traversal. 1446e652d70Sopenharmony_ci 1456e652d70Sopenharmony_ci#### fst 1466e652d70Sopenharmony_ci 1476e652d70Sopenharmony_ciUses a [finite state transducer][fst]. This representation is built into 1486e652d70Sopenharmony_ci[ucd-generate] but I am not aware of any advantage over the `ucd-trie` 1496e652d70Sopenharmony_cirepresentation. In particular `ucd-trie` is optimized for storing Unicode 1506e652d70Sopenharmony_ciproperties while `fst` is not. 1516e652d70Sopenharmony_ci 1526e652d70Sopenharmony_ci[fst]: https://github.com/BurntSushi/fst 1536e652d70Sopenharmony_ci[ucd-generate]: https://github.com/BurntSushi/ucd-generate 1546e652d70Sopenharmony_ci 1556e652d70Sopenharmony_ciAs far as I can tell, the main thing that causes `fst` to have large size and 1566e652d70Sopenharmony_cislow lookups for this use case relative to `ucd-trie` is that it does not 1576e652d70Sopenharmony_cispecialize for the fact that only 21 of the 32 bits in a `char` are meaningful. 1586e652d70Sopenharmony_ciThere are some dense arrays in the structure with large ranges that could never 1596e652d70Sopenharmony_cipossibly be used. 1606e652d70Sopenharmony_ci 1616e652d70Sopenharmony_ci#### roaring 1626e652d70Sopenharmony_ci 1636e652d70Sopenharmony_ciThis crate is a pure-Rust implementation of [Roaring Bitmap], a data structure 1646e652d70Sopenharmony_cidesigned for storing sets of 32-bit unsigned integers. 1656e652d70Sopenharmony_ci 1666e652d70Sopenharmony_ci[Roaring Bitmap]: https://roaringbitmap.org/about/ 1676e652d70Sopenharmony_ci 1686e652d70Sopenharmony_ciRoaring bitmaps are compressed bitmaps which tend to outperform conventional 1696e652d70Sopenharmony_cicompressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be 1706e652d70Sopenharmony_cihundreds of times faster and they often offer significantly better compression. 1716e652d70Sopenharmony_ci 1726e652d70Sopenharmony_ciIn this use case the performance was reasonably competitive but still 1736e652d70Sopenharmony_cisubstantially slower than the Unicode-optimized crates. Meanwhile the 1746e652d70Sopenharmony_cicompression was significantly worse, requiring 6× as much storage for the 1756e652d70Sopenharmony_cidata structure. 1766e652d70Sopenharmony_ci 1776e652d70Sopenharmony_ciI also benchmarked the [`croaring`] crate which is an FFI wrapper around the C 1786e652d70Sopenharmony_cireference implementation of Roaring Bitmap. This crate was consistently about 1796e652d70Sopenharmony_ci15% slower than pure-Rust `roaring`, which could just be FFI overhead. I did not 1806e652d70Sopenharmony_ciinvestigate further. 1816e652d70Sopenharmony_ci 1826e652d70Sopenharmony_ci[`croaring`]: https://crates.io/crates/croaring 1836e652d70Sopenharmony_ci 1846e652d70Sopenharmony_ci#### unicode-ident 1856e652d70Sopenharmony_ci 1866e652d70Sopenharmony_ciThis crate is most similar to the `ucd-trie` library, in that it's based on 1876e652d70Sopenharmony_cibitmaps stored in the leafs of a trie representation, achieving both prefix 1886e652d70Sopenharmony_cicompression and suffix compression. 1896e652d70Sopenharmony_ci 1906e652d70Sopenharmony_ciThe key differences are: 1916e652d70Sopenharmony_ci 1926e652d70Sopenharmony_ci- Uses a single 2-level trie, rather than 3 disjoint partitions of different 1936e652d70Sopenharmony_ci depth each. 1946e652d70Sopenharmony_ci- Uses significantly larger chunks: 512 bits rather than 64 bits. 1956e652d70Sopenharmony_ci- Compresses the XID\_Start and XID\_Continue properties together 1966e652d70Sopenharmony_ci simultaneously, rather than duplicating identical trie leaf chunks across the 1976e652d70Sopenharmony_ci two. 1986e652d70Sopenharmony_ci 1996e652d70Sopenharmony_ciThe following diagram show the XID\_Start and XID\_Continue Unicode boolean 2006e652d70Sopenharmony_ciproperties in uncompressed form, in row-major order: 2016e652d70Sopenharmony_ci 2026e652d70Sopenharmony_ci<table> 2036e652d70Sopenharmony_ci<tr><th>XID_Start</th><th>XID_Continue</th></tr> 2046e652d70Sopenharmony_ci<tr> 2056e652d70Sopenharmony_ci<td><img alt="XID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td> 2066e652d70Sopenharmony_ci<td><img alt="XID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td> 2076e652d70Sopenharmony_ci</tr> 2086e652d70Sopenharmony_ci</table> 2096e652d70Sopenharmony_ci 2106e652d70Sopenharmony_ciUncompressed, these would take 140 K to store, which is beyond what would be 2116e652d70Sopenharmony_cireasonable. However, as you can see there is a large degree of similarity 2126e652d70Sopenharmony_cibetween the two bitmaps and across the rows, which lends well to compression. 2136e652d70Sopenharmony_ci 2146e652d70Sopenharmony_ciThis crate stores one 512-bit "row" of the above bitmaps in the leaf level of a 2156e652d70Sopenharmony_citrie, and a single additional level to index into the leafs. It turns out there 2166e652d70Sopenharmony_ciare 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient to 2176e652d70Sopenharmony_ciindex them. 2186e652d70Sopenharmony_ci 2196e652d70Sopenharmony_ciThe chunk size of 512 bits is selected as the size that minimizes the total size 2206e652d70Sopenharmony_ciof the data structure. A smaller chunk, like 256 or 128 bits, would achieve 2216e652d70Sopenharmony_cibetter deduplication but require a larger index. A larger chunk would increase 2226e652d70Sopenharmony_ciredundancy in the leaf bitmaps. 512 bit chunks are the optimum for total size of 2236e652d70Sopenharmony_cithe index plus leaf bitmaps. 2246e652d70Sopenharmony_ci 2256e652d70Sopenharmony_ciIn fact since there are only 124 unique chunks, we can use an 8-bit index with a 2266e652d70Sopenharmony_cispare bit to index at the half-chunk level. This achieves an additional 8.5% 2276e652d70Sopenharmony_cicompression by eliminating redundancies between the second half of any chunk and 2286e652d70Sopenharmony_cithe first half of any other chunk. Note that this is not the same as using 2296e652d70Sopenharmony_cichunks which are half the size, because it does not necessitate raising the size 2306e652d70Sopenharmony_ciof the trie's first level. 2316e652d70Sopenharmony_ci 2326e652d70Sopenharmony_ciIn contrast to binary search or the `ucd-trie` crate, performing lookups in this 2336e652d70Sopenharmony_cidata structure is straight-line code with no need for branching. 2346e652d70Sopenharmony_ci 2356e652d70Sopenharmony_ci```asm 2366e652d70Sopenharmony_ciis_xid_start: 2376e652d70Sopenharmony_ci mov eax, edi 2386e652d70Sopenharmony_ci shr eax, 9 2396e652d70Sopenharmony_ci lea rcx, [rip + unicode_ident::tables::TRIE_START] 2406e652d70Sopenharmony_ci add rcx, rax 2416e652d70Sopenharmony_ci xor eax, eax 2426e652d70Sopenharmony_ci cmp edi, 201728 2436e652d70Sopenharmony_ci cmovb rax, rcx 2446e652d70Sopenharmony_ci test rax, rax 2456e652d70Sopenharmony_ci lea rcx, [rip + .L__unnamed_1] 2466e652d70Sopenharmony_ci cmovne rcx, rax 2476e652d70Sopenharmony_ci movzx eax, byte ptr [rcx] 2486e652d70Sopenharmony_ci shl rax, 5 2496e652d70Sopenharmony_ci mov ecx, edi 2506e652d70Sopenharmony_ci shr ecx, 3 2516e652d70Sopenharmony_ci and ecx, 63 2526e652d70Sopenharmony_ci add rcx, rax 2536e652d70Sopenharmony_ci lea rax, [rip + unicode_ident::tables::LEAF] 2546e652d70Sopenharmony_ci mov al, byte ptr [rax + rcx] 2556e652d70Sopenharmony_ci and dil, 7 2566e652d70Sopenharmony_ci mov ecx, edi 2576e652d70Sopenharmony_ci shr al, cl 2586e652d70Sopenharmony_ci and al, 1 2596e652d70Sopenharmony_ci ret 2606e652d70Sopenharmony_ci``` 2616e652d70Sopenharmony_ci 2626e652d70Sopenharmony_ci<br> 2636e652d70Sopenharmony_ci 2646e652d70Sopenharmony_ci## License 2656e652d70Sopenharmony_ci 2666e652d70Sopenharmony_ciUse of the Unicode Character Database, as this crate does, is governed by the <a 2676e652d70Sopenharmony_cihref="LICENSE-UNICODE">Unicode License Agreement – Data Files and Software 2686e652d70Sopenharmony_ci(2016)</a>. 2696e652d70Sopenharmony_ci 2706e652d70Sopenharmony_ciAll intellectual property within this crate that is **not generated** using the 2716e652d70Sopenharmony_ciUnicode Character Database as input is licensed under either of <a 2726e652d70Sopenharmony_cihref="LICENSE-APACHE">Apache License, Version 2.0</a> or <a 2736e652d70Sopenharmony_cihref="LICENSE-MIT">MIT license</a> at your option. 2746e652d70Sopenharmony_ci 2756e652d70Sopenharmony_ciThe **generated** files incorporate tabular data derived from the Unicode 2766e652d70Sopenharmony_ciCharacter Database, together with intellectual property from the original source 2776e652d70Sopenharmony_cicode content of the crate. One must comply with the terms of both the Unicode 2786e652d70Sopenharmony_ciLicense Agreement and either of the Apache license or MIT license when those 2796e652d70Sopenharmony_cigenerated files are involved. 2806e652d70Sopenharmony_ci 2816e652d70Sopenharmony_ciUnless you explicitly state otherwise, any contribution intentionally submitted 2826e652d70Sopenharmony_cifor inclusion in this crate by you, as defined in the Apache-2.0 license, shall 2836e652d70Sopenharmony_cibe licensed as just described, without any additional terms or conditions. 284