16e652d70Sopenharmony_ci//! [![github]](https://github.com/dtolnay/unicode-ident) [![crates-io]](https://crates.io/crates/unicode-ident) [![docs-rs]](https://docs.rs/unicode-ident) 26e652d70Sopenharmony_ci//! 36e652d70Sopenharmony_ci//! [github]: https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github 46e652d70Sopenharmony_ci//! [crates-io]: https://img.shields.io/badge/crates.io-fc8d62?style=for-the-badge&labelColor=555555&logo=rust 56e652d70Sopenharmony_ci//! [docs-rs]: https://img.shields.io/badge/docs.rs-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs 66e652d70Sopenharmony_ci//! 76e652d70Sopenharmony_ci//! <br> 86e652d70Sopenharmony_ci//! 96e652d70Sopenharmony_ci//! Implementation of [Unicode Standard Annex #31][tr31] for determining which 106e652d70Sopenharmony_ci//! `char` values are valid in programming language identifiers. 116e652d70Sopenharmony_ci//! 126e652d70Sopenharmony_ci//! [tr31]: https://www.unicode.org/reports/tr31/ 136e652d70Sopenharmony_ci//! 146e652d70Sopenharmony_ci//! This crate is a better optimized implementation of the older `unicode-xid` 156e652d70Sopenharmony_ci//! crate. This crate uses less static storage, and is able to classify both 166e652d70Sopenharmony_ci//! ASCII and non-ASCII codepoints with better performance, 2–10× 176e652d70Sopenharmony_ci//! faster than `unicode-xid`. 186e652d70Sopenharmony_ci//! 196e652d70Sopenharmony_ci//! <br> 206e652d70Sopenharmony_ci//! 216e652d70Sopenharmony_ci//! ## Comparison of performance 226e652d70Sopenharmony_ci//! 236e652d70Sopenharmony_ci//! The following table shows a comparison between five Unicode identifier 246e652d70Sopenharmony_ci//! implementations. 256e652d70Sopenharmony_ci//! 266e652d70Sopenharmony_ci//! - `unicode-ident` is this crate; 276e652d70Sopenharmony_ci//! - [`unicode-xid`] is a widely used crate run by the "unicode-rs" org; 286e652d70Sopenharmony_ci//! - `ucd-trie` and `fst` are two data structures supported by the 296e652d70Sopenharmony_ci//! [`ucd-generate`] tool; 306e652d70Sopenharmony_ci//! - [`roaring`] is a Rust implementation of Roaring bitmap. 316e652d70Sopenharmony_ci//! 326e652d70Sopenharmony_ci//! The *static storage* column shows the total size of `static` tables that the 336e652d70Sopenharmony_ci//! crate bakes into your binary, measured in 1000s of bytes. 346e652d70Sopenharmony_ci//! 356e652d70Sopenharmony_ci//! The remaining columns show the **cost per call** to evaluate whether a 366e652d70Sopenharmony_ci//! single `char` has the XID\_Start or XID\_Continue Unicode property, 376e652d70Sopenharmony_ci//! comparing across different ratios of ASCII to non-ASCII codepoints in the 386e652d70Sopenharmony_ci//! input data. 396e652d70Sopenharmony_ci//! 406e652d70Sopenharmony_ci//! [`unicode-xid`]: https://github.com/unicode-rs/unicode-xid 416e652d70Sopenharmony_ci//! [`ucd-generate`]: https://github.com/BurntSushi/ucd-generate 426e652d70Sopenharmony_ci//! [`roaring`]: https://github.com/RoaringBitmap/roaring-rs 436e652d70Sopenharmony_ci//! 446e652d70Sopenharmony_ci//! | | static storage | 0% nonascii | 1% | 10% | 100% nonascii | 456e652d70Sopenharmony_ci//! |---|---|---|---|---|---| 466e652d70Sopenharmony_ci//! | **`unicode-ident`** | 9.75 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns | 476e652d70Sopenharmony_ci//! | **`unicode-xid`** | 11.34 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns | 486e652d70Sopenharmony_ci//! | **`ucd-trie`** | 9.95 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns | 496e652d70Sopenharmony_ci//! | **`fst`** | 133 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns | 506e652d70Sopenharmony_ci//! | **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns | 516e652d70Sopenharmony_ci//! 526e652d70Sopenharmony_ci//! Source code for the benchmark is provided in the *bench* directory of this 536e652d70Sopenharmony_ci//! repo and may be repeated by running `cargo criterion`. 546e652d70Sopenharmony_ci//! 556e652d70Sopenharmony_ci//! <br> 566e652d70Sopenharmony_ci//! 576e652d70Sopenharmony_ci//! ## Comparison of data structures 586e652d70Sopenharmony_ci//! 596e652d70Sopenharmony_ci//! #### unicode-xid 606e652d70Sopenharmony_ci//! 616e652d70Sopenharmony_ci//! They use a sorted array of character ranges, and do a binary search to look 626e652d70Sopenharmony_ci//! up whether a given character lands inside one of those ranges. 636e652d70Sopenharmony_ci//! 646e652d70Sopenharmony_ci//! ```rust 656e652d70Sopenharmony_ci//! # const _: &str = stringify! { 666e652d70Sopenharmony_ci//! static XID_Continue_table: [(char, char); 763] = [ 676e652d70Sopenharmony_ci//! ('\u{30}', '\u{39}'), // 0-9 686e652d70Sopenharmony_ci//! ('\u{41}', '\u{5a}'), // A-Z 696e652d70Sopenharmony_ci//! # " 706e652d70Sopenharmony_ci//! … 716e652d70Sopenharmony_ci//! # " 726e652d70Sopenharmony_ci//! ('\u{e0100}', '\u{e01ef}'), 736e652d70Sopenharmony_ci//! ]; 746e652d70Sopenharmony_ci//! # }; 756e652d70Sopenharmony_ci//! ``` 766e652d70Sopenharmony_ci//! 776e652d70Sopenharmony_ci//! The static storage used by this data structure scales with the number of 786e652d70Sopenharmony_ci//! contiguous ranges of identifier codepoints in Unicode. Every table entry 796e652d70Sopenharmony_ci//! consumes 8 bytes, because it consists of a pair of 32-bit `char` values. 806e652d70Sopenharmony_ci//! 816e652d70Sopenharmony_ci//! In some ranges of the Unicode codepoint space, this is quite a sparse 826e652d70Sopenharmony_ci//! representation – there are some ranges where tens of thousands of 836e652d70Sopenharmony_ci//! adjacent codepoints are all valid identifier characters. In other places, 846e652d70Sopenharmony_ci//! the representation is quite inefficient. A characater like `µ` (U+00B5) 856e652d70Sopenharmony_ci//! which is surrounded by non-identifier codepoints consumes 64 bits in the 866e652d70Sopenharmony_ci//! table, while it would be just 1 bit in a dense bitmap. 876e652d70Sopenharmony_ci//! 886e652d70Sopenharmony_ci//! On a system with 64-byte cache lines, binary searching the table touches 7 896e652d70Sopenharmony_ci//! cache lines on average. Each cache line fits only 8 table entries. 906e652d70Sopenharmony_ci//! Additionally, the branching performed during the binary search is probably 916e652d70Sopenharmony_ci//! mostly unpredictable to the branch predictor. 926e652d70Sopenharmony_ci//! 936e652d70Sopenharmony_ci//! Overall, the crate ends up being about 10× slower on non-ASCII input 946e652d70Sopenharmony_ci//! compared to the fastest crate. 956e652d70Sopenharmony_ci//! 966e652d70Sopenharmony_ci//! A potential improvement would be to pack the table entries more compactly. 976e652d70Sopenharmony_ci//! Rust's `char` type is a 21-bit integer padded to 32 bits, which means every 986e652d70Sopenharmony_ci//! table entry is holding 22 bits of wasted space, adding up to 3.9 K. They 996e652d70Sopenharmony_ci//! could instead fit every table entry into 6 bytes, leaving out some of the 1006e652d70Sopenharmony_ci//! padding, for a 25% improvement in space used. With some cleverness it may be 1016e652d70Sopenharmony_ci//! possible to fit in 5 bytes or even 4 bytes by storing a low char and an 1026e652d70Sopenharmony_ci//! extent, instead of low char and high char. I don't expect that performance 1036e652d70Sopenharmony_ci//! would improve much but this could be the most efficient for space across all 1046e652d70Sopenharmony_ci//! the libraries, needing only about 7 K to store. 1056e652d70Sopenharmony_ci//! 1066e652d70Sopenharmony_ci//! #### ucd-trie 1076e652d70Sopenharmony_ci//! 1086e652d70Sopenharmony_ci//! Their data structure is a compressed trie set specifically tailored for 1096e652d70Sopenharmony_ci//! Unicode codepoints. The design is credited to Raph Levien in 1106e652d70Sopenharmony_ci//! [rust-lang/rust#33098]. 1116e652d70Sopenharmony_ci//! 1126e652d70Sopenharmony_ci//! [rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098 1136e652d70Sopenharmony_ci//! 1146e652d70Sopenharmony_ci//! ```rust 1156e652d70Sopenharmony_ci//! pub struct TrieSet { 1166e652d70Sopenharmony_ci//! tree1_level1: &'static [u64; 32], 1176e652d70Sopenharmony_ci//! tree2_level1: &'static [u8; 992], 1186e652d70Sopenharmony_ci//! tree2_level2: &'static [u64], 1196e652d70Sopenharmony_ci//! tree3_level1: &'static [u8; 256], 1206e652d70Sopenharmony_ci//! tree3_level2: &'static [u8], 1216e652d70Sopenharmony_ci//! tree3_level3: &'static [u64], 1226e652d70Sopenharmony_ci//! } 1236e652d70Sopenharmony_ci//! ``` 1246e652d70Sopenharmony_ci//! 1256e652d70Sopenharmony_ci//! It represents codepoint sets using a trie to achieve prefix compression. The 1266e652d70Sopenharmony_ci//! final states of the trie are embedded in leaves or "chunks", where each 1276e652d70Sopenharmony_ci//! chunk is a 64-bit integer. Each bit position of the integer corresponds to 1286e652d70Sopenharmony_ci//! whether a particular codepoint is in the set or not. These chunks are not 1296e652d70Sopenharmony_ci//! just a compact representation of the final states of the trie, but are also 1306e652d70Sopenharmony_ci//! a form of suffix compression. In particular, if multiple ranges of 64 1316e652d70Sopenharmony_ci//! contiguous codepoints have the same Unicode properties, then they all map to 1326e652d70Sopenharmony_ci//! the same chunk in the final level of the trie. 1336e652d70Sopenharmony_ci//! 1346e652d70Sopenharmony_ci//! Being tailored for Unicode codepoints, this trie is partitioned into three 1356e652d70Sopenharmony_ci//! disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints 1366e652d70Sopenharmony_ci//! \[0, 0x800), the second \[0x800, 0x10000) and the third \[0x10000, 1376e652d70Sopenharmony_ci//! 0x110000). These partitions conveniently correspond to the space of 1 or 2 1386e652d70Sopenharmony_ci//! byte UTF-8 encoded codepoints, 3 byte UTF-8 encoded codepoints and 4 byte 1396e652d70Sopenharmony_ci//! UTF-8 encoded codepoints, respectively. 1406e652d70Sopenharmony_ci//! 1416e652d70Sopenharmony_ci//! Lookups in this data structure are significantly more efficient than binary 1426e652d70Sopenharmony_ci//! search. A lookup touches either 1, 2, or 3 cache lines based on which of the 1436e652d70Sopenharmony_ci//! trie partitions is being accessed. 1446e652d70Sopenharmony_ci//! 1456e652d70Sopenharmony_ci//! One possible performance improvement would be for this crate to expose a way 1466e652d70Sopenharmony_ci//! to query based on a UTF-8 encoded string, returning the Unicode property 1476e652d70Sopenharmony_ci//! corresponding to the first character in the string. Without such an API, the 1486e652d70Sopenharmony_ci//! caller is required to tokenize their UTF-8 encoded input data into `char`, 1496e652d70Sopenharmony_ci//! hand the `char` into `ucd-trie`, only for `ucd-trie` to undo that work by 1506e652d70Sopenharmony_ci//! converting back into the variable-length representation for trie traversal. 1516e652d70Sopenharmony_ci//! 1526e652d70Sopenharmony_ci//! #### fst 1536e652d70Sopenharmony_ci//! 1546e652d70Sopenharmony_ci//! Uses a [finite state transducer][fst]. This representation is built into 1556e652d70Sopenharmony_ci//! [ucd-generate] but I am not aware of any advantage over the `ucd-trie` 1566e652d70Sopenharmony_ci//! representation. In particular `ucd-trie` is optimized for storing Unicode 1576e652d70Sopenharmony_ci//! properties while `fst` is not. 1586e652d70Sopenharmony_ci//! 1596e652d70Sopenharmony_ci//! [fst]: https://github.com/BurntSushi/fst 1606e652d70Sopenharmony_ci//! [ucd-generate]: https://github.com/BurntSushi/ucd-generate 1616e652d70Sopenharmony_ci//! 1626e652d70Sopenharmony_ci//! As far as I can tell, the main thing that causes `fst` to have large size 1636e652d70Sopenharmony_ci//! and slow lookups for this use case relative to `ucd-trie` is that it does 1646e652d70Sopenharmony_ci//! not specialize for the fact that only 21 of the 32 bits in a `char` are 1656e652d70Sopenharmony_ci//! meaningful. There are some dense arrays in the structure with large ranges 1666e652d70Sopenharmony_ci//! that could never possibly be used. 1676e652d70Sopenharmony_ci//! 1686e652d70Sopenharmony_ci//! #### roaring 1696e652d70Sopenharmony_ci//! 1706e652d70Sopenharmony_ci//! This crate is a pure-Rust implementation of [Roaring Bitmap], a data 1716e652d70Sopenharmony_ci//! structure designed for storing sets of 32-bit unsigned integers. 1726e652d70Sopenharmony_ci//! 1736e652d70Sopenharmony_ci//! [Roaring Bitmap]: https://roaringbitmap.org/about/ 1746e652d70Sopenharmony_ci//! 1756e652d70Sopenharmony_ci//! Roaring bitmaps are compressed bitmaps which tend to outperform conventional 1766e652d70Sopenharmony_ci//! compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can 1776e652d70Sopenharmony_ci//! be hundreds of times faster and they often offer significantly better 1786e652d70Sopenharmony_ci//! compression. 1796e652d70Sopenharmony_ci//! 1806e652d70Sopenharmony_ci//! In this use case the performance was reasonably competitive but still 1816e652d70Sopenharmony_ci//! substantially slower than the Unicode-optimized crates. Meanwhile the 1826e652d70Sopenharmony_ci//! compression was significantly worse, requiring 6× as much storage for 1836e652d70Sopenharmony_ci//! the data structure. 1846e652d70Sopenharmony_ci//! 1856e652d70Sopenharmony_ci//! I also benchmarked the [`croaring`] crate which is an FFI wrapper around the 1866e652d70Sopenharmony_ci//! C reference implementation of Roaring Bitmap. This crate was consistently 1876e652d70Sopenharmony_ci//! about 15% slower than pure-Rust `roaring`, which could just be FFI overhead. 1886e652d70Sopenharmony_ci//! I did not investigate further. 1896e652d70Sopenharmony_ci//! 1906e652d70Sopenharmony_ci//! [`croaring`]: https://crates.io/crates/croaring 1916e652d70Sopenharmony_ci//! 1926e652d70Sopenharmony_ci//! #### unicode-ident 1936e652d70Sopenharmony_ci//! 1946e652d70Sopenharmony_ci//! This crate is most similar to the `ucd-trie` library, in that it's based on 1956e652d70Sopenharmony_ci//! bitmaps stored in the leafs of a trie representation, achieving both prefix 1966e652d70Sopenharmony_ci//! compression and suffix compression. 1976e652d70Sopenharmony_ci//! 1986e652d70Sopenharmony_ci//! The key differences are: 1996e652d70Sopenharmony_ci//! 2006e652d70Sopenharmony_ci//! - Uses a single 2-level trie, rather than 3 disjoint partitions of different 2016e652d70Sopenharmony_ci//! depth each. 2026e652d70Sopenharmony_ci//! - Uses significantly larger chunks: 512 bits rather than 64 bits. 2036e652d70Sopenharmony_ci//! - Compresses the XID\_Start and XID\_Continue properties together 2046e652d70Sopenharmony_ci//! simultaneously, rather than duplicating identical trie leaf chunks across 2056e652d70Sopenharmony_ci//! the two. 2066e652d70Sopenharmony_ci//! 2076e652d70Sopenharmony_ci//! The following diagram show the XID\_Start and XID\_Continue Unicode boolean 2086e652d70Sopenharmony_ci//! properties in uncompressed form, in row-major order: 2096e652d70Sopenharmony_ci//! 2106e652d70Sopenharmony_ci//! <table> 2116e652d70Sopenharmony_ci//! <tr><th>XID_Start</th><th>XID_Continue</th></tr> 2126e652d70Sopenharmony_ci//! <tr> 2136e652d70Sopenharmony_ci//! <td><img alt="XID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td> 2146e652d70Sopenharmony_ci//! <td><img alt="XID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td> 2156e652d70Sopenharmony_ci//! </tr> 2166e652d70Sopenharmony_ci//! </table> 2176e652d70Sopenharmony_ci//! 2186e652d70Sopenharmony_ci//! Uncompressed, these would take 140 K to store, which is beyond what would be 2196e652d70Sopenharmony_ci//! reasonable. However, as you can see there is a large degree of similarity 2206e652d70Sopenharmony_ci//! between the two bitmaps and across the rows, which lends well to 2216e652d70Sopenharmony_ci//! compression. 2226e652d70Sopenharmony_ci//! 2236e652d70Sopenharmony_ci//! This crate stores one 512-bit "row" of the above bitmaps in the leaf level 2246e652d70Sopenharmony_ci//! of a trie, and a single additional level to index into the leafs. It turns 2256e652d70Sopenharmony_ci//! out there are 124 unique 512-bit chunks across the two bitmaps so 7 bits are 2266e652d70Sopenharmony_ci//! sufficient to index them. 2276e652d70Sopenharmony_ci//! 2286e652d70Sopenharmony_ci//! The chunk size of 512 bits is selected as the size that minimizes the total 2296e652d70Sopenharmony_ci//! size of the data structure. A smaller chunk, like 256 or 128 bits, would 2306e652d70Sopenharmony_ci//! achieve better deduplication but require a larger index. A larger chunk 2316e652d70Sopenharmony_ci//! would increase redundancy in the leaf bitmaps. 512 bit chunks are the 2326e652d70Sopenharmony_ci//! optimum for total size of the index plus leaf bitmaps. 2336e652d70Sopenharmony_ci//! 2346e652d70Sopenharmony_ci//! In fact since there are only 124 unique chunks, we can use an 8-bit index 2356e652d70Sopenharmony_ci//! with a spare bit to index at the half-chunk level. This achieves an 2366e652d70Sopenharmony_ci//! additional 8.5% compression by eliminating redundancies between the second 2376e652d70Sopenharmony_ci//! half of any chunk and the first half of any other chunk. Note that this is 2386e652d70Sopenharmony_ci//! not the same as using chunks which are half the size, because it does not 2396e652d70Sopenharmony_ci//! necessitate raising the size of the trie's first level. 2406e652d70Sopenharmony_ci//! 2416e652d70Sopenharmony_ci//! In contrast to binary search or the `ucd-trie` crate, performing lookups in 2426e652d70Sopenharmony_ci//! this data structure is straight-line code with no need for branching. 2436e652d70Sopenharmony_ci 2446e652d70Sopenharmony_ci#![no_std] 2456e652d70Sopenharmony_ci#![doc(html_root_url = "https://docs.rs/unicode-ident/1.0.8")] 2466e652d70Sopenharmony_ci#![allow(clippy::doc_markdown, clippy::must_use_candidate)] 2476e652d70Sopenharmony_ci 2486e652d70Sopenharmony_ci#[rustfmt::skip] 2496e652d70Sopenharmony_cimod tables; 2506e652d70Sopenharmony_ci 2516e652d70Sopenharmony_ciuse crate::tables::{ASCII_CONTINUE, ASCII_START, CHUNK, LEAF, TRIE_CONTINUE, TRIE_START}; 2526e652d70Sopenharmony_ci 2536e652d70Sopenharmony_cipub fn is_xid_start(ch: char) -> bool { 2546e652d70Sopenharmony_ci if ch.is_ascii() { 2556e652d70Sopenharmony_ci return ASCII_START.0[ch as usize]; 2566e652d70Sopenharmony_ci } 2576e652d70Sopenharmony_ci let chunk = *TRIE_START.0.get(ch as usize / 8 / CHUNK).unwrap_or(&0); 2586e652d70Sopenharmony_ci let offset = chunk as usize * CHUNK / 2 + ch as usize / 8 % CHUNK; 2596e652d70Sopenharmony_ci unsafe { LEAF.0.get_unchecked(offset) }.wrapping_shr(ch as u32 % 8) & 1 != 0 2606e652d70Sopenharmony_ci} 2616e652d70Sopenharmony_ci 2626e652d70Sopenharmony_cipub fn is_xid_continue(ch: char) -> bool { 2636e652d70Sopenharmony_ci if ch.is_ascii() { 2646e652d70Sopenharmony_ci return ASCII_CONTINUE.0[ch as usize]; 2656e652d70Sopenharmony_ci } 2666e652d70Sopenharmony_ci let chunk = *TRIE_CONTINUE.0.get(ch as usize / 8 / CHUNK).unwrap_or(&0); 2676e652d70Sopenharmony_ci let offset = chunk as usize * CHUNK / 2 + ch as usize / 8 % CHUNK; 2686e652d70Sopenharmony_ci unsafe { LEAF.0.get_unchecked(offset) }.wrapping_shr(ch as u32 % 8) & 1 != 0 2696e652d70Sopenharmony_ci} 270