1--- 2layout: default 3title: StringPrep 4nav_order: 7 5parent: Chars and Strings 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# StringPrep 13 14## Overview 15 16Comparing strings in a consistent manner becomes imperative when a large 17repertoire of characters such as Unicode is used in network protocols. 18StringPrep provides sets of rules for use of Unicode and syntax for prevention 19of spoofing. The implementation of StringPrep and IDNA services and their usage 20in ICU is described below. 21 22## StringPrep 23 24StringPrep, the process of preparing Unicode strings for use in network 25protocols is defined in RFC 3454 (<http://www.rfc-editor.org/rfc/rfc3454.txt> ). 26The RFC defines a broad framework and rules for processing the strings. 27 28Protocols that prescribe use of StringPrep must define a profile of StringPrep, 29whose applicability is limited to the protocol. Profiles are a set of rules and 30data tables which describe the how the strings should be prepare. The profiles 31can choose to turn on or turn off normalization, checking for bidirectional 32characters. They can also choose to add or remove mappings, unassigned and 33prohibited code points from the tables provided. 34 35StringPrep uses Unicode Version 3.2 and defines a set of tables for use by the 36profiles. The profiles can chose to include or exclude tables or code points 37from the tables defined by the RFC. 38 39StringPrep defines tables that can be broadly classified into 40 411. *Unassigned Table*: Contains code points that are unassigned in Unicode 42 Version 3.2. Unassigned code points may be allowed or disallowed in the 43 output string depending on the application. The table in Appendix A.1 of the 44 RFC contains the code points. 45 461. *Mapping Tables*: Code points that are commonly deleted from the output and 47 code points that are case mapped are included in this table. There are two 48 mapping tables in the Appendix namely B.1 and B.2 49 502. *Prohibited Tables*: Contains code points that are prohibited from the 51 output string. Control codes, private use area code points, non-character 52 code points, surrogate code points, tagging and deprecated code points are 53 included in this table. There are nine mapping tables in Appendix which 54 include the prohibited code points namely C.1, C.2, C.3, C.4, C.5, C.6, C.7, 55 C.8 and C.9. 56 57The procedure for preparing strings for use can be described in the following 58steps: 59 601. *Map*: For each code point in the input check if it has a mapping defined in 61 the mapping table, if so, replace it with the mapping in the output. 62 632. *Normalize*: Normalize the output of step 1 using Unicode Normalization Form 64 NFKC, it the option is set. Normalization algorithm must conform to UAX 15. 65 663. *Prohibit*: For each code point in the output of step 2 check if the code 67 point is present in the prohibited table, if so, fail returning an error. 68 694. *Check BiDi*: Check for code points with strong right-to-left directionality 70 in the output of step 3. If present, check if the string satisfies the rules 71 for bidirectional strings as specified. 72 73## NamePrep 74 75NamePrep is a profile of StringPrep for use in IDNA. This profile in defined in 76RFC 3491(<http://www.rfc-editor.org/rfc/rfc3491.txt> ). 77 78The profile specifies the following rules: 79 801. *Map* : Include all code point mappings specified in the StringPrep. 81 822. *Normalize*: Normalize the output of step 1 according to NFKC. 83 843. *Prohibit*: Prohibit all code points specified as prohibited in StringPrep 85 except for the space ( U+0020) code point from the output of step 2. 86 874. *Check BiDi*: Check for bidirectional code points and process according to 88 the rules specified in StringPrep. 89 90## Punycode 91 92Punycode is an encoding scheme for Unicode for use in IDNA. Punycode converts 93Unicode text to unique sequence of ASCII text and back to Unicode. It is an 94ASCII Compatible Encoding (ACE). Punycode is described in RFC 3492 95(<http://www.rfc-editor.org/rfc/rfc3492.txt> ). 96 97The Punycode algorithm is a form of a general Bootstring algorithm which allows 98strings composed of smaller set of code points to uniquely represent any string 99of code points from a larger set. Punycode represents Unicode code points from 100U+0000 to U+10FFFF by using the smaller ASCII set U+0000 to U+0007F. The 101algorithm can also preserve case information of the code points in the lager set 102while and encoding and decoding. This feature, however, is not used in IDNA. 103 104## Internationalizing Domain Names in Applications (IDNA) 105 106The Domain Name Service (DNS) protocol defines the procedure for matching of 107ASCII strings case insensitively to the names in the lookup tables containing 108mapping of IP (Internet Protocol) addresses to server names. When Unicode is 109used instead of ASCII in server names then two problems arise which need to be 110dealt with differently. When the server name is displayed to the user then 111Unicode text should be displayed. When Unicode text is stored in lookup tables, 112for compatibility with older DNS protocol and the resolver libraries, the text 113should be the ASCII equivalent. The IDNA protocol, defined by RFC 3490 114(<http://www.rfc-editor.org/rfc/rfc3490.txt> ), satisfies the above 115requirements. 116 117Server names stored in the DNS lookup tables are usually formed by concatenating 118domain labels with a label separator, for example: 119 120The protocol defines operations to be performed on domain labels before the 121names are stored in the lookup tables and before the names fetched from lookup 122tables are displayed to the user. The operations are : 123 1241. ToASCII: This operation is performed on domain labels before sending the 125 name to a resolver and before storing the name in the DNS lookup table. The 126 domain labels are processed by StringPrep algorithm by using the rules 127 specified by NamePrep profile. The output of this step is then encoded by 128 using Punycode and an ACE prefix is added to denote that the text is encoded 129 using Punycode. IDNA uses “xn--” before the encoded label. 130 1311. ToUnicode: This operation is performed on domain labels before displaying 132 the names to to users. If the domain label is prefixed with the ACE prefix 133 for IDNA, then the label excluding the prefix is decoded using Punycode. The 134 output of Punycode decoder is verified by applying ToASCII operation and 135 comparing the output with the input to the ToUnicode operation. 136 137Unicode contains code points that are glyphically similar to the ASCII Full Stop 138(U+002E). These code points must be treated as label separators when performing 139ToASCII operation. These code points are : 140 1411. Ideographic Full Stop (U+3002) 142 1432. Full Width Full Stop (U+FF0E) 144 1453. Half Width Ideographic Full Stop (U+FF61) 146 147Unassigned code points in Unicode Version 3.2 as given in StringPrep tables are 148treated differently depending on how the processed string is used. For query 149operations, where a registrar is requested for information regarding 150availability of a certain domain name, unassigned code points are allowed to be 151present in the string. For storing the string in DNS lookup tables, unassigned 152code points are prohibited from the input. 153 154IDNA specifies that the ToUnicode and ToASCII have options to check for 155Letter-Digit-Hyphen code points and adhere to the STD3 ASCII Rules. 156 157IDNA specifies that domain labels are equivalent if and only if the output of 158ToASCII operation on the labels match using case insensitive ASCII comparison. 159 160## StringPrep Service in ICU 161 162The StringPrep service in ICU is data driven. The service is based on 163Open-Use-Close pattern. A StringPrep profile is opened, the strings are 164processed according to the rules specified in the profile and the profile is 165closed once the profile is ready to be disposed. 166 167Tools for filtering RFC 3454 and producing a rule file that can be compiled into 168a binary format containing all the information required by the service are 169provided. 170 171The procedure for producing a StringPrep profile data file are as given below: 172 1731. Run filterRFC3454.pl Perl tool, to filter the RFC file and produce a rule 174 file. The text file produced can be edited by the clients to add/delete 175 mappings or add/delete prohibited code points. 176 1772. Run the gensprep tool to compile the rule file into a binary format. The 178 options to turn on normalization of strings and checking of bidirectional 179 code points are passed as command line options to the tool. This tool 180 produces a binary profile file with the extension “spp”. 181 1823. Open the StringPrep profile with path to the binary and name of the binary 183 profile file as the options to the open call. The profile data files are 184 memory mapped and cached for optimum performance. 185 186### Code Snippets 187 188> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should 189keep the profile object around for reuse, instead of opening and closing the 190profile each time.* 191 192#### C++ 193 194 UErrorCode status = U_ZERO_ERROR; 195 UParseError parseError; 196 /* open the StringPrep profile */ 197 UStringPrepProfile* nameprep = usprep_open("/usr/joe/mydata", 198 "nfscsi", &status); 199 if(U_FAILURE(status)) { 200 /* handle the error */ 201 } 202 /* prepare the string for use according 203 * to the rules specified in the profile 204 */ 205 int32_t retLen = usprep_prepare(src, srcLength, dest, 206 destCapacity, USPREP_ALLOW_UNASSIGNED, 207 nameprep, &parseError, &status); 208 /* close the profile */ 209 usprep_close(nameprep); 210 211#### Java 212 213 private static final StringPrep nfscsi = null; 214 //singleton instance 215 private static final NFSCSIStringPrep prep=new NFSCSIStringPrep(); 216 private NFSCSIStringPrep() { 217 try { 218 InputStream nfscsiFile = TestUtil.getDataStream("nfscsi.spp"); 219 nfscsi = new StringPrep(nfscsiFile); 220 nfscsiFile.close(); 221 } catch(IOException e) { 222 throw new RuntimeException(e.toString()); 223 } 224 } 225 private static byte[] prepare(byte[] src, StringPrep prep) 226 throws StringPrepParseException, UnsupportedEncodingException { 227 String s = new String(src, "UTF-8"); 228 UCharacterIterator iter = UCharacterIterator.getInstance(s); 229 StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT); 230 return out.toString().getBytes("UTF-8"); 231 } 232 233## IDNA API in ICU 234 235ICU provides APIs for performing the ToASCII, ToUnicode and compare operations 236as defined by the RFC 3490. Convenience methods for comparing IDNs are also 237provided. These APIs follow ICU policies for string manipulation and coding 238guidelines. 239 240### Code Snippets 241 242> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should 243keep the profile object around for reuse, instead of opening and closing the 244profile each time.* 245 246### ToASCII operation 247 248***C*** 249 250 UChar* dest = (UChar*) malloc(destCapacity * U_SIZEOF_UCHAR); 251 destLen = uidna_toASCII(src, srcLen, dest, destCapacity, 252 UIDNA_DEFAULT, &parseError, &status); 253 if(status == U_BUFFER_OVERFLOW_ERROR) { 254 status = U_ZERO_ERROR; 255 destCapacity= destLen + 1; /* for the terminating Null */ 256 free(dest); /* free the memory */ 257 dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR); 258 destLen = uidna_toASCII(src, srcLen, dest, destCapacity, 259 UIDNA_DEFAULT, &parseError, &status); 260 } 261 if(U_FAILURE(status)) { 262 /* handle the error */ 263 } 264 /* do interesting stuff with output*/ 265 266***Java*** 267 268 try { 269 StringBuffer out= IDNA.convertToASCII(inBuf,IDNA.DEFAULT); 270 } catch(StringPrepParseException ex) { 271 /*handle the exception*/ 272 } 273 274### toUnicode operation 275 276***C*** 277 278 UChar * dest = (UChar *) malloc(destCapacity * U_SIZEOF_UCHAR); 279 destLen = uidna_toUnicode(src, srcLen, dest, destCapacity, 280 UIDNA_DEFAULT 281 &parseError, &status); 282 if(status == U_BUFFER_OVERFLOW_ERROR) { 283 status = U_ZERO_ERROR; 284 destCapacity= destLen + 1; /* for the terminating Null */ 285 /* free the memory */ 286 free(dest); 287 dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR); 288 destLen = uidna_toUnicode(src, srcLen, dest, destCapacity, 289 UIDNA_DEFAULT, &parseError, &status); 290 } 291 if(U_FAILURE(status)) { 292 /* handle the error */ 293 } 294 /* do interesting stuff with output*/ 295 296***Java*** 297 298 try { 299 StringBuffer out= IDNA.convertToUnicode(inBuf,IDNA.DEFAULT); 300 } catch(StringPrepParseException ex) { 301 // handle the exception 302 } 303 304### compare operation 305 306***C*** 307 308 int32_t rc = uidna_compare(source1, length1, 309 source2, length2, 310 UIDNA_DEFAULT, 311 &status); 312 if(rc==0) { 313 /* the IDNs are same ... do something interesting */ 314 } else { 315 /* the IDNs are different ... do something */ 316 } 317 318***Java*** 319 320 try { 321 int retVal = IDNA.compare(s1,s2,IDNA.DEFAULT); 322 // do something interesting with retVal 323 } catch(StringPrepParseException e) { 324 // handle the exception 325 } 326 327## Design Considerations 328 329StringPrep profiles exhibit the following characteristics: 330 3311. The profiles contain information about code points. StringPrep allows 332 profiles to add/delete code points or mappings. 333 3342. Options such as turning normalization and checking for bidirectional code 335 points on or off are the properties of the profiles. 336 3373. The StringPrep algorithm is not overridden by the profile. 338 3394. Once defined, the profiles do not change. 340 341The StringPrep profiles are used in network protocols so runtime performance is 342important. 343 344Many profiles have been and are being defined, so applications should be able to 345plug-in arbitrary profiles and get the desired result out of the framework. 346 347ICU is designed for this usage by providing build-time tools for arbitrary 348StringPrep profile definitions, and loading them from application-supplied data 349in binary form with data structures optimized for runtime use. 350 351## Demo 352 353A web application at <https://icu4c-demos.unicode.org/icu-bin/idnbrowser> 354illustrates the use of IDNA API. The source code for the application is 355available at <https://github.com/unicode-org/icu-demos/tree/main/idnbrowser>. 356 357## Appendix 358 359#### NFS Version 4 Profiles 360 361Network File System Version 4 defined by RFC 3530 362(<http://www.rfc-editor.org/rfc/rfc3530.txt> ) defines use of Unicode text in 363the protocol. ICU provides the requisite profiles as part of test suite and code 364for processing the strings according the profiles as a part of samples. 365 366The RFC defines three profiles : 367 3681. *nfs4_cs_prep Profile*: This profile is used for preparing file and path 369 name strings. Normalization of code points and checking for bidirectional 370 code points are turned off. Case mappings are included if the NFS 371 implementation supports case insensitive file and path names. 372 3732. *nfs4_cis_prep Profile*: This profile is used for preparing NFS server 374 names. Normalization of code points and checking for bidirectional code 375 points are turned on. This profile is equivalent to NamePrep profile. 376 3773. *nfs4_mixed_prep Profile*: This profile is used for preparing strings in the 378 Access Control Entries of NFS servers. These strings consist of two parts, 379 prefix and suffix, separated by '@' (U+0040). The prefix is processed with 380 case mappings turned off and the suffix is processed with case mappings 381 turned on. Normalization of code points and checking for bidirectional code 382 points are turned on. 383 384#### XMPP Profiles 385 386Extensible Messaging and Presence Protocol (XMPP) is an XML based protocol for 387near real-time extensible messaging and presence. This protocol defines use of 388two StringPrep profiles: 389 3901. *ResourcePrep Profile*: This profile is used for processing the resource 391 identifiers within XMPP. Normalization of code points and checking of 392 bidirectional code points are turned on. Case mappings are excluded. The 393 space code point (U+0020) is excluded from the prohibited code points set. 394 3952. *NodePrep Profile*: This profile is used for processing the node identifiers 396 within XMPP. Normalization of code points and checking of bidirectional code 397 points are turned on. Case mappings are included. All code points specified 398 as prohibited in StringPrep are prohibited. Additional code points are added 399 to the prohibited set. 400