1---
2layout: default
3title: StringPrep
4nav_order: 7
5parent: Chars and Strings
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# StringPrep
13
14## Overview
15
16Comparing strings in a consistent manner becomes imperative when a large
17repertoire of characters such as Unicode is used in network protocols.
18StringPrep provides sets of rules for use of Unicode and syntax for prevention
19of spoofing. The implementation of StringPrep and IDNA services and their usage
20in ICU is described below.
21
22## StringPrep
23
24StringPrep, the process of preparing Unicode strings for use in network
25protocols is defined in RFC 3454 (<http://www.rfc-editor.org/rfc/rfc3454.txt> ).
26The RFC defines a broad framework and rules for processing the strings.
27
28Protocols that prescribe use of StringPrep must define a profile of StringPrep,
29whose applicability is limited to the protocol. Profiles are a set of rules and
30data tables which describe the how the strings should be prepare. The profiles
31can choose to turn on or turn off normalization, checking for bidirectional
32characters. They can also choose to add or remove mappings, unassigned and
33prohibited code points from the tables provided.
34
35StringPrep uses Unicode Version 3.2 and defines a set of tables for use by the
36profiles. The profiles can chose to include or exclude tables or code points
37from the tables defined by the RFC.
38
39StringPrep defines tables that can be broadly classified into
40
411.  *Unassigned Table*: Contains code points that are unassigned in Unicode
42    Version 3.2. Unassigned code points may be allowed or disallowed in the
43    output string depending on the application. The table in Appendix A.1 of the
44    RFC contains the code points.
45
461.  *Mapping Tables*: Code points that are commonly deleted from the output and
47    code points that are case mapped are included in this table. There are two
48    mapping tables in the Appendix namely B.1 and B.2
49
502.  *Prohibited Tables*: Contains code points that are prohibited from the
51    output string. Control codes, private use area code points, non-character
52    code points, surrogate code points, tagging and deprecated code points are
53    included in this table. There are nine mapping tables in Appendix which
54    include the prohibited code points namely C.1, C.2, C.3, C.4, C.5, C.6, C.7,
55    C.8 and C.9.
56
57The procedure for preparing strings for use can be described in the following
58steps:
59
601.  *Map*: For each code point in the input check if it has a mapping defined in
61    the mapping table, if so, replace it with the mapping in the output.
62
632.  *Normalize*: Normalize the output of step 1 using Unicode Normalization Form
64    NFKC, it the option is set. Normalization algorithm must conform to UAX 15.
65
663.  *Prohibit*: For each code point in the output of step 2 check if the code
67    point is present in the prohibited table, if so, fail returning an error.
68
694.  *Check BiDi*: Check for code points with strong right-to-left directionality
70    in the output of step 3. If present, check if the string satisfies the rules
71    for bidirectional strings as specified.
72
73## NamePrep
74
75NamePrep is a profile of StringPrep for use in IDNA. This profile in defined in
76RFC 3491(<http://www.rfc-editor.org/rfc/rfc3491.txt> ).
77
78The profile specifies the following rules:
79
801.  *Map* : Include all code point mappings specified in the StringPrep.
81
822.  *Normalize*: Normalize the output of step 1 according to NFKC.
83
843.  *Prohibit*: Prohibit all code points specified as prohibited in StringPrep
85    except for the space ( U+0020) code point from the output of step 2.
86
874.  *Check BiDi*: Check for bidirectional code points and process according to
88    the rules specified in StringPrep.
89
90## Punycode
91
92Punycode is an encoding scheme for Unicode for use in IDNA. Punycode converts
93Unicode text to unique sequence of ASCII text and back to Unicode. It is an
94ASCII Compatible Encoding (ACE). Punycode is described in RFC 3492
95(<http://www.rfc-editor.org/rfc/rfc3492.txt> ).
96
97The Punycode algorithm is a form of a general Bootstring algorithm which allows
98strings composed of smaller set of code points to uniquely represent any string
99of code points from a larger set. Punycode represents Unicode code points from
100U+0000 to U+10FFFF by using the smaller ASCII set U+0000 to U+0007F. The
101algorithm can also preserve case information of the code points in the lager set
102while and encoding and decoding. This feature, however, is not used in IDNA.
103
104## Internationalizing Domain Names in Applications (IDNA)
105
106The Domain Name Service (DNS) protocol defines the procedure for matching of
107ASCII strings case insensitively to the names in the lookup tables containing
108mapping of IP (Internet Protocol) addresses to server names. When Unicode is
109used instead of ASCII in server names then two problems arise which need to be
110dealt with differently. When the server name is displayed to the user then
111Unicode text should be displayed. When Unicode text is stored in lookup tables,
112for compatibility with older DNS protocol and the resolver libraries, the text
113should be the ASCII equivalent. The IDNA protocol, defined by RFC 3490
114(<http://www.rfc-editor.org/rfc/rfc3490.txt> ), satisfies the above
115requirements.
116
117Server names stored in the DNS lookup tables are usually formed by concatenating
118domain labels with a label separator, for example:
119
120The protocol defines operations to be performed on domain labels before the
121names are stored in the lookup tables and before the names fetched from lookup
122tables are displayed to the user. The operations are :
123
1241.  ToASCII: This operation is performed on domain labels before sending the
125    name to a resolver and before storing the name in the DNS lookup table. The
126    domain labels are processed by StringPrep algorithm by using the rules
127    specified by NamePrep profile. The output of this step is then encoded by
128    using Punycode and an ACE prefix is added to denote that the text is encoded
129    using Punycode. IDNA uses “xn--” before the encoded label.
130
1311.  ToUnicode: This operation is performed on domain labels before displaying
132    the names to to users. If the domain label is prefixed with the ACE prefix
133    for IDNA, then the label excluding the prefix is decoded using Punycode. The
134    output of Punycode decoder is verified by applying ToASCII operation and
135    comparing the output with the input to the ToUnicode operation.
136
137Unicode contains code points that are glyphically similar to the ASCII Full Stop
138(U+002E). These code points must be treated as label separators when performing
139ToASCII operation. These code points are :
140
1411.  Ideographic Full Stop (U+3002)
142
1432.  Full Width Full Stop (U+FF0E)
144
1453.  Half Width Ideographic Full Stop (U+FF61)
146
147Unassigned code points in Unicode Version 3.2 as given in StringPrep tables are
148treated differently depending on how the processed string is used. For query
149operations, where a registrar is requested for information regarding
150availability of a certain domain name, unassigned code points are allowed to be
151present in the string. For storing the string in DNS lookup tables, unassigned
152code points are prohibited from the input.
153
154IDNA specifies that the ToUnicode and ToASCII have options to check for
155Letter-Digit-Hyphen code points and adhere to the STD3 ASCII Rules.
156
157IDNA specifies that domain labels are equivalent if and only if the output of
158ToASCII operation on the labels match using case insensitive ASCII comparison.
159
160## StringPrep Service in ICU
161
162The StringPrep service in ICU is data driven. The service is based on
163Open-Use-Close pattern. A StringPrep profile is opened, the strings are
164processed according to the rules specified in the profile and the profile is
165closed once the profile is ready to be disposed.
166
167Tools for filtering RFC 3454 and producing a rule file that can be compiled into
168a binary format containing all the information required by the service are
169provided.
170
171The procedure for producing a StringPrep profile data file are as given below:
172
1731.  Run filterRFC3454.pl Perl tool, to filter the RFC file and produce a rule
174    file. The text file produced can be edited by the clients to add/delete
175    mappings or add/delete prohibited code points.
176
1772.  Run the gensprep tool to compile the rule file into a binary format. The
178    options to turn on normalization of strings and checking of bidirectional
179    code points are passed as command line options to the tool. This tool
180    produces a binary profile file with the extension “spp”.
181
1823.  Open the StringPrep profile with path to the binary and name of the binary
183    profile file as the options to the open call. The profile data files are
184    memory mapped and cached for optimum performance.
185
186### Code Snippets
187
188> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
189keep the profile object around for reuse, instead of opening and closing the
190profile each time.*
191
192#### C++
193
194    UErrorCode status = U_ZERO_ERROR;
195    UParseError parseError;
196    /* open the StringPrep profile */
197    UStringPrepProfile* nameprep = usprep_open("/usr/joe/mydata",
198                                               "nfscsi", &status);
199    if(U_FAILURE(status)) {
200        /* handle the error */
201    }
202    /* prepare the string for use according
203     * to the rules specified in the profile
204     */
205    int32_t retLen = usprep_prepare(src, srcLength, dest,
206                                    destCapacity, USPREP_ALLOW_UNASSIGNED,
207                                    nameprep, &parseError, &status);
208    /* close the profile */
209    usprep_close(nameprep);
210
211#### Java
212
213    private static final StringPrep nfscsi = null;
214    //singleton instance
215    private static final NFSCSIStringPrep prep=new NFSCSIStringPrep();
216    private NFSCSIStringPrep() {
217        try {
218            InputStream nfscsiFile = TestUtil.getDataStream("nfscsi.spp");
219            nfscsi = new StringPrep(nfscsiFile);
220            nfscsiFile.close();
221        } catch(IOException e) {
222            throw new RuntimeException(e.toString());
223        }
224    }
225    private static byte[] prepare(byte[] src, StringPrep prep)
226            throws StringPrepParseException, UnsupportedEncodingException {
227        String s = new String(src, "UTF-8");
228        UCharacterIterator iter = UCharacterIterator.getInstance(s);
229        StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT);
230        return out.toString().getBytes("UTF-8");
231    }
232
233## IDNA API in ICU
234
235ICU provides APIs for performing the ToASCII, ToUnicode and compare operations
236as defined by the RFC 3490. Convenience methods for comparing IDNs are also
237provided. These APIs follow ICU policies for string manipulation and coding
238guidelines.
239
240### Code Snippets
241
242> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
243keep the profile object around for reuse, instead of opening and closing the
244profile each time.*
245
246### ToASCII operation
247
248***C***
249
250    UChar* dest = (UChar*) malloc(destCapacity * U_SIZEOF_UCHAR);
251    destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
252                            UIDNA_DEFAULT, &parseError, &status);
253    if(status == U_BUFFER_OVERFLOW_ERROR) {
254        status = U_ZERO_ERROR;
255        destCapacity= destLen + 1; /* for the terminating Null */
256        free(dest); /* free the memory */
257        dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
258        destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
259                                UIDNA_DEFAULT, &parseError, &status);
260    }
261    if(U_FAILURE(status)) {
262        /* handle the error */
263    }
264    /* do interesting stuff with output*/
265
266***Java***
267
268    try {
269        StringBuffer out= IDNA.convertToASCII(inBuf,IDNA.DEFAULT);
270    } catch(StringPrepParseException ex) {
271        /*handle the exception*/
272    }
273
274### toUnicode operation
275
276***C***
277
278    UChar * dest = (UChar *) malloc(destCapacity * U_SIZEOF_UCHAR);
279    destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
280                              UIDNA_DEFAULT
281                              &parseError, &status);
282    if(status == U_BUFFER_OVERFLOW_ERROR) {
283        status = U_ZERO_ERROR;
284        destCapacity= destLen + 1; /* for the terminating Null */
285        /* free the memory */
286        free(dest);
287        dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
288        destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
289                                  UIDNA_DEFAULT, &parseError, &status);
290    }
291    if(U_FAILURE(status)) {
292        /* handle the error */
293    }
294    /* do interesting stuff with output*/
295
296***Java***
297
298    try {
299        StringBuffer out= IDNA.convertToUnicode(inBuf,IDNA.DEFAULT);
300    } catch(StringPrepParseException ex) {
301        // handle the exception
302    }
303
304### compare operation
305
306***C***
307
308    int32_t rc = uidna_compare(source1, length1,
309                               source2, length2,
310                               UIDNA_DEFAULT,
311                               &status);
312    if(rc==0) {
313        /* the IDNs are same ... do something interesting */
314    } else {
315        /* the IDNs are different ... do something */
316    }
317
318***Java***
319
320    try {
321        int retVal = IDNA.compare(s1,s2,IDNA.DEFAULT);
322        // do something interesting with retVal
323    } catch(StringPrepParseException e) {
324       // handle the exception
325    }
326
327## Design Considerations
328
329StringPrep profiles exhibit the following characteristics:
330
3311.  The profiles contain information about code points. StringPrep allows
332    profiles to add/delete code points or mappings.
333
3342.  Options such as turning normalization and checking for bidirectional code
335    points on or off are the properties of the profiles.
336
3373.  The StringPrep algorithm is not overridden by the profile.
338
3394.  Once defined, the profiles do not change.
340
341The StringPrep profiles are used in network protocols so runtime performance is
342important.
343
344Many profiles have been and are being defined, so applications should be able to
345plug-in arbitrary profiles and get the desired result out of the framework.
346
347ICU is designed for this usage by providing build-time tools for arbitrary
348StringPrep profile definitions, and loading them from application-supplied data
349in binary form with data structures optimized for runtime use.
350
351## Demo
352
353A web application at <https://icu4c-demos.unicode.org/icu-bin/idnbrowser>
354illustrates the use of IDNA API. The source code for the application is
355available at <https://github.com/unicode-org/icu-demos/tree/main/idnbrowser>.
356
357## Appendix
358
359#### NFS Version 4 Profiles
360
361Network File System Version 4 defined by RFC 3530
362(<http://www.rfc-editor.org/rfc/rfc3530.txt> ) defines use of Unicode text in
363the protocol. ICU provides the requisite profiles as part of test suite and code
364for processing the strings according the profiles as a part of samples.
365
366The RFC defines three profiles :
367
3681.  *nfs4_cs_prep Profile*: This profile is used for preparing file and path
369    name strings. Normalization of code points and checking for bidirectional
370    code points are turned off. Case mappings are included if the NFS
371    implementation supports case insensitive file and path names.
372
3732.  *nfs4_cis_prep Profile*: This profile is used for preparing NFS server
374    names. Normalization of code points and checking for bidirectional code
375    points are turned on. This profile is equivalent to NamePrep profile.
376
3773.  *nfs4_mixed_prep Profile*: This profile is used for preparing strings in the
378    Access Control Entries of NFS servers. These strings consist of two parts,
379    prefix and suffix, separated by '@' (U+0040). The prefix is processed with
380    case mappings turned off and the suffix is processed with case mappings
381    turned on. Normalization of code points and checking for bidirectional code
382    points are turned on.
383
384#### XMPP Profiles
385
386Extensible Messaging and Presence Protocol (XMPP) is an XML based protocol for
387near real-time extensible messaging and presence. This protocol defines use of
388two StringPrep profiles:
389
3901.  *ResourcePrep Profile*: This profile is used for processing the resource
391    identifiers within XMPP. Normalization of code points and checking of
392    bidirectional code points are turned on. Case mappings are excluded. The
393    space code point (U+0020) is excluded from the prohibited code points set.
394
3952.  *NodePrep Profile*: This profile is used for processing the node identifiers
396    within XMPP. Normalization of code points and checking of bidirectional code
397    points are turned on. Case mappings are included. All code points specified
398    as prohibited in StringPrep are prohibited. Additional code points are added
399    to the prohibited set.
400