userguide/strings/stringprep.md

---
layout: default
title: StringPrep
nav_order: 7
parent: Chars and Strings
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->

# StringPrep

## Overview

Comparing strings in a consistent manner becomes imperative when a large
repertoire of characters such as Unicode is used in network protocols.
StringPrep provides sets of rules for use of Unicode and syntax for prevention
of spoofing. The implementation of StringPrep and IDNA services and their usage
in ICU is described below.

## StringPrep

StringPrep, the process of preparing Unicode strings for use in network
protocols is defined in RFC 3454 (<http://www.rfc-editor.org/rfc/rfc3454.txt> ).
The RFC defines a broad framework and rules for processing the strings.

Protocols that prescribe use of StringPrep must define a profile of StringPrep,
whose applicability is limited to the protocol. Profiles are a set of rules and
data tables which describe the how the strings should be prepare. The profiles
can choose to turn on or turn off normalization, checking for bidirectional
characters. They can also choose to add or remove mappings, unassigned and
prohibited code points from the tables provided.

StringPrep uses Unicode Version 3.2 and defines a set of tables for use by the
profiles. The profiles can chose to include or exclude tables or code points
from the tables defined by the RFC.

StringPrep defines tables that can be broadly classified into

1.  *Unassigned Table*: Contains code points that are unassigned in Unicode
    Version 3.2. Unassigned code points may be allowed or disallowed in the
    output string depending on the application. The table in Appendix A.1 of the
    RFC contains the code points.

1.  *Mapping Tables*: Code points that are commonly deleted from the output and
    code points that are case mapped are included in this table. There are two
    mapping tables in the Appendix namely B.1 and B.2

2.  *Prohibited Tables*: Contains code points that are prohibited from the
    output string. Control codes, private use area code points, non-character
    code points, surrogate code points, tagging and deprecated code points are
    included in this table. There are nine mapping tables in Appendix which
    include the prohibited code points namely C.1, C.2, C.3, C.4, C.5, C.6, C.7,
    C.8 and C.9.

The procedure for preparing strings for use can be described in the following
steps:

1.  *Map*: For each code point in the input check if it has a mapping defined in
    the mapping table, if so, replace it with the mapping in the output.

2.  *Normalize*: Normalize the output of step 1 using Unicode Normalization Form
    NFKC, it the option is set. Normalization algorithm must conform to UAX 15.

3.  *Prohibit*: For each code point in the output of step 2 check if the code
    point is present in the prohibited table, if so, fail returning an error.

4.  *Check BiDi*: Check for code points with strong right-to-left directionality
    in the output of step 3. If present, check if the string satisfies the rules
    for bidirectional strings as specified.

## NamePrep

NamePrep is a profile of StringPrep for use in IDNA. This profile in defined in
RFC 3491(<http://www.rfc-editor.org/rfc/rfc3491.txt> ).

The profile specifies the following rules:

1.  *Map* : Include all code point mappings specified in the StringPrep.

2.  *Normalize*: Normalize the output of step 1 according to NFKC.

3.  *Prohibit*: Prohibit all code points specified as prohibited in StringPrep
    except for the space ( U+0020) code point from the output of step 2.

4.  *Check BiDi*: Check for bidirectional code points and process according to
    the rules specified in StringPrep.

## Punycode

Punycode is an encoding scheme for Unicode for use in IDNA. Punycode converts
Unicode text to unique sequence of ASCII text and back to Unicode. It is an
ASCII Compatible Encoding (ACE). Punycode is described in RFC 3492
(<http://www.rfc-editor.org/rfc/rfc3492.txt> ).

The Punycode algorithm is a form of a general Bootstring algorithm which allows
strings composed of smaller set of code points to uniquely represent any string
of code points from a larger set. Punycode represents Unicode code points from
U+0000 to U+10FFFF by using the smaller ASCII set U+0000 to U+0007F. The
algorithm can also preserve case information of the code points in the lager set
while and encoding and decoding. This feature, however, is not used in IDNA.

## Internationalizing Domain Names in Applications (IDNA)

The Domain Name Service (DNS) protocol defines the procedure for matching of
ASCII strings case insensitively to the names in the lookup tables containing
mapping of IP (Internet Protocol) addresses to server names. When Unicode is
used instead of ASCII in server names then two problems arise which need to be
dealt with differently. When the server name is displayed to the user then
Unicode text should be displayed. When Unicode text is stored in lookup tables,
for compatibility with older DNS protocol and the resolver libraries, the text
should be the ASCII equivalent. The IDNA protocol, defined by RFC 3490
(<http://www.rfc-editor.org/rfc/rfc3490.txt> ), satisfies the above
requirements.

Server names stored in the DNS lookup tables are usually formed by concatenating
domain labels with a label separator, for example:

The protocol defines operations to be performed on domain labels before the
names are stored in the lookup tables and before the names fetched from lookup
tables are displayed to the user. The operations are :

1.  ToASCII: This operation is performed on domain labels before sending the
    name to a resolver and before storing the name in the DNS lookup table. The
    domain labels are processed by StringPrep algorithm by using the rules
    specified by NamePrep profile. The output of this step is then encoded by
    using Punycode and an ACE prefix is added to denote that the text is encoded
    using Punycode. IDNA uses “xn--” before the encoded label.

1.  ToUnicode: This operation is performed on domain labels before displaying
    the names to to users. If the domain label is prefixed with the ACE prefix
    for IDNA, then the label excluding the prefix is decoded using Punycode. The
    output of Punycode decoder is verified by applying ToASCII operation and
    comparing the output with the input to the ToUnicode operation.

Unicode contains code points that are glyphically similar to the ASCII Full Stop
(U+002E). These code points must be treated as label separators when performing
ToASCII operation. These code points are :

1.  Ideographic Full Stop (U+3002)

2.  Full Width Full Stop (U+FF0E)

3.  Half Width Ideographic Full Stop (U+FF61)

Unassigned code points in Unicode Version 3.2 as given in StringPrep tables are
treated differently depending on how the processed string is used. For query
operations, where a registrar is requested for information regarding
availability of a certain domain name, unassigned code points are allowed to be
present in the string. For storing the string in DNS lookup tables, unassigned
code points are prohibited from the input.

IDNA specifies that the ToUnicode and ToASCII have options to check for
Letter-Digit-Hyphen code points and adhere to the STD3 ASCII Rules.

IDNA specifies that domain labels are equivalent if and only if the output of
ToASCII operation on the labels match using case insensitive ASCII comparison.

## StringPrep Service in ICU

The StringPrep service in ICU is data driven. The service is based on
Open-Use-Close pattern. A StringPrep profile is opened, the strings are
processed according to the rules specified in the profile and the profile is
closed once the profile is ready to be disposed.

Tools for filtering RFC 3454 and producing a rule file that can be compiled into
a binary format containing all the information required by the service are
provided.

The procedure for producing a StringPrep profile data file are as given below:

1.  Run filterRFC3454.pl Perl tool, to filter the RFC file and produce a rule
    file. The text file produced can be edited by the clients to add/delete
    mappings or add/delete prohibited code points.

2.  Run the gensprep tool to compile the rule file into a binary format. The
    options to turn on normalization of strings and checking of bidirectional
    code points are passed as command line options to the tool. This tool
    produces a binary profile file with the extension “spp”.

3.  Open the StringPrep profile with path to the binary and name of the binary
    profile file as the options to the open call. The profile data files are
    memory mapped and cached for optimum performance.

### Code Snippets

> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
keep the profile object around for reuse, instead of opening and closing the
profile each time.*

#### C++

    UErrorCode status = U_ZERO_ERROR;
    UParseError parseError;
    /* open the StringPrep profile */
    UStringPrepProfile* nameprep = usprep_open("/usr/joe/mydata",
                                               "nfscsi", &status);
    if(U_FAILURE(status)) {
        /* handle the error */
    }
    /* prepare the string for use according
     * to the rules specified in the profile
     */
    int32_t retLen = usprep_prepare(src, srcLength, dest,
                                    destCapacity, USPREP_ALLOW_UNASSIGNED,
                                    nameprep, &parseError, &status);
    /* close the profile */
    usprep_close(nameprep);

#### Java

    private static final StringPrep nfscsi = null;
    //singleton instance
    private static final NFSCSIStringPrep prep=new NFSCSIStringPrep();
    private NFSCSIStringPrep() {
        try {
            InputStream nfscsiFile = TestUtil.getDataStream("nfscsi.spp");
            nfscsi = new StringPrep(nfscsiFile);
            nfscsiFile.close();
        } catch(IOException e) {
            throw new RuntimeException(e.toString());
        }
    }
    private static byte[] prepare(byte[] src, StringPrep prep)
            throws StringPrepParseException, UnsupportedEncodingException {
        String s = new String(src, "UTF-8");
        UCharacterIterator iter = UCharacterIterator.getInstance(s);
        StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT);
        return out.toString().getBytes("UTF-8");
    }

## IDNA API in ICU

ICU provides APIs for performing the ToASCII, ToUnicode and compare operations
as defined by the RFC 3490. Convenience methods for comparing IDNs are also
provided. These APIs follow ICU policies for string manipulation and coding
guidelines.

### Code Snippets

> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
keep the profile object around for reuse, instead of opening and closing the
profile each time.*

### ToASCII operation

***C***

    UChar* dest = (UChar*) malloc(destCapacity * U_SIZEOF_UCHAR);
    destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
                            UIDNA_DEFAULT, &parseError, &status);
    if(status == U_BUFFER_OVERFLOW_ERROR) {
        status = U_ZERO_ERROR;
        destCapacity= destLen + 1; /* for the terminating Null */
        free(dest); /* free the memory */
        dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
        destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
                                UIDNA_DEFAULT, &parseError, &status);
    }
    if(U_FAILURE(status)) {
        /* handle the error */
    }
    /* do interesting stuff with output*/

***Java***

    try {
        StringBuffer out= IDNA.convertToASCII(inBuf,IDNA.DEFAULT);
    } catch(StringPrepParseException ex) {
        /*handle the exception*/
    }

### toUnicode operation

***C***

    UChar * dest = (UChar *) malloc(destCapacity * U_SIZEOF_UCHAR);
    destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
                              UIDNA_DEFAULT
                              &parseError, &status);
    if(status == U_BUFFER_OVERFLOW_ERROR) {
        status = U_ZERO_ERROR;
        destCapacity= destLen + 1; /* for the terminating Null */
        /* free the memory */
        free(dest);
        dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
        destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
                                  UIDNA_DEFAULT, &parseError, &status);
    }
    if(U_FAILURE(status)) {
        /* handle the error */
    }
    /* do interesting stuff with output*/

***Java***

    try {
        StringBuffer out= IDNA.convertToUnicode(inBuf,IDNA.DEFAULT);
    } catch(StringPrepParseException ex) {
        // handle the exception
    }

### compare operation

***C***

    int32_t rc = uidna_compare(source1, length1,
                               source2, length2,
                               UIDNA_DEFAULT,
                               &status);
    if(rc==0) {
        /* the IDNs are same ... do something interesting */
    } else {
        /* the IDNs are different ... do something */
    }

***Java***

    try {
        int retVal = IDNA.compare(s1,s2,IDNA.DEFAULT);
        // do something interesting with retVal
    } catch(StringPrepParseException e) {
       // handle the exception
    }

## Design Considerations

StringPrep profiles exhibit the following characteristics:

1.  The profiles contain information about code points. StringPrep allows
    profiles to add/delete code points or mappings.

2.  Options such as turning normalization and checking for bidirectional code
    points on or off are the properties of the profiles.

3.  The StringPrep algorithm is not overridden by the profile.

4.  Once defined, the profiles do not change.

The StringPrep profiles are used in network protocols so runtime performance is
important.

Many profiles have been and are being defined, so applications should be able to
plug-in arbitrary profiles and get the desired result out of the framework.

ICU is designed for this usage by providing build-time tools for arbitrary
StringPrep profile definitions, and loading them from application-supplied data
in binary form with data structures optimized for runtime use.

## Demo

A web application at <https://icu4c-demos.unicode.org/icu-bin/idnbrowser>
illustrates the use of IDNA API. The source code for the application is
available at <https://github.com/unicode-org/icu-demos/tree/main/idnbrowser>.

## Appendix

#### NFS Version 4 Profiles

Network File System Version 4 defined by RFC 3530
(<http://www.rfc-editor.org/rfc/rfc3530.txt> ) defines use of Unicode text in
the protocol. ICU provides the requisite profiles as part of test suite and code
for processing the strings according the profiles as a part of samples.

The RFC defines three profiles :

1.  *nfs4_cs_prep Profile*: This profile is used for preparing file and path
    name strings. Normalization of code points and checking for bidirectional
    code points are turned off. Case mappings are included if the NFS
    implementation supports case insensitive file and path names.

2.  *nfs4_cis_prep Profile*: This profile is used for preparing NFS server
    names. Normalization of code points and checking for bidirectional code
    points are turned on. This profile is equivalent to NamePrep profile.

3.  *nfs4_mixed_prep Profile*: This profile is used for preparing strings in the
    Access Control Entries of NFS servers. These strings consist of two parts,
    prefix and suffix, separated by '@' (U+0040). The prefix is processed with
    case mappings turned off and the suffix is processed with case mappings
    turned on. Normalization of code points and checking for bidirectional code
    points are turned on.

#### XMPP Profiles

Extensible Messaging and Presence Protocol (XMPP) is an XML based protocol for
near real-time extensible messaging and presence. This protocol defines use of
two StringPrep profiles:

1.  *ResourcePrep Profile*: This profile is used for processing the resource
    identifiers within XMPP. Normalization of code points and checking of
    bidirectional code points are turned on. Case mappings are excluded. The
    space code point (U+0020) is excluded from the prohibited code points set.

2.  *NodePrep Profile*: This profile is used for processing the node identifiers
    within XMPP. Normalization of code points and checking of bidirectional code
    points are turned on. Case mappings are included. All code points specified
    as prohibited in StringPrep are prohibited. Additional code points are added
    to the prohibited set.