1PCRE2TEST(1) General Commands Manual PCRE2TEST(1) 2 3 4 5NAME 6 pcre2test - a program for testing Perl-compatible regular expressions. 7 8SYNOPSIS 9 10 pcre2test [options] [input file [output file]] 11 12 pcre2test is a test program for the PCRE2 regular expression libraries, 13 but it can also be used for experimenting with regular expressions. 14 This document describes the features of the test program; for details 15 of the regular expressions themselves, see the pcre2pattern documenta- 16 tion. For details of the PCRE2 library function calls and their op- 17 tions, see the pcre2api documentation. 18 19 The input for pcre2test is a sequence of regular expression patterns 20 and subject strings to be matched. There are also command lines for 21 setting defaults and controlling some special actions. The output shows 22 the result of each match attempt. Modifiers on external or internal 23 command lines, the patterns, and the subject lines specify PCRE2 func- 24 tion options, control how the subject is processed, and what output is 25 produced. 26 27 There are many obscure modifiers, some of which are specifically de- 28 signed for use in conjunction with the test script and data files that 29 are distributed as part of PCRE2. All the modifiers are documented 30 here, some without much justification, but many of them are unlikely to 31 be of use except when testing the libraries. 32 33 34PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES 35 36 Different versions of the PCRE2 library can be built to support charac- 37 ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units. 38 One, two, or all three of these libraries may be simultaneously in- 39 stalled. The pcre2test program can be used to test all the libraries. 40 However, its own input and output are always in 8-bit format. When 41 testing the 16-bit or 32-bit libraries, patterns and subject strings 42 are converted to 16-bit or 32-bit format before being passed to the li- 43 brary functions. Results are converted back to 8-bit code units for 44 output. 45 46 In the rest of this document, the names of library functions and struc- 47 tures are given in generic form, for example, pcre2_compile(). The ac- 48 tual names used in the libraries have a suffix _8, _16, or _32, as ap- 49 propriate. 50 51 52INPUT ENCODING 53 54 Input to pcre2test is processed line by line, either by calling the C 55 library's fgets() function, or via the libreadline or libedit library. 56 In some Windows environments character 26 (hex 1A) causes an immediate 57 end of file, and no further data is read, so this character should be 58 avoided unless you really want that action. 59 60 The input is processed using using C's string functions, so must not 61 contain binary zeros, even though in Unix-like environments, fgets() 62 treats any bytes other than newline as data characters. An error is 63 generated if a binary zero is encountered. By default subject lines are 64 processed for backslash escapes, which makes it possible to include any 65 data value in strings that are passed to the library for matching. For 66 patterns, there is a facility for specifying some or all of the 8-bit 67 input characters as hexadecimal pairs, which makes it possible to in- 68 clude binary zeros. 69 70 Input for the 16-bit and 32-bit libraries 71 72 When testing the 16-bit or 32-bit libraries, there is a need to be able 73 to generate character code points greater than 255 in the strings that 74 are passed to the library. For subject lines, backslash escapes can be 75 used. In addition, when the utf modifier (see "Setting compilation op- 76 tions" below) is set, the pattern and any following subject lines are 77 interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as ap- 78 propriate. 79 80 For non-UTF testing of wide characters, the utf8_input modifier can be 81 used. This is mutually exclusive with utf, and is allowed only in 82 16-bit or 32-bit mode. It causes the pattern and following subject 83 lines to be treated as UTF-8 according to the original definition (RFC 84 2279), which allows for character values up to 0x7fffffff. Each charac- 85 ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case, 86 values greater than 0xffff cause an error to occur). 87 88 UTF-8 (in its original definition) is not capable of encoding values 89 greater than 0x7fffffff, but such values can be handled by the 32-bit 90 library. When testing this library in non-UTF mode with utf8_input set, 91 if any character is preceded by the byte 0xff (which is an invalid byte 92 in UTF-8) 0x80000000 is added to the character's value. This is the 93 only way of passing such code points in a pattern string. For subject 94 strings, using an escape sequence is preferable. 95 96 97COMMAND LINE OPTIONS 98 99 -8 If the 8-bit library has been built, this option causes it to 100 be used (this is the default). If the 8-bit library has not 101 been built, this option causes an error. 102 103 -16 If the 16-bit library has been built, this option causes it 104 to be used. If only the 16-bit library has been built, this 105 is the default. If the 16-bit library has not been built, 106 this option causes an error. 107 108 -32 If the 32-bit library has been built, this option causes it 109 to be used. If only the 32-bit library has been built, this 110 is the default. If the 32-bit library has not been built, 111 this option causes an error. 112 113 -ac Behave as if each pattern has the auto_callout modifier, that 114 is, insert automatic callouts into every pattern that is com- 115 piled. 116 117 -AC As for -ac, but in addition behave as if each subject line 118 has the callout_extra modifier, that is, show additional in- 119 formation from callouts. 120 121 -b Behave as if each pattern has the fullbincode modifier; the 122 full internal binary form of the pattern is output after com- 123 pilation. 124 125 -C Output the version number of the PCRE2 library, and all 126 available information about the optional features that are 127 included, and then exit with zero exit code. All other op- 128 tions are ignored. If both -C and -LM are present, whichever 129 is first is recognized. 130 131 -C option Output information about a specific build-time option, then 132 exit. This functionality is intended for use in scripts such 133 as RunTest. The following options output the value and set 134 the exit code as indicated: 135 136 ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 137 0x15 or 0x25 138 0 if used in an ASCII environment 139 exit code is always 0 140 linksize the configured internal link size (2, 3, or 4) 141 exit code is set to the link size 142 newline the default newline setting: 143 CR, LF, CRLF, ANYCRLF, ANY, or NUL 144 exit code is always 0 145 bsr the default setting for what \R matches: 146 ANYCRLF or ANY 147 exit code is always 0 148 149 The following options output 1 for true or 0 for false, and 150 set the exit code to the same value: 151 152 backslash-C \C is supported (not locked out) 153 ebcdic compiled for an EBCDIC environment 154 jit just-in-time support is available 155 pcre2-16 the 16-bit library was built 156 pcre2-32 the 32-bit library was built 157 pcre2-8 the 8-bit library was built 158 unicode Unicode support is available 159 160 If an unknown option is given, an error message is output; 161 the exit code is 0. 162 163 -d Behave as if each pattern has the debug modifier; the inter- 164 nal form and information about the compiled pattern is output 165 after compilation; -d is equivalent to -b -i. 166 167 -dfa Behave as if each subject line has the dfa modifier; matching 168 is done using the pcre2_dfa_match() function instead of the 169 default pcre2_match(). 170 171 -error number[,number,...] 172 Call pcre2_get_error_message() for each of the error numbers 173 in the comma-separated list, display the resulting messages 174 on the standard output, then exit with zero exit code. The 175 numbers may be positive or negative. This is a convenience 176 facility for PCRE2 maintainers. 177 178 -help Output a brief summary these options and then exit. 179 180 -i Behave as if each pattern has the info modifier; information 181 about the compiled pattern is given after compilation. 182 183 -jit Behave as if each pattern line has the jit modifier; after 184 successful compilation, each pattern is passed to the just- 185 in-time compiler, if available. 186 187 -jitfast Behave as if each pattern line has the jitfast modifier; af- 188 ter successful compilation, each pattern is passed to the 189 just-in-time compiler, if available, and each subject line is 190 passed directly to the JIT matcher via its "fast path". 191 192 -jitverify 193 Behave as if each pattern line has the jitverify modifier; 194 after successful compilation, each pattern is passed to the 195 just-in-time compiler, if available, and the use of JIT for 196 matching is verified. 197 198 -LM List modifiers: write a list of available pattern and subject 199 modifiers to the standard output, then exit with zero exit 200 code. All other options are ignored. If both -C and any -Lx 201 options are present, whichever is first is recognized. 202 203 -LP List properties: write a list of recognized Unicode proper- 204 ties to the standard output, then exit with zero exit code. 205 All other options are ignored. If both -C and any -Lx options 206 are present, whichever is first is recognized. 207 208 -LS List scripts: write a list of recognized Unicode script names 209 to the standard output, then exit with zero exit code. All 210 other options are ignored. If both -C and any -Lx options are 211 present, whichever is first is recognized. 212 213 -pattern modifier-list 214 Behave as if each pattern line contains the given modifiers. 215 216 -q Do not output the version number of pcre2test at the start of 217 execution. 218 219 -S size On Unix-like systems, set the size of the run-time stack to 220 size mebibytes (units of 1024*1024 bytes). 221 222 -subject modifier-list 223 Behave as if each subject line contains the given modifiers. 224 225 -t Run each compile and match many times with a timer, and out- 226 put the resulting times per compile or match. When JIT is 227 used, separate times are given for the initial compile and 228 the JIT compile. You can control the number of iterations 229 that are used for timing by following -t with a number (as a 230 separate item on the command line). For example, "-t 1000" 231 iterates 1000 times. The default is to iterate 500,000 times. 232 233 -tm This is like -t except that it times only the matching phase, 234 not the compile phase. 235 236 -T -TM These behave like -t and -tm, but in addition, at the end of 237 a run, the total times for all compiles and matches are out- 238 put. 239 240 -version Output the PCRE2 version number and then exit. 241 242 243DESCRIPTION 244 245 If pcre2test is given two filename arguments, it reads from the first 246 and writes to the second. If the first name is "-", input is taken from 247 the standard input. If pcre2test is given only one argument, it reads 248 from that file and writes to stdout. Otherwise, it reads from stdin and 249 writes to stdout. 250 251 When pcre2test is built, a configuration option can specify that it 252 should be linked with the libreadline or libedit library. When this is 253 done, if the input is from a terminal, it is read using the readline() 254 function. This provides line-editing and history facilities. The output 255 from the -help option states whether or not readline() will be used. 256 257 The program handles any number of tests, each of which consists of a 258 set of input lines. Each set starts with a regular expression pattern, 259 followed by any number of subject lines to be matched against that pat- 260 tern. In between sets of test data, command lines that begin with # may 261 appear. This file format, with some restrictions, can also be processed 262 by the perltest.sh script that is distributed with PCRE2 as a means of 263 checking that the behaviour of PCRE2 and Perl is the same. For a speci- 264 fication of perltest.sh, see the comments near its beginning. See also 265 the #perltest command below. 266 267 When the input is a terminal, pcre2test prompts for each line of input, 268 using "re>" to prompt for regular expression patterns, and "data>" to 269 prompt for subject lines. Command lines starting with # can be entered 270 only in response to the "re>" prompt. 271 272 Each subject line is matched separately and independently. If you want 273 to do multi-line matches, you have to use the \n escape sequence (or \r 274 or \r\n, etc., depending on the newline setting) in a single line of 275 input to encode the newline sequences. There is no limit on the length 276 of subject lines; the input buffer is automatically extended if it is 277 too small. There are replication features that makes it possible to 278 generate long repetitive pattern or subject lines without having to 279 supply them explicitly. 280 281 An empty line or the end of the file signals the end of the subject 282 lines for a test, at which point a new pattern or command line is ex- 283 pected if there is still input to be read. 284 285 286COMMAND LINES 287 288 In between sets of test data, a line that begins with # is interpreted 289 as a command line. If the first character is followed by white space or 290 an exclamation mark, the line is treated as a comment, and ignored. 291 Otherwise, the following commands are recognized: 292 293 #forbid_utf 294 295 Subsequent patterns automatically have the PCRE2_NEVER_UTF and 296 PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF 297 and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of 298 patterns. This command also forces an error if a subsequent pattern 299 contains any occurrences of \P, \p, or \X, which are still supported 300 when PCRE2_UTF is not set, but which require Unicode property support 301 to be included in the library. 302 303 This is a trigger guard that is used in test files to ensure that UTF 304 or Unicode property tests are not accidentally added to files that are 305 used when Unicode support is not included in the library. Setting 306 PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained 307 by the use of #pattern; the difference is that #forbid_utf cannot be 308 unset, and the automatic options are not displayed in pattern informa- 309 tion, to avoid cluttering up test output. 310 311 #load <filename> 312 313 This command is used to load a set of precompiled patterns from a file, 314 as described in the section entitled "Saving and restoring compiled 315 patterns" below. 316 317 #loadtables <filename> 318 319 This command is used to load a set of binary character tables that can 320 be accessed by the tables=3 qualifier. Such tables can be created by 321 the pcre2_dftables program with the -b option. 322 323 #newline_default [<newline-list>] 324 325 When PCRE2 is built, a default newline convention can be specified. 326 This determines which characters and/or character pairs are recognized 327 as indicating a newline in a pattern or subject string. The default can 328 be overridden when a pattern is compiled. The standard test files con- 329 tain tests of various newline conventions, but the majority of the 330 tests expect a single linefeed to be recognized as a newline by de- 331 fault. Without special action the tests would fail when PCRE2 is com- 332 piled with either CR or CRLF as the default newline. 333 334 The #newline_default command specifies a list of newline types that are 335 acceptable as the default. The types must be one of CR, LF, CRLF, ANY- 336 CRLF, ANY, or NUL (in upper or lower case), for example: 337 338 #newline_default LF Any anyCRLF 339 340 If the default newline is in the list, this command has no effect. Oth- 341 erwise, except when testing the POSIX API, a newline modifier that 342 specifies the first newline convention in the list (LF in the above ex- 343 ample) is added to any pattern that does not already have a newline 344 modifier. If the newline list is empty, the feature is turned off. This 345 command is present in a number of the standard test input files. 346 347 When the POSIX API is being tested there is no way to override the de- 348 fault newline convention, though it is possible to set the newline con- 349 vention from within the pattern. A warning is given if the posix or 350 posix_nosub modifier is used when #newline_default would set a default 351 for the non-POSIX API. 352 353 #pattern <modifier-list> 354 355 This command sets a default modifier list that applies to all subse- 356 quent patterns. Modifiers on a pattern can change these settings. 357 358 #perltest 359 360 This line is used in test files that can also be processed by perl- 361 test.sh to confirm that Perl gives the same results as PCRE2. Subse- 362 quent tests are checked for the use of pcre2test features that are in- 363 compatible with the perltest.sh script. 364 365 Patterns must use '/' as their delimiter, and only certain modifiers 366 are supported. Comment lines, #pattern commands, and #subject commands 367 that set or unset "mark" are recognized and acted on. The #perltest, 368 #forbid_utf, and #newline_default commands, which are needed in the 369 relevant pcre2test files, are silently ignored. All other command lines 370 are ignored, but give a warning message. The #perltest command helps 371 detect tests that are accidentally put in the wrong file or use the 372 wrong delimiter. For more details of the perltest.sh script see the 373 comments it contains. 374 375 #pop [<modifiers>] 376 #popcopy [<modifiers>] 377 378 These commands are used to manipulate the stack of compiled patterns, 379 as described in the section entitled "Saving and restoring compiled 380 patterns" below. 381 382 #save <filename> 383 384 This command is used to save a set of compiled patterns to a file, as 385 described in the section entitled "Saving and restoring compiled pat- 386 terns" below. 387 388 #subject <modifier-list> 389 390 This command sets a default modifier list that applies to all subse- 391 quent subject lines. Modifiers on a subject line can change these set- 392 tings. 393 394 395MODIFIER SYNTAX 396 397 Modifier lists are used with both pattern and subject lines. Items in a 398 list are separated by commas followed by optional white space. Trailing 399 whitespace in a modifier list is ignored. Some modifiers may be given 400 for both patterns and subject lines, whereas others are valid only for 401 one or the other. Each modifier has a long name, for example "an- 402 chored", and some of them must be followed by an equals sign and a 403 value, for example, "offset=12". Values cannot contain comma charac- 404 ters, but may contain spaces. Modifiers that do not take values may be 405 preceded by a minus sign to turn off a previous setting. 406 407 A few of the more common modifiers can also be specified as single let- 408 ters, for example "i" for "caseless". In documentation, following the 409 Perl convention, these are written with a slash ("the /i modifier") for 410 clarity. Abbreviated modifiers must all be concatenated in the first 411 item of a modifier list. If the first item is not recognized as a long 412 modifier name, it is interpreted as a sequence of these abbreviations. 413 For example: 414 415 /abc/ig,newline=cr,jit=3 416 417 This is a pattern line whose modifier list starts with two one-letter 418 modifiers (/i and /g). The lower-case abbreviated modifiers are the 419 same as used in Perl. 420 421 422PATTERN SYNTAX 423 424 A pattern line must start with one of the following characters (common 425 symbols, excluding pattern meta-characters): 426 427 / ! " ' ` - = _ : ; , % & @ ~ 428 429 This is interpreted as the pattern's delimiter. A regular expression 430 may be continued over several input lines, in which case the newline 431 characters are included within it. It is possible to include the delim- 432 iter as a literal within the pattern by escaping it with a backslash, 433 for example 434 435 /abc\/def/ 436 437 If you do this, the escape and the delimiter form part of the pattern, 438 but since the delimiters are all non-alphanumeric, the inclusion of the 439 backslash does not affect the pattern's interpretation. Note, however, 440 that this trick does not work within \Q...\E literal bracketing because 441 the backslash will itself be interpreted as a literal. If the terminat- 442 ing delimiter is immediately followed by a backslash, for example, 443 444 /abc/\ 445 446 then a backslash is added to the end of the pattern. This is done to 447 provide a way of testing the error condition that arises if a pattern 448 finishes with a backslash, because 449 450 /abc\/ 451 452 is interpreted as the first line of a pattern that starts with "abc/", 453 causing pcre2test to read the next line as a continuation of the regu- 454 lar expression. 455 456 A pattern can be followed by a modifier list (details below). 457 458 459SUBJECT LINE SYNTAX 460 461 Before each subject line is passed to pcre2_match(), pcre2_dfa_match(), 462 or pcre2_jit_match(), leading and trailing white space is removed, and 463 the line is scanned for backslash escapes, unless the subject_literal 464 modifier was set for the pattern. The following provide a means of en- 465 coding non-printing characters in a visible way: 466 467 \a alarm (BEL, \x07) 468 \b backspace (\x08) 469 \e escape (\x27) 470 \f form feed (\x0c) 471 \n newline (\x0a) 472 \r carriage return (\x0d) 473 \t tab (\x09) 474 \v vertical tab (\x0b) 475 \nnn octal character (up to 3 octal digits); always 476 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 477 \o{dd...} octal character (any number of octal digits} 478 \xhh hexadecimal byte (up to 2 hex digits) 479 \x{hh...} hexadecimal character (any number of hex digits) 480 481 The use of \x{hh...} is not dependent on the use of the utf modifier on 482 the pattern. It is recognized always. There may be any number of hexa- 483 decimal digits inside the braces; invalid values provoke error mes- 484 sages. 485 486 Note that \xhh specifies one byte rather than one character in UTF-8 487 mode; this makes it possible to construct invalid UTF-8 sequences for 488 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 489 character in UTF-8 mode, generating more than one byte if the value is 490 greater than 127. When testing the 8-bit library not in UTF-8 mode, 491 \x{hh} generates one byte for values less than 256, and causes an error 492 for greater values. 493 494 In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 495 possible to construct invalid UTF-16 sequences for testing purposes. 496 497 In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This 498 makes it possible to construct invalid UTF-32 sequences for testing 499 purposes. 500 501 There is a special backslash sequence that specifies replication of one 502 or more characters: 503 504 \[<characters>]{<count>} 505 506 This makes it possible to test long strings without having to provide 507 them as part of the file. For example: 508 509 \[abc]{4} 510 511 is converted to "abcabcabcabc". This feature does not support nesting. 512 To include a closing square bracket in the characters, code it as \x5D. 513 514 A backslash followed by an equals sign marks the end of the subject 515 string and the start of a modifier list. For example: 516 517 abc\=notbol,notempty 518 519 If the subject string is empty and \= is followed by whitespace, the 520 line is treated as a comment line, and is not used for matching. For 521 example: 522 523 \= This is a comment. 524 abc\= This is an invalid modifier list. 525 526 A backslash followed by any other non-alphanumeric character just es- 527 capes that character. A backslash followed by anything else causes an 528 error. However, if the very last character in the line is a backslash 529 (and there is no modifier list), it is ignored. This gives a way of 530 passing an empty line as data, since a real empty line terminates the 531 data input. 532 533 If the subject_literal modifier is set for a pattern, all subject lines 534 that follow are treated as literals, with no special treatment of back- 535 slashes. No replication is possible, and any subject modifiers must be 536 set as defaults by a #subject command. 537 538 539PATTERN MODIFIERS 540 541 There are several types of modifier that can appear in pattern lines. 542 Except where noted below, they may also be used in #pattern commands. A 543 pattern's modifier list can add to or override default modifiers that 544 were set by a previous #pattern command. 545 546 Setting compilation options 547 548 The following modifiers set options for pcre2_compile(). Most of them 549 set bits in the options argument of that function, but those whose 550 names start with PCRE2_EXTRA are additional options that are set in the 551 compile context. For the main options, there are some single-letter ab- 552 breviations that are the same as Perl options. There is special han- 553 dling for /x: if a second x is present, PCRE2_EXTENDED is converted 554 into PCRE2_EXTENDED_MORE as in Perl. A third appearance adds PCRE2_EX- 555 TENDED as well, though this makes no difference to the way pcre2_com- 556 pile() behaves. See pcre2api for a description of the effects of these 557 options. 558 559 allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 560 allow_lookaround_bsk set PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK 561 allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 562 alt_bsux set PCRE2_ALT_BSUX 563 alt_circumflex set PCRE2_ALT_CIRCUMFLEX 564 alt_verbnames set PCRE2_ALT_VERBNAMES 565 anchored set PCRE2_ANCHORED 566 auto_callout set PCRE2_AUTO_CALLOUT 567 bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 568 /i caseless set PCRE2_CASELESS 569 dollar_endonly set PCRE2_DOLLAR_ENDONLY 570 /s dotall set PCRE2_DOTALL 571 dupnames set PCRE2_DUPNAMES 572 endanchored set PCRE2_ENDANCHORED 573 escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF 574 /x extended set PCRE2_EXTENDED 575 /xx extended_more set PCRE2_EXTENDED_MORE 576 extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX 577 firstline set PCRE2_FIRSTLINE 578 literal set PCRE2_LITERAL 579 match_line set PCRE2_EXTRA_MATCH_LINE 580 match_invalid_utf set PCRE2_MATCH_INVALID_UTF 581 match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 582 match_word set PCRE2_EXTRA_MATCH_WORD 583 /m multiline set PCRE2_MULTILINE 584 never_backslash_c set PCRE2_NEVER_BACKSLASH_C 585 never_ucp set PCRE2_NEVER_UCP 586 never_utf set PCRE2_NEVER_UTF 587 /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE 588 no_auto_possess set PCRE2_NO_AUTO_POSSESS 589 no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 590 no_start_optimize set PCRE2_NO_START_OPTIMIZE 591 no_utf_check set PCRE2_NO_UTF_CHECK 592 ucp set PCRE2_UCP 593 ungreedy set PCRE2_UNGREEDY 594 use_offset_limit set PCRE2_USE_OFFSET_LIMIT 595 utf set PCRE2_UTF 596 597 As well as turning on the PCRE2_UTF option, the utf modifier causes all 598 non-printing characters in output strings to be printed using the 599 \x{hh...} notation. Otherwise, those less than 0x100 are output in hex 600 without the curly brackets. Setting utf in 16-bit or 32-bit mode also 601 causes pattern and subject strings to be translated to UTF-16 or 602 UTF-32, respectively, before being passed to library functions. 603 604 Setting compilation controls 605 606 The following modifiers affect the compilation process or request in- 607 formation about the pattern. There are single-letter abbreviations for 608 some that are heavily used in the test files. 609 610 bsr=[anycrlf|unicode] specify \R handling 611 /B bincode show binary code without lengths 612 callout_info show callout information 613 convert=<options> request foreign pattern conversion 614 convert_glob_escape=c set glob escape character 615 convert_glob_separator=c set glob separator character 616 convert_length set convert buffer length 617 debug same as info,fullbincode 618 framesize show matching frame size 619 fullbincode show binary code with lengths 620 /I info show info about compiled pattern 621 hex unquoted characters are hexadecimal 622 jit[=<number>] use JIT 623 jitfast use JIT fast path 624 jitverify verify JIT use 625 locale=<name> use this locale 626 max_pattern_length=<n> set the maximum pattern length 627 memory show memory used 628 newline=<type> set newline type 629 null_context compile with a NULL context 630 parens_nest_limit=<n> set maximum parentheses depth 631 posix use the POSIX API 632 posix_nosub use the POSIX API with REG_NOSUB 633 push push compiled pattern onto the stack 634 pushcopy push a copy onto the stack 635 stackguard=<number> test the stackguard feature 636 subject_literal treat all subject lines as literal 637 tables=[0|1|2|3] select internal tables 638 use_length do not zero-terminate the pattern 639 utf8_input treat input as UTF-8 640 641 The effects of these modifiers are described in the following sections. 642 643 Newline and \R handling 644 645 The bsr modifier specifies what \R in a pattern should match. If it is 646 set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to 647 "unicode", \R matches any Unicode newline sequence. The default can be 648 specified when PCRE2 is built; if it is not, the default is set to Uni- 649 code. 650 651 The newline modifier specifies which characters are to be interpreted 652 as newlines, both in the pattern and in subject lines. The type must be 653 one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case). 654 655 Information about a pattern 656 657 The debug modifier is a shorthand for info,fullbincode, requesting all 658 available information. 659 660 The bincode modifier causes a representation of the compiled code to be 661 output after compilation. This information does not contain length and 662 offset values, which ensures that the same output is generated for dif- 663 ferent internal link sizes and different code unit widths. By using 664 bincode, the same regression tests can be used in different environ- 665 ments. 666 667 The fullbincode modifier, by contrast, does include length and offset 668 values. This is used in a few special tests that run only for specific 669 code unit widths and link sizes, and is also useful for one-off tests. 670 671 The info modifier requests information about the compiled pattern 672 (whether it is anchored, has a fixed first character, and so on). The 673 information is obtained from the pcre2_pattern_info() function. Here 674 are some typical examples: 675 676 re> /(?i)(^a|^b)/m,info 677 Capture group count = 1 678 Compile options: multiline 679 Overall options: caseless multiline 680 First code unit at start or follows newline 681 Subject length lower bound = 1 682 683 re> /(?i)abc/info 684 Capture group count = 0 685 Compile options: <none> 686 Overall options: caseless 687 First code unit = 'a' (caseless) 688 Last code unit = 'c' (caseless) 689 Subject length lower bound = 3 690 691 "Compile options" are those specified by modifiers; "overall options" 692 have added options that are taken or deduced from the pattern. If both 693 sets of options are the same, just a single "options" line is output; 694 if there are no options, the line is omitted. "First code unit" is 695 where any match must start; if there is more than one they are listed 696 as "starting code units". "Last code unit" is the last literal code 697 unit that must be present in any match. This is not necessarily the 698 last character. These lines are omitted if no starting or ending code 699 units are recorded. The subject length line is omitted when 700 no_start_optimize is set because the minimum length is not calculated 701 when it can never be used. 702 703 The framesize modifier shows the size, in bytes, of the storage frames 704 used by pcre2_match() for handling backtracking. The size depends on 705 the number of capturing parentheses in the pattern. 706 707 The callout_info modifier requests information about all the callouts 708 in the pattern. A list of them is output at the end of any other infor- 709 mation that is requested. For each callout, either its number or string 710 is given, followed by the item that follows it in the pattern. 711 712 Passing a NULL context 713 714 Normally, pcre2test passes a context block to pcre2_compile(). If the 715 null_context modifier is set, however, NULL is passed. This is for 716 testing that pcre2_compile() behaves correctly in this case (it uses 717 default values). 718 719 Specifying pattern characters in hexadecimal 720 721 The hex modifier specifies that the characters of the pattern, except 722 for substrings enclosed in single or double quotes, are to be inter- 723 preted as pairs of hexadecimal digits. This feature is provided as a 724 way of creating patterns that contain binary zeros and other non-print- 725 ing characters. White space is permitted between pairs of digits. For 726 example, this pattern contains three characters: 727 728 /ab 32 59/hex 729 730 Parts of such a pattern are taken literally if quoted. This pattern 731 contains nine characters, only two of which are specified in hexadeci- 732 mal: 733 734 /ab "literal" 32/hex 735 736 Either single or double quotes may be used. There is no way of includ- 737 ing the delimiter within a substring. The hex and expand modifiers are 738 mutually exclusive. 739 740 Specifying the pattern's length 741 742 By default, patterns are passed to the compiling functions as zero-ter- 743 minated strings but can be passed by length instead of being zero-ter- 744 minated. The use_length modifier causes this to happen. Using a length 745 happens automatically (whether or not use_length is set) when hex is 746 set, because patterns specified in hexadecimal may contain binary ze- 747 ros. 748 749 If hex or use_length is used with the POSIX wrapper API (see "Using the 750 POSIX wrapper API" below), the REG_PEND extension is used to pass the 751 pattern's length. 752 753 Specifying wide characters in 16-bit and 32-bit modes 754 755 In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 756 and translated to UTF-16 or UTF-32 when the utf modifier is set. For 757 testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input 758 modifier can be used. It is mutually exclusive with utf. Input lines 759 are interpreted as UTF-8 as a means of specifying wide characters. More 760 details are given in "Input encoding" above. 761 762 Generating long repetitive patterns 763 764 Some tests use long patterns that are very repetitive. Instead of cre- 765 ating a very long input line for such a pattern, you can use a special 766 repetition feature, similar to the one described for subject lines 767 above. If the expand modifier is present on a pattern, parts of the 768 pattern that have the form 769 770 \[<characters>]{<count>} 771 772 are expanded before the pattern is passed to pcre2_compile(). For exam- 773 ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 774 cannot be nested. An initial "\[" sequence is recognized only if "]{" 775 followed by decimal digits and "}" is found later in the pattern. If 776 not, the characters remain in the pattern unaltered. The expand and hex 777 modifiers are mutually exclusive. 778 779 If part of an expanded pattern looks like an expansion, but is really 780 part of the actual pattern, unwanted expansion can be avoided by giving 781 two values in the quantifier. For example, \[AB]{6000,6000} is not rec- 782 ognized as an expansion item. 783 784 If the info modifier is set on an expanded pattern, the result of the 785 expansion is included in the information that is output. 786 787 JIT compilation 788 789 Just-in-time (JIT) compiling is a heavyweight optimization that can 790 greatly speed up pattern matching. See the pcre2jit documentation for 791 details. JIT compiling happens, optionally, after a pattern has been 792 successfully compiled into an internal form. The JIT compiler converts 793 this to optimized machine code. It needs to know whether the match-time 794 options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, 795 because different code is generated for the different cases. See the 796 partial modifier in "Subject Modifiers" below for details of how these 797 options are specified for each match attempt. 798 799 JIT compilation is requested by the jit pattern modifier, which may op- 800 tionally be followed by an equals sign and a number in the range 0 to 801 7. The three bits that make up the number specify which of the three 802 JIT operating modes are to be compiled: 803 804 1 compile JIT code for non-partial matching 805 2 compile JIT code for soft partial matching 806 4 compile JIT code for hard partial matching 807 808 The possible values for the jit modifier are therefore: 809 810 0 disable JIT 811 1 normal matching only 812 2 soft partial matching only 813 3 normal and soft partial matching 814 4 hard partial matching only 815 6 soft and hard partial matching only 816 7 all three modes 817 818 If no number is given, 7 is assumed. The phrase "partial matching" 819 means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the 820 PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- 821 plete match; the options enable the possibility of a partial match, but 822 do not require it. Note also that if you request JIT compilation only 823 for partial matching (for example, jit=2) but do not set the partial 824 modifier on a subject line, that match will not use JIT code because 825 none was compiled for non-partial matching. 826 827 If JIT compilation is successful, the compiled JIT code will automati- 828 cally be used when an appropriate type of match is run, except when in- 829 compatible run-time options are specified. For more details, see the 830 pcre2jit documentation. See also the jitstack modifier below for a way 831 of setting the size of the JIT stack. 832 833 If the jitfast modifier is specified, matching is done using the JIT 834 "fast path" interface, pcre2_jit_match(), which skips some of the san- 835 ity checks that are done by pcre2_match(), and of course does not work 836 when JIT is not supported. If jitfast is specified without jit, jit=7 837 is assumed. 838 839 If the jitverify modifier is specified, information about the compiled 840 pattern shows whether JIT compilation was or was not successful. If 841 jitverify is specified without jit, jit=7 is assumed. If JIT compila- 842 tion is successful when jitverify is set, the text "(JIT)" is added to 843 the first output line after a match or non match when JIT-compiled code 844 was actually used in the match. 845 846 Setting a locale 847 848 The locale modifier must specify the name of a locale, for example: 849 850 /pattern/locale=fr_FR 851 852 The given locale is set, pcre2_maketables() is called to build a set of 853 character tables for the locale, and this is then passed to pcre2_com- 854 pile() when compiling the regular expression. The same tables are used 855 when matching the following subject lines. The locale modifier applies 856 only to the pattern on which it appears, but can be given in a #pattern 857 command if a default is needed. Setting a locale and alternate charac- 858 ter tables are mutually exclusive. 859 860 Showing pattern memory 861 862 The memory modifier causes the size in bytes of the memory used to hold 863 the compiled pattern to be output. This does not include the size of 864 the pcre2_code block; it is just the actual compiled data. If the pat- 865 tern is subsequently passed to the JIT compiler, the size of the JIT 866 compiled code is also output. Here is an example: 867 868 re> /a(b)c/jit,memory 869 Memory allocation (code space): 21 870 Memory allocation (JIT code): 1910 871 872 873 Limiting nested parentheses 874 875 The parens_nest_limit modifier sets a limit on the depth of nested 876 parentheses in a pattern. Breaching the limit causes a compilation er- 877 ror. The default for the library is set when PCRE2 is built, but 878 pcre2test sets its own default of 220, which is required for running 879 the standard test suite. 880 881 Limiting the pattern length 882 883 The max_pattern_length modifier sets a limit, in code units, to the 884 length of pattern that pcre2_compile() will accept. Breaching the limit 885 causes a compilation error. The default is the largest number a 886 PCRE2_SIZE variable can hold (essentially unlimited). 887 888 Using the POSIX wrapper API 889 890 The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via 891 the POSIX wrapper API rather than its native API. When posix_nosub is 892 used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX 893 wrapper supports only the 8-bit library. Note that it does not imply 894 POSIX matching semantics; for more detail see the pcre2posix documenta- 895 tion. The following pattern modifiers set options for the regcomp() 896 function: 897 898 caseless REG_ICASE 899 multiline REG_NEWLINE 900 dotall REG_DOTALL ) 901 ungreedy REG_UNGREEDY ) These options are not part of 902 ucp REG_UCP ) the POSIX standard 903 utf REG_UTF8 ) 904 905 The regerror_buffsize modifier specifies a size for the error buffer 906 that is passed to regerror() in the event of a compilation error. For 907 example: 908 909 /abc/posix,regerror_buffsize=20 910 911 This provides a means of testing the behaviour of regerror() when the 912 buffer is too small for the error message. If this modifier has not 913 been set, a large buffer is used. 914 915 The aftertext and allaftertext subject modifiers work as described be- 916 low. All other modifiers are either ignored, with a warning message, or 917 cause an error. 918 919 The pattern is passed to regcomp() as a zero-terminated string by de- 920 fault, but if the use_length or hex modifiers are set, the REG_PEND ex- 921 tension is used to pass it by length. 922 923 Testing the stack guard feature 924 925 The stackguard modifier is used to test the use of pcre2_set_com- 926 pile_recursion_guard(), a function that is provided to enable stack 927 availability to be checked during compilation (see the pcre2api docu- 928 mentation for details). If the number specified by the modifier is 929 greater than zero, pcre2_set_compile_recursion_guard() is called to set 930 up callback from pcre2_compile() to a local function. The argument it 931 receives is the current nesting parenthesis depth; if this is greater 932 than the value given by the modifier, non-zero is returned, causing the 933 compilation to be aborted. 934 935 Using alternative character tables 936 937 The value specified for the tables modifier must be one of the digits 938 0, 1, 2, or 3. It causes a specific set of built-in character tables to 939 be passed to pcre2_compile(). This is used in the PCRE2 tests to check 940 behaviour with different character tables. The digit specifies the ta- 941 bles as follows: 942 943 0 do not pass any special character tables 944 1 the default ASCII tables, as distributed in 945 pcre2_chartables.c.dist 946 2 a set of tables defining ISO 8859 characters 947 3 a set of tables loaded by the #loadtables command 948 949 In tables 2, some characters whose codes are greater than 128 are iden- 950 tified as letters, digits, spaces, etc. Tables 3 can be used only after 951 a #loadtables command has loaded them from a binary file. Setting al- 952 ternate character tables and a locale are mutually exclusive. 953 954 Setting certain match controls 955 956 The following modifiers are really subject modifiers, and are described 957 under "Subject Modifiers" below. However, they may be included in a 958 pattern's modifier list, in which case they are applied to every sub- 959 ject line that is processed with that pattern. These modifiers do not 960 affect the compilation process. 961 962 aftertext show text after match 963 allaftertext show text after captures 964 allcaptures show all captures 965 allvector show the entire ovector 966 allusedtext show all consulted text 967 altglobal alternative global matching 968 /g global global matching 969 jitstack=<n> set size of JIT stack 970 mark show mark values 971 replace=<string> specify a replacement string 972 startchar show starting character when relevant 973 substitute_callout use substitution callouts 974 substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 975 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 976 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 977 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 978 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 979 substitute_skip=<n> skip substitution <n> 980 substitute_stop=<n> skip substitution <n> and following 981 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 982 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 983 984 These modifiers may not appear in a #pattern command. If you want them 985 as defaults, set them in a #subject command. 986 987 Specifying literal subject lines 988 989 If the subject_literal modifier is present on a pattern, all the sub- 990 ject lines that it matches are taken as literal strings, with no inter- 991 pretation of backslashes. It is not possible to set subject modifiers 992 on such lines, but any that are set as defaults by a #subject command 993 are recognized. 994 995 Saving a compiled pattern 996 997 When a pattern with the push modifier is successfully compiled, it is 998 pushed onto a stack of compiled patterns, and pcre2test expects the 999 next line to contain a new pattern (or a command) instead of a subject 1000 line. This facility is used when saving compiled patterns to a file, as 1001 described in the section entitled "Saving and restoring compiled pat- 1002 terns" below. If pushcopy is used instead of push, a copy of the com- 1003 piled pattern is stacked, leaving the original as current, ready to 1004 match the following input lines. This provides a way of testing the 1005 pcre2_code_copy() function. The push and pushcopy modifiers are in- 1006 compatible with compilation modifiers such as global that act at match 1007 time. Any that are specified are ignored (for the stacked copy), with a 1008 warning message, except for replace, which causes an error. Note that 1009 jitverify, which is allowed, does not carry through to any subsequent 1010 matching that uses a stacked pattern. 1011 1012 Testing foreign pattern conversion 1013 1014 The experimental foreign pattern conversion functions in PCRE2 can be 1015 tested by setting the convert modifier. Its argument is a colon-sepa- 1016 rated list of options, which set the equivalent option for the 1017 pcre2_pattern_convert() function: 1018 1019 glob PCRE2_CONVERT_GLOB 1020 glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR 1021 glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR 1022 posix_basic PCRE2_CONVERT_POSIX_BASIC 1023 posix_extended PCRE2_CONVERT_POSIX_EXTENDED 1024 unset Unset all options 1025 1026 The "unset" value is useful for turning off a default that has been set 1027 by a #pattern command. When one of these options is set, the input pat- 1028 tern is passed to pcre2_pattern_convert(). If the conversion is suc- 1029 cessful, the result is reflected in the output and then passed to 1030 pcre2_compile(). The normal utf and no_utf_check options, if set, cause 1031 the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be 1032 passed to pcre2_pattern_convert(). 1033 1034 By default, the conversion function is allowed to allocate a buffer for 1035 its output. However, if the convert_length modifier is set to a value 1036 greater than zero, pcre2test passes a buffer of the given length. This 1037 makes it possible to test the length check. 1038 1039 The convert_glob_escape and convert_glob_separator modifiers can be 1040 used to specify the escape and separator characters for glob process- 1041 ing, overriding the defaults, which are operating-system dependent. 1042 1043 1044SUBJECT MODIFIERS 1045 1046 The modifiers that can appear in subject lines and the #subject command 1047 are of two types. 1048 1049 Setting match options 1050 1051 The following modifiers set options for pcre2_match() or 1052 pcre2_dfa_match(). See pcreapi for a description of their effects. 1053 1054 anchored set PCRE2_ANCHORED 1055 endanchored set PCRE2_ENDANCHORED 1056 dfa_restart set PCRE2_DFA_RESTART 1057 dfa_shortest set PCRE2_DFA_SHORTEST 1058 no_jit set PCRE2_NO_JIT 1059 no_utf_check set PCRE2_NO_UTF_CHECK 1060 notbol set PCRE2_NOTBOL 1061 notempty set PCRE2_NOTEMPTY 1062 notempty_atstart set PCRE2_NOTEMPTY_ATSTART 1063 noteol set PCRE2_NOTEOL 1064 partial_hard (or ph) set PCRE2_PARTIAL_HARD 1065 partial_soft (or ps) set PCRE2_PARTIAL_SOFT 1066 1067 The partial matching modifiers are provided with abbreviations because 1068 they appear frequently in tests. 1069 1070 If the posix or posix_nosub modifier was present on the pattern, caus- 1071 ing the POSIX wrapper API to be used, the only option-setting modifiers 1072 that have any effect are notbol, notempty, and noteol, causing REG_NOT- 1073 BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to 1074 regexec(). The other modifiers are ignored, with a warning message. 1075 1076 There is one additional modifier that can be used with the POSIX wrap- 1077 per. It is ignored (with a warning) if used for non-POSIX matching. 1078 1079 posix_startend=<n>[:<m>] 1080 1081 This causes the subject string to be passed to regexec() using the 1082 REG_STARTEND option, which uses offsets to specify which part of the 1083 string is searched. If only one number is given, the end offset is 1084 passed as the end of the subject string. For more detail of REG_STAR- 1085 TEND, see the pcre2posix documentation. If the subject string contains 1086 binary zeros (coded as escapes such as \x{00} because pcre2test does 1087 not support actual binary zeros in its input), you must use posix_star- 1088 tend to specify its length. 1089 1090 Setting match controls 1091 1092 The following modifiers affect the matching process or request addi- 1093 tional information. Some of them may also be specified on a pattern 1094 line (see above), in which case they apply to every subject line that 1095 is matched against that pattern, but can be overridden by modifiers on 1096 the subject. 1097 1098 aftertext show text after match 1099 allaftertext show text after captures 1100 allcaptures show all captures 1101 allvector show the entire ovector 1102 allusedtext show all consulted text (non-JIT only) 1103 altglobal alternative global matching 1104 callout_capture show captures at callout time 1105 callout_data=<n> set a value to pass via callouts 1106 callout_error=<n>[:<m>] control callout error 1107 callout_extra show extra callout information 1108 callout_fail=<n>[:<m>] control callout failure 1109 callout_no_where do not show position of a callout 1110 callout_none do not supply a callout function 1111 copy=<number or name> copy captured substring 1112 depth_limit=<n> set a depth limit 1113 dfa use pcre2_dfa_match() 1114 find_limits find heap, match and depth limits 1115 find_limits_noheap find match and depth limits 1116 get=<number or name> extract captured substring 1117 getall extract all captured substrings 1118 /g global global matching 1119 heap_limit=<n> set a limit on heap memory (Kbytes) 1120 jitstack=<n> set size of JIT stack 1121 mark show mark values 1122 match_limit=<n> set a match limit 1123 memory show heap memory usage 1124 null_context match with a NULL context 1125 null_replacement substitute with NULL replacement 1126 null_subject match with NULL subject 1127 offset=<n> set starting offset 1128 offset_limit=<n> set offset limit 1129 ovector=<n> set size of output vector 1130 recursion_limit=<n> obsolete synonym for depth_limit 1131 replace=<string> specify a replacement string 1132 startchar show startchar when relevant 1133 startoffset=<n> same as offset=<n> 1134 substitute_callout use substitution callouts 1135 substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 1136 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1137 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1138 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1139 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1140 substitute_skip=<n> skip substitution number n 1141 substitute_stop=<n> skip substitution number n and greater 1142 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1143 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1144 zero_terminate pass the subject as zero-terminated 1145 1146 The effects of these modifiers are described in the following sections. 1147 When matching via the POSIX wrapper API, the aftertext, allaftertext, 1148 and ovector subject modifiers work as described below. All other modi- 1149 fiers are either ignored, with a warning message, or cause an error. 1150 1151 Showing more text 1152 1153 The aftertext modifier requests that as well as outputting the part of 1154 the subject string that matched the entire pattern, pcre2test should in 1155 addition output the remainder of the subject string. This is useful for 1156 tests where the subject contains multiple copies of the same substring. 1157 The allaftertext modifier requests the same action for captured sub- 1158 strings as well as the main matched substring. In each case the remain- 1159 der is output on the following line with a plus character following the 1160 capture number. 1161 1162 The allusedtext modifier requests that all the text that was consulted 1163 during a successful pattern match by the interpreter should be shown, 1164 for both full and partial matches. This feature is not supported for 1165 JIT matching, and if requested with JIT it is ignored (with a warning 1166 message). Setting this modifier affects the output if there is a look- 1167 behind at the start of a match, or, for a complete match, a lookahead 1168 at the end, or if \K is used in the pattern. Characters that precede or 1169 follow the start and end of the actual match are indicated in the out- 1170 put by '<' or '>' characters underneath them. Here is an example: 1171 1172 re> /(?<=pqr)abc(?=xyz)/ 1173 data> 123pqrabcxyz456\=allusedtext 1174 0: pqrabcxyz 1175 <<< >>> 1176 data> 123pqrabcxy\=ph,allusedtext 1177 Partial match: pqrabcxy 1178 <<< 1179 1180 The first, complete match shows that the matched string is "abc", with 1181 the preceding and following strings "pqr" and "xyz" having been con- 1182 sulted during the match (when processing the assertions). The partial 1183 match can indicate only the preceding string. 1184 1185 The startchar modifier requests that the starting character for the 1186 match be indicated, if it is different to the start of the matched 1187 string. The only time when this occurs is when \K has been processed as 1188 part of the match. In this situation, the output for the matched string 1189 is displayed from the starting character instead of from the match 1190 point, with circumflex characters under the earlier characters. For ex- 1191 ample: 1192 1193 re> /abc\Kxyz/ 1194 data> abcxyz\=startchar 1195 0: abcxyz 1196 ^^^ 1197 1198 Unlike allusedtext, the startchar modifier can be used with JIT. How- 1199 ever, these two modifiers are mutually exclusive. 1200 1201 Showing the value of all capture groups 1202 1203 The allcaptures modifier requests that the values of all potential cap- 1204 tured parentheses be output after a match. By default, only those up to 1205 the highest one actually used in the match are output (corresponding to 1206 the return code from pcre2_match()). Groups that did not take part in 1207 the match are output as "<unset>". This modifier is not relevant for 1208 DFA matching (which does no capturing) and does not apply when replace 1209 is specified; it is ignored, with a warning message, if present. 1210 1211 Showing the entire ovector, for all outcomes 1212 1213 The allvector modifier requests that the entire ovector be shown, what- 1214 ever the outcome of the match. Compare allcaptures, which shows only up 1215 to the maximum number of capture groups for the pattern, and then only 1216 for a successful complete non-DFA match. This modifier, which acts af- 1217 ter any match result, and also for DFA matching, provides a means of 1218 checking that there are no unexpected modifications to ovector fields. 1219 Before each match attempt, the ovector is filled with a special value, 1220 and if this is found in both elements of a capturing pair, "<un- 1221 changed>" is output. After a successful match, this applies to all 1222 groups after the maximum capture group for the pattern. In other cases 1223 it applies to the entire ovector. After a partial match, the first two 1224 elements are the only ones that should be set. After a DFA match, the 1225 amount of ovector that is used depends on the number of matches that 1226 were found. 1227 1228 Testing pattern callouts 1229 1230 A callout function is supplied when pcre2test calls the library match- 1231 ing functions, unless callout_none is specified. Its behaviour can be 1232 controlled by various modifiers listed above whose names begin with 1233 callout_. Details are given in the section entitled "Callouts" below. 1234 Testing callouts from pcre2_substitute() is described separately in 1235 "Testing the substitution function" below. 1236 1237 Finding all matches in a string 1238 1239 Searching for all possible matches within a subject can be requested by 1240 the global or altglobal modifier. After finding a match, the matching 1241 function is called again to search the remainder of the subject. The 1242 difference between global and altglobal is that the former uses the 1243 start_offset argument to pcre2_match() or pcre2_dfa_match() to start 1244 searching at a new point within the entire string (which is what Perl 1245 does), whereas the latter passes over a shortened subject. This makes a 1246 difference to the matching process if the pattern begins with a lookbe- 1247 hind assertion (including \b or \B). 1248 1249 If an empty string is matched, the next match is done with the 1250 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search 1251 for another, non-empty, match at the same point in the subject. If this 1252 match fails, the start offset is advanced, and the normal match is re- 1253 tried. This imitates the way Perl handles such cases when using the /g 1254 modifier or the split() function. Normally, the start offset is ad- 1255 vanced by one character, but if the newline convention recognizes CRLF 1256 as a newline, and the current character is CR followed by LF, an ad- 1257 vance of two characters occurs. 1258 1259 Testing substring extraction functions 1260 1261 The copy and get modifiers can be used to test the pcre2_sub- 1262 string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be 1263 given more than once, and each can specify a capture group name or num- 1264 ber, for example: 1265 1266 abcd\=copy=1,copy=3,get=G1 1267 1268 If the #subject command is used to set default copy and/or get lists, 1269 these can be unset by specifying a negative number to cancel all num- 1270 bered groups and an empty name to cancel all named groups. 1271 1272 The getall modifier tests pcre2_substring_list_get(), which extracts 1273 all captured substrings. 1274 1275 If the subject line is successfully matched, the substrings extracted 1276 by the convenience functions are output with C, G, or L after the 1277 string number instead of a colon. This is in addition to the normal 1278 full list. The string length (that is, the return from the extraction 1279 function) is given in parentheses after each substring, followed by the 1280 name when the extraction was by name. 1281 1282 Testing the substitution function 1283 1284 If the replace modifier is set, the pcre2_substitute() function is 1285 called instead of one of the matching functions (or after one call of 1286 pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re- 1287 placement strings cannot contain commas, because a comma signifies the 1288 end of a modifier. This is not thought to be an issue in a test pro- 1289 gram. 1290 1291 Specifying a completely empty replacement string disables this modi- 1292 fier. However, it is possible to specify an empty replacement by pro- 1293 viding a buffer length, as described below, for an otherwise empty re- 1294 placement. 1295 1296 Unlike subject strings, pcre2test does not process replacement strings 1297 for escape sequences. In UTF mode, a replacement string is checked to 1298 see if it is a valid UTF-8 string. If so, it is correctly converted to 1299 a UTF string of the appropriate code unit width. If it is not a valid 1300 UTF-8 string, the individual code units are copied directly. This pro- 1301 vides a means of passing an invalid UTF-8 string for testing purposes. 1302 1303 The following modifiers set options (in additional to the normal match 1304 options) for pcre2_substitute(): 1305 1306 global PCRE2_SUBSTITUTE_GLOBAL 1307 substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1308 substitute_literal PCRE2_SUBSTITUTE_LITERAL 1309 substitute_matched PCRE2_SUBSTITUTE_MATCHED 1310 substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1311 substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1312 substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1313 substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1314 1315 See the pcre2api documentation for details of these options. 1316 1317 After a successful substitution, the modified string is output, pre- 1318 ceded by the number of replacements. This may be zero if there were no 1319 matches. Here is a simple example of a substitution test: 1320 1321 /abc/replace=xxx 1322 =abc=abc= 1323 1: =xxx=abc= 1324 =abc=abc=\=global 1325 2: =xxx=xxx= 1326 1327 Subject and replacement strings should be kept relatively short (fewer 1328 than 256 characters) for substitution tests, as fixed-size buffers are 1329 used. To make it easy to test for buffer overflow, if the replacement 1330 string starts with a number in square brackets, that number is passed 1331 to pcre2_substitute() as the size of the output buffer, with the re- 1332 placement string starting at the next character. Here is an example 1333 that tests the edge case: 1334 1335 /abc/ 1336 123abc123\=replace=[10]XYZ 1337 1: 123XYZ123 1338 123abc123\=replace=[9]XYZ 1339 Failed: error -47: no more memory 1340 1341 The default action of pcre2_substitute() is to return PCRE2_ER- 1342 ROR_NOMEMORY when the output buffer is too small. However, if the 1343 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the substi- 1344 tute_overflow_length modifier), pcre2_substitute() continues to go 1345 through the motions of matching and substituting (but not doing any 1346 callouts), in order to compute the size of buffer that is required. 1347 When this happens, pcre2test shows the required buffer length (which 1348 includes space for the trailing zero) as part of the error message. For 1349 example: 1350 1351 /abc/substitute_overflow_length 1352 123abc123\=replace=[9]XYZ 1353 Failed: error -47: no more memory: 10 code units are needed 1354 1355 A replacement string is ignored with POSIX and DFA matching. Specifying 1356 partial matching provokes an error return ("bad option value") from 1357 pcre2_substitute(). 1358 1359 Testing substitute callouts 1360 1361 If the substitute_callout modifier is set, a substitution callout func- 1362 tion is set up. The null_context modifier must not be set, because the 1363 address of the callout function is passed in a match context. When the 1364 callout function is called (after each substitution), details of the 1365 the input and output strings are output. For example: 1366 1367 /abc/g,replace=<$0>,substitute_callout 1368 abcdefabcpqr 1369 1(1) Old 0 3 "abc" New 0 5 "<abc>" 1370 2(1) Old 6 9 "abc" New 8 13 "<abc>" 1371 2: <abc>def<abc>pqr 1372 1373 The first number on each callout line is the count of matches. The 1374 parenthesized number is the number of pairs that are set in the ovector 1375 (that is, one more than the number of capturing groups that were set). 1376 Then are listed the offsets of the old substring, its contents, and the 1377 same for the replacement. 1378 1379 By default, the substitution callout function returns zero, which ac- 1380 cepts the replacement and causes matching to continue if /g was used. 1381 Two further modifiers can be used to test other return values. If sub- 1382 stitute_skip is set to a value greater than zero the callout function 1383 returns +1 for the match of that number, and similarly substitute_stop 1384 returns -1. These cause the replacement to be rejected, and -1 causes 1385 no further matching to take place. If either of them are set, substi- 1386 tute_callout is assumed. For example: 1387 1388 /abc/g,replace=<$0>,substitute_skip=1 1389 abcdefabcpqr 1390 1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED" 1391 2(1) Old 6 9 "abc" New 6 11 "<abc>" 1392 2: abcdef<abc>pqr 1393 abcdefabcpqr\=substitute_stop=1 1394 1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED" 1395 1: abcdefabcpqr 1396 1397 If both are set for the same number, stop takes precedence. Only a sin- 1398 gle skip or stop is supported, which is sufficient for testing that the 1399 feature works. 1400 1401 Setting the JIT stack size 1402 1403 The jitstack modifier provides a way of setting the maximum stack size 1404 that is used by the just-in-time optimization code. It is ignored if 1405 JIT optimization is not being used. The value is a number of kibibytes 1406 (units of 1024 bytes). Setting zero reverts to the default of 32KiB. 1407 Providing a stack that is larger than the default is necessary only for 1408 very complicated patterns. If jitstack is set non-zero on a subject 1409 line it overrides any value that was set on the pattern. 1410 1411 Setting heap, match, and depth limits 1412 1413 The heap_limit, match_limit, and depth_limit modifiers set the appro- 1414 priate limits in the match context. These values are ignored when the 1415 find_limits or find_limits_noheap modifier is specified. 1416 1417 Finding minimum limits 1418 1419 If the find_limits modifier is present on a subject line, pcre2test 1420 calls the relevant matching function several times, setting different 1421 values in the match context via pcre2_set_heap_limit(), 1422 pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the 1423 smallest value for each parameter that allows the match to complete 1424 without a "limit exceeded" error. The match itself may succeed or fail. 1425 An alternative modifier, find_limits_noheap, omits the heap limit. This 1426 is used in the standard tests, because the minimum heap limit varies 1427 between systems. If JIT is being used, only the match limit is rele- 1428 vant, and the other two are automatically omitted. 1429 1430 When using this modifier, the pattern should not contain any limit set- 1431 tings such as (*LIMIT_MATCH=...) within it. If such a setting is 1432 present and is lower than the minimum matching value, the minimum value 1433 cannot be found because pcre2_set_match_limit() etc. are only able to 1434 reduce the value of an in-pattern limit; they cannot increase it. 1435 1436 For non-DFA matching, the minimum depth_limit number is a measure of 1437 how much nested backtracking happens (that is, how deeply the pattern's 1438 tree is searched). In the case of DFA matching, depth_limit controls 1439 the depth of recursive calls of the internal function that is used for 1440 handling pattern recursion, lookaround assertions, and atomic groups. 1441 1442 For non-DFA matching, the match_limit number is a measure of the amount 1443 of backtracking that takes place, and learning the minimum value can be 1444 instructive. For most simple matches, the number is quite small, but 1445 for patterns with very large numbers of matching possibilities, it can 1446 become large very quickly with increasing length of subject string. In 1447 the case of DFA matching, match_limit controls the total number of 1448 calls, both recursive and non-recursive, to the internal matching func- 1449 tion, thus controlling the overall amount of computing resource that is 1450 used. 1451 1452 For both kinds of matching, the heap_limit number, which is in 1453 kibibytes (units of 1024 bytes), limits the amount of heap memory used 1454 for matching. 1455 1456 Showing MARK names 1457 1458 1459 The mark modifier causes the names from backtracking control verbs that 1460 are returned from calls to pcre2_match() to be displayed. If a mark is 1461 returned for a match, non-match, or partial match, pcre2test shows it. 1462 For a match, it is on a line by itself, tagged with "MK:". Otherwise, 1463 it is added to the non-match message. 1464 1465 Showing memory usage 1466 1467 The memory modifier causes pcre2test to log the sizes of all heap mem- 1468 ory allocation and freeing calls that occur during a call to 1469 pcre2_match() or pcre2_dfa_match(). In the latter case, heap memory is 1470 used only when a match requires more internal workspace that the de- 1471 fault allocation on the stack, so in many cases there will be no out- 1472 put. No heap memory is allocated during matching with JIT. For this 1473 modifier to work, the null_context modifier must not be set on both the 1474 pattern and the subject, though it can be set on one or the other. 1475 1476 Setting a starting offset 1477 1478 The offset modifier sets an offset in the subject string at which 1479 matching starts. Its value is a number of code units, not characters. 1480 1481 Setting an offset limit 1482 1483 The offset_limit modifier sets a limit for unanchored matches. If a 1484 match cannot be found starting at or before this offset in the subject, 1485 a "no match" return is given. The data value is a number of code units, 1486 not characters. When this modifier is used, the use_offset_limit modi- 1487 fier must have been set for the pattern; if not, an error is generated. 1488 1489 Setting the size of the output vector 1490 1491 The ovector modifier applies only to the subject line in which it ap- 1492 pears, though of course it can also be used to set a default in a #sub- 1493 ject command. It specifies the number of pairs of offsets that are 1494 available for storing matching information. The default is 15. 1495 1496 A value of zero is useful when testing the POSIX API because it causes 1497 regexec() to be called with a NULL capture vector. When not testing the 1498 POSIX API, a value of zero is used to cause pcre2_match_data_cre- 1499 ate_from_pattern() to be called, in order to create a match block of 1500 exactly the right size for the pattern. (It is not possible to create a 1501 match block with a zero-length ovector; there is always at least one 1502 pair of offsets.) 1503 1504 Passing the subject as zero-terminated 1505 1506 By default, the subject string is passed to a native API matching func- 1507 tion with its correct length. In order to test the facility for passing 1508 a zero-terminated string, the zero_terminate modifier is provided. It 1509 causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching 1510 via the POSIX interface, this modifier is ignored, with a warning. 1511 1512 When testing pcre2_substitute(), this modifier also has the effect of 1513 passing the replacement string as zero-terminated. 1514 1515 Passing a NULL context, subject, or replacement 1516 1517 Normally, pcre2test passes a context block to pcre2_match(), 1518 pcre2_dfa_match(), pcre2_jit_match() or pcre2_substitute(). If the 1519 null_context modifier is set, however, NULL is passed. This is for 1520 testing that the matching and substitution functions behave correctly 1521 in this case (they use default values). This modifier cannot be used 1522 with the find_limits, find_limits_noheap, or substitute_callout modi- 1523 fiers. 1524 1525 Similarly, for testing purposes, if the null_subject or null_replace- 1526 ment modifier is set, the subject or replacement string pointers are 1527 passed as NULL, respectively, to the relevant functions. 1528 1529 1530THE ALTERNATIVE MATCHING FUNCTION 1531 1532 By default, pcre2test uses the standard PCRE2 matching function, 1533 pcre2_match() to match each subject line. PCRE2 also supports an alter- 1534 native matching function, pcre2_dfa_match(), which operates in a dif- 1535 ferent way, and has some restrictions. The differences between the two 1536 functions are described in the pcre2matching documentation. 1537 1538 If the dfa modifier is set, the alternative matching function is used. 1539 This function finds all possible matches at a given point in the sub- 1540 ject. If, however, the dfa_shortest modifier is set, processing stops 1541 after the first match is found. This is always the shortest possible 1542 match. 1543 1544 1545DEFAULT OUTPUT FROM pcre2test 1546 1547 This section describes the output when the normal matching function, 1548 pcre2_match(), is being used. 1549 1550 When a match succeeds, pcre2test outputs the list of captured sub- 1551 strings, starting with number 0 for the string that matched the whole 1552 pattern. Otherwise, it outputs "No match" when the return is PCRE2_ER- 1553 ROR_NOMATCH, or "Partial match:" followed by the partially matching 1554 substring when the return is PCRE2_ERROR_PARTIAL. (Note that this is 1555 the entire substring that was inspected during the partial match; it 1556 may include characters before the actual match start if a lookbehind 1557 assertion, \K, \b, or \B was involved.) 1558 1559 For any other return, pcre2test outputs the PCRE2 negative error number 1560 and a short descriptive phrase. If the error is a failed UTF string 1561 check, the code unit offset of the start of the failing character is 1562 also output. Here is an example of an interactive pcre2test run. 1563 1564 $ pcre2test 1565 PCRE2 version 10.22 2016-07-29 1566 1567 re> /^abc(\d+)/ 1568 data> abc123 1569 0: abc123 1570 1: 123 1571 data> xyz 1572 No match 1573 1574 Unset capturing substrings that are not followed by one that is set are 1575 not shown by pcre2test unless the allcaptures modifier is specified. In 1576 the following example, there are two capturing substrings, but when the 1577 first data line is matched, the second, unset substring is not shown. 1578 An "internal" unset substring is shown as "<unset>", as for the second 1579 data line. 1580 1581 re> /(a)|(b)/ 1582 data> a 1583 0: a 1584 1: a 1585 data> b 1586 0: b 1587 1: <unset> 1588 2: b 1589 1590 If the strings contain any non-printing characters, they are output as 1591 \xhh escapes if the value is less than 256 and UTF mode is not set. 1592 Otherwise they are output as \x{hh...} escapes. See below for the defi- 1593 nition of non-printing characters. If the aftertext modifier is set, 1594 the output for substring 0 is followed by the the rest of the subject 1595 string, identified by "0+" like this: 1596 1597 re> /cat/aftertext 1598 data> cataract 1599 0: cat 1600 0+ aract 1601 1602 If global matching is requested, the results of successive matching at- 1603 tempts are output in sequence, like this: 1604 1605 re> /\Bi(\w\w)/g 1606 data> Mississippi 1607 0: iss 1608 1: ss 1609 0: iss 1610 1: ss 1611 0: ipp 1612 1: pp 1613 1614 "No match" is output only if the first match attempt fails. Here is an 1615 example of a failure message (the offset 4 that is specified by the 1616 offset modifier is past the end of the subject string): 1617 1618 re> /xyz/ 1619 data> xyz\=offset=4 1620 Error -24 (bad offset value) 1621 1622 Note that whereas patterns can be continued over several lines (a plain 1623 ">" prompt is used for continuations), subject lines may not. However 1624 newlines can be included in a subject by means of the \n escape (or \r, 1625 \r\n, etc., depending on the newline sequence setting). 1626 1627 1628OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1629 1630 When the alternative matching function, pcre2_dfa_match(), is used, the 1631 output consists of a list of all the matches that start at the first 1632 point in the subject where there is at least one match. For example: 1633 1634 re> /(tang|tangerine|tan)/ 1635 data> yellow tangerine\=dfa 1636 0: tangerine 1637 1: tang 1638 2: tan 1639 1640 Using the normal matching function on this data finds only "tang". The 1641 longest matching string is always given first (and numbered zero). Af- 1642 ter a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", fol- 1643 lowed by the partially matching substring. Note that this is the entire 1644 substring that was inspected during the partial match; it may include 1645 characters before the actual match start if a lookbehind assertion, \b, 1646 or \B was involved. (\K is not supported for DFA matching.) 1647 1648 If global matching is requested, the search for further matches resumes 1649 at the end of the longest match. For example: 1650 1651 re> /(tang|tangerine|tan)/g 1652 data> yellow tangerine and tangy sultana\=dfa 1653 0: tangerine 1654 1: tang 1655 2: tan 1656 0: tang 1657 1: tan 1658 0: tan 1659 1660 The alternative matching function does not support substring capture, 1661 so the modifiers that are concerned with captured substrings are not 1662 relevant. 1663 1664 1665RESTARTING AFTER A PARTIAL MATCH 1666 1667 When the alternative matching function has given the PCRE2_ERROR_PAR- 1668 TIAL return, indicating that the subject partially matched the pattern, 1669 you can restart the match with additional subject data by means of the 1670 dfa_restart modifier. For example: 1671 1672 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1673 data> 23ja\=ps,dfa 1674 Partial match: 23ja 1675 data> n05\=dfa,dfa_restart 1676 0: n05 1677 1678 For further information about partial matching, see the pcre2partial 1679 documentation. 1680 1681 1682CALLOUTS 1683 1684 If the pattern contains any callout requests, pcre2test's callout func- 1685 tion is called during matching unless callout_none is specified. This 1686 works with both matching functions, and with JIT, though there are some 1687 differences in behaviour. The output for callouts with numerical argu- 1688 ments and those with string arguments is slightly different. 1689 1690 Callouts with numerical arguments 1691 1692 By default, the callout function displays the callout number, the start 1693 and current positions in the subject text at the callout time, and the 1694 next pattern item to be tested. For example: 1695 1696 --->pqrabcdef 1697 0 ^ ^ \d 1698 1699 This output indicates that callout number 0 occurred for a match at- 1700 tempt starting at the fourth character of the subject string, when the 1701 pointer was at the seventh character, and when the next pattern item 1702 was \d. Just one circumflex is output if the start and current posi- 1703 tions are the same, or if the current position precedes the start posi- 1704 tion, which can happen if the callout is in a lookbehind assertion. 1705 1706 Callouts numbered 255 are assumed to be automatic callouts, inserted as 1707 a result of the auto_callout pattern modifier. In this case, instead of 1708 showing the callout number, the offset in the pattern, preceded by a 1709 plus, is output. For example: 1710 1711 re> /\d?[A-E]\*/auto_callout 1712 data> E* 1713 --->E* 1714 +0 ^ \d? 1715 +3 ^ [A-E] 1716 +8 ^^ \* 1717 +10 ^ ^ 1718 0: E* 1719 1720 If a pattern contains (*MARK) items, an additional line is output when- 1721 ever a change of latest mark is passed to the callout function. For ex- 1722 ample: 1723 1724 re> /a(*MARK:X)bc/auto_callout 1725 data> abc 1726 --->abc 1727 +0 ^ a 1728 +1 ^^ (*MARK:X) 1729 +10 ^^ b 1730 Latest Mark: X 1731 +11 ^ ^ c 1732 +12 ^ ^ 1733 0: abc 1734 1735 The mark changes between matching "a" and "b", but stays the same for 1736 the rest of the match, so nothing more is output. If, as a result of 1737 backtracking, the mark reverts to being unset, the text "<unset>" is 1738 output. 1739 1740 Callouts with string arguments 1741 1742 The output for a callout with a string argument is similar, except that 1743 instead of outputting a callout number before the position indicators, 1744 the callout string and its offset in the pattern string are output be- 1745 fore the reflection of the subject string, and the subject string is 1746 reflected for each callout. For example: 1747 1748 re> /^ab(?C'first')cd(?C"second")ef/ 1749 data> abcdefg 1750 Callout (7): 'first' 1751 --->abcdefg 1752 ^ ^ c 1753 Callout (20): "second" 1754 --->abcdefg 1755 ^ ^ e 1756 0: abcdef 1757 1758 1759 Callout modifiers 1760 1761 The callout function in pcre2test returns zero (carry on matching) by 1762 default, but you can use a callout_fail modifier in a subject line to 1763 change this and other parameters of the callout (see below). 1764 1765 If the callout_capture modifier is set, the current captured groups are 1766 output when a callout occurs. This is useful only for non-DFA matching, 1767 as pcre2_dfa_match() does not support capturing, so no captures are 1768 ever shown. 1769 1770 The normal callout output, showing the callout number or pattern offset 1771 (as described above) is suppressed if the callout_no_where modifier is 1772 set. 1773 1774 When using the interpretive matching function pcre2_match() without 1775 JIT, setting the callout_extra modifier causes additional output from 1776 pcre2test's callout function to be generated. For the first callout in 1777 a match attempt at a new starting position in the subject, "New match 1778 attempt" is output. If there has been a backtrack since the last call- 1779 out (or start of matching if this is the first callout), "Backtrack" is 1780 output, followed by "No other matching paths" if the backtrack ended 1781 the previous match attempt. For example: 1782 1783 re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess 1784 data> aac\=callout_extra 1785 New match attempt 1786 --->aac 1787 +0 ^ ( 1788 +1 ^ a+ 1789 +3 ^ ^ ) 1790 +4 ^ ^ b 1791 Backtrack 1792 --->aac 1793 +3 ^^ ) 1794 +4 ^^ b 1795 Backtrack 1796 No other matching paths 1797 New match attempt 1798 --->aac 1799 +0 ^ ( 1800 +1 ^ a+ 1801 +3 ^^ ) 1802 +4 ^^ b 1803 Backtrack 1804 No other matching paths 1805 New match attempt 1806 --->aac 1807 +0 ^ ( 1808 +1 ^ a+ 1809 Backtrack 1810 No other matching paths 1811 New match attempt 1812 --->aac 1813 +0 ^ ( 1814 +1 ^ a+ 1815 No match 1816 1817 Notice that various optimizations must be turned off if you want all 1818 possible matching paths to be scanned. If no_start_optimize is not 1819 used, there is an immediate "no match", without any callouts, because 1820 the starting optimization fails to find "b" in the subject, which it 1821 knows must be present for any match. If no_auto_possess is not used, 1822 the "a+" item is turned into "a++", which reduces the number of back- 1823 tracks. 1824 1825 The callout_extra modifier has no effect if used with the DFA matching 1826 function, or with JIT. 1827 1828 Return values from callouts 1829 1830 The default return from the callout function is zero, which allows 1831 matching to continue. The callout_fail modifier can be given one or two 1832 numbers. If there is only one number, 1 is returned instead of 0 (caus- 1833 ing matching to backtrack) when a callout of that number is reached. If 1834 two numbers (<n>:<m>) are given, 1 is returned when callout <n> is 1835 reached and there have been at least <m> callouts. The callout_error 1836 modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- 1837 ing the entire matching process to be aborted. If both these modifiers 1838 are set for the same callout number, callout_error takes precedence. 1839 Note that callouts with string arguments are always given the number 1840 zero. 1841 1842 The callout_data modifier can be given an unsigned or a negative num- 1843 ber. This is set as the "user data" that is passed to the matching 1844 function, and passed back when the callout function is invoked. Any 1845 value other than zero is used as a return from pcre2test's callout 1846 function. 1847 1848 Inserting callouts can be helpful when using pcre2test to check compli- 1849 cated regular expressions. For further information about callouts, see 1850 the pcre2callout documentation. 1851 1852 1853NON-PRINTING CHARACTERS 1854 1855 When pcre2test is outputting text in the compiled version of a pattern, 1856 bytes other than 32-126 are always treated as non-printing characters 1857 and are therefore shown as hex escapes. 1858 1859 When pcre2test is outputting text that is a matched part of a subject 1860 string, it behaves in the same way, unless a different locale has been 1861 set for the pattern (using the locale modifier). In this case, the is- 1862 print() function is used to distinguish printing and non-printing char- 1863 acters. 1864 1865 1866SAVING AND RESTORING COMPILED PATTERNS 1867 1868 It is possible to save compiled patterns on disc or elsewhere, and 1869 reload them later, subject to a number of restrictions. JIT data cannot 1870 be saved. The host on which the patterns are reloaded must be running 1871 the same version of PCRE2, with the same code unit width, and must also 1872 have the same endianness, pointer width and PCRE2_SIZE type. Before 1873 compiled patterns can be saved they must be serialized, that is, con- 1874 verted to a stream of bytes. A single byte stream may contain any num- 1875 ber of compiled patterns, but they must all use the same character ta- 1876 bles. A single copy of the tables is included in the byte stream (its 1877 size is 1088 bytes). 1878 1879 The functions whose names begin with pcre2_serialize_ are used for se- 1880 rializing and de-serializing. They are described in the pcre2serialize 1881 documentation. In this section we describe the features of pcre2test 1882 that can be used to test these functions. 1883 1884 Note that "serialization" in PCRE2 does not convert compiled patterns 1885 to an abstract format like Java or .NET. It just makes a reloadable 1886 byte code stream. Hence the restrictions on reloading mentioned above. 1887 1888 In pcre2test, when a pattern with push modifier is successfully com- 1889 piled, it is pushed onto a stack of compiled patterns, and pcre2test 1890 expects the next line to contain a new pattern (or command) instead of 1891 a subject line. By contrast, the pushcopy modifier causes a copy of the 1892 compiled pattern to be stacked, leaving the original available for im- 1893 mediate matching. By using push and/or pushcopy, a number of patterns 1894 can be compiled and retained. These modifiers are incompatible with 1895 posix, and control modifiers that act at match time are ignored (with a 1896 message) for the stacked patterns. The jitverify modifier applies only 1897 at compile time. 1898 1899 The command 1900 1901 #save <filename> 1902 1903 causes all the stacked patterns to be serialized and the result written 1904 to the named file. Afterwards, all the stacked patterns are freed. The 1905 command 1906 1907 #load <filename> 1908 1909 reads the data in the file, and then arranges for it to be de-serial- 1910 ized, with the resulting compiled patterns added to the pattern stack. 1911 The pattern on the top of the stack can be retrieved by the #pop com- 1912 mand, which must be followed by lines of subjects that are to be 1913 matched with the pattern, terminated as usual by an empty line or end 1914 of file. This command may be followed by a modifier list containing 1915 only control modifiers that act after a pattern has been compiled. In 1916 particular, hex, posix, posix_nosub, push, and pushcopy are not al- 1917 lowed, nor are any option-setting modifiers. The JIT modifiers are, 1918 however permitted. Here is an example that saves and reloads two pat- 1919 terns. 1920 1921 /abc/push 1922 /xyz/push 1923 #save tempfile 1924 #load tempfile 1925 #pop info 1926 xyz 1927 1928 #pop jit,bincode 1929 abc 1930 1931 If jitverify is used with #pop, it does not automatically imply jit, 1932 which is different behaviour from when it is used on a pattern. 1933 1934 The #popcopy command is analogous to the pushcopy modifier in that it 1935 makes current a copy of the topmost stack pattern, leaving the original 1936 still on the stack. 1937 1938 1939SEE ALSO 1940 1941 pcre2(3), pcre2api(3), pcre2callout(3), pcre2jit, pcre2matching(3), 1942 pcre2partial(d), pcre2pattern(3), pcre2serialize(3). 1943 1944 1945AUTHOR 1946 1947 Philip Hazel 1948 Retired from University Computing Service 1949 Cambridge, England. 1950 1951 1952REVISION 1953 1954 Last updated: 27 July 2022 1955 Copyright (c) 1997-2022 University of Cambridge. 1956