11cb0ef41Sopenharmony_ci// Copyright 2020 the V8 project authors. All rights reserved. 21cb0ef41Sopenharmony_ci// Use of this source code is governed by a BSD-style license that can be 31cb0ef41Sopenharmony_ci// found in the LICENSE file. 41cb0ef41Sopenharmony_ci 51cb0ef41Sopenharmony_ci#ifndef V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_ 61cb0ef41Sopenharmony_ci#define V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_ 71cb0ef41Sopenharmony_ci 81cb0ef41Sopenharmony_ci#include <ios> 91cb0ef41Sopenharmony_ci 101cb0ef41Sopenharmony_ci#include "src/base/strings.h" 111cb0ef41Sopenharmony_ci#include "src/base/vector.h" 121cb0ef41Sopenharmony_ci#include "src/regexp/regexp-ast.h" 131cb0ef41Sopenharmony_ci 141cb0ef41Sopenharmony_ci// ---------------------------------------------------------------------------- 151cb0ef41Sopenharmony_ci// Definition and semantics of the EXPERIMENTAL bytecode. 161cb0ef41Sopenharmony_ci// Background: 171cb0ef41Sopenharmony_ci// - Russ Cox's blog post series on regular expression matching, in particular 181cb0ef41Sopenharmony_ci// https://swtch.com/~rsc/regexp/regexp2.html 191cb0ef41Sopenharmony_ci// - The re2 regular regexp library: https://github.com/google/re2 201cb0ef41Sopenharmony_ci// 211cb0ef41Sopenharmony_ci// This comment describes the bytecode used by the experimental regexp engine 221cb0ef41Sopenharmony_ci// and its abstract semantics in terms of a VM. An implementation of the 231cb0ef41Sopenharmony_ci// semantics that avoids exponential runtime can be found in `NfaInterpreter`. 241cb0ef41Sopenharmony_ci// 251cb0ef41Sopenharmony_ci// The experimental bytecode describes a non-deterministic finite automaton. It 261cb0ef41Sopenharmony_ci// runs on a multithreaded virtual machine (VM), i.e. in several threads 271cb0ef41Sopenharmony_ci// concurrently. (These "threads" don't need to be actual operating system 281cb0ef41Sopenharmony_ci// threads.) Apart from a list of threads, the VM maintains an immutable 291cb0ef41Sopenharmony_ci// shared input string which threads can read from. Each thread is given by a 301cb0ef41Sopenharmony_ci// program counter (PC, index of the current instruction), a fixed number of 311cb0ef41Sopenharmony_ci// registers of indices into the input string, and a monotonically increasing 321cb0ef41Sopenharmony_ci// index which represents the current position within the input string. 331cb0ef41Sopenharmony_ci// 341cb0ef41Sopenharmony_ci// For the precise encoding of the instruction set, see the definition `struct 351cb0ef41Sopenharmony_ci// RegExpInstruction` below. Currently we support the following instructions: 361cb0ef41Sopenharmony_ci// - CONSUME_RANGE: Check whether the codepoint of the current character is 371cb0ef41Sopenharmony_ci// contained in a non-empty closed interval [min, max] specified in the 381cb0ef41Sopenharmony_ci// instruction payload. Abort this thread if false, otherwise advance the 391cb0ef41Sopenharmony_ci// input position by 1 and continue with the next instruction. 401cb0ef41Sopenharmony_ci// - ACCEPT: Stop this thread and signify the end of a match at the current 411cb0ef41Sopenharmony_ci// input position. 421cb0ef41Sopenharmony_ci// - FORK: If executed by a thread t, spawn a new thread t0 whose register 431cb0ef41Sopenharmony_ci// values and input position agree with those of t, but whose PC value is set 441cb0ef41Sopenharmony_ci// to the value specified in the instruction payload. The register values of 451cb0ef41Sopenharmony_ci// t and t0 agree directly after the FORK, but they can diverge. Thread t 461cb0ef41Sopenharmony_ci// continues with the instruction directly after the current FORK 471cb0ef41Sopenharmony_ci// instruction. 481cb0ef41Sopenharmony_ci// - JMP: Instead of incrementing the PC value after execution of this 491cb0ef41Sopenharmony_ci// instruction by 1, set PC of this thread to the value specified in the 501cb0ef41Sopenharmony_ci// instruction payload and continue there. 511cb0ef41Sopenharmony_ci// - SET_REGISTER_TO_CP: Set a register specified in the paylod to the current 521cb0ef41Sopenharmony_ci// position (CP) within the input, then continue with the next instruction. 531cb0ef41Sopenharmony_ci// - CLEAR_REGISTER: Clear the register specified in the payload by resetting 541cb0ef41Sopenharmony_ci// it to the initial value -1. 551cb0ef41Sopenharmony_ci// 561cb0ef41Sopenharmony_ci// Special care must be exercised with respect to thread priority. It is 571cb0ef41Sopenharmony_ci// possible that more than one thread executes an ACCEPT statement. The output 581cb0ef41Sopenharmony_ci// of the program is given by the contents of the matching thread's registers, 591cb0ef41Sopenharmony_ci// so this is ambiguous in case of multiple matches. To resolve the ambiguity, 601cb0ef41Sopenharmony_ci// every implementation of the VM must output the match that a backtracking 611cb0ef41Sopenharmony_ci// implementation would output (i.e. behave the same as Irregexp). 621cb0ef41Sopenharmony_ci// 631cb0ef41Sopenharmony_ci// A backtracking implementation of the VM maintains a stack of postponed 641cb0ef41Sopenharmony_ci// threads. Upon encountering a FORK statement, this VM will create a copy of 651cb0ef41Sopenharmony_ci// the current thread, set the copy's PC value according to the instruction 661cb0ef41Sopenharmony_ci// payload, and push it to the stack of postponed threads. The VM will then 671cb0ef41Sopenharmony_ci// continue execution of the current thread. 681cb0ef41Sopenharmony_ci// 691cb0ef41Sopenharmony_ci// If at some point a thread t executes a MATCH statement, the VM stops and 701cb0ef41Sopenharmony_ci// outputs the registers of t. Postponed threads are discarded. On the other 711cb0ef41Sopenharmony_ci// hand, if a thread t is aborted because some input character didn't pass a 721cb0ef41Sopenharmony_ci// check, then the VM pops the topmost postponed thread and continues execution 731cb0ef41Sopenharmony_ci// with this thread. If there are no postponed threads, then the VM outputs 741cb0ef41Sopenharmony_ci// failure, i.e. no matches. 751cb0ef41Sopenharmony_ci// 761cb0ef41Sopenharmony_ci// Equivalently, we can describe the behavior of the backtracking VM in terms 771cb0ef41Sopenharmony_ci// of priority: Threads are linearly ordered by priority, and matches generated 781cb0ef41Sopenharmony_ci// by threads with high priority must be preferred over matches generated by 791cb0ef41Sopenharmony_ci// threads with low priority, regardless of the chronological order in which 801cb0ef41Sopenharmony_ci// matches were found. If a thread t executes a FORK statement and spawns a 811cb0ef41Sopenharmony_ci// thread t0, then the priority of t0 is such that the following holds: 821cb0ef41Sopenharmony_ci// * t0 < t, i.e. t0 has lower priority than t. 831cb0ef41Sopenharmony_ci// * For all threads u such that u != t and u != t0, we have t0 < u iff t < u, 841cb0ef41Sopenharmony_ci// i.e. the t0 compares to other threads the same as t. 851cb0ef41Sopenharmony_ci// For example, if there are currently 3 threads s, t, u such that s < t < u, 861cb0ef41Sopenharmony_ci// then after t executes a fork, the thread priorities will be s < t0 < t < u. 871cb0ef41Sopenharmony_ci 881cb0ef41Sopenharmony_cinamespace v8 { 891cb0ef41Sopenharmony_cinamespace internal { 901cb0ef41Sopenharmony_ci 911cb0ef41Sopenharmony_ci// Bytecode format. 921cb0ef41Sopenharmony_ci// Currently very simple fixed-size: The opcode is encoded in the first 4 931cb0ef41Sopenharmony_ci// bytes, the payload takes another 4 bytes. 941cb0ef41Sopenharmony_cistruct RegExpInstruction { 951cb0ef41Sopenharmony_ci enum Opcode : int32_t { 961cb0ef41Sopenharmony_ci ACCEPT, 971cb0ef41Sopenharmony_ci ASSERTION, 981cb0ef41Sopenharmony_ci CLEAR_REGISTER, 991cb0ef41Sopenharmony_ci CONSUME_RANGE, 1001cb0ef41Sopenharmony_ci FORK, 1011cb0ef41Sopenharmony_ci JMP, 1021cb0ef41Sopenharmony_ci SET_REGISTER_TO_CP, 1031cb0ef41Sopenharmony_ci }; 1041cb0ef41Sopenharmony_ci 1051cb0ef41Sopenharmony_ci struct Uc16Range { 1061cb0ef41Sopenharmony_ci base::uc16 min; // Inclusive. 1071cb0ef41Sopenharmony_ci base::uc16 max; // Inclusive. 1081cb0ef41Sopenharmony_ci }; 1091cb0ef41Sopenharmony_ci 1101cb0ef41Sopenharmony_ci static RegExpInstruction ConsumeRange(base::uc16 min, base::uc16 max) { 1111cb0ef41Sopenharmony_ci RegExpInstruction result; 1121cb0ef41Sopenharmony_ci result.opcode = CONSUME_RANGE; 1131cb0ef41Sopenharmony_ci result.payload.consume_range = Uc16Range{min, max}; 1141cb0ef41Sopenharmony_ci return result; 1151cb0ef41Sopenharmony_ci } 1161cb0ef41Sopenharmony_ci 1171cb0ef41Sopenharmony_ci static RegExpInstruction ConsumeAnyChar() { 1181cb0ef41Sopenharmony_ci return ConsumeRange(0x0000, 0xFFFF); 1191cb0ef41Sopenharmony_ci } 1201cb0ef41Sopenharmony_ci 1211cb0ef41Sopenharmony_ci static RegExpInstruction Fail() { 1221cb0ef41Sopenharmony_ci // This is encoded as the empty CONSUME_RANGE of characters 0xFFFF <= c <= 1231cb0ef41Sopenharmony_ci // 0x0000. 1241cb0ef41Sopenharmony_ci return ConsumeRange(0xFFFF, 0x0000); 1251cb0ef41Sopenharmony_ci } 1261cb0ef41Sopenharmony_ci 1271cb0ef41Sopenharmony_ci static RegExpInstruction Fork(int32_t alt_index) { 1281cb0ef41Sopenharmony_ci RegExpInstruction result; 1291cb0ef41Sopenharmony_ci result.opcode = FORK; 1301cb0ef41Sopenharmony_ci result.payload.pc = alt_index; 1311cb0ef41Sopenharmony_ci return result; 1321cb0ef41Sopenharmony_ci } 1331cb0ef41Sopenharmony_ci 1341cb0ef41Sopenharmony_ci static RegExpInstruction Jmp(int32_t alt_index) { 1351cb0ef41Sopenharmony_ci RegExpInstruction result; 1361cb0ef41Sopenharmony_ci result.opcode = JMP; 1371cb0ef41Sopenharmony_ci result.payload.pc = alt_index; 1381cb0ef41Sopenharmony_ci return result; 1391cb0ef41Sopenharmony_ci } 1401cb0ef41Sopenharmony_ci 1411cb0ef41Sopenharmony_ci static RegExpInstruction Accept() { 1421cb0ef41Sopenharmony_ci RegExpInstruction result; 1431cb0ef41Sopenharmony_ci result.opcode = ACCEPT; 1441cb0ef41Sopenharmony_ci return result; 1451cb0ef41Sopenharmony_ci } 1461cb0ef41Sopenharmony_ci 1471cb0ef41Sopenharmony_ci static RegExpInstruction SetRegisterToCp(int32_t register_index) { 1481cb0ef41Sopenharmony_ci RegExpInstruction result; 1491cb0ef41Sopenharmony_ci result.opcode = SET_REGISTER_TO_CP; 1501cb0ef41Sopenharmony_ci result.payload.register_index = register_index; 1511cb0ef41Sopenharmony_ci return result; 1521cb0ef41Sopenharmony_ci } 1531cb0ef41Sopenharmony_ci 1541cb0ef41Sopenharmony_ci static RegExpInstruction ClearRegister(int32_t register_index) { 1551cb0ef41Sopenharmony_ci RegExpInstruction result; 1561cb0ef41Sopenharmony_ci result.opcode = CLEAR_REGISTER; 1571cb0ef41Sopenharmony_ci result.payload.register_index = register_index; 1581cb0ef41Sopenharmony_ci return result; 1591cb0ef41Sopenharmony_ci } 1601cb0ef41Sopenharmony_ci 1611cb0ef41Sopenharmony_ci static RegExpInstruction Assertion(RegExpAssertion::Type t) { 1621cb0ef41Sopenharmony_ci RegExpInstruction result; 1631cb0ef41Sopenharmony_ci result.opcode = ASSERTION; 1641cb0ef41Sopenharmony_ci result.payload.assertion_type = t; 1651cb0ef41Sopenharmony_ci return result; 1661cb0ef41Sopenharmony_ci } 1671cb0ef41Sopenharmony_ci 1681cb0ef41Sopenharmony_ci Opcode opcode; 1691cb0ef41Sopenharmony_ci union { 1701cb0ef41Sopenharmony_ci // Payload of CONSUME_RANGE: 1711cb0ef41Sopenharmony_ci Uc16Range consume_range; 1721cb0ef41Sopenharmony_ci // Payload of FORK and JMP, the next/forked program counter (pc): 1731cb0ef41Sopenharmony_ci int32_t pc; 1741cb0ef41Sopenharmony_ci // Payload of SET_REGISTER_TO_CP and CLEAR_REGISTER: 1751cb0ef41Sopenharmony_ci int32_t register_index; 1761cb0ef41Sopenharmony_ci // Payload of ASSERTION: 1771cb0ef41Sopenharmony_ci RegExpAssertion::Type assertion_type; 1781cb0ef41Sopenharmony_ci } payload; 1791cb0ef41Sopenharmony_ci STATIC_ASSERT(sizeof(payload) == 4); 1801cb0ef41Sopenharmony_ci}; 1811cb0ef41Sopenharmony_ciSTATIC_ASSERT(sizeof(RegExpInstruction) == 8); 1821cb0ef41Sopenharmony_ci// TODO(mbid,v8:10765): This is rather wasteful. We can fit the opcode in 2-3 1831cb0ef41Sopenharmony_ci// bits, so the remaining 29/30 bits can be used as payload. Problem: The 1841cb0ef41Sopenharmony_ci// payload of CONSUME_RANGE consists of two 16-bit values `min` and `max`, so 1851cb0ef41Sopenharmony_ci// this wouldn't fit. We could encode the payload of a CONSUME_RANGE 1861cb0ef41Sopenharmony_ci// instruction by the start of the interval and its length instead, and then 1871cb0ef41Sopenharmony_ci// only allows lengths that fit into 14/13 bits. A longer range can then be 1881cb0ef41Sopenharmony_ci// encoded as a disjunction of smaller ranges. 1891cb0ef41Sopenharmony_ci// 1901cb0ef41Sopenharmony_ci// Another thought: CONSUME_RANGEs are only valid if the payloads are such that 1911cb0ef41Sopenharmony_ci// min <= max. Thus there are 1921cb0ef41Sopenharmony_ci// 1931cb0ef41Sopenharmony_ci// 2^16 + 2^16 - 1 + ... + 1 1941cb0ef41Sopenharmony_ci// = 2^16 * (2^16 + 1) / 2 1951cb0ef41Sopenharmony_ci// = 2^31 + 2^15 1961cb0ef41Sopenharmony_ci// 1971cb0ef41Sopenharmony_ci// valid payloads for a CONSUME_RANGE instruction. If we want to fit 1981cb0ef41Sopenharmony_ci// instructions into 4 bytes, we would still have almost 2^31 instructions left 1991cb0ef41Sopenharmony_ci// over if we encode everything as tight as possible. For example, we could 2001cb0ef41Sopenharmony_ci// use another 2^29 values for JMP, another 2^29 for FORK, 1 value for ACCEPT, 2011cb0ef41Sopenharmony_ci// and then still have almost 2^30 instructions left over for something like 2021cb0ef41Sopenharmony_ci// zero-width assertions and captures. 2031cb0ef41Sopenharmony_ci 2041cb0ef41Sopenharmony_cistd::ostream& operator<<(std::ostream& os, const RegExpInstruction& inst); 2051cb0ef41Sopenharmony_cistd::ostream& operator<<(std::ostream& os, 2061cb0ef41Sopenharmony_ci base::Vector<const RegExpInstruction> insts); 2071cb0ef41Sopenharmony_ci 2081cb0ef41Sopenharmony_ci} // namespace internal 2091cb0ef41Sopenharmony_ci} // namespace v8 2101cb0ef41Sopenharmony_ci 2111cb0ef41Sopenharmony_ci#endif // V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_ 212