11cb0ef41Sopenharmony_ci// Copyright 2020 the V8 project authors. All rights reserved.
21cb0ef41Sopenharmony_ci// Use of this source code is governed by a BSD-style license that can be
31cb0ef41Sopenharmony_ci// found in the LICENSE file.
41cb0ef41Sopenharmony_ci
51cb0ef41Sopenharmony_ci#ifndef V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_
61cb0ef41Sopenharmony_ci#define V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_
71cb0ef41Sopenharmony_ci
81cb0ef41Sopenharmony_ci#include <ios>
91cb0ef41Sopenharmony_ci
101cb0ef41Sopenharmony_ci#include "src/base/strings.h"
111cb0ef41Sopenharmony_ci#include "src/base/vector.h"
121cb0ef41Sopenharmony_ci#include "src/regexp/regexp-ast.h"
131cb0ef41Sopenharmony_ci
141cb0ef41Sopenharmony_ci// ----------------------------------------------------------------------------
151cb0ef41Sopenharmony_ci// Definition and semantics of the EXPERIMENTAL bytecode.
161cb0ef41Sopenharmony_ci// Background:
171cb0ef41Sopenharmony_ci// - Russ Cox's blog post series on regular expression matching, in particular
181cb0ef41Sopenharmony_ci//   https://swtch.com/~rsc/regexp/regexp2.html
191cb0ef41Sopenharmony_ci// - The re2 regular regexp library: https://github.com/google/re2
201cb0ef41Sopenharmony_ci//
211cb0ef41Sopenharmony_ci// This comment describes the bytecode used by the experimental regexp engine
221cb0ef41Sopenharmony_ci// and its abstract semantics in terms of a VM.  An implementation of the
231cb0ef41Sopenharmony_ci// semantics that avoids exponential runtime can be found in `NfaInterpreter`.
241cb0ef41Sopenharmony_ci//
251cb0ef41Sopenharmony_ci// The experimental bytecode describes a non-deterministic finite automaton. It
261cb0ef41Sopenharmony_ci// runs on a multithreaded virtual machine (VM), i.e. in several threads
271cb0ef41Sopenharmony_ci// concurrently.  (These "threads" don't need to be actual operating system
281cb0ef41Sopenharmony_ci// threads.)  Apart from a list of threads, the VM maintains an immutable
291cb0ef41Sopenharmony_ci// shared input string which threads can read from.  Each thread is given by a
301cb0ef41Sopenharmony_ci// program counter (PC, index of the current instruction), a fixed number of
311cb0ef41Sopenharmony_ci// registers of indices into the input string, and a monotonically increasing
321cb0ef41Sopenharmony_ci// index which represents the current position within the input string.
331cb0ef41Sopenharmony_ci//
341cb0ef41Sopenharmony_ci// For the precise encoding of the instruction set, see the definition `struct
351cb0ef41Sopenharmony_ci// RegExpInstruction` below.  Currently we support the following instructions:
361cb0ef41Sopenharmony_ci// - CONSUME_RANGE: Check whether the codepoint of the current character is
371cb0ef41Sopenharmony_ci//   contained in a non-empty closed interval [min, max] specified in the
381cb0ef41Sopenharmony_ci//   instruction payload.  Abort this thread if false, otherwise advance the
391cb0ef41Sopenharmony_ci//   input position by 1 and continue with the next instruction.
401cb0ef41Sopenharmony_ci// - ACCEPT: Stop this thread and signify the end of a match at the current
411cb0ef41Sopenharmony_ci//   input position.
421cb0ef41Sopenharmony_ci// - FORK: If executed by a thread t, spawn a new thread t0 whose register
431cb0ef41Sopenharmony_ci//   values and input position agree with those of t, but whose PC value is set
441cb0ef41Sopenharmony_ci//   to the value specified in the instruction payload.  The register values of
451cb0ef41Sopenharmony_ci//   t and t0 agree directly after the FORK, but they can diverge.  Thread t
461cb0ef41Sopenharmony_ci//   continues with the instruction directly after the current FORK
471cb0ef41Sopenharmony_ci//   instruction.
481cb0ef41Sopenharmony_ci// - JMP: Instead of incrementing the PC value after execution of this
491cb0ef41Sopenharmony_ci//   instruction by 1, set PC of this thread to the value specified in the
501cb0ef41Sopenharmony_ci//   instruction payload and continue there.
511cb0ef41Sopenharmony_ci// - SET_REGISTER_TO_CP: Set a register specified in the paylod to the current
521cb0ef41Sopenharmony_ci//   position (CP) within the input, then continue with the next instruction.
531cb0ef41Sopenharmony_ci// - CLEAR_REGISTER: Clear the register specified in the payload by resetting
541cb0ef41Sopenharmony_ci//   it to the initial value -1.
551cb0ef41Sopenharmony_ci//
561cb0ef41Sopenharmony_ci// Special care must be exercised with respect to thread priority.  It is
571cb0ef41Sopenharmony_ci// possible that more than one thread executes an ACCEPT statement.  The output
581cb0ef41Sopenharmony_ci// of the program is given by the contents of the matching thread's registers,
591cb0ef41Sopenharmony_ci// so this is ambiguous in case of multiple matches.  To resolve the ambiguity,
601cb0ef41Sopenharmony_ci// every implementation of the VM  must output the match that a backtracking
611cb0ef41Sopenharmony_ci// implementation would output (i.e. behave the same as Irregexp).
621cb0ef41Sopenharmony_ci//
631cb0ef41Sopenharmony_ci// A backtracking implementation of the VM maintains a stack of postponed
641cb0ef41Sopenharmony_ci// threads.  Upon encountering a FORK statement, this VM will create a copy of
651cb0ef41Sopenharmony_ci// the current thread, set the copy's PC value according to the instruction
661cb0ef41Sopenharmony_ci// payload, and push it to the stack of postponed threads.  The VM will then
671cb0ef41Sopenharmony_ci// continue execution of the current thread.
681cb0ef41Sopenharmony_ci//
691cb0ef41Sopenharmony_ci// If at some point a thread t executes a MATCH statement, the VM stops and
701cb0ef41Sopenharmony_ci// outputs the registers of t.  Postponed threads are discarded.  On the other
711cb0ef41Sopenharmony_ci// hand, if a thread t is aborted because some input character didn't pass a
721cb0ef41Sopenharmony_ci// check, then the VM pops the topmost postponed thread and continues execution
731cb0ef41Sopenharmony_ci// with this thread.  If there are no postponed threads, then the VM outputs
741cb0ef41Sopenharmony_ci// failure, i.e. no matches.
751cb0ef41Sopenharmony_ci//
761cb0ef41Sopenharmony_ci// Equivalently, we can describe the behavior of the backtracking VM in terms
771cb0ef41Sopenharmony_ci// of priority: Threads are linearly ordered by priority, and matches generated
781cb0ef41Sopenharmony_ci// by threads with high priority must be preferred over matches generated by
791cb0ef41Sopenharmony_ci// threads with low priority, regardless of the chronological order in which
801cb0ef41Sopenharmony_ci// matches were found.  If a thread t executes a FORK statement and spawns a
811cb0ef41Sopenharmony_ci// thread t0, then the priority of t0 is such that the following holds:
821cb0ef41Sopenharmony_ci// * t0 < t, i.e. t0 has lower priority than t.
831cb0ef41Sopenharmony_ci// * For all threads u such that u != t and u != t0, we have t0 < u iff t < u,
841cb0ef41Sopenharmony_ci//   i.e. the t0 compares to other threads the same as t.
851cb0ef41Sopenharmony_ci// For example, if there are currently 3 threads s, t, u such that s < t < u,
861cb0ef41Sopenharmony_ci// then after t executes a fork, the thread priorities will be s < t0 < t < u.
871cb0ef41Sopenharmony_ci
881cb0ef41Sopenharmony_cinamespace v8 {
891cb0ef41Sopenharmony_cinamespace internal {
901cb0ef41Sopenharmony_ci
911cb0ef41Sopenharmony_ci// Bytecode format.
921cb0ef41Sopenharmony_ci// Currently very simple fixed-size: The opcode is encoded in the first 4
931cb0ef41Sopenharmony_ci// bytes, the payload takes another 4 bytes.
941cb0ef41Sopenharmony_cistruct RegExpInstruction {
951cb0ef41Sopenharmony_ci  enum Opcode : int32_t {
961cb0ef41Sopenharmony_ci    ACCEPT,
971cb0ef41Sopenharmony_ci    ASSERTION,
981cb0ef41Sopenharmony_ci    CLEAR_REGISTER,
991cb0ef41Sopenharmony_ci    CONSUME_RANGE,
1001cb0ef41Sopenharmony_ci    FORK,
1011cb0ef41Sopenharmony_ci    JMP,
1021cb0ef41Sopenharmony_ci    SET_REGISTER_TO_CP,
1031cb0ef41Sopenharmony_ci  };
1041cb0ef41Sopenharmony_ci
1051cb0ef41Sopenharmony_ci  struct Uc16Range {
1061cb0ef41Sopenharmony_ci    base::uc16 min;  // Inclusive.
1071cb0ef41Sopenharmony_ci    base::uc16 max;  // Inclusive.
1081cb0ef41Sopenharmony_ci  };
1091cb0ef41Sopenharmony_ci
1101cb0ef41Sopenharmony_ci  static RegExpInstruction ConsumeRange(base::uc16 min, base::uc16 max) {
1111cb0ef41Sopenharmony_ci    RegExpInstruction result;
1121cb0ef41Sopenharmony_ci    result.opcode = CONSUME_RANGE;
1131cb0ef41Sopenharmony_ci    result.payload.consume_range = Uc16Range{min, max};
1141cb0ef41Sopenharmony_ci    return result;
1151cb0ef41Sopenharmony_ci  }
1161cb0ef41Sopenharmony_ci
1171cb0ef41Sopenharmony_ci  static RegExpInstruction ConsumeAnyChar() {
1181cb0ef41Sopenharmony_ci    return ConsumeRange(0x0000, 0xFFFF);
1191cb0ef41Sopenharmony_ci  }
1201cb0ef41Sopenharmony_ci
1211cb0ef41Sopenharmony_ci  static RegExpInstruction Fail() {
1221cb0ef41Sopenharmony_ci    // This is encoded as the empty CONSUME_RANGE of characters 0xFFFF <= c <=
1231cb0ef41Sopenharmony_ci    // 0x0000.
1241cb0ef41Sopenharmony_ci    return ConsumeRange(0xFFFF, 0x0000);
1251cb0ef41Sopenharmony_ci  }
1261cb0ef41Sopenharmony_ci
1271cb0ef41Sopenharmony_ci  static RegExpInstruction Fork(int32_t alt_index) {
1281cb0ef41Sopenharmony_ci    RegExpInstruction result;
1291cb0ef41Sopenharmony_ci    result.opcode = FORK;
1301cb0ef41Sopenharmony_ci    result.payload.pc = alt_index;
1311cb0ef41Sopenharmony_ci    return result;
1321cb0ef41Sopenharmony_ci  }
1331cb0ef41Sopenharmony_ci
1341cb0ef41Sopenharmony_ci  static RegExpInstruction Jmp(int32_t alt_index) {
1351cb0ef41Sopenharmony_ci    RegExpInstruction result;
1361cb0ef41Sopenharmony_ci    result.opcode = JMP;
1371cb0ef41Sopenharmony_ci    result.payload.pc = alt_index;
1381cb0ef41Sopenharmony_ci    return result;
1391cb0ef41Sopenharmony_ci  }
1401cb0ef41Sopenharmony_ci
1411cb0ef41Sopenharmony_ci  static RegExpInstruction Accept() {
1421cb0ef41Sopenharmony_ci    RegExpInstruction result;
1431cb0ef41Sopenharmony_ci    result.opcode = ACCEPT;
1441cb0ef41Sopenharmony_ci    return result;
1451cb0ef41Sopenharmony_ci  }
1461cb0ef41Sopenharmony_ci
1471cb0ef41Sopenharmony_ci  static RegExpInstruction SetRegisterToCp(int32_t register_index) {
1481cb0ef41Sopenharmony_ci    RegExpInstruction result;
1491cb0ef41Sopenharmony_ci    result.opcode = SET_REGISTER_TO_CP;
1501cb0ef41Sopenharmony_ci    result.payload.register_index = register_index;
1511cb0ef41Sopenharmony_ci    return result;
1521cb0ef41Sopenharmony_ci  }
1531cb0ef41Sopenharmony_ci
1541cb0ef41Sopenharmony_ci  static RegExpInstruction ClearRegister(int32_t register_index) {
1551cb0ef41Sopenharmony_ci    RegExpInstruction result;
1561cb0ef41Sopenharmony_ci    result.opcode = CLEAR_REGISTER;
1571cb0ef41Sopenharmony_ci    result.payload.register_index = register_index;
1581cb0ef41Sopenharmony_ci    return result;
1591cb0ef41Sopenharmony_ci  }
1601cb0ef41Sopenharmony_ci
1611cb0ef41Sopenharmony_ci  static RegExpInstruction Assertion(RegExpAssertion::Type t) {
1621cb0ef41Sopenharmony_ci    RegExpInstruction result;
1631cb0ef41Sopenharmony_ci    result.opcode = ASSERTION;
1641cb0ef41Sopenharmony_ci    result.payload.assertion_type = t;
1651cb0ef41Sopenharmony_ci    return result;
1661cb0ef41Sopenharmony_ci  }
1671cb0ef41Sopenharmony_ci
1681cb0ef41Sopenharmony_ci  Opcode opcode;
1691cb0ef41Sopenharmony_ci  union {
1701cb0ef41Sopenharmony_ci    // Payload of CONSUME_RANGE:
1711cb0ef41Sopenharmony_ci    Uc16Range consume_range;
1721cb0ef41Sopenharmony_ci    // Payload of FORK and JMP, the next/forked program counter (pc):
1731cb0ef41Sopenharmony_ci    int32_t pc;
1741cb0ef41Sopenharmony_ci    // Payload of SET_REGISTER_TO_CP and CLEAR_REGISTER:
1751cb0ef41Sopenharmony_ci    int32_t register_index;
1761cb0ef41Sopenharmony_ci    // Payload of ASSERTION:
1771cb0ef41Sopenharmony_ci    RegExpAssertion::Type assertion_type;
1781cb0ef41Sopenharmony_ci  } payload;
1791cb0ef41Sopenharmony_ci  STATIC_ASSERT(sizeof(payload) == 4);
1801cb0ef41Sopenharmony_ci};
1811cb0ef41Sopenharmony_ciSTATIC_ASSERT(sizeof(RegExpInstruction) == 8);
1821cb0ef41Sopenharmony_ci// TODO(mbid,v8:10765): This is rather wasteful.  We can fit the opcode in 2-3
1831cb0ef41Sopenharmony_ci// bits, so the remaining 29/30 bits can be used as payload.  Problem: The
1841cb0ef41Sopenharmony_ci// payload of CONSUME_RANGE consists of two 16-bit values `min` and `max`, so
1851cb0ef41Sopenharmony_ci// this wouldn't fit.  We could encode the payload of a CONSUME_RANGE
1861cb0ef41Sopenharmony_ci// instruction by the start of the interval and its length instead, and then
1871cb0ef41Sopenharmony_ci// only allows lengths that fit into 14/13 bits.  A longer range can then be
1881cb0ef41Sopenharmony_ci// encoded as a disjunction of smaller ranges.
1891cb0ef41Sopenharmony_ci//
1901cb0ef41Sopenharmony_ci// Another thought: CONSUME_RANGEs are only valid if the payloads are such that
1911cb0ef41Sopenharmony_ci// min <= max. Thus there are
1921cb0ef41Sopenharmony_ci//
1931cb0ef41Sopenharmony_ci//     2^16 + 2^16 - 1 + ... + 1
1941cb0ef41Sopenharmony_ci//   = 2^16 * (2^16 + 1) / 2
1951cb0ef41Sopenharmony_ci//   = 2^31 + 2^15
1961cb0ef41Sopenharmony_ci//
1971cb0ef41Sopenharmony_ci// valid payloads for a CONSUME_RANGE instruction.  If we want to fit
1981cb0ef41Sopenharmony_ci// instructions into 4 bytes, we would still have almost 2^31 instructions left
1991cb0ef41Sopenharmony_ci// over if we encode everything as tight as possible.  For example, we could
2001cb0ef41Sopenharmony_ci// use another 2^29 values for JMP, another 2^29 for FORK, 1 value for ACCEPT,
2011cb0ef41Sopenharmony_ci// and then still have almost 2^30 instructions left over for something like
2021cb0ef41Sopenharmony_ci// zero-width assertions and captures.
2031cb0ef41Sopenharmony_ci
2041cb0ef41Sopenharmony_cistd::ostream& operator<<(std::ostream& os, const RegExpInstruction& inst);
2051cb0ef41Sopenharmony_cistd::ostream& operator<<(std::ostream& os,
2061cb0ef41Sopenharmony_ci                         base::Vector<const RegExpInstruction> insts);
2071cb0ef41Sopenharmony_ci
2081cb0ef41Sopenharmony_ci}  // namespace internal
2091cb0ef41Sopenharmony_ci}  // namespace v8
2101cb0ef41Sopenharmony_ci
2111cb0ef41Sopenharmony_ci#endif  // V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_
212