17db96d56Sopenharmony_ci# Adding or extending a family of adaptive instructions.
27db96d56Sopenharmony_ci
37db96d56Sopenharmony_ci## Families of instructions
47db96d56Sopenharmony_ci
57db96d56Sopenharmony_ciThe core part of PEP 659 (specializing adaptive interpreter) is the families
67db96d56Sopenharmony_ciof instructions that perform the adaptive specialization.
77db96d56Sopenharmony_ci
87db96d56Sopenharmony_ciA family of instructions has the following fundamental properties:
97db96d56Sopenharmony_ci
107db96d56Sopenharmony_ci* It corresponds to a single instruction in the code
117db96d56Sopenharmony_ci  generated by the bytecode compiler.
127db96d56Sopenharmony_ci* It has a single adaptive instruction that records an execution count and,
137db96d56Sopenharmony_ci  at regular intervals, attempts to specialize itself. If not specializing,
147db96d56Sopenharmony_ci  it executes the non-adaptive instruction.
157db96d56Sopenharmony_ci* It has at least one specialized form of the instruction that is tailored
167db96d56Sopenharmony_ci  for a particular value or set of values at runtime.
177db96d56Sopenharmony_ci* All members of the family must have the same number of inline cache entries,
187db96d56Sopenharmony_ci  to ensure correct execution.
197db96d56Sopenharmony_ci  Individual family members do not need to use all of the entries,
207db96d56Sopenharmony_ci  but must skip over any unused entries when executing.
217db96d56Sopenharmony_ci
227db96d56Sopenharmony_ciThe current implementation also requires the following,
237db96d56Sopenharmony_cialthough these are not fundamental and may change:
247db96d56Sopenharmony_ci
257db96d56Sopenharmony_ci* All families uses one or more inline cache entries,
267db96d56Sopenharmony_ci  the first entry is always the counter.
277db96d56Sopenharmony_ci* All instruction names should start with the name of the non-adaptive
287db96d56Sopenharmony_ci  instruction.
297db96d56Sopenharmony_ci* The adaptive instruction should end in `_ADAPTIVE`.
307db96d56Sopenharmony_ci* Specialized forms should have names describing their specialization.
317db96d56Sopenharmony_ci
327db96d56Sopenharmony_ci## Example family
337db96d56Sopenharmony_ci
347db96d56Sopenharmony_ciThe `LOAD_GLOBAL` instruction (in Python/ceval.c) already has an adaptive
357db96d56Sopenharmony_cifamily that serves as a relatively simple example.
367db96d56Sopenharmony_ci
377db96d56Sopenharmony_ciThe `LOAD_GLOBAL_ADAPTIVE` instruction performs adaptive specialization,
387db96d56Sopenharmony_cicalling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
397db96d56Sopenharmony_ci
407db96d56Sopenharmony_ciThere are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
417db96d56Sopenharmony_ciwhich is specialized for global variables in the module, and
427db96d56Sopenharmony_ci`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
437db96d56Sopenharmony_ci
447db96d56Sopenharmony_ci## Performance analysis
457db96d56Sopenharmony_ci
467db96d56Sopenharmony_ciThe benefit of a specialization can be assessed with the following formula:
477db96d56Sopenharmony_ci`Tbase/Tadaptive`.
487db96d56Sopenharmony_ci
497db96d56Sopenharmony_ciWhere `Tbase` is the mean time to execute the base instruction,
507db96d56Sopenharmony_ciand `Tadaptive` is the mean time to execute the specialized and adaptive forms.
517db96d56Sopenharmony_ci
527db96d56Sopenharmony_ci`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
537db96d56Sopenharmony_ci
547db96d56Sopenharmony_ci`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
557db96d56Sopenharmony_cithe number of times that instruction is executed.
567db96d56Sopenharmony_ci`Tmiss` is the time to process a miss, including de-optimzation
577db96d56Sopenharmony_ciand the time to execute the base instruction.
587db96d56Sopenharmony_ci
597db96d56Sopenharmony_ciThe ideal situation is where misses are rare and the specialized
607db96d56Sopenharmony_ciforms are much faster than the base instruction.
617db96d56Sopenharmony_ci`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
627db96d56Sopenharmony_ciIn which case we have `Tadaptive ≈ sum(Ti*Ni)`.
637db96d56Sopenharmony_ciSince we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
647db96d56Sopenharmony_ci`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
657db96d56Sopenharmony_ciwe would expect the specialization of `LOAD_GLOBAL` to be profitable.
667db96d56Sopenharmony_ci
677db96d56Sopenharmony_ci## Design considerations
687db96d56Sopenharmony_ci
697db96d56Sopenharmony_ciWhile `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
707db96d56Sopenharmony_ci`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
717db96d56Sopenharmony_cilow for all specialized instructions and `Nmiss` as low as possible.
727db96d56Sopenharmony_ci
737db96d56Sopenharmony_ciKeeping `Nmiss` low means that there should be specializations for almost
747db96d56Sopenharmony_ciall values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
757db96d56Sopenharmony_cikeeping `Ti` low which means minimizing branches and dependent memory
767db96d56Sopenharmony_ciaccesses (pointer chasing). These two objectives may be in conflict,
777db96d56Sopenharmony_cirequiring judgement and experimentation to design the family of instructions.
787db96d56Sopenharmony_ci
797db96d56Sopenharmony_ciThe size of the inline cache should as small as possible,
807db96d56Sopenharmony_ciwithout impairing performance, to reduce the number of
817db96d56Sopenharmony_ci`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
827db96d56Sopenharmony_ci
837db96d56Sopenharmony_ci### Gathering data
847db96d56Sopenharmony_ci
857db96d56Sopenharmony_ciBefore choosing how to specialize an instruction, it is important to gather
867db96d56Sopenharmony_cisome data. What are the patterns of usage of the base instruction?
877db96d56Sopenharmony_ciData can best be gathered by instrumenting the interpreter. Since a
887db96d56Sopenharmony_cispecialization function and adaptive instruction are going to be required,
897db96d56Sopenharmony_ciinstrumentation can most easily be added in the specialization function.
907db96d56Sopenharmony_ci
917db96d56Sopenharmony_ci### Choice of specializations
927db96d56Sopenharmony_ci
937db96d56Sopenharmony_ciThe performance of the specializing adaptive interpreter relies on the
947db96d56Sopenharmony_ciquality of specialization and keeping the overhead of specialization low.
957db96d56Sopenharmony_ci
967db96d56Sopenharmony_ciSpecialized instructions must be fast. In order to be fast,
977db96d56Sopenharmony_cispecialized instructions should be tailored for a particular
987db96d56Sopenharmony_ciset of values that allows them to:
997db96d56Sopenharmony_ci1. Verify that incoming value is part of that set with low overhead.
1007db96d56Sopenharmony_ci2. Perform the operation quickly.
1017db96d56Sopenharmony_ci
1027db96d56Sopenharmony_ciThis requires that the set of values is chosen such that membership can be
1037db96d56Sopenharmony_citested quickly and that membership is sufficient to allow the operation to
1047db96d56Sopenharmony_ciperformed quickly.
1057db96d56Sopenharmony_ci
1067db96d56Sopenharmony_ciFor example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
1077db96d56Sopenharmony_cidictionaries that have a keys with the expected version.
1087db96d56Sopenharmony_ci
1097db96d56Sopenharmony_ciThis can be tested quickly:
1107db96d56Sopenharmony_ci* `globals->keys->dk_version == expected_version`
1117db96d56Sopenharmony_ci
1127db96d56Sopenharmony_ciand the operation can be performed quickly:
1137db96d56Sopenharmony_ci* `value = entries[cache->index].me_value;`.
1147db96d56Sopenharmony_ci
1157db96d56Sopenharmony_ciBecause it is impossible to measure the performance of an instruction without
1167db96d56Sopenharmony_cialso measuring unrelated factors, the assessment of the quality of a
1177db96d56Sopenharmony_cispecialization will require some judgement.
1187db96d56Sopenharmony_ci
1197db96d56Sopenharmony_ciAs a general rule, specialized instructions should be much faster than the
1207db96d56Sopenharmony_cibase instruction.
1217db96d56Sopenharmony_ci
1227db96d56Sopenharmony_ci### Implementation of specialized instructions
1237db96d56Sopenharmony_ci
1247db96d56Sopenharmony_ciIn general, specialized instructions should be implemented in two parts:
1257db96d56Sopenharmony_ci1. A sequence of guards, each of the form
1267db96d56Sopenharmony_ci  `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
1277db96d56Sopenharmony_ci2. The operation, which should ideally have no branches and
1287db96d56Sopenharmony_ci  a minimum number of dependent memory accesses.
1297db96d56Sopenharmony_ci
1307db96d56Sopenharmony_ciIn practice, the parts may overlap, as data required for guards
1317db96d56Sopenharmony_cican be re-used in the operation.
1327db96d56Sopenharmony_ci
1337db96d56Sopenharmony_ciIf there are branches in the operation, then consider further specialization
1347db96d56Sopenharmony_cito eliminate the branches.
1357db96d56Sopenharmony_ci
1367db96d56Sopenharmony_ci### Maintaining stats
1377db96d56Sopenharmony_ci
1387db96d56Sopenharmony_ciFinally, take care that stats are gather correctly.
1397db96d56Sopenharmony_ciAfter the last `DEOPT_IF` has passed, a hit should be recorded with
1407db96d56Sopenharmony_ci`STAT_INC(BASE_INSTRUCTION, hit)`.
1417db96d56Sopenharmony_ciAfter a optimization has been deferred in the `ADAPTIVE` form,
1427db96d56Sopenharmony_cithat should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
143