Specialization is typically done in the context of a JIT compiler, but research shows specialization in an interpreter can boost performance significantly, even outperforming a naive compiler.
即研究发现仅仅在 Interpreter 层面进行 Specialization 优化也可以获得显著的性能提升,性能收益甚至可以超过一些初级的 JIT 方案,作者在这里还引用了一篇自己之前的论文,感兴趣的同学可以自行去 PEP 659 提案的参考文献部分查看。
到这里我们也就明确了 PEP 659 不包含 JIT Compiler,简单地说就是它不生成代码,它只是代码的搬运工,我们需要穷举所有可能的优化情况,并且提前准备好代码,在观察到匹配的优化条件时将字节码进行替换,当发现不满足优化条件时还必须能够优雅的退回到优化前的代码以保证程序的正确性。
为了能更好的穷举优化情况和切换代码,需要选择合适的优化粒度,提案原文是:
By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster interpreter that also generates profiling information for more sophisticated optimizations in the future.
Typical optimizations for virtual machines are expensive, so a long "warm up" time is required to gain confidence that the cost of optimization is justified.
/* Bytecode object */
struct PyCodeObject {
PyObject_HEAD
// The hottest fields (in the eval loop) are grouped here at the top.
PyObject *co_consts; /* list (constants used) */
PyObject *co_names; /* list of strings (names used) */
// ...
int co_warmup; /* Warmup counter for quickening */
// ...
/* Quickened instructions and cache, or NULL
This should be treated as opaque by all code except the specializer and
interpreter. */
union _cache_or_instruction *co_quickened;
};
/* Bytecode object */
struct PyCodeObject {
PyObject_HEAD
// The hottest fields (in the eval loop) are grouped here at the top.
PyObject *co_consts; /* list (constants used) */
PyObject *co_names; /* list of strings (names used) */
_Py_CODEUNIT *co_firstinstr; /* Pointer to first instruction, used for quickening. */
// ...
PyObject *co_code; /* instruction opcodes */
/* Quickened instructions and cache, or NULL
This should be treated as opaque by all code except the specializer and
interpreter. */
union _cache_or_instruction *co_quickened;
};
typedef uint16_t _Py_CODEUNIT;
int
_Py_Quicken(PyCodeObject *code) {
if (code->co_quickened) {
return 0;
}
Py_ssize_t size = PyBytes_GET_SIZE(code->co_code);
int instr_count = (int)(size/sizeof(_Py_CODEUNIT));
if (instr_count > MAX_SIZE_TO_QUICKEN) {
code->co_warmup = QUICKENING_WARMUP_COLDEST;
return 0;
}
int entry_count = entries_needed(code->co_firstinstr, instr_count);
SpecializedCacheOrInstruction *quickened = allocate(entry_count, instr_count);
if (quickened == NULL) {
return -1;
}
_Py_CODEUNIT *new_instructions = first_instruction(quickened);
memcpy(new_instructions, code->co_firstinstr, size);
optimize(quickened, instr_count);
code->co_quickened = quickened;
code->co_firstinstr = new_instructions;
return 0;
}
/* We layout the quickened data as a bi-directional array:
* Instructions upwards, cache entries downwards.
* first_instr is aligned to a SpecializedCacheEntry.
* The nth instruction is located at first_instr[n]
* The nth cache is located at ((SpecializedCacheEntry *)first_instr)[-1-n]
* The first (index 0) cache entry is reserved for the count, to enable finding
* the first instruction from the base pointer.
* The cache_count argument must include space for the count.
* We use the SpecializedCacheOrInstruction union to refer to the data
* to avoid type punning.
Layout of quickened data, each line 8 bytes for M cache entries and N instructions:
<cache_count> <---- co->co_quickened
<cache M-1>
<cache M-2>
...
<cache 0>
<instr 0> <instr 1> <instr 2> <instr 3> <--- co->co_first_instr
<instr 4> <instr 5> <instr 6> <instr 7>
...
<instr N-1>
*/
For instructions that need specialization data, the operand in the quickened array will serve as a partial index, along with the offset of the instruction, to find the first specialization data entry for that instruction.