Skip to content

gh-148171: Convert CALL_BUILTIN_FAST to leave inputs on the stack for refcount elimination in JIT#148172

Open
Fidget-Spinner wants to merge 9 commits intopython:mainfrom
Fidget-Spinner:pop_top_calls
Open

gh-148171: Convert CALL_BUILTIN_FAST to leave inputs on the stack for refcount elimination in JIT#148172
Fidget-Spinner wants to merge 9 commits intopython:mainfrom
Fidget-Spinner:pop_top_calls

Conversation

@Fidget-Spinner
Copy link
Copy Markdown
Member

@Fidget-Spinner Fidget-Spinner commented Apr 6, 2026

@Fidget-Spinner
Copy link
Copy Markdown
Member Author

Results are pretty good. 7% faster on call microbenchmark:

from _testcapi import meth_fastcall

def testfunc(n):
    x = 0
    for _ in range(n):
        meth_fastcall(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    return x

testfunc(20000000)

Summary
  PYTHON_JIT_STRESS=1 ./python ../add_int.py ran
    1.07 ± 0.10 times faster than PYTHON_JIT_STRESS=1 ../cpython_jit_pgo_tc_0/python ../add_int.py
  

Co-authored-by: Kumar Aditya <kumaraditya@python.org>
@markshannon
Copy link
Copy Markdown
Member

Rather than adding an extra stack item for all calls everywhere, I think it would make more sense to keep within stack bounds for this one uop.

The callable is almost always going to be immortal, and self_or_null will either by borrowed (for self) or NULL, so there are options to optimize away these values, leaving space for the result and the arguments.

For example:

op(_CALL_BUILTIN_FAST, (callable, self_or_null, args[oparg] -- res, s, a[oparg])) {
    ...
    PyStackRef_CLOSE(callable);
    ...

means we can optimize away the decrefs on self_or_null and args.

or, for function calls where we know self_or_null is NULL, we can just declare it dead:

op(_CALL_BUILTIN_FAST, (callable, self_or_null, args[oparg] -- res, c, a[oparg])) {
    ...
    DEAD(self_of_null);
    ...

assert(next->op.code == STORE_FAST);
operand = next->op.arg;
}
else if (uop == _POP_TOP_OPARG) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be special casing as few instructions as possible in the front-end.

This can probably be better handled in the optimizer, as we might want to leave a single _POP_TOP_OPARG rather than a long series of _POP_TOPs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done,

@Fidget-Spinner
Copy link
Copy Markdown
Member Author

@markshannon I chose to swap out callable, as it's the most convenient location, and I suspect self_or_null has a higher hit rate of being a NULL or borrowed than callable.

// To maintain a consistent view of the stack in the GC, callable's stack location
// must be overwritten before we close it.
_PyStackRef *callable_ptr = args - 2;
PyStackRef_XSETREF(*callable_ptr, PyStackRef_NULL);
Copy link
Copy Markdown
Member

@markshannon markshannon Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit strange. Does the code generator force you to do this?

Would this work?

    PyObject *res_o = _Py_BuiltinCallFast_StackRef(...)
    if (res_o == NULL) {
         ERROR_NO_POP();
    }
    s = self_or_null;
    DEAD(self_or_null);
    DEAD(args);
    res = PyStackRef_FromPyObjectSteal(res_o);
    PyStackRef_CLOSE(callable);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do that, the code generator shrinks the "real" stack to nothing, sets the stack pointer to before callable, and then calls the close.

This is incorrect, as we need the self_or_null and args to still be visible from the stack when closing callable, as on FT, the stack is a GC root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants