Re: Storing Bytecode Offset

Thanks for looking over the changes and making suggestions.

On Sat, Jul 11, 2020 at 11:29 AM Tom Tromey <tom@tromey.com> wrote:

>>>>> "Zach" == Zach Shaftel <zshaftel@gmail.com> writes:

Zach> Rocky has pushed the code to Savannah. In the
Zach> feature/zach-soc-funcall-from-bytecode branch
[...]
Zach> But the code on that branch is just
Zach> a rough proof-of-concept and isn't 100% accurate. Doing it right would
Zach> require a lot of refactoring that could ultimately lead to a dead end.

Could you say more specifically what is wrong on that branch? What kind
of refactoring is needed?

I think other interpreters out there take this "self-call" approach, and
it seems like a decent idea to try in Emacs.

Zach: the code added to exec_byte_code in that branch is almost identical to
the contents of Ffuncall, with the exception of the lexical bytecode
special case. That part is just a copy of some of funcall_lambda and
fetch_and_exec_byte_code. That doesn't seem to be a problem in terms of
performance or functionality (so far) but the code duplication is
obviously not ideal. I don't see any way to avoid this without some big
changes to the whole funcall call graph.

That being said, Zach would love to have this in the bytecode interpreter, so he'll
would gladly make those changes if this ends up being the best way
forward. However unless it is necessary now, it would probably be

done after the Summer of Code project when we hope to have more of the

complete solution done.

Zach> In the feature/zach-soc-bytecode-in-traceback branch, the offset is
Zach> stored in the thread_state struct. Prior to the most recent commit, the
Zach> offset was then stashed in the backtrace specbinding frame from
Zach> record_in_backtrace, so each bytecode call in the backtrace buffer was
Zach> annotated with the offset when the error occurred. This is the ultimate
Zach> goal, but the current implementation is flawed and a significant source
Zach> of slowdown.

I took a brief look at it. One thing I noticed is that the assignment
is done in NEXT:

+#define NEXT UPDATE_OFFSET; goto *(targets[op = FETCH])

However, it seems to me that there cases where this is not needed -- any
bytecode that cannot possibly cause an exception doesn't need to record
the offset. For example:

CASE (Bvarref6):
op = FETCH;
varref:
{
Lisp_Object v1 = vectorp[op], v2;
if (!SYMBOLP (v1)
|| XSYMBOL (v1)->u.s.redirect != SYMBOL_PLAINVAL
|| (v2 = SYMBOL_VAL (XSYMBOL (v1)), EQ (v2, Qunbound)))
v2 = Fsymbol_value (v1);
PUSH (v2);
NEXT;
}

Here, I think the updating only has to be done before the call to
Fsymbol_value.

Or:

CASE (Bgotoifnil):
{
Lisp_Object v1 = POP;
op = FETCH2;
if (NILP (v1))
goto op_branch;
NEXT;
}

This opcode doesn't need to update the offset at all.

This matters because the change is introducing an extra store into
performance-sensitive code.

Absolutely correct. There is about a 7% slowdown and we've tried various
approaches, some that you've mentioned below and some that were
mentioned by Andrea Corallo.

In particular, storing bytstr_data in the frame. As a fallback position
Andrea also suggested giving up on the most recent offset and keeping
offsets in the backtrace.

We think that this is bad in that the most-recent position is the one
that is most desired.

No matter what is done, there is going to be a loss of performance
unless we gain it back somewhere else, e.g. bytecode-to-bytecode call
which is going to be complicated

Please developers, we need some feedback here.

1. What would a maximum acceptable slowdown be?

2. If we follow the update only as needed there are going to be a
massive number of changes which will make review harder. Will that
diminish the chances of this being merged into master?

Note that if the code saved offsets initially (or was written less well) then a 7% slowdown

probably wouldn't be an issue. If you were to take your other favorite programming

language that reports bytecode offsets, and announce that you can speed up the

interpreter by 7% but you lose backtrace locations, I suspect the community wouldn't

go for that.

I had a couple of other ideas for how to micro-optimize this store.
I don't know if they will work, but you might try them.

The updating looks like this:

+#define UPDATE_OFFSET (backtrace_byte_offset = pc - bytestr_data)

If you also stored bytestr_data in the structure, then instead of doing
the subtraction in the hot loop, you could do the subtraction when
computing the location -- which happens more rarely, and which is not
performance-sensitive.

Andrea suggested this too. We will do this.

Also, as you said, backtrace_byte_offset actually references a field in
the thread state:

+ /* The offset of the current op of the byte-code function being
+ executed. */
+ int m_backtrace_byte_offset;
+#define backtrace_byte_offset (current_thread->m_backtrace_byte_offset)

You might try changing this to be a pointer, and have it point to a
local variable that is updated by UPDATE_OFFSET. This might be a bit
faster due to cache effects.

That's a great idea, We will try that out.

Note that you do *not* want to have it point to the actual "pc" variable
-- if you take the address of "pc", the compiler will probably not put
it in a register. I mean, you might try it, but I would expect a
performance loss in this case.

We saw this firsthand when we kept the pointers in the
specbinding frame. I'm glad you clarified what was going on because we
were puzzled.

Tom

Rocky and Zach

From:	Rocky Bernstein
Subject:	Re: Storing Bytecode Offset
Date:	Sun, 12 Jul 2020 23:27:31 -0400