lilypond-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Blockers for Guile 2.2


From: Luca Fascione
Subject: Re: Blockers for Guile 2.2
Date: Sun, 27 Feb 2022 10:39:34 +0100

On Sat, Feb 26, 2022 at 10:48 PM Jean Abou Samra <jean@abou-samra.fr> wrote:

>
> [Jonas]
> > He, I always thought auto-compilation didn't optimize! 😕 now don't
> > tell me Guile also applies optimizations while just reading and
> > supposedly interpreting code...
>
> I don't think it does. At least, you don't usually call eval or
> primitive-eval on code to run it repeatedly, just for one-shot
> code, so I don't see the sense it would make. Also, it seems
> to me that the Guile developers are much more focused on compilation.
>

Randomly, it seems to me my own problem with fingering might possibly bring
up
a counter example pattern (for eval vs compile): when you attach a callback
to something that is evaluated
often (in the case of fingering there is an after-line-break callback,
which I am guessing
runs O(count(Fingering)) ~= O(count(NoteHead)). Isn't this a case of scheme
code coming in from
the .ly file (likely an include ly file) and being run O(N) times?

I can't say I understand lilypond super well, but it doesn't feel like it
would employ often
alogrithms that are superlinear in the NoteHead count (it seems most
activities would be
organized as linear scans), so I'm guessing O(N) is as bad as it gets for a
cursory evaluation?

In other words, is it approximately true that "for (almost) any real-life
score the total compilation time
is proportional to the number of NoteHeads, past a certain size"?
I'm guessing you need a few pages worth of material to kill away constant
overheads, but beyond that
is it true that if you double the source size you double the compilation
time?

The background for this was what David was saying, whether code in .ly
files would be optimized or not.
At a guess, I'd think stuff that is big and runs O(n) times might make some
sense to see if need to optimize.
I am sensitive to your other passage of the nightmare it will be to keep
this stuff around and properly invalidate
it upon changes.

> [ skipping over the part regarding Guile 3, since I don't think it's
> > relevant here ]
>
> Perhaps I should have changed the title, but I do think it
> is relevant -- it gives hope for a state where development
> cycles will be easier. When we want Guile 3 is another question.
> I'm in favor of not making the move to Guile 2 wait more
> than it already has, but I wonder if at some point in this release
> cycle (by which I mean before the next stable release) we will want
> to switch to Guile 3.
>

fwiw, this sounds super reasonable to me.

I was just trying to get orders
> of magnitude (for Guile 3 it's 1m30 vs. 4s, no need for
> precise benchmarks to see which is faster :-).
>

Indeed. I was more thinking that not only the opt/no-opt numbers are close,
but also 18-ish to 19ish seconds are close, it is possible that difference
too
is spurious for some reason (I guess I'm saying: there's a possibility you
have been a little lucky or a little unlucky, and your actual runtime
difference is
closer to 2% or maybe 15%, but you happened to take samples at "weird"
times.

I might have some time next week (after 7 march) to run these tests several
times, depends on
some stuff I have going on for next weekend. I'll contact you oob if I do
for some
quick guidance exchange.

I am no performance expert, but LilyPond's performance-critical parts
> are written in C++, so in general I would think the cost of Scheme code
> is spread a bit all over the code base (unlike C++ where you can find
> some rather critical parts taking lots of time, like page breaking,
> or skyline algorithms).


Well but this would be single call from scheme into C++ that take "a long
time",
do I read you right?

Instead I was thinking more of the "dying from a thousand cuts" kinda
scenario.
[continue below]


> I am not sure what you mean by effort spent
> in "preparing" callbacks, could you elaborate?
>

Imagine the grob is a C++ object that remains opaque to Scheme. So it's a
thing for which
in Scheme you move around a blind pointer, but to read property Y-offset
you'd call (grob-get 'Y-offset)
and grob-get is C++.

And then in C++ you'd just have a one-liner accessor that is conceptually
  float scheme_grob_get(Grob* grob, sch_value* propname ) {
    // typecheck propname to be a symbol
    // find out if you have a symbol and what accessor you want to delegate
to
    return grob.get_y_offset();  // actual delegation
  }
and in turn the getter is something like
  inline float get_y_offset() const { return y_offset; /* this is a float
member, say */ }

Code following a simple pattern like this, once compiled, will largely be
dominated by the
scripting language runtime overhead in traversing through the callback
table, finding the
function pointer for the dispatch, marshalling the values from scheme
representation to
something you can mess with in C++ and the way back upon return.
I've read enough of your C++ to see that a lot of this happens in client
code, but either way
it's all code that dominates in cost the execution of the accessor itself.

On the contrary, it seems to me the Guile compiler has some latitude to
compile away
a fair bit of this stuff, it could for example reduce the "boxing" (*)
overhead by just compiling
it away to code runs that it can analyze fully.

(*) 'boxing' is a C# word to mean wrap a POD value in an object to move it
around in a way
that is uniform with the other objects in the language. C# people need to
keep boxing and
unboxing costs in their head in code with lots of PODs being moved around,
because that
cost can dominate the execution of their code. I'm not sure what word is
used in Guile for this
concept.

When above I said "preparing" I was referring to the overhead of:
 - finding what function to call (not sure if it can be compiled away)
 - preparing the values so that the C++ can use them (including whatever
boxing/unboxing happens to be called by the client code)
as it relates to the cost of running the "meat" of the call.

Some groups call this overhead "marshalling", it's a classic problem in
projects like
PyQt, where you have to use your bindings code to make _both_ sides think
they're running
on their native datatypes. If you go and look at how people gain speed when
using stuff
like numpy, as another example, it's all about making sure you don't
marshall through
the boundary all the time, but instead keep things on "one side" as long as
you can, with things like
numpy.array(...) and such stuff.

Obviously the "read me one float and return" case was an absurd
extreme case. But in saying this, it does happen in real life that you'd
write code like that,
because it can make for a more uniform API which is easier to wrap your
head around
and code to.

> Seems like it could be an interesting way forward for the dev group to run
> > 3.x with -O0 for iteration cycles, and then do what David is saying to
> > ship the scm file with optimizations on
> > and have the in-score Scheme be just built -O0.
>
>
> Hm, note that -O0 is different from what is currently done, which
> is not another -O level, but running code through the evaluator. There
> are two parts in Guile >= 2, the evaluator, which is still there
> for running code passed to eval and primitive-eval (think Python
> exec()), and the byte-compiler + virtual machine. Optimization is
> only done during bytecode generation. Currently, code embedded
> in .ly files is routed through the evaluator. So whether we
> want to byte-compile the code from .ly files and what its optimization
> level should be if we do want to byte-compile it are different
> questions (but probably the desirable level would be -O0).
>

So, going back to my scenario above in which the .ly's give you callbacks
that
get invoked a lot, is there a scenario that would make sense where you'd
instead
compile+optimize and bind the callback through the bytecode thing instead
of calling
"eval"/"primitive-eval" all the time?
Obviously I still don't know enough to have a gut feeling either way,
but this being hundreds or thousands of calls you could see it add up in
the end.

Re: keeping caches, I wonder if you can compile to memory, like you would
do in python
with things like regular expressions, for example.

L


reply via email to

[Prev in Thread] Current Thread [Next in Thread]