qemu-riscv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] target/riscv: Implement dynamic establishment of custom deco


From: Christoph Müllner
Subject: Re: [PATCH] target/riscv: Implement dynamic establishment of custom decoder
Date: Fri, 8 Mar 2024 12:56:43 +0100

On Thu, Mar 7, 2024 at 9:35 PM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> >>> -        for (size_t i = 0; i < ARRAY_SIZE(decoders); ++i) {
> >>> -            if (decoders[i].guard_func(ctx->cfg_ptr) &&
> >>> -                decoders[i].decode_func(ctx, opcode32)) {
> >>> +        for (size_t i = 0; i < decoder_table_size; ++i) {
> >>> +            if (ctx->decoder[i](ctx, opcode32)) {
> >>>                   return;
>
> By the way, you're adding layers of pointer chasing, so I suspect you'll find 
> all of this
> is a wash or worse, performance-wise.
>
>
> Indeed, since some of the predicates are trivial, going the other way might 
> help: allow
> everything to be inlined:
>
>      if (decode_insn32(...)) {
>          return;
>      }
>      if (has_xthead_p(...) && decode_xthead(...)) {
>          return;
>      }
>      ...
>
>
> Even if there are 10 entries here, so what?  All of the code has to be 
> compiled into QEMU.
>   You're not going to somehow add truly dynamic code that you've loaded from 
> a module.

I just tested this with GCC -O2/-O3. The generated code from the
existing decoder loop will
result in exactly what you have listed here (loop unrolling,
transforming the indirect calls
to direct calls, inlining, and evaluation of statically known conditions).
has_xthead_p() can get inlined as well, if the inlining costs are
considered low enough
(thank you, Richard, for giving some hints about that below).

What the commit message is not mentioning (and what this patch was
actually addressing and
therefore should have been mentioned):
Having dynamic control of the decoder order was meant to allow adding
a vendor_decoder
before decode_insn32() with minimal overhead (no guard_func) for users
that don't need
this particular vendor_decoder.

Being more explicit: there is interest in supporting the
non-conforming (conflicting) instruction
extension XTheadVector:
  
https://github.com/T-head-Semi/thead-extension-spec/blob/master/xtheadvector.adoc
XTheadVector uses the RVV 0.7.1 draft encoding, which conflicts with
the ratified RVV spec.
The idea is to avoid these conflicts with a call to
decode_xtheadvector() before decode_insn32().
This implies that everyone has to call has_xtheadvector_p() before
calling decode_insn32().
And the intent of this patch is to provide a mechanism to reduce this overhead.

When suggesting the dynamic decoder list, I was not aware that
always_true_p() will be
eliminated by the compiler. Since this is the case, I agree with the
"wash or worse" for
decode_insn32(). The elimination of following guard functions for
vendor decoders is likely
less performance relevant.

I don't think we should discuss efficiency any further unless we have
some data to
justify any changes. E.g. emulating a RISC-V SPEC CPU 2017 run on x86_64 and
looking at the runtimes could give the relevant insights for the
following scenarios:
* existing code on upstream/master
* existing code + adding a new extension that comes before
decode_insn32 in decoders[]
* existing code + this patch (dynamic decoders)
Related overheads that could be measured: adding 20 new instructions
to decode_insn32(),
which are not executed (to put the costs into perspective).

> You could perhaps streamline predicates such as has_xthead_p to not test 11 
> variables by
> adding an artificial "ext_xthead_any" configuration entry that is the sum of 
> all of those.
>
>
> r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]