qemu-riscv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] risc-v vector (RVV) emulation performance issues


From: Richard Henderson
Subject: Re: [RFC] risc-v vector (RVV) emulation performance issues
Date: Tue, 25 Jul 2023 11:53:46 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 7/24/23 06:40, Daniel Henrique Barboza wrote:
Hi,

As some of you are already aware the current RVV emulation could be faster.
We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
skip set tail when vta is zero") that tried to address at least part of the
problem.

Running a simple program like this:

-------

#define SZ 10000000

int main ()
{
   int *a = malloc (SZ * sizeof (int));
   int *b = malloc (SZ * sizeof (int));
   int *c = malloc (SZ * sizeof (int));

   for (int i = 0; i < SZ; i++)
     c[i] = a[i] + b[i];
   return c[SZ - 1];
}

-------

And then compiling it without RVV support will run in 50 milis or so:

$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo-novect.out

real    0m0.043s
user    0m0.025s
sys    0m0.018s

Building the same program with RVV support slows it 4-5 times:

$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out

real    0m0.196s
user    0m0.177s
sys    0m0.018s

Using the lowest 'vlen' val allowed (128) will slow down things even further, 
taking it to
~0.260s.


'perf record' shows the following profile on the aforementioned binary:

   23.27%  qemu-riscv64  qemu-riscv64             [.] do_ld4_mmu
   21.11%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
   14.05%  qemu-riscv64  qemu-riscv64             [.] cpu_ldl_le_data_ra
   11.51%  qemu-riscv64  qemu-riscv64             [.] cpu_stl_le_data_ra
    8.18%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
    8.04%  qemu-riscv64  qemu-riscv64             [.] do_st4_mmu
    2.04%  qemu-riscv64  qemu-riscv64             [.] ste_w
    1.15%  qemu-riscv64  qemu-riscv64             [.] lde_w
    1.02%  qemu-riscv64  [unknown]                [k] 0xffffffffb3001260
    0.90%  qemu-riscv64  qemu-riscv64             [.] cpu_get_tb_cpu_state
    0.64%  qemu-riscv64  qemu-riscv64             [.] tb_lookup
    0.64%  qemu-riscv64  qemu-riscv64             [.] riscv_cpu_mmu_index
    0.39%  qemu-riscv64  qemu-riscv64             [.] object_dynamic_cast_assert


First thing that caught my attention is vext_ldst_us from 
target/riscv/vector_helper.c:

     /* load bytes from guest memory */
     for (i = env->vstart; i < evl; i++, env->vstart++) {
         k = 0;
         while (k < nf) {
             target_ulong addr = base + ((i * nf + k) << log2_esz);
             ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
             k++;
         }
     }
     env->vstart = 0;

Given that this is a unit-stride load that access contiguous elements in memory 
it
seems that this loop could be optimized/removed since it's loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that ARM SVE 
might
have something of the sorts. Richard, care to comment?

Yes, SVE optimizes this case -- see

https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/sve_helper.c?ref_type=heads#L5651

It's not possible to do this generically, due to the predication. There's quite a lot of machinery that goes into expanding this such that each helper uses the correct host load/store insn in the fast case.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]