qemu-riscv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RFC] risc-v vector (RVV) emulation performance issues


From: Daniel Henrique Barboza
Subject: [RFC] risc-v vector (RVV) emulation performance issues
Date: Mon, 24 Jul 2023 10:40:08 -0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

Hi,

As some of you are already aware the current RVV emulation could be faster.
We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
skip set tail when vta is zero") that tried to address at least part of the
problem.

Running a simple program like this:

-------

#define SZ 10000000

int main ()
{
  int *a = malloc (SZ * sizeof (int));
  int *b = malloc (SZ * sizeof (int));
  int *c = malloc (SZ * sizeof (int));

  for (int i = 0; i < SZ; i++)
    c[i] = a[i] + b[i];
  return c[SZ - 1];
}

-------

And then compiling it without RVV support will run in 50 milis or so:

$ time ~/work/qemu/build/qemu-riscv64 -cpu 
rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo-novect.out

real    0m0.043s
user    0m0.025s
sys     0m0.018s

Building the same program with RVV support slows it 4-5 times:

$ time ~/work/qemu/build/qemu-riscv64 -cpu 
rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out

real    0m0.196s
user    0m0.177s
sys     0m0.018s

Using the lowest 'vlen' val allowed (128) will slow down things even further, 
taking it to
~0.260s.


'perf record' shows the following profile on the aforementioned binary:

  23.27%  qemu-riscv64  qemu-riscv64             [.] do_ld4_mmu
  21.11%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
  14.05%  qemu-riscv64  qemu-riscv64             [.] cpu_ldl_le_data_ra
  11.51%  qemu-riscv64  qemu-riscv64             [.] cpu_stl_le_data_ra
   8.18%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
   8.04%  qemu-riscv64  qemu-riscv64             [.] do_st4_mmu
   2.04%  qemu-riscv64  qemu-riscv64             [.] ste_w
   1.15%  qemu-riscv64  qemu-riscv64             [.] lde_w
   1.02%  qemu-riscv64  [unknown]                [k] 0xffffffffb3001260
   0.90%  qemu-riscv64  qemu-riscv64             [.] cpu_get_tb_cpu_state
   0.64%  qemu-riscv64  qemu-riscv64             [.] tb_lookup
   0.64%  qemu-riscv64  qemu-riscv64             [.] riscv_cpu_mmu_index
   0.39%  qemu-riscv64  qemu-riscv64             [.] object_dynamic_cast_assert


First thing that caught my attention is vext_ldst_us from 
target/riscv/vector_helper.c:

    /* load bytes from guest memory */
    for (i = env->vstart; i < evl; i++, env->vstart++) {
        k = 0;
        while (k < nf) {
            target_ulong addr = base + ((i * nf + k) << log2_esz);
            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
            k++;
        }
    }
    env->vstart = 0;

Given that this is a unit-stride load that access contiguous elements in memory 
it
seems that this loop could be optimized/removed since it's loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that ARM SVE 
might
have something of the sorts. Richard, care to comment?

The current support we have is good enough for booting a kernel and tests, but 
things
aggravate fast if one attempts to run a x264 SPEC with it. With a SPEC run we 
have
other insns appearing as hot but for now it would be good to see if we can 
optimize
these loads and stores.


Any ideas on how to tackle this? Thanks,


Daniel




reply via email to

[Prev in Thread] Current Thread [Next in Thread]