Re: [PATCH 0/5] Add LoongArch v1.1 instructions

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/5] Add LoongArch v1.1 instructions

From:	gaosong
Subject:	Re: [PATCH 0/5] Add LoongArch v1.1 instructions
Date:	Tue, 31 Oct 2023 20:12:45 +0800
User-agent:	Mozilla/5.0 (X11; Linux loongarch64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0

在 2023/10/31 下午7:10, Jiajie Chen 写道:

On 2023/10/31 19:06, gaosong wrote:
在 2023/10/31 下午5:13, Jiajie Chen 写道:
On 2023/10/31 17:11, gaosong wrote:
在 2023/10/30 下午7:54, Jiajie Chen 写道:
On 2023/10/30 16:23, gaosong wrote:
在 2023/10/28 下午9:09, Jiajie Chen 写道:
On 2023/10/26 14:54, gaosong wrote:
在 2023/10/26 上午9:38, Jiajie Chen 写道:
On 2023/10/26 03:04, Richard Henderson wrote:
On 10/25/23 10:13, Jiajie Chen wrote:
On 2023/10/24 07:26, Richard Henderson wrote:
See target/arm/tcg/translate-a64.c, gen_store_exclusive,TCGv_i128 block.
See target/ppc/translate.c, gen_stqcx_.
The situation here is slightly different: aarch64 and ppc64have both 128-bit ll and sc, however LoongArch v1.1 onlyhas 64-bit ll and 128-bit sc.
Ah, that does complicate things.
Possibly use the combination of ll.d and ld.d:


ll.d lo, base, 0
ld.d hi, base, 4

# do some computation

sc.q lo, hi, base

# try again if sc failed
Then a possible implementation of gen_ll() would be: alignbase to 128-bit boundary, read 128-bit from memory, save64-bit part to rd and record whole 128-bit data in llval.Then, in gen_sc_q(), it uses a 128-bit cmpxchg.
But what about the reversed instruction pattern: ll.d hi,base, 4; ld.d lo, base 0?
It would be worth asking your hardware engineers about thebounds of legal behaviour. Ideally there would be some veryexplicit language, similar to
I'm a community developer not affiliated with Loongson. SongGao, could you provide some detail from Loongson Inc.?
ll.d   r1, base, 0
dbar 0x700          ==> see 2.2.8.1
ld.d  r2, base,  8
...
sc.q r1, r2, base
Thanks! I think we may need to detect the ll.d-dbar-ld.dsequence and translate the sequence into onetcg_gen_qemu_ld_i128 and split the result into two 64-bit parts.Can do this in QEMU?
Oh, I'm not sure.
I think we just need to implement sc.q. We don't need to careabout 'll.d-dbar-ld.d'. It's just like 'll.q'.
It needs the user to ensure that .

ll.q' is
1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
2) dbar 0x700　
3) ld.d r2 base, 8 ==> load the high 64 bits to r2

sc.q needs to
1) Use 64-bit cmpxchg.
2) Write 128 bits to memory.
Consider the following code:


ll.d r1, base, 0

dbar 0x700

ld.d r2, base, 8

addi.d r2, r2, 1

sc.q r1, r2, base


We translate them into native code:


ld.d r1, base, 0

mv LLbit, 1

mv LLaddr, base

mv LLval, r1

dbar 0x700

ld.d r2, base, 8

addi.d r2, r2, 1

if (LLbit == 1 && LLaddr == base) {

    cmpxchg addr=base compare=LLval new=r1

    128-bit write {r2, r1} to base if cmpxchg succeeded

}

set r1 if sc.q succeeded
If the memory content of base+8 has changed between ld.d r2 andaddi.d r2, the atomicity is not guaranteed, i.e. only the highpart has changed, the low part hasn't.
Sorry, my mistake. need use cmpxchg_i128. Seetarget/arm/tcg/translate-a64.c gen_store_exclusive().
gen_scq(rd, rk, rj)
{
     ...
    TCGv_i128 t16 = tcg_temp_new_i128();
    TCGv_i128 c16 = tcg_temp_new_i128();
    TCGv_i64 low = tcg_temp_new_i64();
    TCGv_i64 high= tcg_temp_new_i64();
    TCGv_i64 temp = tcg_temp_new_i64();

    tcg_gen_concat_i64_i128(t16, cpu_gpr[rd], cpu_gpr[rk]));

    tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx, MO_TEUQ);
    tcg_gen_addi_tl(temp, cpu_lladdr, 8);
    tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
    tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
The problem is that, the high value read here might not equal to thepreviously read one in ll.d r2, base 8 instruction.
I think dbar 0x7000 ensures that the 2 loads in 'll.q' are a 128bitatomic operation.
The code does work in real LoongArch machine. However, we areemulating LoongArch in qemu, we have to make it atomic, yet it isn't now.

yes, I know, As i said before, we need't care about 'll.q', it needsthe user to ensure that.

In QEMU, I think the instruction dbar can make it atomic. but I amnot sure this is right.


static bool trans_dbar()
{
        tcg_gen_mb(TCG_BAR_SC | TCG_MO_ALL);
        return;
}

may be this is already enough.

or

like this:
static bool trans_dbar()
{
    TCGBar bar;
    if (a->hint == 0x700)
        bar = TCG_BAR_SC |  TCG_MO_LD_LD;
    } else {
        bar = TCG_BAR_SC | TCG_MO_ALL;
    }

    tcg_gen_mb(bar);
    return true;
}

Thanks.
Song Gao

Thanks.
Song Gao
tcg_gen_concat_i64_i128(c16, low, high);
tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16,ctx->mem_idx, MO_128);
    ...
}

I am not sure this is right.

I think Richard can give you more suggestions. @Richard

Thanks.
Song Gao
Thanks.
Song Gao
For this series,
I think we need set the new config bits to the 'max cpu', andchange linux-user/target_elf.h ''any' to 'max', so that we canuse these new instructions on linux-user mode.
I will work on it.
Thanks
Song Gao
https://developer.arm.com/documentation/ddi0487/latest/
B2.9.5 Load-Exclusive and Store-Exclusive instruction usagerestrictions
But you could do the same thing, aligning and recording theentire 128-bit quantity, then extract the ll.d result basedon address bit 6. This would complicate the implementation ofsc.d as well, but would perhaps bring us "close enough" tothe actual architecture.
Note that our Arm store-exclusive implementation isn't quitein spec either. There is quite a large comment withintranslate-a64.c store_exclusive() about the ways things arenot quite right. But it seems to be close enough for actualusage to succeed.
r~

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 0/5] Add LoongArch v1.1 instructions, (continued)

Prev by Date: Any plans to implement more ARM SMMUv3 features?
Next by Date: Re: [PATCH 6/7] balloon: Fix a misleading error message
Previous by thread: Re: [PATCH 0/5] Add LoongArch v1.1 instructions
Next by thread: [PATCH v3 0/4] riscv: zicntr/zihpm flags and disable support
Index(es):
- Date
- Thread