tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Optimizing for avx512


From: Samir Ribić
Subject: Re: [Tinycc-devel] Optimizing for avx512
Date: Sun, 6 Feb 2022 09:41:16 +0100

Even if inline  assembly does not support AVx-512  you can still use NASM and link externally.. Anyway, there are only six instructions you need to use:
VMOVUPS zmm1,[memory location]  ; Loads 8 floats from memory location to zmm1
VMULPS zmm1,zmm2,zmm3 ; multiplies 8 float numbers in zmm2 with 8 numbers in zmm3 and stores result in 8 numbers of zmm1
VADDPS zmm1,zmm2,zmm3 ; adds 8 float numbers in zmm2 with 8 numbers in zmm3 and stores result in 8 numbers of zmm1
VSUBPS zmm1,zmm2,zmm3 ; subtracts 8 float numbers in zmm3 from 8 numbers in zmm2 and stores result in 8 numbers of zmm1
VDIVPS zmm1,zmm2,zmm3 ; divides 8 float numbers in zmm2 with 8 numbers in zmm3 and stores result in 8 numbers of zmm1
VMOVUPS [memory location],zmm1  ; Stores 8 floats from zmn1 to memory location 
A bit faster than VMOVUPS is VMOVAPS, but the numbers must be at addresses divisible by 64.
Check if your PC supports AVX-512. All Xeon processors support it, usually no Pentium and Celeron, while Core processors may and may not.



On Sun, Feb 6, 2022 at 9:02 AM Yair Lenga <yair.lenga@gmail.com> wrote:
Thank you for feedback. I understand what are the limits of tcc. In my specific problem, I am trying to speed up user-provided _expression_ in a simulation of 100 paths. Can I use the avx512 build-in - e.g. work on 8 double precision values with one operation - practically reducing the 100 evaluations to 13 (100/8) ?

User expressions are all in the form that can be handle by AVX SIMD instructions: add, multiple, …

Thanks, yair.

Sent from my iPad
_______________________________________________
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

reply via email to

[Prev in Thread] Current Thread [Next in Thread]