avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Testing alternatives to functions from lib1funcs.S


From: Marek Michalkiewicz
Subject: Re: Testing alternatives to functions from lib1funcs.S
Date: Sun, 21 Apr 2024 23:32:10 +0200

If speed is more important than size (likely as most AVR chips today
have much more flash than those available long long ago when that code
was originally written), here ia a proposed (untested) patch to unroll
the loop (simply repeat the code 8 times) for >8K flash devices,

The larger and faster version doesn't use r23 as loop count, telling
GCC it is not clobbered (which may make better code around the calls
to this function) is left as an exercise for the reader.  So it could
actually be a win in some cases on the smaller chips too.  Your call.

Thanks,
Marek

--- libgcc_config_avr_lib1funcs.S.orig  2024-04-21 22:22:35.231870200 +0200
+++ libgcc_config_avr_lib1funcs.S       2024-04-21 23:11:54.285118400 +0200
@@ -1340,8 +1340,17 @@
 #if defined (L_udivmodqi4)
 DEFUN __udivmodqi4
        clr     r_rem           ; clear remainder
-       ldi     r_cnt,8         ; init loop counter
        lsl     r_arg1          ; shift dividend
+#ifdef __AVR_HAVE_JMP_CALL__ /* Optimize speed: 40 words, 40 cycles, r_cnt not 
used. */
+.rept 8
+       rol     r_rem           ; shift dividend into remainder
+       cp      r_rem,r_arg2    ; compare remainder & divisor
+       brcs    1f      ; remainder <= divisor
+       sub     r_rem,r_arg2    ; restore remainder
+1:     rol     r_arg1          ; shift dividend (with CARRY)
+.endr
+#else /* Optimize size: 8 words, 64 cycles. */
+       ldi     r_cnt,8         ; init loop counter
 __udivmodqi4_loop:
        rol     r_rem           ; shift dividend into remainder
        cp      r_rem,r_arg2    ; compare remainder & divisor
@@ -1351,6 +1360,7 @@
        rol     r_arg1          ; shift dividend (with CARRY)
        dec     r_cnt           ; decrement loop counter
        brne    __udivmodqi4_loop
+#endif
        com     r_arg1          ; complement result
                                ; because C flag was complemented in loop
        ret


Dnia Sun, Apr 21, 2024 at 03:22:31PM +0200, Georg-Johann Lay napisał(a):
> Am 21.04.24 um 10:08 schrieb Wolfgang Hospital:>   Dear all,>
> > Is there a test scaffold for the functions from lib1funcs.S,
> > correctness, size&speed over the variety of 8-bit AVR cores?
> 
> Size is the easiest one: Just determine the size of, say
> -nodefaultlibs -nostartfiles against a respective compilation
> with -Wl,-u,__divmodqi4
> 
> Benchmarking speed is not so easy.  I am using the avrtest core
> simulator because it is fast, simulating a core is enough, and
> it has some extra features, e.g. get random values and get values
> out of the target, e.g. LOG_FMT_DOUBLE ("double = %f\n", x);
> 
> https://github.com/sprintersb/atest
> 
> See the end of this mail for an example.
> 
> For correctness, most of the functions are tested off testsuite
> by hand-written programs that test new implementations against
> existing ones, like in the code below.  Such tests don't make sense
> any more when the new version is integrated.  And performance
> tests / comparisons are misplaced in the GCC testsuite anyway.
> 
> > Is there a more comprehensive statement of calling conventions than
> > https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention,
> 
> It is comprehensive, but likely not complete.  For completeness, you'll
> have to resort to avr.md and the files it includes.  There is no
> table that lists the non-ABT stuff though; you'll have to find the
> transparent calls, usually of type "xcall".  Notice however that
> such functions may be ABI or non-ABI.  Transparent calls are basically
> used for two purposes:
> * Non-ABI calls like some mul stuff that gets param in X reg.
> * ABI calls that don't clobber all callee-used regs, in order to
>   model the smaller footprint.
> 
> > in particular explicitly stating which functions are guaranteed to have
> > __zero_reg__ 0 on entry/where it suffices to have __zero_reg__ 0 on
> > return as opposed to preserving its value?
> 
> When a function does /not/ have zero_reg=0 on entry, then the compiler
> or libc (or application code) has a bug.  Same when zero_reg!=0 on
> exit.
> 
> > I've been tinkeringaround, the "ldi  r_cnt, 9""rjmp entry point" in
> > __udivmodqi4 instead of "ldi  r_cnt, 8""lsl  r_arg1" annoying me for
> > years. (Biggest relative strict improvement I found, FWIW.)
> 
> I went ahead and applied it, see https://gcc.gnu.org/PR114794
> 
> In order to test it, I ran the following code with
> avrtest_log -q -no-log ...
> 
> <CODE>
> #include <stdint.h>
> #include "avrtest.h"
> 
> volatile uint8_t q8, my_q8;
> volatile uint8_t r8, my_r8;
> 
> extern void __udivmodqi4 (void);
> extern void my_udivmodqi4 (void);
> 
> __asm("\n"
> "r_rem        = 25    /* remainder */" "\n"
> "r_arg1       = 24    /* dividend, quotient */" "\n"
> "r_arg2       = 22    /* divisor */" "\n"
> "r_cnt        = 23    /* loop count */" "\n"
> ".pushsection .text" "\n"
> ".global my_udivmodqi4" "\n"
> "my_udivmodqi4:" "\n\t"
> "     sub     r_rem,r_rem     ; clear remainder and carry" "\n\t"
> "     ldi     r_cnt,8         ; init loop counter" "\n\t"
> "     lsl     r_arg1          ; shift dividend" "\n\t"
> "__udivmodqi4_loop:" "\n\t"
> "     rol     r_rem           ; shift dividend into remainder" "\n\t"
> "     cp      r_rem,r_arg2    ; compare remainder & divisor" "\n\t"
> "     brcs    __udivmodqi4_ep ; remainder <= divisor" "\n\t"
> "     sub     r_rem,r_arg2    ; restore remainder" "\n\t"
> "__udivmodqi4_ep:" "\n\t"
> "     rol     r_arg1          ; shift dividend (with CARRY)" "\n\t"
> "     dec     r_cnt           ; decrement loop counter" "\n\t"
> "     brne    __udivmodqi4_loop" "\n\t"
> "     com     r_arg1          ; complement result" "\n\t"
> "                             ; because C flag was complemented in loop" 
> "\n\t"
> "     ret" "\n\t"
> ".popsection");
> 
> static inline __attribute__((__always_inline__))
> void my_divmod8 (volatile uint8_t *pq, volatile uint8_t *prem,
>                  uint8_t dividend, uint8_t divisor)
> {
>     register uint8_t rem asm("25");
>     register uint8_t q asm("24");
>     register uint8_t r22 asm("22") = divisor;
>     register uint8_t r24 asm("24") = dividend;
>     asm ("%~call %x[func]"
>          : "=r" (q), "=r" (rem)
>          : "r" (r22), "r" (r24), [func] "i" (my_udivmodqi4)
>          : "r23");
>     *pq = q;
>     *prem = rem;
> }
> 
> static inline __attribute__((__always_inline__))
> void divmod8 (volatile uint8_t *pq, volatile uint8_t *prem,
>               uint8_t dividend, uint8_t divisor)
> {
>     register uint8_t rem asm("25");
>     register uint8_t q asm("24");
>     register uint8_t r22 asm("22") = divisor;
>     register uint8_t r24 asm("24") = dividend;
>     asm ("%~call %x[func]"
>          : "=r" (q), "=r" (rem)
>          : "r" (r22), "r" (r24), [func] "i" (__udivmodqi4)
>          : "r23");
>     *pq = q;
>     *prem = rem;
> }
> 
> void bench_divmod8 (void)
> {
>     uint8_t a = 0;
>     do
>     {
>         uint8_t b = 1;
>         do
>         {
>             PERF_START_CALL (1);
>             divmod8 (&q8, &r8, a, b);
>             PERF_STOP (1);
> 
>             PERF_START_CALL (2);
>             my_divmod8 (&my_q8, &my_r8, a, b);
>             PERF_STOP (2);
> 
>             if (q8 != my_q8 || r8 != my_r8)
>                 __builtin_abort();
>         } while (++b);
>     } while (++a);
> }
> 
> int main (void)
> {
>     bench_divmod8();
>     PERF_DUMP_ALL;
>     return 0;
> }
> </CODE>
> 
> The input space is only 16 bits wide, so a full coverage is possible.
> With larger input spaces, one could use avrtest_[p]rand() or
> similar means to randomize the input.
> 
> The output is as follows:
> 
> $ avrtest_log -mmcu=avr5 -no-log ben.elf -m 100000000 -q
> 
> --- Dump # 1:
>  Timer T1 "" (65280 rounds):  00ec--00fc
>               Instructions        Ticks
>     Total:      3765820         5222400
>     Mean:            57              80
>     Stand.Dev:      0.9             0.0
>     Min:             57              80
>     Max:             65              80
>     Calls (abs) in [   2,   3] was:   2 now:   2
>     Calls (rel) in [   0,   1] was:   0 now:   0
>     Stack (abs) in [08fb,08f9] was:08fb now:08fb
>     Stack (rel) in [   0,   2] was:   0 now:   0
> 
>            Min round Max round    Min tag           /   Max tag
>     Calls       -all-same-                          /
>     Stack       -all-same-                          /
>     Instr.         1     65026    -no-tag-          /   -no-tag-
>     Ticks       -all-same-                          /
> 
>  Timer T2 "" (65280 rounds):  0108--0116
>               Instructions        Ticks
>     Total:      3569980         4896000
>     Mean:            54              75
>     Stand.Dev:      0.9             0.0
>     Min:             54              75
>     Max:             62              75
>     Calls (abs) in [   2,   3] was:   2 now:   2
>     Calls (rel) in [   0,   1] was:   0 now:   0
>     Stack (abs) in [08fb,08f9] was:08fb now:08fb
>     Stack (rel) in [   0,   2] was:   0 now:   0
> 
>            Min round Max round    Min tag           /   Max tag
>     Calls       -all-same-                          /
>     Stack       -all-same-                          /
>     Instr.         1     65026    -no-tag-          /   -no-tag-
>     Ticks       -all-same-                          /
> 
> So the new code requires 5 ticks less (changed from 80 to 75)
> 
> "Calls" is the (relative or absolute) call depth.
> "Stack" is the (relative or absolute) stack usage.
> 
> Johann
> 
> > Recommendations for a platform to vent such ideas welcome (I know of
> > stackoverflow.com).
> > 
> > regards
> > 
> > W. Hospital
> > 
> > -- 
> > Wolfgang Hospital
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]