Re: Help with Hand-Optimized Assembly

help-gplusplus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with Hand-Optimized Assembly

From:	Jan Seiffert
Subject:	Re: Help with Hand-Optimized Assembly
Date:	Wed, 28 Mar 2012 18:29:58 -0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20120112 Firefox/9.0.1 SeaMonkey/2.6.1

Bill Woessner schrieb:
> I'm a 100% total newbie at writing assembly.  But I figured it would
> be a good exercise.  And besides, this tiny chunk of code is
> definitely in the critical path of something I'm working on.  Any and
> all advice would be appreciated.
> 
> I'm trying to rewrite the following function in x86 assembly:
> 
> inline double DiffAngle(double theta1, double theta2)
> {
>   double delta(theta1 - theta2);
> 
>   return std::abs(delta) <= M_PI ? delta : delta - copysign(2 * M_PI,
> delta);
> }
> 
> To my great surprise, I've actually been somewhat successful.  Here's
> what I have so far:
> 
> double DiffAngle(double theta1, double theta2)
> {
>   asm(
>       "fldl    4(%esp);"

Use two percent signs to create a percent sign.
Add a \n\t at the end, trust me, it helps if you want to read the generated asm.
Do not simply reference some stack location.

>       "fsubl   12(%esp);"
>       "fxam;"
>       "fnstsw  %ax;"
>       "fldl    TWO_PI;"
>       "testb   $2, %ah;"
>       "fldl    NEG_TWO_PI;"
>       "fcmovne %st(1), %st;"
>       "fstp    %st(1);"
>       "fsubr   %st(1), %st;"
>       "fldpi;"
>       "fld     %st(2);"
>       "fabs;"
>       "fcomip  %st(1), %st;"
>       "fstp    %st(0);"
>       "fcmovbe %st(1), %st;"
>       "fstp    %st(1);"
>       "rep;"
>       "ret;"

YIKES!!!
do not simply ret from your inline asm

>       "NEG_TWO_PI:;"
>       ".long   1413754136;"
>       ".long   1075388923;"
>       "TWO_PI:;"
>       ".long   1413754136;"
>       ".long   -1072094725;"

There are other ways to get your constants.

The basic problem is:
This inline asm is missing inputs and output

>       );
> }
> 
> This compiles, runs and produces the correct answers.  But I have a
> few issues with it:
> 
> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)

Which is no wonder, when the compiler inlines the function, your stack
references are totally bogus.

> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.

By creating an output from the inline asm and returning it from the function.

> 3) I don't like how TWO_PI and NEG_TWO_PI are defined.  I had to steal
> it from some generated assembly.  It would be nice to use M_PI,
> 4*atan(1) or something like that.
> 

The x87 has a instruction to load pi, two pi is two loads and an add+pop. Neg is
mul -1. The trick is to create the constants, and keep them in the lower stack
register. For that you better work on an array of values (where SSE gets also
handy) because you have to leave the x87 stack the way you found it.


You have to understand how GCC handles inline ASM.
For GCC, inline ASM is just a bunch of text where GCC makes certain
substitutions (that's also why you should write every literal % as %%, % is a
substitution like in printf). It does not grok or understand your ASM. It can
make no deductions from what you write there.

To interface with the rest of the code, you have to tell GCC what you are doing
there, by means of inputs and outputs to your ASM, because the state of the
machine before and after the ASM is all GCC cares for.

Let's start simple:

asm (
        "add %%eax, %%edx"
);

Is a valid inline ASM, but has the same problem as yours. The compiler will put
the txt literately into the output. If something meaningful is in eax or edx,
there is no guaranty. That GCC does not throw your ASM away (yes, GCC can
optimize ASM a little bit according to their input and output) comes from the
special rule that empty input/output/clobber means "clobbers everything".
>From GCC points of view this is the same as

asm (
        "yadda yadda yadda"
);


The canonical format of an inline ASM in GCC is:
asm {volatile} (
        "instructions"
        : {outputs}
        : {inputs}
        : {clobbers}
);

You do not always have to write all parts, you can leave empty last parts out.
A volatile can be used to give the compiler a little nudge, but the exact
semantic is complicated.

So for our example above one should write:
int a, b;
asm (
        "add %1, %0"
        : "=r" (a)
        : "r" (b), "0" (a)
);

Besides of remaining problems (the compiler may see that the values are not
initialized when passed into the asm, so he may remove the asm), let's see what
we have here.

First, in the instruction section, we refer to our registers by using the number
of the operands. Counting starts at zero and goes from outputs to inputs.

Then there are the outputs. Here we have one output. with "=r" we say the
compiler he should expect an result (the =, all outputs must have an =, except
if you use special stuff like +) in an general purpose register (the r is the
code for that). In parenthesis we say that the compiler should arrange for the
variable a to be living in that register.

After that there are the inputs. First we say we want b to be in a general
purpose register. Then we say that a should be in the same spot as operand 0.

Normally you should also say GCC that you clobbered the eflags register (the
condition codes), by adding a cc clobber like so:
int a, b;
asm (
        "add %1, %0"
        : "=r" (a)
        : "r" (b), "0" (a)
        : "cc"
);
But this is implicit on x86, because it is a cc0 target.

So, now to your code.
First, you are entering a world of pain with x87. The stack based nature of the
x87 does not play well with inline ASM.
But more important: never return premature from your inline ASM...

What you want to do does more look like this:

double DiffAngle(double theta1, double theta2)
{
        static double NEG_TWO_PI = -(2*3.14);
        static double TWO_PI = 2*3.14;
        double ret;
        int t;

        asm(
                "fldl   %2\n\t"
                "fsubl  %3\n\t"
                "fxam\n\t"
                "fnstsw %w1\n\t"
                "fldl   %4\n\t"
                "testb  $2, %h1\n\t"
                "fldl   %5\n\t"
                "fcmovne        %%st(1), %%st\n\t"
                "fstp   %%st(1)\n\t"
                "fsubr  %%st(1), %%st\n\t"
                "fldpi\n\t"
                "fld    %%st(2)\n\t"
                "fabs\n\t"
                "fcomip %%st(1), %%st\n\t"
                "fstp   %%st(0)\n\t"
                "fcmovbe        %%st(1), %%st\n\t"
                "fstp   %%st(1)"
        : /* %0 */ "=t" (ret),
          /* %1 */ "=a" (t)
        : /* %2 */ "m" (theta1),
          /* %3 */ "m" (theta2),
          /* %4 */ "m" (TWO_PI),
          /* %5 */ "m" (NEG_TWO_PI)
        );
        return ret;
}

I did not check the calc stuff. We tell the compiler that ret will be on the top
of the x87 stack, we put a temporary into the eax register because we clobber
it, we pass in all inputs as memory operands.

This is just a little crash introduction, there are a lot of details and caveats
to inline ASM, esp. the freedoms the compiler has with your inline ASM.
For all the nitty gritty details, look into the GCC handbook under inline 
assembly:
http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Extended-Asm.html
and the chapters after that.

Whatever you do, don't forget the basic rule: GCC does not understand your ASM,
he goes strictly by the input/output/clobber. I you lie there to GCC what your
ASM is about, don't be disappointed when GCC does not obey your wish (and
resorting to giganto "whack-over-the-head" like volatile and a memory clobber
may bite you later).
In face of a inline ASM, GCC does not drop everything on the floor, crawls into
a hole and turns on dumb mode, it can make optimization decisions based on the
inputs/outputs.
NB: GCC views that inline ASM as one single instruction, this can have some
implications.

(the whole decision to implement inline ASMs the way they are in GCC is to
prevent problems other compiler which support inline ASM have with inline ASM,
for example has to understand every single instruction (also the seldom used
system level instructions) and falls over if a new instruction is encountered,
or turn to totally dumb mode. One can use inline ASM for fancy stuff (and i like
to use it lavishly because it has some upsides), but there are limits to it.
Basically it was intended to get that single special instruction the compiler
can not generate, mainly system level stuff)

> Thanks in advance,

If i would be you, i would forget about the whole thing.
This code is still "miserable", but in this case the compiler can do nothing
about it, because he does not understand ASM. (and esp. can not do constant
elimination). In loops more optimizations would be possible, but now you have to
do them by hand. And one big bummer is that you have to write two versions, one
for i386, and one for x86_4, because the later passes float args in SSE
registers, so the compiler would copy the functions arguments first to the stack
so you can load them from there to the x87.

You not being happy with the floating point code the compiler generates probably
stems from the fact that the compiler has to be very careful when it comes to
floats (NaNs, Inf, loss of precision, exes precision, etc.). So he prop. passes
all things of to the proper lib calls when he can not proof that emitting the
direct instructions can safely be done.

Additionally i expect that you are simply compiling for an generic x86 machine,
try to play with -march=pentium3 or -msse.

You may want to try -ffast-math, this unleashes GCC to get aggressive with your
floats. If -ffast-math generates wrong results in your program (they are not
really wrong, but simply effects from "imprecise" float arithmetic covered up
when not optimizing so aggressive), with a new enough GCC you can try:
 __attribute__((optimize("fast-math")))
inline double DiffAngle_o(double theta1, double theta2)
{
        double delta(theta1 - theta2);

        return std::abs(delta) <= M_PI ?
                delta : delta - copysign(2 * M_PI,delta);
}

to only make this single function fast math.

N.B: you do have set -O? Prop. to -O2? Maybe even to -O3?

The only reason i would look into making this an asm is do diff a complete array
of angles with SSE vector instructions to calc several values at once, because,
while i love GCC, i do not trust him with vectorization.


> Bill

Greetings
        Jan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Help with Hand-Optimized Assembly, (continued)
- Re: Help with Hand-Optimized Assembly, Bob Masta, 2012/03/28
- Re: Help with Hand-Optimized Assembly, James Harris, 2012/03/28
- Re: Help with Hand-Optimized Assembly, Markus Wichmann, 2012/03/28
- Re: Help with Hand-Optimized Assembly, Jan Seiffert <=
- Re: Help with Hand-Optimized Assembly, Bill Woessner, 2012/03/28
  - Re: Help with Hand-Optimized Assembly, sfuerst, 2012/03/28
    - Re: Help with Hand-Optimized Assembly, Bill Woessner, 2012/03/28

Prev by Date: Re: Help with Hand-Optimized Assembly
Next by Date: Re: Help with Hand-Optimized Assembly
Previous by thread: Re: Help with Hand-Optimized Assembly
Next by thread: Re: Help with Hand-Optimized Assembly
Index(es):
- Date
- Thread