[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: glibc segfault on "special" long double values is _ok_!?

From: James Youngman
Subject: Re: glibc segfault on "special" long double values is _ok_!?
Date: Fri, 8 Jun 2007 11:54:17 +0100

On 6/8/07, Jan-Benedict Glaw <address@hidden> wrote:

In this setup, you control all the cluster and you can ensure that all
nodes use the same hardware and that no node will send data over the
network that wasn't the result of CPU calculation.

In the ticket, the case was different in that he got data fed in that
most probably was _not_ the result of a calculation done by the CPU,
but hand-craftes.

This won't happen in your controlled cluster.

It would be nice if that was true, but it is not, as I already wrote:-

> Can the network infrastructure corrupt bits in the exchanged data?
> Yes.  Not often, but it does happen.  Same for the RAM.  So what do we
> do when we detect a problem?  Print debugging messages, as Nix already

Stop.  Would you continue with known-wrong data once detected?

Exactly.  Stop and prompt diagnosis of the problem.

hexdump (&my_long_double, sizeof my_long_double());
kill (getpid (), SIGABRT);

or just call abort() which is designed for the purpose.

That way, you get a nice core dump and can call GDB with it. With
"clean" floats, just use GDB's "print" to print it (or even call
printf() with it.)

If printf fails on the offending bit pattern, presumably that is not
going to help.

> Could we just print the raw bytes as hex or something?  Sure, but then
> we'd need to interpret that anyway.  The days of manually poring over
> core dumps that came out of the line printer shuld be behind us these
> days.

Once you detected madness somewhere in your data, be sceptic with it.

Obviously.  There needs to exist some strategy where the offending
data can be logged and analysed.  The mechanism for problem diagnosis
needs to scale.

You can fully control your cluster, but in the case discussed here,
the data was injected by a non-controlled source.

No item of hardware is fully under control either.  Push enough bits
through it, some will get corrupted.  As I said in the email to which
you are replying, this happens in practice, for real.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]