bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 2 issues with binning


From: Erik Auerswald
Subject: Re: 2 issues with binning
Date: Wed, 22 Jun 2022 16:52:32 +0200

Hi Andreas,

On Sun, Jun 19, 2022 at 11:54:48PM +0200, Andreas Schamanek wrote:
> [...]
> ## Possible issue due to binary floating point arithmetics:
> 
> $ printf '%s\n' 4.19 4.2 4.21 | datamash --full bin:0.1 1
> 4.19  4.1
> 4.2   4.1
> 4.21  4.2
> 
> Of course, 4.2 should bin to 4.2, unless I mistaken.

I think that this is indeed a floating point issue:

$ printf '%s\n' 4.19 4.2000000000000000001 4.21 | env LC_NUMERIC=C datamash 
--full bin:0.1 1 | column -t
4.19                   4.1
4.2000000000000000001  4.2
4.21                   4.2

$ printf '%s\n' 4.19 4.20000000000000000001 4.21 | env LC_NUMERIC=C datamash 
--full bin:0.1 1 | column -t
4.19                    4.1
4.20000000000000000001  4.1
4.21                    4.2

> ## Possible issue with binning negative numbers:
> 
> $ printf '%s\n' 0 1 2  | datamash --full bin:2 1
> 0     0
> 1     0
> 2     2
> 
> For positive numbers, the bins are inclusive ("[") on the lower end,
> exclusive on the upper end (")"), i.e. here they are [0,2) and
> [2,4).
> 
> I expected this type of binning to continue for negative numbers,
> i.e. that the bins left of [0,2) are [-2,0) and next one would be
> [-4,2). However:
> 
> $ printf '%s\n' -2 -1 0  | datamash --full bin:2 1
> -2    -4
> -1    -2
> 0     0
> 
> I was expecting -2 to map to -2. Maybe, bin:1 shows my concerns even
> better:
> 
> $ printf '%s\n' -2 -1 0 | datamash --full bin:1 1
> -2    -3
> -1    -2
> 0     0
> 
> Curious, what you think!

I, too, think that this is unexpected.

Using a bin size of 1 as example, GNU Datamash uses the following bins:

..., (-3,-2], (-2,-1], (-1,-0], [0,1), [1,2), [2,3), ...

It seems to me as if it were less suprising to use:

..., [-3,-2), [-2,-1), [-1,0), [0,1), [1,2), [2,3), ...

The current behavior is documented (with an example) at
<https://www.gnu.org/software/datamash/manual/html_node/Binning-numbers.html>
and checked with test "bin5" in file "tests/datamash-tests-2".

The following patch might adjust binning to use the less suprising bins
(hardly tested, may not work that well):

---------------- 8< ----------------
diff --git a/src/field-ops.c b/src/field-ops.c
index a426b90..e1588b5 100644
--- a/src/field-ops.c
+++ b/src/field-ops.c
@@ -566,10 +566,11 @@ field_op_collect (struct fieldop *op,
     case OP_BIN_BUCKETS:
       {
         const long double val = num_value / op->params.bin_bucket_size;
-        modfl (val, & op->value);
+        const long double frac = modfl (val, & op->value);
         /* signbit will take care of negative-zero as well. */
-        if (signbit (op->value))
+        if (signbit (op->value) && !is_zero (frac))
           --op->value;
+        op->value = pos_zero (op->value);
         op->value *= op->params.bin_bucket_size;
       }
       break;
---------------- >8 ----------------

The above patch is intended to illustrate a possible way to change the
implementation.  It is not intended to be merged as-is.

The comment regarding negative-zero would need to be adjusted as well, the
new line with "pos_zero" is required to bin -0 into 0.  The tests would
need adjustments as well, at least "bin5" from "tests/datamash-tests-2".
The documentation would need to be adjusted, too.  The new behavior
would also need to be mentioned in the "NEWS" file.

HTH,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]