bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] fractional bin sizes do not work in some locales (e.g., de_DE.


From: Erik Auerswald
Subject: Re: [BUG] fractional bin sizes do not work in some locales (e.g., de_DE.UTF-8)
Date: Sat, 25 Jun 2022 13:22:58 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

Hey Tim,

On 25.06.22 06:00, Tim Rice wrote:
[Erik Auerswald wrote:]

I suspect the operation parsing code, but have not yet looked at it.

I've been having a look. It comes down to the function `scanner_get_token()` in op-scanner.c.

This function returns TOK_FLOAT only when it sees a digit followed by a period. TOK_COMMA is handled separately and is never considered for being part of a float.

Turning the scanner code into a bunch of special cases for each locale would be a nasty piece of work. Let's not head in that direction.

The files src/key-compare.[ch] already work with the locale specific
decimal point.  I'd expect that we can use this.

Including src/key-compare.h provides a variable for the decimal point:

    src/key-compare.h:extern int decimal_point;

Invoking src/key-compare.c::init_key_spec() sets it.

Hm, this seems to be used only in decorate(1), and just including
src/key-compare.h in src/op-scanner.c does not work at all.

But I do think that this general idea should work:

* Add an init_scanner() function:
  * Read locale-specific decimal separator value
  * Save locale-specific decimal separator value in a variable
* Change scanner_get_token() to use the decimal separator variable
  instead of hard-coding '.'

I plan to further explore this idea.

Hm, thinking some more, this might interact badly with field lists:
is "1,2" a list of two fields (1 and 2) or the floating point number
"1,2".

I think we might want something like a function pointer which switches between scanners according to what operation is being parsed. We might have something like a "bin_scanner_get_token()" function. It would be set as the scanner whenever the current operation is `bin`. It knows everything after the ':' delimiter should be a float, and takes advantage of locale-aware code to do the right thing.

I'd rather look into changing out just the decimal separator, not
the complete scanner.  My first impression is that replacing the
complete scanner would either result in lots of duplicated code or
require quite a bit of refactoring.

OK, at first glance that looks like more than a small change, because
the scanner has some hard-coded "special characters", comma is one of
them, and there is currently no way to change that.

It might be simpler to extend backslash-escaping in the scanner from
the current identifiers-only support to also support it in operation
arguments, or even more generally.

Using function pointers like this could be a useful first step towards the modularization work proposed for v2.0, too. Admitting different scanners on an ad hoc basis will make it easier to introduce new ops, or do more sophisticated things with the current ops.

Possible, I would need to think more about this for a meaningful
comment. ;-)

Given the complexity of the work, I wonder what you think of defering this until after the v1.8 release?

Yes, I think that can be deferred after v1.8.  I would not want this
corner case issue to hold up the release.  This also holds for my
idea of extending backslash-escaping.

Locales with a comma as decimal separator have some unintended
interaction with several features, at least the default CSV value
delimiter and the field list value separator in operation parsing
(as seen in the token scanning part).

A widely used proprietary spreadsheet program that shall remain
unnamed uses a semicolon as default CSV value separator in Germany
(personal experience) and presumably uses a comma in the US (from
what I read on the web).  I like the idea of replacing comma with
a different character when input numbers are expected to comprise
a comma.  But this is not backwards compatible and might need its
own set of options (e.g., to change the default comma replacement),
documentation, and perhaps even informational output (e.g.,
"datamash: decimal separator is a comma, using semicolon instead
of comma as value separator") [with all the questions concerning
when a non-interactive command line program is supposed to emit
extra messages, where to send those, is it appropriate to send
something informational to STDERR, should there be --verbose
and/or --quiet options, …].  I would prefer to discuss and build
consensus regarding something like this before starting to
implement it.  IMHO this would be "v2.0+" material.

Best regards,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]