bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-datamash] histograms and/or CDFs


From: Assaf Gordon
Subject: Re: [Bug-datamash] histograms and/or CDFs
Date: Fri, 15 Aug 2014 10:12:16 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Icedove/24.7.0

Hello Miah,

On 08/13/2014 08:48 PM, Miah Ness wrote:> Hi Assaf Gordon,

Thanks for developing this nice tool. I've been looking for something
like this for years, and always resorted to perl/awk one-liners.

Would you be interested in integrating support for histograms and/or
cumulative distribution functions?

I'm thinking about something as follows:

# datamash hist:0:10:5 1 data

Thanks again for this idea.
I have a working beta version, which I'll be happy to get feedback on.

A beta tarball is available here:
http://files.housegordon.org/datamash/src/datamash-1.0.6.38-ae71.tar.gz
The code is here:
http://gitweb.housegordon.org/datamash.git/shortlog/refs/heads/binning

(note that this is a beta version with several new operations and features, not 
stable yet).


I've added two operations: "bin" and "cumsum" (cumulative sum).

"bin", like you've suggested, takes two parameters: bucket size and offset 
(offset is optional and defaults to zero).

This would bin the values in column 1 into buckets of size 5:
    $ seq 10 | datamash bin:5 1

Combining with the existing 'group + count' features, you'll get the histogram:

    $ seq 10 | datamash bin:5 1 | datamash -g1 count 1
    0     4
    5     5
    10    1

Combining with "cumsum" on column 2 you'll get the CDF:

    $ seq 10 | datamash bin:5 1 | datamash -g1 count 1 | datamash -f cumsum 2
    0       4       4
    5       5       9
    10      1       10

On 08/15/2014 12:55 AM, Miah Ness wrote:

Additionally, what do you think of splitting this into two
operations: 'bin' and 'count' (using the existing 'count') ?

Such as: $ cat data | datamash bin:0:10:5 1 | datamash -g 1 count 1 0
3 10   2 30   4


I'm not a big fan of this option, though I admit it still meets my
needs. Do you foresee other use cases for having the operations
split?

The current syntax is not as concise as it could have been (not yet), mainly 
because of implementation details.

I'm contemplating an improved syntax which will allow mixing per-line and 
per-group operators, perhaps something like:

    "datamash bin:10 1 groupby 1 count 1 cumsum 2"

and then "datamash cdf:10 1" would be a simple shortcut of that.
But that requires some more thinking and will take some time to implement.

In the meantime, I could perhaps offer to add an alias of 'dtm' to 'datamash' - 
if you type it often,
or a tiny shell function to wrap it all together; You could run:
    # Create the shell function (or put it in ~/.bashrc )
    $ datamash_cdf() { BINSIZE=$1 ; COL=$2 ;
                 datamash bin:"$BINSIZE" "$COL" |
                     datamash -g "$COL" count "$COL" |
                       datamash -f cumsum "$((COL+1))" ; }
    # Use the function
    $ seq 10 | datamash_cdf 5 1
    0       4       4
    5       5       9
    10      1       10


I haven't looked at the project code much, however are you open to
receiving patches?

Of course! patches are always welcomed.
But just a reminder that this is a GNU project, and all code must be licensed 
under GPLv3-or-later.


Regards,
 - Assaf



reply via email to

[Prev in Thread] Current Thread [Next in Thread]