bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Suggestion: add the possibility to apply multiple operations to a si


From: Tim Rice
Subject: Re: Suggestion: add the possibility to apply multiple operations to a single column (or multiple columns)
Date: Fri, 4 Nov 2022 12:18:17 +0000

Hey Tomas,

Thanks for the request. It's an interesting idea.

I've been struggling with finding time to do datamash development the last couple of 
months. I've added your request to my "to do" list. I hope to get back to this 
in the coming months.

The main thing to be careful of with a "mean,max,count"-style operation is how it would 
interact with groupby or crosstab. Eg I wonder if "datamash groupby 1 mean,max,count 2" 
makes sense in any way.

Ranges like 1-2,4 could be less straightforward, especially when combined with the former idea of providing 
multiple operations simultaneously. When preparing a test for "mean,max,count 1-2,4", should the 
test output columns like "mean_1, max_1, count_1, mean_2, max_2, count_2, mean_4, max_4, count_4", 
or "mean_1, max_1, count_1, mean_2, max_2, count_2, mean_4, max_4, count_4", or something else?

Such a test would need to be written out, and whatever order you choose, no 
doubt someone would disagree.

Is there any chance you could provide a preliminary patch and tests which would get the 
ball rolling? You could break it up into two patches, one for adding column ranges, and 
one for "lambda-ing" multiple operations over a column.

In the short term, I don't see much traction unless patches are incoming. On 
longer time scales, I'm open to investigating further after a few more months. 
But even in those longer time scales, there are no promises the work will be 
done. If it seems like too much refactoring is required, I might decide to not 
pursue it.

If you are interested in submitting a patch, you can use the instructions at 
https://www.gnu.org/software/datamash/ to git clone the latest sources. Then use either "git 
diff" or "git show" to generate a patch. Once you have the patch, you can email it 
as an attachment to this list. We (the GNU Datamash developers) can then apply the patch to our 
local copy of the sources to try it for ourselves.

~ Tim


On Fri, Nov 04, 2022 at 11:12:14AM +0100, Erik Auerswald wrote:
Hi Tomas,

On Thu, Nov 03, 2022 at 10:17:26AM +0100, Tomas Peitl wrote:

My suggestion is to make it possible to write

datamash mean,max,count 2

instead of

datamash mean 2 max 2 count 2

i.e. to remove the repetitiveness in the field identifier when
multiple operations are needed on the same field. Of course, it
doesn't have to be just a single field, you could also do

datamash mean,max,count 1-2,4 perc 2-3

or anything like that.

I do like the idea, but I do not think that I have sufficient time and
motivation to implement it, at least not in the short term.

Cheers,
Erik




reply via email to

[Prev in Thread] Current Thread [Next in Thread]