[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Suggestion: add the possibility to apply multiple operations to a si
From: |
Tim Rice |
Subject: |
Re: Suggestion: add the possibility to apply multiple operations to a single column (or multiple columns) |
Date: |
Fri, 4 Nov 2022 12:18:17 +0000 |
Hey Tomas,
Thanks for the request. It's an interesting idea.
I've been struggling with finding time to do datamash development the last couple of
months. I've added your request to my "to do" list. I hope to get back to this
in the coming months.
The main thing to be careful of with a "mean,max,count"-style operation is how it would
interact with groupby or crosstab. Eg I wonder if "datamash groupby 1 mean,max,count 2"
makes sense in any way.
Ranges like 1-2,4 could be less straightforward, especially when combined with the former idea of providing
multiple operations simultaneously. When preparing a test for "mean,max,count 1-2,4", should the
test output columns like "mean_1, max_1, count_1, mean_2, max_2, count_2, mean_4, max_4, count_4",
or "mean_1, max_1, count_1, mean_2, max_2, count_2, mean_4, max_4, count_4", or something else?
Such a test would need to be written out, and whatever order you choose, no
doubt someone would disagree.
Is there any chance you could provide a preliminary patch and tests which would get the
ball rolling? You could break it up into two patches, one for adding column ranges, and
one for "lambda-ing" multiple operations over a column.
In the short term, I don't see much traction unless patches are incoming. On
longer time scales, I'm open to investigating further after a few more months.
But even in those longer time scales, there are no promises the work will be
done. If it seems like too much refactoring is required, I might decide to not
pursue it.
If you are interested in submitting a patch, you can use the instructions at
https://www.gnu.org/software/datamash/ to git clone the latest sources. Then use either "git
diff" or "git show" to generate a patch. Once you have the patch, you can email it
as an attachment to this list. We (the GNU Datamash developers) can then apply the patch to our
local copy of the sources to try it for ourselves.
~ Tim
On Fri, Nov 04, 2022 at 11:12:14AM +0100, Erik Auerswald wrote:
Hi Tomas,
On Thu, Nov 03, 2022 at 10:17:26AM +0100, Tomas Peitl wrote:
My suggestion is to make it possible to write
datamash mean,max,count 2
instead of
datamash mean 2 max 2 count 2
i.e. to remove the repetitiveness in the field identifier when
multiple operations are needed on the same field. Of course, it
doesn't have to be just a single field, you could also do
datamash mean,max,count 1-2,4 perc 2-3
or anything like that.
I do like the idea, but I do not think that I have sufficient time and
motivation to implement it, at least not in the short term.
Cheers,
Erik