[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: datamash performance question
From: |
Dima Kogan |
Subject: |
Re: datamash performance question |
Date: |
Fri, 25 Jun 2021 14:52:20 -0700 |
User-agent: |
mu4e 1.4.15; emacs 28.0.50 |
Jake VanEck <jake.vaneck@gmail.com> writes:
> So far, this option seems to be putting the data into memory, which I will
> far exceed. After just a few minutes, mawk is using over 3gb of memory and
> nothing is returned per your comment about how it will keep the running
> sums in memory and write them out when the input exhausted.
No. You must be doing something not 100% what you described
(unintentionally, probably). Proof that mawk does not store its input
into memory:
seq 1000000000 | mawk '{print $1,int($1%5)}' | mawk '{s[$2] += $1;} END { for
(k in s) {print k, s[k]; } }'
While that's running, you can look at the memory usage:
while true; do ps -h -O rss | grep mawk; usleep 250000; done | mawk
-Winteractive '{print $2}'
Note that it isn't climbing as that computation runs. As mentioned
earlier, it IS storing the running sum into memory, by necessity. So if
the number of groups is growing without bound, the memory it's using
will grow without bound too. In the above command, it's grouping on $2,
which is an integer in [0,4]: there are 5 bins. If you swap $1 and $2 in
the mawk command, you'll have 1000000000 bins, and the memory usage WILL
grow without bound. This will happen in datamash and in everything else.
You should make sure that the grouping column you're using is what you
think you're using. And you don't actually have billions of groups,
right?
> Any way to run datamash in parallel?
I want to know that too :)