coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Another appeal for `uniq --stream`


From: Tony Fischetti
Subject: Another appeal for `uniq --stream`
Date: Fri, 22 Jan 2021 13:34:12 -0500

For a while, I've been using a small program I wrote (with help from a
GPL AVL-library) to filter unsorted duplicate lines. I thought I might
see if this can be added to `uniq` (or some other way) but I saw that
a nearly identical proposal
(https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html)
was already put forth and rejected.

I thought it might be worth it to make the case again, with an expanded
rationale, and especially as I already have a proof of concept (available
below) and I'm willing to write the code, documentation, translation,
etc...

It was said in the replies to the original proposal that it's up to
the user to decide whether they want to run `sort` and then pipe it
to `uniq`. But in all the years I've used coreutils, I've never once
used `uniq` without `sort`. I've spoken to many others, and their
experience comports with mine.

But this was not because I wanted the output to be sorted; in fact,
I specifically didn't. Most times, I want (and even require that) the
duplicated lines be stripped as soon as the data becomes available,
and remain in the original order. This is especially useful for log
files, journals, output from statistical software, etc...

The pervasive `sort | uniq` idiom, of course, besides for changing
the order of the data, carries the other problem of completely
arresting the flow of data (as `sort` has to read all of the data
in the pipe in order to work). I view this as a limitation since it
counter-acts one of the main benefits of using a CLI pipeline, namely
that the whole pipeline works in unison and reads data in a streaming
fashion.

The most sensible place to add this functionality (that I think many
people would enjoy) is as an option for the `uniq` command (line
`uniq --stream` or similar)

It was also said in the original replies that this might constitute
feature creep, and `uniq` as it stands now is less than 200 lines of
code. I'm sympathetic to this view, especially since adding a tree
or hash to `uniq` would considerably increase its size.

But maybe that's an ok thing. Especially if it brings the functionality
of `uniq` more in line with people's expectation of the command.

It would also not disturb the user with increased memory usage;
the tree would only initialize if the user specifically specified
that they wanted this option.


(proof of concept)
[https://github.com/tonyfischetti/eweniq] (apologies, this is before I
knew better to use a more free hosting for version control)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]