coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support for CSV file format on sort


From: Grigoriy Sokolik
Subject: Re: Support for CSV file format on sort
Date: Sun, 31 Jan 2021 01:11:42 +0200

> If you implement csv in sort you’ll have to implement it in head, tail, uniq,
joint, wc, etc. etc. etc...

Could the format processing logic be extracted? Also maybe that's a place
for some kind of abstractions like format processor, unquoted format
processor, etc?

On Sun 31. Jan 2021 at 0.59, Erik Auerswald <auerswal@unix-ag.uni-kl.de>
wrote:

> Hi,
>
> On 30.01.21 21:28, Eric Fischer wrote:
> > A couple of years ago I went down this route of thinking I would add CSV
> > support to sort, and then let myself get distracted into trying to follow
> >
> https://paulfitz.github.io/2017/01/24/the-year-of-poop-on-the-desktop.html
>
> Well, but not everyone is using PSV format, many are using some
> kind of CSV format.  I sometimes use CSV (or SSV, semicolon
> separated values ;) as a simple compatibility format when working
> with people not using the GNU operating system.
>
> Even with ASCII there are seldom used characters that look helpful
> for character separated value files, e.g., "Unit Separator" (0x1f),
> to practically get rid of the need for quoted fields.
>
> But since not everybody uses those characters already, a tool that
> bridges the worlds of RFC 4180 CSV(*) and GNU Coreutils might be
> handy.
>
> Seldom used ASCII (i.e., single byte) characters could be used as
> field separator to enable working with GNU tools, even if this is
> just used in a pipeline, but never seen by the user:
>
> csvconv -f, -t$'x1f' data.csv | sort -t$'\x1f' | csvconv -f$'\x1f' -t,
>
> (This uses an imaginary CSV tool "csvconv" to convert from (-f) one
> separator to (-t) another while observing CSV quoting rules.)
>
> Disclaimer: I did not check if sort works correctly with "-t$'\x1f'".
>
> To allow newlines inside a field one could terminate each row of CSV
> data with NUL, and use "sort -z".  Thus the imaginary csvconv could
> use "--input-zero-terminated" and "--output-zero-terminated" options
> as well.
>
> The imaginary "csvconv"'s adherence to (generalized) CSV quoting
> rules would be the primary difference to "tr", "sed", or "awk".
>
> Thanks,
> Erik
>
> (*) RFC 4180 requires CRLF instead of LF as end-of-line sequence, but
>      many implementations just use the native end-of-line sequence.
>
> --
Thanks!
Best regards,
Grigorii


reply via email to

[Prev in Thread] Current Thread [Next in Thread]