coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support for CSV file format on sort


From: Erik Auerswald
Subject: Re: Support for CSV file format on sort
Date: Sat, 30 Jan 2021 23:58:48 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

Hi,

On 30.01.21 21:28, Eric Fischer wrote:
A couple of years ago I went down this route of thinking I would add CSV
support to sort, and then let myself get distracted into trying to follow
https://paulfitz.github.io/2017/01/24/the-year-of-poop-on-the-desktop.html

Well, but not everyone is using PSV format, many are using some
kind of CSV format.  I sometimes use CSV (or SSV, semicolon
separated values ;) as a simple compatibility format when working
with people not using the GNU operating system.

Even with ASCII there are seldom used characters that look helpful
for character separated value files, e.g., "Unit Separator" (0x1f),
to practically get rid of the need for quoted fields.

But since not everybody uses those characters already, a tool that
bridges the worlds of RFC 4180 CSV(*) and GNU Coreutils might be
handy.

Seldom used ASCII (i.e., single byte) characters could be used as
field separator to enable working with GNU tools, even if this is
just used in a pipeline, but never seen by the user:

csvconv -f, -t$'x1f' data.csv | sort -t$'\x1f' | csvconv -f$'\x1f' -t,

(This uses an imaginary CSV tool "csvconv" to convert from (-f) one
separator to (-t) another while observing CSV quoting rules.)

Disclaimer: I did not check if sort works correctly with "-t$'\x1f'".

To allow newlines inside a field one could terminate each row of CSV
data with NUL, and use "sort -z".  Thus the imaginary csvconv could
use "--input-zero-terminated" and "--output-zero-terminated" options
as well.

The imaginary "csvconv"'s adherence to (generalized) CSV quoting
rules would be the primary difference to "tr", "sed", or "awk".

Thanks,
Erik

(*) RFC 4180 requires CRLF instead of LF as end-of-line sequence, but
    many implementations just use the native end-of-line sequence.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]