bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quotes being stripped by "--csv"


From: Ed Morton
Subject: Re: Quotes being stripped by "--csv"
Date: Thu, 23 Nov 2023 11:35:52 -0600
User-agent: Mozilla Thunderbird

See my responses inline below.

On 11/19/2023 3:32 PM, Ben Hoyt wrote:
Hi Ed and Arnold,

Yes, this shows the difference between CSV for input mode and for output mode. The --csv option is only about "CSV input mode", and doesn't affect output, as you observe.

My post is not about input mode vs output mode, it's entirely about input mode - a way to leave the quotes alone or strip them when populating fields, that is all. Output is left entirely up to the user in either case.


I think --csv unescaping doubled quotes and stripping the start and end quotes (as it does now) is the correct behaviour,

It is 1 of the 2 possible correct behaviors, and it's the one that I expect will be most useful most of the time.

because those extra quotes aren't part of the actual field value, so I think the behaviour you propose with --csv=q would be an unhelpful patch over the problem -- you can't actually use the value in the field without the quotes escaped/removed.

Yes, you can, e.g. with `print $1, $2`, for example when you just want whichever field(s) you print to retain their quotes, if any.

You could make the same "they aren't part of the value of the field/record" argument for not needing the string that matched RS or not needing the spaces that separate fields and yet in gawk we have `RT` and the 4th arg to `split($0,flds,fs,seps)`, both of which are, at times, very useful.


GoAWK and Frawk both had more extensive CSV support before awk and Gawk. (I based GoAWK's approach on Frawk's.) We have separate CSV controls for input mode and output mode. So you can say "goawk -i csv" for CSV input mode ("goawk --csv" is equivalent to that) but you can also say "goawk -o csv" for CSV output mode, and of course "goawk -i csv -o csv" for CSV input mode and output mode. Full GoAWK CSV docs here: https://github.com/benhoyt/goawk/blob/master/docs/csv.md

Original awk and Gawk don't have the concept of CSV output mode. I suspect they probably won't add it in a hurry. In the new second edition of "The AWK Programming Language" by Kernighan et al, chapter 3 page 39 basically says you should do the output part manually. To quote:

-----
By the way, generating CSV is straightforward. Here’s a function to_csv that converts a string to a properly quoted string by doubling each quote and surrounding the result with quotes. It’s an example of a function that could go into a personal library.

# to_csv - convert s to proper "..."
function to_csv(s) {
  gsub(/"/, "\"\"", s)
  return "\"" s "\""
}

(Note how quotes are quoted with backslashes.)
Yes, that's trivial and obvious. What is neither trivial nor obvious is something like:

function csvsplit(       tail,nf) {
    tail = $0
    $0 = ""
    nf = 0
    while ( (tail != "") && match(tail,/([^,]*)|("([^"]*|"")*")/) ) {
        $(++nf) = substr(tail,RSTART,RLENGTH)
        tail = substr(tail,RSTART+RLENGTH+1)
    }
}

{
    csvsplit()
    print $1, $2
}

so when using `--csv` (to handle newlines within quoted fields) we can populate and then print fields with the original quotes intact but then afterwards have to be careful not to change `$0` or the fields will be re-split with the quotes stripped and have to do the above again.


We can use this function within a loop to insert commas between elements of an array to create a properly formatted CSV record for an associative array, or for an indexed array like the fields of a line, as illustrated in the functions rec_to_csv and arr_to_csv:

# rec_to_csv - convert a record to csv
function rec_to_csv(    s, i) {
  for (i = 1; i < NF; i++)
    s = s to_csv($i) ","
  s = s to_csv($NF)
  return s
}

# arr_to_csv - convert an indexed array to csv
function arr_to_csv(arr,    s, i, n) {
  n = length(arr)
  for (i = 1; i <= n; i++)
    s = s to_csv(arr[i]) ","
  return substr(s, 1, length(s)-1) # remove trailing comma
}
-----

Sure, simple stuff.

I don't love the lack of a built-in way to do this, hence the support for "CSV output mode" in GoAWK. But it is what it is for now. I'd definitely be interested to hear what Kernighan has to say.

IMHO we don't need a CSV output mode, we just need a simple way to not strip quotes when splitting input into fields and everything else the user might want to do from there is trivial.

    Ed.


Cheers,
Ben.

On Sun, 19 Nov 2023 at 21:37, <arnold@skeeve.com> wrote:

    Hi.

    I understand what you're saying. I don't have an answer at this point.
    I think it would be helpful for you to open an issue on the Github
    repo
    for Brian Kernighan's awk, as CSV handling was his idea. Maybe he can
    come up with something.

    In any case, opening an issue there will allow for wider
    discussion amongst
    AWK implementors.

    Thanks,

    Arnold

    Ed Morton <mortoneccc@comcast.net> wrote:

    > Someone posted a question on stackoverflow about how to print
    just the
    > first 2 fields from a CSV so given this input:
    >
    >     "foo,""bar""",2,3
    >     1,"foo,bar",3
    >     1,"foo,
    >     bar",3
    >
    > the expected output would be:
    >
    >     "foo,""bar""",2
    >     1,"foo,bar"
    >     1,"foo,
    >     bar"
    >
    > I thought I'd answer with "--csv" but when I tried it I got this
    output:
    >
    >     $ awk --csv -v OFS=',' '{print $1, $2}' file.csv
    >     foo,"bar",2
    >     1,foo,bar
    >     1,foo,
    >     bar
    >
    > The quotes around the fields that need to be quoted (and were
    quoted in
    > the input) are missing and the escaped double quotes (`""`)
    around the
    > first `bar` have become individual (`"`) so the output is no longer
    > valid CSV.
    >
    > I could get it back to valid CSV and produce the expected output by
    > writing this or similar:
    >
    >     $ awk --csv -v OFS=',' '{for (i=1; i<=NF; i++) {
    >     gsub(/"/,"\"\"",$i); if ($i ~ /[,\n"]/) { $i="\"" $i "\""}
    }; print
    >     $1, $2}' file.csv
    >     "foo,""bar""",2
    >     1,"foo,bar"
    >     1,"foo,
    >     bar"
    >
    > but that's counter-intuitive and frustrating to have to write and I
    > think many users wouldn't know how to, or understand why they
    need to,
    > write that code to get valid CSV output.
    >
    > I understand there is a benefit to stripping double quotes for
    working
    > on field contents and I appreciate that you need to make this
    work with
    > existing functionality (`OFS` values, etc.) so I understand why
    `--csv`
    > can't simply always output valid CSV and I also understand the
    "don't
    > provide constructs to do things that are easy to do with existing
    > constructs" awk mantra to avoid code bloat, but there has to be
    a way to
    > make it easier for people to just print a couple of fields from
    valid
    > CSV input and have the output still be valid CSV.
    >
    > If there was a way to have `--csv` optionally NOT strip double
    quotes
    > when reading the fields then that'd solve the problem, e.g.
    `--csv=q` or
    > `--csvq` or similar to indicate quotes in and around fields
    should be
    > retained. If we had that then I could write something like:
    >
    >      awk --csv=q -v OFS=',' '{print $1, $2}' file.csv
    >
    > or, less desirably as it's longer and can't be set on the
    command line
    > but would be better than nothing:
    >
    >      awk --csv -v OFS=',' 'BEGIN{PROCINFO["CSV"]="q"} {print $1,
    $2}'
    > file.csv
    >
    > to get the desired output above and there are almost certainly
    other use
    > cases for people wanting to retain the quotes and there is no
    simple
    > alternative today (not using --csv but instead setting FPAT and
    counting
    > double quotes to know if a newline is inside or outside of a
    field, and
    > adding lines to $0 until you have a complete record).
    >
    > I don't think that would be hard for users to understand or
    result in
    > language bloat or introduce any additional complexity working with
    > existing constructs - you simply wouldn't strip quotes when
    reading the
    > input and so they'd still be there when producing output.
    >
    > Regards,
    >
    >      Ed.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]