[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: inconsistency with counting characters vs bytes for multi-byte chara
From: |
Ed Morton |
Subject: |
Re: inconsistency with counting characters vs bytes for multi-byte characters |
Date: |
Fri, 1 Sep 2023 02:51:02 -0500 |
You’re welcome and thanks for the quick turnaround on a fix.
Ed Morton
> On Aug 31, 2023, at 11:30 PM, arnold@skeeve.com wrote:
>
> Hi Ed.
>
> This was a really interesting corner case. Good catch. The fix
> is attached and will be in git eventually.
>
> Thanks for the report!
>
> Arnold
>
> Ed Morton <mortoneccc@comcast.net> wrote:
>
>> Configuration Information [Automatically generated, do not change]:
>> Machine: x86_64
>> OS: cygwin
>> Compiler: gcc
>> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
>> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
>> --param=ssp-buffer-size=4
>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1
>>
>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1
>>
>> -DNDEBUG
>> uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64
>> 2023-08-17 17:02 UTC x86_64 Cygwin
>> Machine Type: x86_64-pc-cygwin
>>
>> Gawk Version: 5.2.2
>>
>> Attestation 1:
>> I have read
>> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
>> Yes
>>
>> Attestation 2:
>> I have not modified the sources before building gawk.
>> True
>>
>> Description:
>> Different string handling functions produce different results
>> for multi-byte characters.
>>
>> Repeat-By:
>> Without "-b":
>>
>> $ awk 'BEGIN{str="\342\200\257"; print length(str);
>> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
>> 1
>> 1
>> 4
>>
>> Note that length() thinks that string is 1 character, the first
>> call to match() agrees, but then the 2nd call to match() thinks it's 3
>> characters (since RSTART tells us the "end of string" is at position 4).
>>
>> Now with "-b" ("Cause gawk to treat all input data as
>> single-byte characters" per
>> https://www.gnu.org/software/gawk/manual/gawk.html#Options):
>>
>> $ awk -b 'BEGIN{str="\342\200\257"; print length(str);
>> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
>> 3
>> 3
>> 4
>>
>> Note that length() now thinks that string is 3 characters, the
>> first call to match() agrees again, and then the 2nd call to match() now
>> also agrees.
>>
>> Per the manual "in gawk, length(), substr(), split(), match()
>> and the other string functions ... all work in terms of characters in
>> the local character set, and not in terms of bytes." (from
>> https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)
>>
>> so I was expecting more consistent results between those 3 function
>> calls and that they'd basically all always agree with length()s results.
>> It may just be "match()" that has an issue, I haven't noticed a problem
>> with any other function but I haven't been looking for it.
> <fix.diff>