bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: inconsistency with counting characters vs bytes for multi-byte chara


From: Ed Morton
Subject: Re: inconsistency with counting characters vs bytes for multi-byte characters
Date: Fri, 1 Sep 2023 02:51:02 -0500

You’re welcome and thanks for the quick turnaround on a fix.

Ed Morton

> On Aug 31, 2023, at 11:30 PM, arnold@skeeve.com wrote:
> 
> Hi Ed.
> 
> This was a really interesting corner case. Good catch. The fix
> is attached and will be in git eventually.
> 
> Thanks for the report!
> 
> Arnold
> 
> Ed Morton <mortoneccc@comcast.net> wrote:
> 
>> Configuration Information [Automatically generated, do not change]:
>> Machine: x86_64
>> OS: cygwin
>> Compiler: gcc
>> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security 
>> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong 
>> --param=ssp-buffer-size=4 
>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1
>>  
>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1
>>  
>> -DNDEBUG
>> uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64 
>> 2023-08-17 17:02 UTC x86_64 Cygwin
>> Machine Type: x86_64-pc-cygwin
>> 
>> Gawk Version: 5.2.2
>> 
>> Attestation 1:
>>         I have read 
>> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
>>         Yes
>> 
>> Attestation 2:
>>         I have not modified the sources before building gawk.
>>         True
>> 
>> Description:
>>         Different string handling functions produce different results 
>> for multi-byte characters.
>> 
>> Repeat-By:
>>         Without "-b":
>> 
>>         $ awk 'BEGIN{str="\342\200\257"; print length(str); 
>> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
>>         1
>>         1
>>         4
>> 
>>         Note that length() thinks that string is 1 character, the first 
>> call to match() agrees, but then the 2nd call to match() thinks it's 3 
>> characters (since RSTART tells us the "end of string" is at position 4).
>> 
>>         Now with "-b" ("Cause gawk to treat all input data as 
>> single-byte characters" per 
>> https://www.gnu.org/software/gawk/manual/gawk.html#Options):
>> 
>>         $ awk -b 'BEGIN{str="\342\200\257"; print length(str); 
>> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
>>         3
>>         3
>>         4
>> 
>>         Note that length() now thinks that string is 3 characters, the 
>> first call to match() agrees again, and then the 2nd call to match() now 
>> also agrees.
>> 
>>         Per the manual "in gawk, length(), substr(), split(), match() 
>> and the other string functions ... all work in terms of characters in 
>> the local character set, and not in terms of bytes." (from 
>> https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)
>>  
>> so I was expecting more consistent results between those 3 function 
>> calls and that they'd basically all always agree with length()s results. 
>> It may just be "match()" that has an issue, I haven't noticed a problem 
>> with any other function but I haven't been looking for it.
> <fix.diff>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]