Re: UTF8 above U+10FFFF treated inconsistently

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF8 above U+10FFFF treated inconsistently

From:	Jason C. Kwan
Subject:	Re: UTF8 above U+10FFFF treated inconsistently
Date:	Sun, 26 Dec 2021 20:13:52 +0000 (UTC)

Hi

To follow up on the previous report, even in the latest version of gawk, i'm 
noticing the error. Here's the code for full replication of the issue. The test 
string, one top of valid ASCII and 2 valid unicode characters, one 2-byte one 
3-byte, it also intentionally includes 

    * a 4-byte sequence if U+110000 were a valid code point (it's 1 over the 
max of U+1FFFFF)

    * a 4-byte unicode-look-alike sequence, but definitely invalid as it beings 
with \366     it's hypothetical UTF8 code point would be U+ 1987FF, ord# 
1,673,215      
    * a 3-byte sequence that resides within the UTF-16 surrogates region

    * a 3rd extra continuation byte right after a valid 2-byte sequence        
* a 2-byte sequence supposedly to represent U+0088 | \xC2\x88 | \302\210 , but 
intentionally uses       earlier unicode-invalid byte of \300     
           and finally, 
    * a short-changed 3-byte sequence that is missing the second continuation 
byte
The 2 valid multi-byte UTF8 code-points inside the string are U+076D (ARABIC 
LETTER SEEN WITH TWO DOTS VERTICALLY ABOVE) and U+B000 (Korean Hangul Syllable 
Ggwem)
The test code (also included as attachment, along with my zsh's screen outputs) 
is included below

A fully functional hex and octal encoder was included for your convenience. The 
correct # of characters is 17, as confirmed by gnu-wc. Byte count is 36. 
However, as you can see, each function reports something different, and 
frequently do not agree with each other.
split() is now correct at 33
however, length() and gsub(/./,"&") should be 17 instead of 33 and 20 
respectively
as for match($0,/.*/), it's supposed to start at the first position of a valid 
code point, and should stop at the last valid codepoint that's contiguous from 
RSTART. if i'm counting it correctly, it should end at capital letter "A" ( 
\101 :: ord-67 :: \x42), so RSTART should be 1, and RLENGTH should be 7. 
However, currently reports 25
if you run this command 
 gsub(   /.+/, "\f&\f") gsub(/[\f]+/,    "\f")
then it's obvious where gsub() is counting incorrectly  :
4
apple뀀A????XYZ????JR???zݭ                          ?                           
F                            ??                              Q                  
             ?                                W                                
It's clumping multiple invalid code-points all within the first group.In 
another view :
gsub(/.../,"\f&\f"); app   le뀀       A????X             YZ????                  
 JR???                        zݭ?F??Q?W                               
It's supposed to add the vertical form feeds only when it can find 3 
consecutive valid code points.
between A and X isn't valid, so those should've have been grouped togetherditto 
for for the 3rd item after YZ, after JR, and the tail group is just clumped 
together.
::::====:::::======:::====  [ the code also self-prints on terminal, as well, 
as the test string, a full cell-by-cell display of what array splitting looks 
like, and finally, a full self printout of the hex and octal mapping tables to 
ensure they're accurately mapping the 8 bytes

gprintf '\33c\e[3J'; echo; 
str1="apple\353\200\200A\364\220\200\200XYZ\366\230\237\277JR\355\271\272z\335\255\232F\300\210Q\343\207W";
 gprintf '%s\n' "test string :: ${str1}"; echo; gprintf "${str1}" | gwc -lcm ; 
echo ; cmd=' gprintf "${str1}" | gawk -e '\''function hexencode(str,chr) { 
for(chr in b2hex) { if (chr!~/[[:alnum:]%\\]/) { gsub(chr,b2hex[chr],str) } }; 
return str } function octencode(str,chr) { gsub(/\\/,b2oct["\\\\"],str); 
gsub(/[0-7]/,"\\06&",str); for(chr in b2oct) { if(chr!~/[0-7\\]/) { 
gsub(chr,b2oct[chr],str) str } }; return str } BEGIN { 
offset=-4^4;for(x=0;x<256;x++) { 
byte=sprintf("%c",x+offset);b2hex[byte]=sprintf("\\x%.2X",x);b2oct[byte]=sprintf("\\%03o",x)
 }; spc1="/\\^[]";spc2="~!@#%&_-{}:;\42\47\140 <>,$.|()*+=?"; 
for(x=length(spc1);x;x--) { byte=substr(spc1,x,1); 
b2hex[("\\"(byte))]=b2hex[byte]; b2oct[("\\"(byte))]=b2oct[byte]; delete 
b2hex[byte]; delete b2oct[byte] }; for(x=length(spc2);x;x--) { 
byte=substr(spc2,x,1); b2hex[("["(byte)"]")]=b2hex[byte]; 
b2oct[("["(byte)"]")]=b2oct[byte]; delete b2hex[byte]; delete b2oct[byte] } } 
function printtables() { PROCINFO["sorted_in"]="@val_num_asc";cnt=4; for(x in 
b2oct) { printf(" %-4s:%s:%s |%s",(x~/[\040-\176]/) ? x : 
"[.]",b2hex[x],b2oct[x],--cnt?"":ORS); if(!cnt) { cnt=4 } } } { printf("%cinput 
:: |%s|%c%c non-ALNUM-hex :: %s%c%cfull-octal :: %s%c%c", 10, $0, 10, 10, 
hexencode($0), 10, 10, octencode($0), 10, 10); print "byte count via 
match($0,/$/)-1 :: " , match($0,/$/)-1; print "gsub(/./,\"&\") :: " , 
gsub(/./,"&"); match($0,/.*/); print "match($0,/.*/) :: ",RSTART, RLENGTH; 
print "length() :: ",length($0); print "split to array using empty-RE :: ", 
nx=split($0, arr, //); print ORS; print "($0~/^.+$/) :: " ($0~/^.+$/); print 
ORS; print "match($0,/.?$/) :: ",match($0,/.?$/); print ORS; for(x=1;x<=nx;x++) 
{ printf("array cell # [ %2d ] <| %-6s | %16s | %16s |>\n", x, xa = arr[x], 
hexencode(xa), octencode(xa)); xa=""} } END { printtables() } '\'' 2>&1 | gcat 
-n ; echo; uname -a; echo; locale; echo; gawk -V; echo '; echo $'\n'"command is 
:: "$'\n'$'\n'"${cmd}"$'\n'; eval "${cmd}"; echo
And the system config is :

Darwin JCK-MBP18-Retina-13.local 20.6.0 Darwin Kernel Version 20.6.0: Mon Aug 
30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 x86_64
LANG="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_CTYPE="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_ALL=
GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)Copyright (C) 1989, 
1991-2021 Free Software Foundation.

Thanks for your time Jason 

On Saturday, October 2, 2021, 02:09:46 AM EDT, Nethox <nethox+awk@gmail.com> 
wrote: 

2021-09-29T21:29:55-06:00, <arnold@skeeve.com>:
> Asserts are for errors in code, not errors in data. mbrlen() has to> return 
> an error to user code, not fail in an assertion.
Yes. I meant the assert as a postcondition in the "recognized" cases,where 
glibc's full decoder/validator code should never reach with anyof those 13 
invalid bytes.

test_gawk_script_output.txt
Description: Text document

test_gawk_script.sh.txt
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

Re: UTF8 above U+10FFFF treated inconsistently, Jason C. Kwan <=
- Re: UTF8 above U+10FFFF treated inconsistently, arnold, 2021/12/27

Prev by Date: Re: Assertion failed: pc->target_jmp != NULL, file interpret.h, line 446
Next by Date: Re: UTF8 above U+10FFFF treated inconsistently
Previous by thread: Assertion failed: pc->target_jmp != NULL, file interpret.h, line 446
Next by thread: Re: UTF8 above U+10FFFF treated inconsistently
Index(es):
- Date
- Thread