[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Incorrect unicode escapes
From: |
Angus Duggan |
Subject: |
Incorrect unicode escapes |
Date: |
Mon, 26 Jun 2017 13:12:57 +0000 |
Sorry, bashbug didn't work under cygwin...
BASH_VERSION=4.4.12(3)-release
uname -a: CYGWIN_NT-6.1 xxxxxxx 2.8.0(0.309/5/3) 2017-04-01 20:47 x86_64 Cygwin
The function u32toutf16() in lib/sh/unicode.c incorrectly implements
surrogate pairs. \uff08 (Full Width Left Paren) is encoded to the invalid
surrogate pair d7ff df08.
Unicode code points in the range 0xe000-0xffff should be encoded as a single
16-bit code unit.
To repeat (Windows 64-bit, cygwin):
export LANG=en_us.UTF-8
echo $'\uff08' | hexdump -C
This prints:
00000000 ed 9f bf ed bc 88 0a |.......|
00000007
This is UTF-8 encoding for the two 16-bit values 0xdf77 0xdf08. This is
invalid as a UTF-8 encoding, surrogate pairs should not be UTF-8 encoded.
The fix is simple, add tests for the e000-ffff range, or invert the test
order and add a test for dfff (CAVEAT EMPTOR! THIS IS UNTESTED!):
if (c >= 0x010000 && c <= 0x010ffff)
{
c -= 0x010000;
s[0] = (unsigned short)((c >> 10) + 0xd800);
s[1] = (unsigned short)((c & 0x3ff) + 0xdc00);
l = 2;
}
else if (c < 0x0d800 || c > 0xdfff )
{
s[0] = (unsigned short) (c & 0xFFFF);
l = 1;
}
a.
- Incorrect unicode escapes,
Angus Duggan <=