Incorrect unicode escapes

bug-bash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Incorrect unicode escapes

From:	Angus Duggan
Subject:	Incorrect unicode escapes
Date:	Mon, 26 Jun 2017 13:12:57 +0000

Sorry, bashbug didn't work under cygwin...

BASH_VERSION=4.4.12(3)-release
uname -a: CYGWIN_NT-6.1 xxxxxxx 2.8.0(0.309/5/3) 2017-04-01 20:47 x86_64 Cygwin

The function u32toutf16() in lib/sh/unicode.c incorrectly implements
surrogate pairs. \uff08 (Full Width Left Paren) is encoded to the invalid
surrogate pair d7ff df08.

Unicode code points in the range 0xe000-0xffff should be encoded as a single
16-bit code unit.

To repeat (Windows 64-bit, cygwin):

  export LANG=en_us.UTF-8
  echo $'\uff08' | hexdump -C

This prints:

00000000  ed 9f bf ed bc 88 0a                              |.......|
00000007

This is UTF-8 encoding for the two 16-bit values 0xdf77 0xdf08. This is
invalid as a UTF-8 encoding, surrogate pairs should not be UTF-8 encoded.

The fix is simple, add tests for the e000-ffff range, or invert the test
order and add a test for dfff (CAVEAT EMPTOR! THIS IS UNTESTED!):

    if (c >= 0x010000 && c <= 0x010ffff)
    {
      c -= 0x010000;
      s[0] = (unsigned short)((c >> 10) + 0xd800);
      s[1] = (unsigned short)((c & 0x3ff) + 0xdc00);
      l = 2;
    }
    else if (c < 0x0d800 || c > 0xdfff )
    {
      s[0] = (unsigned short) (c & 0xFFFF);
      l = 1;
    }

a.

[Prev in Thread]

Current Thread

[Next in Thread]

Incorrect unicode escapes, Angus Duggan <=
- Re: Incorrect unicode escapes, Chet Ramey, 2017/06/26

Prev by Date: Re: Fwd: Non-upstream patches for bash (2014)
Next by Date: Re: Fwd: Non-upstream patches for bash (2014)
Previous by thread: Re: Worth mentioning in documentation
Next by thread: Re: Incorrect unicode escapes
Index(es):
- Date
- Thread