Incorrect unicode escapes

From: Angus Duggan
Subject: Incorrect unicode escapes
Date: Mon, 26 Jun 2017 13:12:57 +0000

Sorry, bashbug didn't work under cygwin...

uname -a: CYGWIN_NT-6.1 xxxxxxx 2.8.0(0.309/5/3) 2017-04-01 20:47 x86_64 Cygwin

The function u32toutf16() in lib/sh/unicode.c incorrectly implements
surrogate pairs. \uff08 (Full Width Left Paren) is encoded to the invalid
surrogate pair d7ff df08.

Unicode code points in the range 0xe000-0xffff should be encoded as a single
16-bit code unit.

To repeat (Windows 64-bit, cygwin):

  export LANG=en_us.UTF-8
  echo $'\uff08' | hexdump -C

This prints:

00000000  ed 9f bf ed bc 88 0a                              |.......|

This is UTF-8 encoding for the two 16-bit values 0xdf77 0xdf08. This is
invalid as a UTF-8 encoding, surrogate pairs should not be UTF-8 encoded.

The fix is simple, add tests for the e000-ffff range, or invert the test
order and add a test for dfff (CAVEAT EMPTOR! THIS IS UNTESTED!):

    if (c >= 0x010000 && c <= 0x010ffff)
      c -= 0x010000;
      s[0] = (unsigned short)((c >> 10) + 0xd800);
      s[1] = (unsigned short)((c & 0x3ff) + 0xdc00);
      l = 2;
    else if (c < 0x0d800 || c > 0xdfff )
      s[0] = (unsigned short) (c & 0xFFFF);
      l = 1;


