bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: u32_normalize UNINORM_NFKC on 0xD800


From: Simon Josefsson
Subject: Re: u32_normalize UNINORM_NFKC on 0xD800
Date: Fri, 27 May 2011 11:23:03 +0200
User-agent: Gnus/5.110018 (No Gnus v0.18) Emacs/23.2 (gnu/linux)

Bruno Haible <address@hidden> writes:

> Simon Josefsson wrote:
>> I'm doing some Unicode NFKC operations and noticing that u32_normalize
>> fails for U+D800.
>
> This is a valid behaviour, because U+D800 is a "surrogate" point code
> and therefore not a valid character code point.
>
> See the Unicode standard, chapter 2 [1], pages 23..24:
> Surrogate code points and other non-character code points "should never be
> interchanged". This means, for libunistring, that they are invalid input
> and invalid output in all functions taking or returning UTF-32 strings or
> UTF-8 strings.
>
> Character code points and code points that are in regions that may be assigned
> in future Unicode versions must not be rejected; these are valid input.

I'm not interchanging the code points, I'm calculating this IDNA2008
property

   toNFKC(toCaseFold(toNFKC(cp))) != cp

for all code points.  Is this impossible to do with the u32_normalize
interface?

I notice that ICU also gives an error in this situation:

http://demo.icu-project.org/icu-bin/nbrowser?t=&s=D800&uv=0

I wonder what the above expression means when toNFKC fails..

I managed to work around this using a local patch to make u32_uctomb
mimic u32_mbtouc_unsafe's behaviour.  But I'm not sure if I'm going to
use it.

--- lib/unistr/u32-uctomb.c.orig        2011-05-27 11:16:00.112466242 +0200
+++ lib/unistr/u32-uctomb.c     2011-05-27 11:16:01.696467065 +0200
@@ -30,8 +30,10 @@
 int
 u32_uctomb (uint32_t *s, ucs4_t uc, int n)
 {
+#if CONFIG_UNICODE_SAFETY
   if (uc < 0xd800 || (uc >= 0xe000 && uc < 0x110000))
     {
+#endif
       if (n > 0)
         {
           *s = uc;
@@ -39,9 +41,11 @@
         }
       else
         return -2;
+#if CONFIG_UNICODE_SAFETY
     }
   else
     return -1;
+#endif
 }
 
 #endif

/Simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]