bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnulib] Re: strtok_r


From: Bruno Haible
Subject: Re: [Bug-gnulib] Re: strtok_r
Date: Fri, 12 Nov 2004 14:28:13 +0100
User-agent: KMail/1.5

Simon Josefsson wrote:
> I'll install this in gnulib now.
>
> /* Parse S into tokens separated by characters in DELIM.
>    If S is NULL, the saved pointer in SAVE_PTR is used as
>    the next starting point.  For example:
>       char s[] = "-abc-=-def";
>       char *sp;
>       x = strtok_r(s, "-", &sp);      // x = "abc", sp = "=-def"
>       x = strtok_r(NULL, "-=", &sp);  // x = "def", sp = NULL
>       x = strtok_r(NULL, "=", &sp);   // x = NULL
>               // s = "abc\0-def\0"
>
>    For the POSIX documentation for this function, see:
>    http://www.opengroup.org/onlinepubs/009695399/functions/strtok.html
>
>    Caveat: It modifies the original string.
>    Caveat: These functions cannot be used on constant strings.
>    Caveat: The identity of the delimiting character is lost.
>    Caveat: It doesn't work with multibyte strings unless all of the
>            delimiter characters are ASCII characters < 0x80.
>
>    See also strsep().
> */

Yes, this looks good. Except the 0x80 should really be 0x30. Most multibyte
encodings have the property that an ASCII character is encoded as a single
byte, with the same value as in ASCII. But here, in order to use, say, '0'
or 'A' as a delimiter, you need a different property: That every occurrence
of a byte with a given ASCII value means that ASCII character and is not
part of a multibyte character. This property is fulfilled for UTF-8 and the
EUC-*. Unfortunately, the following widely used encodings don't have this
property:

  BIG5 BIG5-HKSCS GBK SHIFT_JIS
            don't have the property for 0x40 <= c <= 0x7E
  GB18030   doesn't have the property for 0x30 <= c <= 0x39, 0x40 <= c <= 0x7E
  JOHAB     doesn't have the property for 0x31 <= c <= 0x7E

Especially GB18030 is probably bound to stay around for a long time.
Therefore really 0x30 is the limit of the usable delimiters.

Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]