[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
ctype.h functions on bytes 0x80..0xFF
From: |
Grisha Levit |
Subject: |
ctype.h functions on bytes 0x80..0xFF |
Date: |
Fri, 26 May 2023 05:55:23 -0400 |
On Mon, May 1, 2023 at 11:48 AM Chet Ramey <chet.ramey@case.edu> wrote:
>
> (And once we get these issues straightened out, if you look back to your
> original example, 0x240 is a blank in my locale, en_US.UTF-8, and will be
> removed from the input stream by the parser unless it's quoted.)
On at least recent macos versions, it seems that the ctype.h functions
treat [0x80..0xFF] the same as wctype.h functions would. So while
U+00A0 is a space character in the en_US.UTF-8 locale, and
iswspace(L'\u00A0') returns 1, it is also the case that isspace(0xA0)
returns 1. But I don't think it's correct to actually rely on the
latter since the single byte 0xA0 doesn't represent _any_ character in
the locale, much less a space.
(I think that's the reason for the behavior Chet noted above from a
previous thread).
For example, these outputs would be correct with \uA0 in place of \xA0
below, but I don't think the current behaviour is expected:
$ eval $'printf "<%s>" [\xA0\xA0]'
<[><]>
[[ $'\xA0' == [[:space:]] ]]; echo $?
0
Perhaps on platforms like this it would be appropriate to mask ctype
results with something equivalent to `btowc(c) != WEOF'?
(See http://www.openradar.me/FB9973780 for an example of the issue in
an apple-supplied program)