[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is tr

From: dethrophes
Subject: Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?
Date: Sun, 11 Mar 2012 23:10:30 +0100
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2

Am 11.03.2012 00:02, schrieb dethrophes:
Am 10.03.2012 23:17, schrieb Chet Ramey:
On 3/7/12 12:07 AM, John Kearney wrote:
You really should stop using this function. It is just plain wrong, and
is not predictable.

It may enocde BIG5 and SJIS but is is more by accident that intent.

If you want to do something like this then do it properly.

basically all of the multibyte system have to have a detection method
for multibyte characters, most of them rely on bit7 to indicate a
multibyte sequence or use vt100 SS3 escape sequences. You really can't
just inject random data into a txt buffer. even returning UTF-8 as a
fallback is a bug. The most that should be done is return ASCII in error
case and I mean U+0-U+7f only and ignore or warn about any unsupported

Using this function is dangerous and pointless.

I mean seriously in what world does it make sense to inject utf-8 into a
big5 string? Or indead into a ascii string. Code should behave like an
adult, not like a frightened kid. By which I mean it shouldn't pretend
it knows what its doing when it doesn't, it should admit the problem so
that the problem can be fixed.
Wow. Do you really think that personal insults are a good way to advance
an argument?

Listen: bottom line.  It's a fallback function.  It's called in the
unlikely event that iconv isn't available at all and we're not in a
UTF-8 locale.  Any fallback is as good as another, though maybe the
best one would be to return \uNNNN or \UNNNNNNNN (before you ask,
Posix leaves the \u/\U failure cases unspecified).  The real question
is what to do with invalid input data, since any transformation is
going to "inject random data" into the buffer.  Maybe the identity
function would be better after all.  But then you'd ask whether or
not it makes sense to inject a C-style escape sequence into a big5

I guess I was a bit terse wouln't call it a personal insult though. Though I guess I do have pretty thick skin, sorry if you felt it was meant as one.

My point is the fallback function/handler should report an error/warning not do anything and move on. Trying to reover an irrecoverable error is just making it more difficult to figure out what is going on. Basically this is a script/enviroment error, so report the error, don't hide it.

Its a similar problem with the iconv fallback of returning UTF-8. If iconv says it can't encode the unicode value in the destination charset do we really know better? Again it is better to report the error an move on. because injecting utf-8 into big5 or whatever is also wrong. because if utf-8 is the destination charset then it would have already been detected or iconv would have worked so contextually we this is wrong. if (iconv (localconv, (ICONV_CONST char **)&iptr, &sn, &optr, &obytesleft) == (size_t)-1)
    return n;   /* You get utf-8 if iconv fails */
now don't forget we know at this point that iconv knows the source and destination charsets so we have unicode character unsupported in destination charset.

or here
  n = u32toutf8 (c, s);
  if (utf8locale || localconv == (iconv_t)-1)
    return n;
If destination charset is utf-8 OR destiation charset NOT utf-8 and icconv didn't recognise detination charset encode it as uft-8.

Lets say CTYPE=BIG5 and you try to encode a unicode char U+F000 which is an invalid big5 char(at least I think it is).
so iconv returns an error.
now the code inserts the utf-8 encoding of U+F000, which is an invalid string sequence.
this isn't helping anyody.

Or lets put it another way.

lets say you type
rm FileDoesntExist
now "FileDoesntExist" doesn't exist but instead of reporting an error rm deletes "FileDoesExist". I think we can agree this is unexpected behavior, though you could also argue that it is a fall-back behavior.

This is equivalent to what the function is currently doing, it has checked and knows it shouldn't output UTF-8, so it tries to encode to the correct charset, which doesn't work for whatever reason(unrecognized/unsupported destination charset, character not present in destination charset), however instead of reporting the encoding/trans-coding error it outputs UTF-8 anyway.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]