Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is tr

bug-bash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is tr

From:	dethrophes
Subject:	Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?
Date:	Sun, 11 Mar 2012 23:10:30 +0100
User-agent:	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2

Am 11.03.2012 00:02, schrieb dethrophes:

Am 10.03.2012 23:17, schrieb Chet Ramey:
On 3/7/12 12:07 AM, John Kearney wrote:
You really should stop using this function. It is just plain wrong, and
is not predictable.


It may enocde BIG5 and SJIS but is is more by accident that intent.

If you want to do something like this then do it properly.

basically all of the multibyte system have to have a detection method
for multibyte characters, most of them rely on bit7 to indicate a
multibyte sequence or use vt100 SS3 escape sequences. You really can't
just inject random data into a txt buffer. even returning UTF-8 as a
fallback is a bug. The most that should be done is return ASCII inerror
case and I mean U+0-U+7f only and ignore or warn about any unsupported
characters.

Using this function is dangerous and pointless.
I mean seriously in what world does it make sense to inject utf-8into a
big5 string? Or indead into a ascii string. Code should behave like an
adult, not like a frightened kid. By which I mean it shouldn't pretend
it knows what its doing when it doesn't, it should admit the problem so
that the problem can be fixed.
Wow. Do you really think that personal insults are a good way toadvance
an argument?

Listen: bottom line.  It's a fallback function.  It's called in the
unlikely event that iconv isn't available at all and we're not in a
UTF-8 locale.  Any fallback is as good as another, though maybe the
best one would be to return \uNNNN or \UNNNNNNNN (before you ask,
Posix leaves the \u/\U failure cases unspecified).  The real question
is what to do with invalid input data, since any transformation is
going to "inject random data" into the buffer.  Maybe the identity
function would be better after all.  But then you'd ask whether or
not it makes sense to inject a C-style escape sequence into a big5
string.

Chet
I guess I was a bit terse wouln't call it a personal insult though.Though I guess I do have pretty thick skin, sorry if you felt it wasmeant as one.
My point is the fallback function/handler should report anerror/warning not do anything and move on.Trying to reover an irrecoverable error is just making it moredifficult to figure out what is going on.Basically this is a script/enviroment error, so report the error,don't hide it.
Its a similar problem with the iconv fallback of returning UTF-8. Ificonv says it can't encode the unicode value in the destinationcharset do we really know better? Again it is better to report theerror an move on. because injecting utf-8 into big5 or whatever isalso wrong. because if utf-8 is the destination charset then it wouldhave already been detected or iconv would have worked so contextuallywe this is wrong.if (iconv (localconv, (ICONV_CONST char **)&iptr, &sn, &optr,&obytesleft) == (size_t)-1)
    return n;   /* You get utf-8 if iconv fails */
now don't forget we know at this point that iconv knows the source anddestination charsets so we have unicode character unsupported indestination charset.
or here
  n = u32toutf8 (c, s);
  if (utf8locale || localconv == (iconv_t)-1)
    return n;
If destination charset is utf-8 OR destiation charset NOT utf-8 andicconv didn't recognise detination charset encode it as uft-8.
Lets say CTYPE=BIG5 and you try to encode a unicode char U+F000 whichis an invalid big5 char(at least I think it is).
so iconv returns an error.
now the code inserts the utf-8 encoding of U+F000, which is an invalidstring sequence.
this isn't helping anyody.

Or lets put it another way.

lets say you type
rm FileDoesntExist

now "FileDoesntExist" doesn't exist but instead of reporting an error rmdeletes "FileDoesExist". I think we can agree this is unexpectedbehavior, though you could also argue that it is a fall-back behavior.

This is equivalent to what the function is currently doing, it haschecked and knows it shouldn't output UTF-8, so it tries to encode tothe correct charset, which doesn't work for whateverreason(unrecognized/unsupported destination charset, character notpresent in destination charset), however instead of reporting theencoding/trans-coding error it outputs UTF-8 anyway.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?, John Kearney, 2012/03/07
- Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?, Chet Ramey, 2012/03/10
  - Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?, dethrophes, 2012/03/10
    - Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?, dethrophes <=

Prev by Date: Re: Link failure: multiple definition of getenv
Next by Date: [bug] Home dir in PS1 not abbreviated to tilde
Previous by thread: Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?
Next by thread: Please remove iconv_open (charset, "ASCII"); from unicode.c
Index(es):
- Date
- Thread