[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patch for unicode in varnames...

From: L A Walsh
Subject: Re: Patch for unicode in varnames...
Date: Tue, 06 Jun 2017 07:01:23 -0700
User-agent: Thunderbird

George wrote:
On Mon, 2017-06-05 at 16:16 -0700, L A Walsh wrote:
George wrote:
On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote:
On 05/06/2560 15:52, George wrote:
there's not a reliable mechanism in place to run a script in a locale whose character encoding doesn't match that of the script
From my experience running such scripts is no problem, but correct rendering it might depend on the client/editor.
It depends on the source and target encodings. For most pairs of source and target encoding there is some case where reinterpreting a string from the source encoding as a string in the target encoding (without proper conversion) will result in an invalid string in the target encoding. For instance, if a script were written in ISO-8859-1, many possible sequences involving accented characters would actually be invalid in UTF-8.

    Um... I think you are answering a case that is different than
what is stated (i.e. locale being same as used in script).  So no
conversion should take place.

Eduardo's patch ... can only work correctly if the character set configured in the locale is the same as the character set of the script.
Right. The 1st paragraph (written by you), above, mentions that. Given the 1st paragraph (which no one is contesting), we are only
talking about the case where the run locale and script locale are the same.

 The Passciers wrote that regardless of such agreement, you can still find
editors that may be ignoring the locale r have no locale support at all
and only display characters where the editor was written.  While that is
also true, it can't really be helped: if your local editor only writes
in Chinese and the script is written in ASCII, you may be out of luck
in having it display properly.

Broadly speaking I think the approach taken in Eduardo's patch (interpreting the byte sequence according to the rules of its character encoding) is better than the approach taken in current versions of Bash (letting 0x80-0xFF slide through the parser) - but that approach only works if you know the correct character encoding to use when processing the script. The information has to be provided in the script somehow.
Not exactly -- as the only variable-length-encoding scheme that linux systems have had to worry about is UTF-8. So if you encounter UTF-8 in the input, it is probable that you can use UTF-8 for the whole script. Otherwise, use a binary decoding stream (letting 0x80-0FF) be
treated as a 2nd half of a 256-byte charset, OR a 128-byte charset
without the a parity bit stripped that is left "as-is".

The utility "file" is one example of such a utility that can usually tell
the encoding type of a text file -- at least telling the difference between
UTF-8, ASCII and some 8-bit local charset.

While such methods may not be 100% accurate, they are usually good enough
for most usages where one isn't running (we hope) random scripts of unknown
origin off the web.

   FWIW, I think we are in agreement, though it may not be clear!  ;-)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]