[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: internationalizing syntax files

From: John Darrington
Subject: Re: internationalizing syntax files
Date: Fri, 16 Jun 2006 18:32:32 +0800
User-agent: Mutt/1.5.9i

On Thu, Jun 15, 2006 at 10:29:36AM -0700, Ben Pfaff wrote:

     [Pet peeve: of course I know what you mean, but in fact a
     "quantum" is the smallest possible amount of something.]

Quantum mechanics never was my forte.  However, as I understand it,
the metaphor stems from the fact that a quantum is the smallest
possible amount of energy necessary to move an electron from one shell
in the Bohr model of the atom  to the next;  and that amount, in
atomic terms, is quite a large amount of energy.  Anyway, when I use
the expression "quantum leap", I normally want to convey the idea of:
"operating at a different level".

     OK, stipulate for the moment that we decide to move to wide
     characters and strings for syntax file.  The biggest issue in my
     mind is, then, deciding how many assumptions we want to make
     about wchar_t.  There are several levels.  In rough order of
     increasingly strong assumptions:
             1. Don't make any assumptions.  There is no benefit to
                this above using "char", because C99 doesn't actually
                say that wide strings can't have stateful or
                multi-unit encodings.  It also doesn't say that the
                encoding of wchar_t is locale-independent.
             2. Assume that wchar_t has a stateless encoding.
             3. Assume that wchar_t has a stateless and
                locale-independent encoding.
             4. Assume that wchar_t is Unicode (one of UCS-2, UTF-16,
                UTF-32), and for UTF-16 ignore the possibility of
                surrogate pairs.  C99 recommends but does not require
                use of Unicode for wchar_t.  (There's a standard macro
                __STDC_ISO_10646__ that indicates this.)
             5. Assume that wchar_t is UTF-32.
     GCC and glibc conform to level 5.  Native Windows conforms to
     level 4.

In the above, I'm assuming that when you say "wchar_t has a stateless
encoding", you mean that the entity reading the stream is
stateless. wchar_t is (on my machine at least) just a typedef to int,
so can't contain any "state" except its face value.

So, that being so, I don't think we need to make any assumptions
beyond level 3. See below for elaboration:

     I'm saying that we can't blindly translate syntax files to UTF-8
     or UTF-32 unless we also translate all of the string and
     character literals that we use in conjunction with them to UTF-8
     or UTF-32 also.  If the execution character set is Unicode, then
     no translation is needed; otherwise, we'd have to call a function
     to do that, which is inconvenient and relatively slow.

Surely, the string and character literals are converted to UTF-32 by the
compiler?  Just by saying:

const wchar_t str[] = L"foo";

then str contains a UTF-32 (or whatever the wchar_t encoding for that
platform happens to be).  We'd have to change strings like
"REGRESSION" to L"REGRESSION" in command.def and other files in
language/lexer, but that doesn't involve any function calls.

Currently, syntax is read one line at a time, using ds_read_line from
str.c.  The way I see it working, is that a wchar_t counterpart to
str.c is created (call it wstr.c). In dws_read_line, the call to
getc(stream) is replaced by getwc(stream).  Now the man page for
getwc(3) says:

       The behaviour of fgetwc depends on the LC_CTYPE category of the current

       In the absence of additional information passed to the fopen  call,  it
       is  reasonable  to  expect  that  fgetwc will actually read a multibyte
       sequence from the stream and then convert it to a wide character.

This "reasonable" expectation seems to be a statement of  your
assumption #3 above. 

So, let us assume that I'm running PSPP on a machine whose wchar_t
happens to be UTF-32 encoded, and it's native charset is EBCDIC.  So
long as my LC_CTYPE encoding specifies EDCDIC, syntax files will be
dutifully converted to UTF-32, and during parsing, compared with
UTF-32 string constants.  If, I'm provided with a syntax file, which
is encoded in UTF-8, I can use this file, simply by changing LANG (or
LC_CTYPE) to en_AU.UTF-8 (or similar).

     > Similarly I cannot conceive that there would be many platforms today
     > that have a sizeof(wchar_t) of 16 bits.  If it does, let's just issue
     > a warning at configure time.
     The elephant in the room here is Windows.  If we ever want to
     have native Windows support, its wchar_t is 16 bits and that's
     unlikely to change as I understand it.
I'm treading outside the bounds of my understanding of unicode
now. But I read a bit of the web site, and from what I can infer,
almost all the glyphs for modern natural languages are located below
65365.  The "code points" above that are for ancient languages and
math symbols etc.


PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See or any PGP keyserver for public key.

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]