[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: internationalizing syntax files

From: John Darrington
Subject: Re: internationalizing syntax files
Date: Fri, 16 Jun 2006 09:29:43 +0800
User-agent: Mutt/1.5.9i

There are some parts here where I'm still not following your
reasoning.  I'll write  a more coherent response tonight or at the
week end.


On Thu, Jun 15, 2006 at 10:29:36AM -0700, Ben Pfaff wrote:
     John Darrington <address@hidden> writes:
     > On Wed, Jun 14, 2006 at 05:59:31PM -0700, Ben Pfaff wrote:
     >      Either of these formats has the potential pitfall that the host
     >      machine's multibyte string encoding might not be based on ASCII
     >      (or its wide string encoding might not be based on Unicode), so
     >      that (wide) character and (wide) string literals would then be in
     >      the wrong encoding.  However, in practice there's only one real
     >      competitor to ASCII, which is EBCDIC.  On those systems, if we
     >      chose to support them at all, we could use UTF-EBCDIC.
     > ????
     > I don't understand this.  Even if pspp's running on some host that has
     > a totally wierd esoteric character set, the compiler should interpret
     > literals in that charset.  So if I have a line like:
     >     int x = 'A';
     > Then in ascii, x == 64, in ebcdic x == something else ....  Similarly,
     > for wide_chars:
     >     int x = L'A';
     > will work.
     But UTF-8 or UTF-32 *isn't* that totally weird esoteric character
     set, so translating syntax files to it will cause problems.
     I'm saying that we can't blindly translate syntax files to UTF-8
     or UTF-32 unless we also translate all of the string and
     character literals that we use in conjunction with them to UTF-8
     or UTF-32 also.  If the execution character set is Unicode, then
     no translation is needed; otherwise, we'd have to call a function
     to do that, which is inconvenient and relatively slow.
     > Personally, I'm leaning [toward UTF-32].  Largely because, although it
     > may be more of a quantum leap, I think that any problems that are
     > introduced are going to be much more obvious with UTF-32.   
     [Pet peeve: of course I know what you mean, but in fact a
     "quantum" is the smallest possible amount of something.]
     Yes, that's something important to note.
     > In fact, I suggest that LESS code will need to be rewritten
     > (much of it will be simple substitution of typenames and
     > function call names), but like you say, it does have to be
     > written all at once.  With the UTF-8 approach, I predict that
     > subtle problems will remain undiscovered for a long time,
     > wherease with UTF-32 most will be caught at compile time.  
     > ---  Like you say, at least in UTF-32 one cannot miss the
     > important bits. 
     > I don't think that the storage inefficiency of UTF-32 is an issue
     > these days.  Even if it means that 4 times the size of the syntax file
     > is needed, syntax files are not huge like casefiles.  Today memory is
     > cheap. 
     It may not be worth worrying about.
     > Similarly I cannot conceive that there would be many platforms today
     > that have a sizeof(wchar_t) of 16 bits.  If it does, let's just issue
     > a warning at configure time.
     The elephant in the room here is Windows.  If we ever want to
     have native Windows support, its wchar_t is 16 bits and that's
     unlikely to change as I understand it.
     > That leaves the question of interfacing to existing libraries.  All
     > the stdio/stdlib/ctype functions (eg: printf) have existing wchar_t
     > counterparts. Which particular libraries are you concerned about?
     I don't have anything in particular in mind.  It may not be worth
     worrying about.
     OK, stipulate for the moment that we decide to move to wide
     characters and strings for syntax file.  The biggest issue in my
     mind is, then, deciding how many assumptions we want to make
     about wchar_t.  There are several levels.  In rough order of
     increasingly strong assumptions:
             1. Don't make any assumptions.  There is no benefit to
                this above using "char", because C99 doesn't actually
                say that wide strings can't have stateful or
                multi-unit encodings.  It also doesn't say that the
                encoding of wchar_t is locale-independent.
             2. Assume that wchar_t has a stateless encoding.
             3. Assume that wchar_t has a stateless and
                locale-independent encoding.
             4. Assume that wchar_t is Unicode (one of UCS-2, UTF-16,
                UTF-32), and for UTF-16 ignore the possibility of
                surrogate pairs.  C99 recommends but does not require
                use of Unicode for wchar_t.  (There's a standard macro
                __STDC_ISO_10646__ that indicates this.)
             5. Assume that wchar_t is UTF-32.
     GCC and glibc conform to level 5.  Native Windows conforms to
     level 4.
     Ben Pfaff 
     email: address@hidden
     pspp-dev mailing list

PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See or any PGP keyserver for public key.

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]