[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: internationalizing syntax files

From: John Darrington
Subject: Re: internationalizing syntax files
Date: Thu, 15 Jun 2006 10:48:48 +0800
User-agent: Mutt/1.5.9i

On Wed, Jun 14, 2006 at 05:59:31PM -0700, Ben Pfaff wrote:
     Either of these formats has the potential pitfall that the host
     machine's multibyte string encoding might not be based on ASCII
     (or its wide string encoding might not be based on Unicode), so
     that (wide) character and (wide) string literals would then be in
     the wrong encoding.  However, in practice there's only one real
     competitor to ASCII, which is EBCDIC.  On those systems, if we
     chose to support them at all, we could use UTF-EBCDIC.

I don't understand this.  Even if pspp's running on some host that has
a totally wierd esoteric character set, the compiler should interpret
literals in that charset.  So if I have a line like:

    int x = 'A';

Then in ascii, x == 64, in ebcdic x == something else ....  Similarly,
for wide_chars:

    int x = L'A';

will work.

The only time it'll fall down is if for some reason somebody has
decided to use numeric literals where character or string literals
should have been used. 
     Here's a summary.
             - Needs multibyte support (but at least it's easy)
             + Some code needs to be rewritten (but which?)
             + Efficient storage of European characters
             + Easy interface to existing libraries
             + Less need for multibyte support (well, except that
               wchar_t might only be 16 bits)
             - All string-handling code must be rewritten (but at
               least you can't miss important parts)
             - European characters expand 2x to 4x
             - Difficult interfaces to existing libraries.
     What do you think?  I am leaning toward UTF-8, not least because
     it is possible to convert to using it in phases.  If we switch to
     UTF-32, then we have to convert pretty much everything all at
     once, because code will not compile or, if it does, will not
     work, when char pointers become wchar_t pointers.

Personally, I'm leaning the other way.  Largely because, although it
may be more of a quantum leap, I think that any problems that are
introduced are going to be much more obvious with UTF-32.   In fact, I
suggest that LESS code will need to be rewritten (much of it will be
simple substitution of typenames and function call names), but like
you say, it does have to be written all at once.   With the UTF-8
approach, I predict that subtle problems will remain undiscovered for
a long time, wherease with UTF-32 most will be caught at compile time.
For example flip.c contains code similar to: 
make_new_var(const char *name)
  char *cp = strchr(name, '\0');
  if ( lex_is_id1(*cp) ) 

In this case, if the first byte in name happens to be part of a
multi-byte sequence, then there's no way the compiler can know that
dereferencing cp this way is inappropriate.   There's a lot of pointer
arithmetic and array indexing in the string parsing code, and it'd
have to be carefully audited to have confidence it'll all work for
multibyte strings.

We don't currently have any developers who use  pspp in a non-European
language, so we'd probably only know about bugs when a Japanese user
complains. ---  Like you say, at least in UTF-32 one cannot miss the
important bits. 

I don't think that the storage inefficiency of UTF-32 is an issue
these days.  Even if it means that 4 times the size of the syntax file
is needed, syntax files are not huge like casefiles.  Today memory is

Similarly I cannot conceive that there would be many platforms today
that have a sizeof(wchar_t) of 16 bits.  If it does, let's just issue
a warning at configure time.

That leaves the question of interfacing to existing libraries.  All
the stdio/stdlib/ctype functions (eg: printf) have existing wchar_t
counterparts. Which particular libraries are you concerned about?

PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See or any PGP keyserver for public key.

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]