[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: internationalizing syntax files

From: Ben Pfaff
Subject: Re: internationalizing syntax files
Date: Thu, 15 Jun 2006 10:29:36 -0700
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> On Wed, Jun 14, 2006 at 05:59:31PM -0700, Ben Pfaff wrote:
>      Either of these formats has the potential pitfall that the host
>      machine's multibyte string encoding might not be based on ASCII
>      (or its wide string encoding might not be based on Unicode), so
>      that (wide) character and (wide) string literals would then be in
>      the wrong encoding.  However, in practice there's only one real
>      competitor to ASCII, which is EBCDIC.  On those systems, if we
>      chose to support them at all, we could use UTF-EBCDIC.
> ????
> I don't understand this.  Even if pspp's running on some host that has
> a totally wierd esoteric character set, the compiler should interpret
> literals in that charset.  So if I have a line like:
>     int x = 'A';
> Then in ascii, x == 64, in ebcdic x == something else ....  Similarly,
> for wide_chars:
>     int x = L'A';
> will work.

But UTF-8 or UTF-32 *isn't* that totally weird esoteric character
set, so translating syntax files to it will cause problems.

I'm saying that we can't blindly translate syntax files to UTF-8
or UTF-32 unless we also translate all of the string and
character literals that we use in conjunction with them to UTF-8
or UTF-32 also.  If the execution character set is Unicode, then
no translation is needed; otherwise, we'd have to call a function
to do that, which is inconvenient and relatively slow.

> Personally, I'm leaning [toward UTF-32].  Largely because, although it
> may be more of a quantum leap, I think that any problems that are
> introduced are going to be much more obvious with UTF-32.   

[Pet peeve: of course I know what you mean, but in fact a
"quantum" is the smallest possible amount of something.]

Yes, that's something important to note.

> In fact, I suggest that LESS code will need to be rewritten
> (much of it will be simple substitution of typenames and
> function call names), but like you say, it does have to be
> written all at once.  With the UTF-8 approach, I predict that
> subtle problems will remain undiscovered for a long time,
> wherease with UTF-32 most will be caught at compile time.  
> ---  Like you say, at least in UTF-32 one cannot miss the
> important bits. 
> I don't think that the storage inefficiency of UTF-32 is an issue
> these days.  Even if it means that 4 times the size of the syntax file
> is needed, syntax files are not huge like casefiles.  Today memory is
> cheap. 

It may not be worth worrying about.

> Similarly I cannot conceive that there would be many platforms today
> that have a sizeof(wchar_t) of 16 bits.  If it does, let's just issue
> a warning at configure time.

The elephant in the room here is Windows.  If we ever want to
have native Windows support, its wchar_t is 16 bits and that's
unlikely to change as I understand it.

> That leaves the question of interfacing to existing libraries.  All
> the stdio/stdlib/ctype functions (eg: printf) have existing wchar_t
> counterparts. Which particular libraries are you concerned about?

I don't have anything in particular in mind.  It may not be worth
worrying about.

OK, stipulate for the moment that we decide to move to wide
characters and strings for syntax file.  The biggest issue in my
mind is, then, deciding how many assumptions we want to make
about wchar_t.  There are several levels.  In rough order of
increasingly strong assumptions:

        1. Don't make any assumptions.  There is no benefit to
           this above using "char", because C99 doesn't actually
           say that wide strings can't have stateful or
           multi-unit encodings.  It also doesn't say that the
           encoding of wchar_t is locale-independent.

        2. Assume that wchar_t has a stateless encoding.

        3. Assume that wchar_t has a stateless and
           locale-independent encoding.

        4. Assume that wchar_t is Unicode (one of UCS-2, UTF-16,
           UTF-32), and for UTF-16 ignore the possibility of
           surrogate pairs.  C99 recommends but does not require
           use of Unicode for wchar_t.  (There's a standard macro
           __STDC_ISO_10646__ that indicates this.)

        5. Assume that wchar_t is UTF-32.

GCC and glibc conform to level 5.  Native Windows conforms to
level 4.

Ben Pfaff 
email: address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]