[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

internationalizing syntax files

From: Ben Pfaff
Subject: internationalizing syntax files
Date: Wed, 14 Jun 2006 17:59:31 -0700
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

I spent some time yesterday thinking about handling
internationalization of syntax files.  I concluded that there are
really two viable approaches: UTF-8 or UTF-32.  That is, we
should convert text that we read from syntax files, at time of
input, to one of these formats and process it in that format

If we use UTF-8, there is the advantage that much processing code
can stay the same.  Some has to change to properly handle
multibyte characters, but at least UTF-8 is about as sane and
simple as a multibyte encoding can be.  Also, UTF-8 stores
English and European characters efficiently, up to 4x more
efficiently than UTF-32.  UTF-8 is also easier to shoehorn into
existing library interfaces.

If we use UTF-32, we will have to change all our processing
code.  On the other hand, at least there's no overlooking code
that needs to change in a sea of code that doesn't.  There won't
need to be any special-casing for multibyte characters, since
every characters fits in a single wchar_t.  Well, except that
some systems use UTF-16 for wchar_t, so that we'd need to support
surrogate pairs if we wanted to be really complete.

Either of these formats has the potential pitfall that the host
machine's multibyte string encoding might not be based on ASCII
(or its wide string encoding might not be based on Unicode), so
that (wide) character and (wide) string literals would then be in
the wrong encoding.  However, in practice there's only one real
competitor to ASCII, which is EBCDIC.  On those systems, if we
chose to support them at all, we could use UTF-EBCDIC.

Here's a summary.

        - Needs multibyte support (but at least it's easy)
        + Some code needs to be rewritten (but which?)
        + Efficient storage of European characters
        + Easy interface to existing libraries

        + Less need for multibyte support (well, except that
          wchar_t might only be 16 bits)
        - All string-handling code must be rewritten (but at
          least you can't miss important parts)
        - European characters expand 2x to 4x
        - Difficult interfaces to existing libraries.

What do you think?  I am leaning toward UTF-8, not least because
it is possible to convert to using it in phases.  If we switch to
UTF-32, then we have to convert pretty much everything all at
once, because code will not compile or, if it does, will not
work, when char pointers become wchar_t pointers.
Ben Pfaff 
email: address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]