[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
i18n proposal
From: |
Ben Pfaff |
Subject: |
i18n proposal |
Date: |
Sun, 18 Jun 2006 14:50:37 -0700 |
User-agent: |
Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux) |
Based on the ongoing discussion here, I'm trying to come up with
an acceptable proposal for i18n of syntax files, messages,
output, and data files. Here is what I have so far.
* We assume that wchar_t can be processed statelessly, that its
encoding is independent of the current locale, and that isw*(),
tow*(), etc. are independent of locale. This is true of glibc
and C99 recommends but does not require such an environment.
* Syntax files are converted from the "syntax locale" into wide
characters as they are read. Processing of syntax thereafter
occurs exclusively in wide characters.
I am not sure what the default syntax locale should be.
Probably there should be a command-line option, which is what
GCC does, and possibly the default should be based on
environment variables (e.g. LANG, LC_ALL).
* String data that occurs in cases is primarily treated as opaque
octets. Even procedure like SORT CASES that could easily do
better (by using language-specific collation rules via, e.g.,
wcscoll()) are documented to use bytewise comparison.
Expression syntax is another place where it might seem that
encoding-specific rules would be handy, but it appears that
only a few actually have any: LOWER, UPPER, MBLEN.BYTE. (PSPP
doesn't yet implement the latter.)
The "data locale" used for these locale-specific behaviors and
for converting string data into output data on PRINT, WRITE,
LIST, etc., is controlled by SET LOCALE or by environment
variables (e.g. LANG, LC_ALL). We could also add subcommands
on GET, etc., that specify the locale used in files that we
read.
At one point we discussed whether the "data locale" should be
global, or per-file, or per-variable. I think at the time I
was arguing for per-file, but now I'm beginning to believe that
per-variable might be practical to implement.
Literal strings ("..." and '...') get converted back to
multibyte strings during lexical analysis, and afterward we
mainly treat them as opaque octets also.
* Diagnostic messages need to be multibyte strings, for two
reasons both relating to the gettext interface. First, the
argument to gettext() is a multibyte string. Second, if
gettext() cannot find a translation for a message, then it
returns the message it was passed verbatim, without doing any
character set conversion. (This is documented as being
intentional in the GNU libc manual.)
We need to format messages in two locales: the "output locale"
used for output, and the "interface locale" used by the user
interface. The former is controlled by SET (DEF)OLANG, the
latter by environment variables (e.g. LANG, LC_ALL).
* The interface to the output subsystem (that is, primarily the
functions in output.h and tab.h) should use multibyte strings,
for these reasons. First, strings passed to the tab_*()
functions are often fed through gettext() along the way, so
wide strings would be inconvenient. Second, tables can get
very large, so wide strings would be wasteful.
(The ASCII driver might want to change its representation of
the page to wide strings, though, because this would be an easy
way for it to support Asian character sets.)
* Each "struct variable" is split between multibyte and wide
strings. Variable names are used as part of syntax processing,
so we will probably want to change "name" to a wide string.
Missing values and the value part of each value label are case
data, so they are opaque octets. Variables labels and the
label part of each value label are used primarily in output, so
they will should probably remain multibyte strings.
* Finally, what should we pass to setlocale()? I think that we
should select, with LC_ALL, the "output locale". Rationale:
- We don't need to be in the "syntax locale", because
we're doing syntax processing as wide characters, which
are locale independent.
We may need to temporarily switch to the "syntax
locale" to read from the syntax file; otherwise
fgetwc() and other wide input functions won't do what
we like. Alternatively, we could read syntax files as
binary octets and convert to wide characters with
iconv().
- We don't need to be in the "data locale", because most
of the time we just deal with data as opaque octets.
When occasionally we do need to deal with data as
strings, we can switch locales temporarily.
- We don't need to be in the "interface locale" because
message output will need to be able to switch locales
in any case, because it needs to call gettext() in both
the output and interface locales and it will need to
change the locale anyhow to do so.
- We do want to be in the "output locale". Otherwise,
we'll need to change to the output locale before we
prepare output.
Comments?
--
Ben Pfaff
email: address@hidden
web: http://benpfaff.org
- i18n proposal,
Ben Pfaff <=