i18n proposal

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

i18n proposal

From:	Ben Pfaff
Subject:	i18n proposal
Date:	Sun, 18 Jun 2006 14:50:37 -0700
User-agent:	Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

Based on the ongoing discussion here, I'm trying to come up with
an acceptable proposal for i18n of syntax files, messages,
output, and data files.  Here is what I have so far.

* We assume that wchar_t can be processed statelessly, that its
  encoding is independent of the current locale, and that isw*(),
  tow*(), etc. are independent of locale.  This is true of glibc
  and C99 recommends but does not require such an environment.

* Syntax files are converted from the "syntax locale" into wide
  characters as they are read.  Processing of syntax thereafter
  occurs exclusively in wide characters.

  I am not sure what the default syntax locale should be.
  Probably there should be a command-line option, which is what
  GCC does, and possibly the default should be based on
  environment variables (e.g. LANG, LC_ALL).

* String data that occurs in cases is primarily treated as opaque
  octets.  Even procedure like SORT CASES that could easily do
  better (by using language-specific collation rules via, e.g.,
  wcscoll()) are documented to use bytewise comparison.
  Expression syntax is another place where it might seem that
  encoding-specific rules would be handy, but it appears that
  only a few actually have any: LOWER, UPPER, MBLEN.BYTE.  (PSPP
  doesn't yet implement the latter.)

  The "data locale" used for these locale-specific behaviors and
  for converting string data into output data on PRINT, WRITE,
  LIST, etc., is controlled by SET LOCALE or by environment
  variables (e.g. LANG, LC_ALL).  We could also add subcommands
  on GET, etc., that specify the locale used in files that we
  read.

  At one point we discussed whether the "data locale" should be
  global, or per-file, or per-variable.  I think at the time I
  was arguing for per-file, but now I'm beginning to believe that
  per-variable might be practical to implement.

  Literal strings ("..." and '...') get converted back to
  multibyte strings during lexical analysis, and afterward we
  mainly treat them as opaque octets also.

* Diagnostic messages need to be multibyte strings, for two
  reasons both relating to the gettext interface.  First, the
  argument to gettext() is a multibyte string.  Second, if
  gettext() cannot find a translation for a message, then it
  returns the message it was passed verbatim, without doing any
  character set conversion.  (This is documented as being
  intentional in the GNU libc manual.)

  We need to format messages in two locales: the "output locale"
  used for output, and the "interface locale" used by the user
  interface.  The former is controlled by SET (DEF)OLANG, the
  latter by environment variables (e.g. LANG, LC_ALL).

* The interface to the output subsystem (that is, primarily the
  functions in output.h and tab.h) should use multibyte strings,
  for these reasons.  First, strings passed to the tab_*()
  functions are often fed through gettext() along the way, so
  wide strings would be inconvenient.  Second, tables can get
  very large, so wide strings would be wasteful.

  (The ASCII driver might want to change its representation of
  the page to wide strings, though, because this would be an easy
  way for it to support Asian character sets.)  

* Each "struct variable" is split between multibyte and wide
  strings.  Variable names are used as part of syntax processing,
  so we will probably want to change "name" to a wide string.
  Missing values and the value part of each value label are case
  data, so they are opaque octets.  Variables labels and the
  label part of each value label are used primarily in output, so
  they will should probably remain multibyte strings.

* Finally, what should we pass to setlocale()?  I think that we
  should select, with LC_ALL, the "output locale".  Rationale:

        - We don't need to be in the "syntax locale", because
          we're doing syntax processing as wide characters, which
          are locale independent.

          We may need to temporarily switch to the "syntax
          locale" to read from the syntax file; otherwise
          fgetwc() and other wide input functions won't do what
          we like.  Alternatively, we could read syntax files as
          binary octets and convert to wide characters with
          iconv().

        - We don't need to be in the "data locale", because most
          of the time we just deal with data as opaque octets.
          When occasionally we do need to deal with data as
          strings, we can switch locales temporarily.

        - We don't need to be in the "interface locale" because
          message output will need to be able to switch locales
          in any case, because it needs to call gettext() in both
          the output and interface locales and it will need to
          change the locale anyhow to do so.

        - We do want to be in the "output locale".  Otherwise,
          we'll need to change to the output locale before we
          prepare output.

Comments?
-- 
Ben Pfaff 
email: address@hidden
web: http://benpfaff.org

[Prev in Thread]

Current Thread

[Next in Thread]

i18n proposal, Ben Pfaff <=
- Re: i18n proposal, John Darrington, 2006/06/18
  - Re: i18n proposal, Ben Pfaff, 2006/06/18

Prev by Date: Re: internationalizing syntax files
Next by Date: Re: i18n proposal
Previous by thread: localisation
Next by thread: Re: i18n proposal
Index(es):
- Date
- Thread