Re: i18n proposal

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: i18n proposal

From:	Ben Pfaff
Subject:	Re: i18n proposal
Date:	Sun, 18 Jun 2006 19:09:15 -0700
User-agent:	Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> On Sun, Jun 18, 2006 at 02:50:37PM -0700, Ben Pfaff wrote:

>      * String data that occurs in cases is primarily treated as opaque
>        octets.  Even procedure like SORT CASES that could easily do
>        better (by using language-specific collation rules via, e.g.,
>        wcscoll()) are documented to use bytewise comparison.
>
> It's probably documented that way, because it's easier to implement.
> It makes sense to me, that SORT CASES should use the collation of the
> "data locale".  Let's at least look into the implications of doing so,
> and perhaps offer it under "enhanced" mode.  
> A German wanting to say, select all cities from 'N' to 'Z' might be
> very annoyed to find that pspp ommitted 'Öhringen' (where they had the
> world cup match last week).

I have thought a little about that.  I have a few ideas.
First, I don't think changing the default behavior is a good
idea, because it seems like it could be a surprising change.  But
I can think of a few other options:

        * Add a COLLATE keyword to SORT CASES that tells it to
          use proper locale-specific collation rules.

        * Add a COLLATE('a','b') function to the expression
          syntax and extend SORT CASES to allow an arbitrary
          expression to be used.

        * Add an XFRM('string') function to the expression
          syntax, then document that you can sort based on
          locale-specific rules using
                COMPUTE collate=XFRM(string).
                SORT CASES BY collate.
          (XFRM would be implemented via strxfrm().)

The last of those is kind of nice since you don't actually have
to change the sort algorithm at all.

>      * The interface to the output subsystem (that is, primarily the
>        functions in output.h and tab.h) should use multibyte strings,
>        for these reasons.  First, strings passed to the tab_*()
>        functions are often fed through gettext() along the way, so
>        wide strings would be inconvenient.  Second, tables can get
>        very large, so wide strings would be wasteful.
>      
>        (The ASCII driver might want to change its representation of
>        the page to wide strings, though, because this would be an easy
>        way for it to support Asian character sets.)  
>
> Reading from the unicode website, there are texts which suggest that
> this would not be the case.  Apparently, even in "monospace fonts"
> in the general case, the number of characters is not necessarily
> proportional to the width required to render them.  The advice there
> is to use multi-byte representation for all input/output operations.

Are you talking about Unicode Standard Annex #11 (East Asian
Width)?  I'm aware of the need to deal with single- and
double-width characters.  It would not be too hard to do, seeing
as the wcwidth() function will tell you the width of a character.

I don't think that multi-byte representation would work well for
the ASCII driver's internal representation, because it's
difficult to index a multibyte string based on the number of
(single-)character widths from the left margin, which the ASCII
driver does all the time.

Of course, the output format of the ASCII output driver should be
multibyte characters.

> Incidently, if the ASCII driver is going to support other character
> sets, then it might want to be changed to a more appropriate name.

Yes, "text" or "plain text" is what I have in mind.

>      * Each "struct variable" is split between multibyte and wide
>        strings.  Variable names are used as part of syntax processing,
>        so we will probably want to change "name" to a wide string.
>
> But the short_name has to remain as it is  I think.

Yes.

>      * Finally, what should we pass to setlocale()?  I think that we
>        should select, with LC_ALL, the "output locale".  
>
> Like you say, there's going to be a lot of locale switching going on,
> and with that comes potentinal for mistakes; mistakes that might
> easily go unnoticed.   I suggest that we avoid direct calls to
> setlocale, and implement some wrappers.

Yes, but I want to keep locale switching to as much of a minimum
as we can.  I suspect that on some systems it actually causes
libc to go out and read a locale file.

On systems that have newlocale()/uselocale()/freelocale(), we
should use those.

> I've been wondering why pspp currently sets the LC_MONETARY category.
[...]

I don't recall.  Probably, it seemed harmless, so I chose to set it.

> Another option would be to preset the CCA format based upon the lconv
> struct, and leave the DOLLAR format as is.  But this would mean that
> DOLLAR is an unmitigated nuisance in countries with a non-dollar
> currency.   I wonder what spss  in a European locale does?

Good questions.
-- 
"In the PARTIES partition there is a small section called the BEER.
 Prior to turning control over to the PARTIES partition,
 the BIOS must measure the BEER area into PCR[5]."
--TCPA PC Specific Implementation Specification

[Prev in Thread]

Current Thread

[Next in Thread]

i18n proposal, Ben Pfaff, 2006/06/18
- Re: i18n proposal, John Darrington, 2006/06/18
  - Re: i18n proposal, Ben Pfaff <=

Prev by Date: GUI-casefile interaction.
Next by Date: Re: GUI-casefile interaction.
Previous by thread: Re: i18n proposal
Next by thread: GUI-casefile interaction.
Index(es):
- Date
- Thread