Re: i18n

pspp-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: i18n

From:	Ben Pfaff
Subject:	Re: i18n
Date:	Sun, 19 Mar 2006 17:26:47 -0800
User-agent:	Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)
John Darrington <address@hidden> writes:

> On Sat, Mar 18, 2006 at 05:38:17PM -0800, Ben Pfaff wrote:
>      
>      >    2b might be achieved by heuristics, using a library such as unac
>      >    http://home.gna.org/unac/unac.en.html or if all else fails, replace
>      >    unknown byte sequences by "...."
>      
>      I assumed that we'd just use the iconv library (which is
>      standardized) to convert between character encodings.
>      
>      I don't know about the unac library.  What are its advantages
>      over iconv?
>
> Iconv is only useful if we know the source encoding. If we don't know
> it we have to guess.  If we guess it wrong, then iconv will fail.
> Also, it won't convert between encodings where data would be lost.
> Unac on the other hand is a (more) robust but lossy thing.  For example,
> given character 0xe1 (acute a) in iso-8859-1 it'll convert to 'a' in
> ascii.  I don't know how it would handle converting from Japanese
> characters to ascii .... 

I do not understand how unac could remove accents from text
without knowing the source encoding.  I don't see any indication
that it can do so, now that I have read the unac manpage from the
webpage you pointed out.  In fact, the first argument to the
unac_string() function is the name of the source encoding, and
unac is documented to use iconv internally to convert to UTF-16.

(Why would we want to remove accents, by the way?)

>      Of course it's sensible to keep everything in a common encoding
>      (at least within a dictionary).  But I don't think it's a good
>      idea to insist that this encoding be UTF-8 (or any other specific
>      encoding).  Instead, I would suggest that we use the local
>      encoding (the one from LC_CTYPE or from SET LOCALE) and convert
>      everything else we encounter into that.
>
> It's just that utf-8 can encompass just about every other encoding.
> If we try and convert from say Korean script into the local encoding
> (say ascii) then we're not going to do a very good job.

That is surely true.  

>      >    Whilst that's feasible, casefiles    cannot possibly (in the
>      >    current system) have this invariant, because the system files which
>      >    implement them may not in fact be utf8 and converting a casefile
>      >    doesn't scale.
>      
>      You mean, to convert all the string data in a casefile to a
>      common encoding?  I think that's a bad idea for other reasons
>      too.  First, we don't know that all the string data in the
>      casefile is actually alphanumeric.  It could just be binary bits;
>      SPSS provides expression operators that can extract and pack
>      data from strings, even though they're not all that convenient.
>      Second, conversions between encodings can lengthen or shorten
>      them, whereas string variables are fixed length.
>
> So we agree then that casefile data must not be meddled with.
> However, this also means that both a) The keys in Value Labels ; and
> b) the Missing Values must also be left verbatim.  Otherwise, they'll
> no longer match.  And this has a rather unfortunate consequence that
> the dictionary cannot be gauranteed to have a consistent encoding.
> Hence my suggestion of a per-variable encoding attribute.

This sounds like a mess.  Any reference to more than one string
variable will have to deal with coding translation.  The most
obvious place where this happens is in string expressions,
e.g. consider the CONCAT function especially.  I'm sure we'll get
confused when we have to fix up code all over to do that.  I bet
that our users will get even more confused.

>      >    An alternative, would be to decide that it is the responsibility of
>      >    the user interface and output subsystem to convert to utf8.  In
>      >    which case, both these entities need to know the encoding of the
>      >    data they receive.  Since, (as in the case of MATCH FILES)
>      >    variables can come from different system sources, each variable
>      >    within a dictionary may have a different encoding.   Thus it may be
>      >    desirable to add an encoding property to struct variable.
>      
>      I think that I disagree (but I may not quite understand what
>      you're saying).  I would think that the encoding would be a
>      property of the dictionary.  When we do something like MATCH
>      FILES that reads from multiple sources, we convert from the
>      encoding used by each source dictionary to the one used by the
>      target dictionary.  We'd assume that the source dictionaries and
>      the target dictionary are in the local encoding unless told
>      otherwise.
>      
>      As for converting the case data in string variables in the
>      various source files to a common encoding, I doubt we'd want to
>      try to do that automatically because there's no way to tell that
>      they even have character data in them.  
>
> Is it not the case that all variables with (Aw) format are intended to
> contain character data?  I thought that bit patterns, blobs and the
> like was supposed to use (AHEXw).

I've always thought that the display format only indicated how
data should be displayed, not what it actually contained, so
that, if you never actually displayed what was in a variable, the
display format didn't do anything.  My initial reaction is that
it seems "rude" to use display format for anything but displaying
variable content.

On the other hand, I bet that in normal use it's very rare to
have anything but character data in a string variable.  We could
probably make that assumption safely, as long as we documented
it and explained how to avoid getting binary data converted in
the exceptional case.

>      Instead, I'd suggest
>      adding some way to convert character data in the active file from
>      one encoding to another.  (I can think of several possible
>      syntaxes: a new feature for RECODE, or a function for use with
>      COMPUTE, or adding a new command altogether.)
>
> So you're suggesting only to convert if explicitly requested by the
> user.

Yes.

>      As for the UI, I guess we'd want to convert from dictionary
>      encoding to display encoding at the UI interface.
>
> I'm thinking that too.
>      
>      > 4. However, when writing a system file, it would be sensible to
>      >    convert all variables to a common encoding first.
>      
>      The way I have been thinking about it, this would simply be a
>      consequence of having just one encoding within a dictionary.
>
> If we can have the dictionary in one common encoding (and for reasons
> above, I'm not sure that we can) this is fine.  But that still leaves
> the case data.  I think it'll open up a real can of worms to have  a
> variable whose name and value labels are in one encoding, but the data
> corresponding to that variable in another.

Definitely.

> Consider that the scenario where I'm conducting a global survey.  I
> have representatives in various parts of the world who collate
> information in their region and then each send me a system file.  The
> system files I receive have identical variables, but come in encodings
> appropriate to that locale (they might include personal names which
> cannot be written in ascii).  I then want to combine all these system
> files into one big system file before analysing it.  The only way I
> can do this without data loss is to use a universal encoding (such as
> utf-8). 
>
>
> In summary I think the logic of my argument goes like this:
>
> 1.  Case Data must not be changed (unless explicitly requested by the
>     user).
>
> 2.  Missing Value and Value Label keys must have the same encoding as
>     the data to which they refer.

Agreed.  Let me add a clarification:

1a. Case Data must not be changed, unless explicitly requested by the
    user.  We should make it easy for the user to make such a
    request.

> 3.  1 ^ 2 --> Missing Values and Value Label keys must never change
>     encodings. 

I would draw a different conclusion:

3a. 1 ^ 2 --> Missing values, value label keys, and case data
    must be re-encoded at the same time.

> 4.  Casefiles from different sources may come with arbitrary and
>     distinct encodings and may need to be combined into a common
>     casefile.  Further, every casefile has a corresponding dictionary.

Agreed.

> 5.  1 ^ 2 ^ 4 --> Missing Value and Value Label keys in the same
>     dictionary, must in general be of different encodings.

Again, I would draw different conclusions:

5a. It is undesirable to mix encodings within PSPP, because that
    is hard on developers and users.

6a. 3a ^ 5a --> Casefiles within PSPP are all in a single locale,
    in particular the one established by the system locale or SET
    LOCALE.  When data is read into PSPP from an external source,
    it is converted to the common locale; when data is written by
    PSPP to some external source, it may be converted to an
    alternate locale.

Let me elaborate.  Here is the plan that I envision:

i. PSPP adopts a single locale that defaults to the system locale
   but can be changed with SET LOCALE.  (I'll call this the "PSPP
   locale".)

ii. All string data in all casefiles and dictionaries is in the
    PSPP locale, or at least we make that assumption.

iii. The GET command assumes by default that data read in is in
     the PSPP locale.  If the user provides a LOCALE subcommand
     specifying something different, then missing values and
     value label keys are converted as the dictionary is read and
     string case data is converted "on the fly" as data is read
     from the file.  We can also provide a NOCONVERT subcommand
     (with a better name, I hope) that flags string variables
     that are not to be converted.

iv. The SAVE command assumes by default that data written out is
    to be in the PSPP locale.  If the user provides a LOCALE
    subcommand specifying something different, then we convert
    string data, etc., as we write it, and again exceptions can
    be accommodated.

v. Users who want accurate translations, as in your survey
   example, choose a reasonable PSPP locale, e.g. something based
   on UTF-8.

vi. We look into the possibility of tagging system files with a
    locale.  The system file format is extensible enough that
    this would really just be a matter of testing whether SPSS
    will complain loudly about our extension records or just
    silently ignore them.
-- 
"The road to hell is paved with convenient shortcuts."
--Peter da Silva
[Prev in Thread]
Current Thread
[Next in Thread]
my status, Ben Pfaff, 2006/03/12
- Re: my status, John Darrington, 2006/03/13
  - Re: my status, Ben Pfaff, 2006/03/13
    - Re: my status, John Darrington, 2006/03/13
    - Re: my status, Ben Pfaff, 2006/03/13
  - Re: my status, Jason Stover, 2006/03/14
    - Re: my status, Ben Pfaff, 2006/03/14
- i18n, John Darrington, 2006/03/17
  - Re: i18n, Ben Pfaff, 2006/03/18
    - Re: i18n, John Darrington, 2006/03/19
    - Re: i18n, Ben Pfaff <=
    - Re: i18n, John Darrington, 2006/03/19
    - Re: i18n, Ben Pfaff, 2006/03/19
Prev by Date: category.c
Next by Date: Re: obsd build include trouble
Previous by thread: Re: i18n
Next by thread: Re: i18n
Index(es):
- Date
- Thread