Interpreting the Character Code field.

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Interpreting the Character Code field.

From:	John Darrington
Subject:	Interpreting the Character Code field.
Date:	Sat, 28 Mar 2009 12:20:46 +0900
User-agent:	Mutt/1.5.18 (2008-05-17)

We've talked about this before, but now we need to make some
decisions.

Spss system files which contain string data come in various character
encodings.  Pspp needs to know that encoding in order to load and
display the strings correctly.

First some observations:

1.    Later versions of Spss write a record (type 7; subtype 20) containing a
      IANA recognised string which names the encoding.  This can be passed
      to iconv_open to create a convertor for that encoding.

2.    Earlier versions do not contain this record, but they do contain a
      32bit field in the Machine Integer Info record, which according to
      Appendix B of the pspp-dev guide is called the "character code". The
      guide also says:

             "1 indicates EBCDIC,
              2 indicates 7-bit ASCII,
              3 indicates 8-bit ASCII,
              4 indicates DEC Kanji.
              Windows code page numbers are also valid."

      However, in my complete collection of .sav files gathered from various
      sources, I have only two which have a character code other than 2 --
      including many files which are clearly NOT 7-bit ascii !! 
      The two files which don't have numbers other than 2, have it set to
      1252 and 65001, which I presume refer to windows-1252 and UTF-8
      respectively.


3.    I did some experiments using iconv on my GNU/Linux system.  The
      code snippet: 

  int i;
  for (i = 0 ; i < 65535; ++i)
    {
      iconv_t cpi, wi;
      char cp[1000];
      char w[1000];
      snprintf (cp, 1000, "CP%d", i);
      cpi = iconv_open ("UTF-8", cp);

      snprintf (w, 1000, "windows-%d", i);
      wi = iconv_open ("UTF-8", w);

      if ( wi != -1 || cpi != -1 ) 
        {
          printf ("%d : %s\t%s\n",
                  i,
                  cpi == -1 ? "(null)" : cp,
                  wi == -1 ? "(null)" : w
                  );

        }

    }

    revealed that the integers 874, 936 and 1250 thru 1258
    are recognised as encoding names of the form "windows-%d".
    A total of 136 discontinuous integers (including the above set) 
    in the range 273 to 16804 are recognised as encoding names in
    the form "CP%d".

4. iconv --list shows some other interesting information.

5. The web page http://demo.icu-project.org/icu-bin/convexp?s=WINDOWS
   gives some additional information which doesn't seem to conflict with 
   these other observations.


Based on these observations, I propose that when reading a system file, we use 
the following method.

      A.      If record 7(20) exists, and is accepted by iconv, use it.

      B.      Otherwise use the "Character Code" as follows:

                values 2 & 3 are ignored - goto C:
                
                1  maps to "EBCDIC-US"
                4  maps to "MS_KANJI"

                65000 maps to "UTF-7"
                65001 maps to "UTF-8"

      other values are used in the form "CP%d".

      C.      If these methods fail, then the encoding used is that of the      
              current locale as returned by gnulib's locale_charset function.

      There'll also be a way to manually override the above at some future date.



Some of this is rather arbitrary, but it seems to be a reasonable solution to a
rather messy problem.  Any comments?

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Interpreting the Character Code field., John Darrington <=

Prev by Date: Re: New function
Next by Date: [patch #6785] Default output precision
Previous by thread: New branch for charset encoding issues.
Next by thread: Inserting data without a variable set
Index(es):
- Date
- Thread