[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: even more about character encoding names

From: Ben Pfaff
Subject: Re: even more about character encoding names
Date: Mon, 03 Jan 2011 10:45:12 -0800
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)

John Darrington <address@hidden> writes:

> On Sun, Jan 02, 2011 at 01:59:05PM -0800, Ben Pfaff wrote:
>      I had been under the impression, from previous discussions, that
>      the string in .sav file record 7, subtype 20
>      (e.g. "windows-1252") provided additional important information
>      on top of what was in the code page number in record 7 subtype 3
>      (e.g. 1252).
>      But now that I go through my trove of .sav files that I have
>      found around the Internet, I can only find one where this is the
>      case.  That one is one written by PSPP itself (version
>      0.7.4-g44daa4)!  In all the others, the encoding string just
>      repeats what we already know from the code page number.
>      Have you seen any .sav files where the character encoding name
>      provides more information than the codepage number?
> Based on discussions I've had with SPSS users it seems that the
> datum provided by 7.3 determines the encoding for strings in
> the dictionary (ie, Variable names, variable labels, value
> label keys AND value label values), whereas the string provided
> by 7.20 determines the encoding of string data in the file
> records.  At least this is what recent SPSS versions appear to
> do.
> Now I have never seen a system file generated by a recent SPSS
> where the two data did not correspond.  However, when I crafted
> such a file and asked my friendly SPSS user to run it, they
> reported inconsistencies in the way strings (especially value
> label keys) were displayed.

I think you've told me all of this before.  It's time to write it
down.  Here's what I have as an update to
system-file-format.texi.  Can you look it over and verify that it
looks accurate?  Also, if you have any system files locally that
have other codepage numbers not already mentioned, please let me
know which ones and I'll add them to the list.

--8<--------------------------cut here-------------------------->8--

From: Ben Pfaff <address@hidden>
Date: Mon, 3 Jan 2011 10:43:21 -0800
Subject: [PATCH] doc: Update description of character encoding information in 
system files.

Based on information provided by John Darrington and on system files
obtained freely from the Internet.
 doc/dev/system-file-format.texi |   66 +++++++++++++++++++++++++++++++++------
 1 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi
index 972b133..bf376b5 100644
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -549,14 +549,46 @@ Compression code.  Always set to 1.
 Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
 @item int32 character_code;
-Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
-indicates 8-bit ASCII, 4 indicates DEC Kanji.
-Windows code page numbers are also valid.
-Experience has shown that in many files, this field is ignored or incorrect.
-For a more reliable indication of the file's character encoding
-see @ref{Character Encoding Record}.
address@hidden Character code.  The following values have
+been actually observed in system files:
address@hidden @asis
address@hidden 2
+7-bit ASCII.
address@hidden 1250
+The @code{windows-1250} code page for Central European and Eastern
+European languages.
address@hidden 1252
+The @code{windows-1252} code page for Western European languages, a
+superset of ISO 8859-1.
address@hidden 28591
+ISO 8859-1.
address@hidden 65001
address@hidden table
+The following additional values are known to be defined:
address@hidden @asis
address@hidden 1
address@hidden 3
+8-bit ``ASCII''.
address@hidden 4
+DEC Kanji.
address@hidden table
+Other Windows code page numbers are known to be generally valid.
+Old versions of SPSS always wrote value 2 in this field, regardless of
+the encoding in use.  Newer versions also write the character encoding
+as a string (see @ref{Character Encoding Record}).
 @end table
 @node Machine Floating-Point Info Record
@@ -959,8 +991,22 @@ The name of the character encoding.  Normally this will be 
an official IANA char
 See @url{}.
 @end table
-This record is not present in files generated by older software.
-See also @ref{character-code}.
+This record is not present in files generated by older software.  See
+also the @code{character_code} field in the machine integer info
+record (@pxref{character-code}).
+When the character encoding record and the machine integer info record
+are both present, all system files observed in practice indicate the
+same character encoding, e.g.@: 1252 as @code{character_code} and
address@hidden as @code{encoding}, 65001 and @code{UTF-8}, etc.
+If, for testing purposes, a file is crafted with different
address@hidden and @code{encoding}, it seems that
address@hidden controls the encoding for all strings in the
+system file before the dictionary termination record, including
+strings in data (e.g.@: string missing values), and @code{encoding}
+controls the encoding for strings following the dictionary termination
 @node Long String Value Labels Record
 @section Long String Value Labels Record

Ben Pfaff

reply via email to

[Prev in Thread] Current Thread [Next in Thread]