[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: even more about character encoding names
From: |
Ben Pfaff |
Subject: |
Re: even more about character encoding names |
Date: |
Sat, 05 Feb 2011 13:19:34 -0800 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) |
I finally pushed this change to "master", along with a few other
minor patches.
I deleted the clause that called windows-1252 a superset of
ISO-8559-1. Thanks for that comment.
John Darrington <address@hidden> writes:
> This seems to cover everything.
>
> A purist might object to calling windows-1252 a "superset" of iso-8859-1 ...
> they are just two different encodings, which happen to have large parts of
> they're mappings identical.
>
> J'
>
> On Mon, Jan 03, 2011 at 10:45:12AM -0800, Ben Pfaff wrote:
>
> I think you've told me all of this before. It's time to write it
> down. Here's what I have as an update to
> system-file-format.texi. Can you look it over and verify that it
> looks accurate? Also, if you have any system files locally that
> have other codepage numbers not already mentioned, please let me
> know which ones and I'll add them to the list.
>
> --8<--------------------------cut here-------------------------->8--
>
> From: Ben Pfaff <address@hidden>
> Date: Mon, 3 Jan 2011 10:43:21 -0800
> Subject: [PATCH] doc: Update description of character encoding
> information in system files.
>
> Based on information provided by John Darrington and on system files
> obtained freely from the Internet.
> ---
> doc/dev/system-file-format.texi | 66
> +++++++++++++++++++++++++++++++++------
> 1 files changed, 56 insertions(+), 10 deletions(-)
>
> diff --git a/doc/dev/system-file-format.texi
> b/doc/dev/system-file-format.texi
> index 972b133..bf376b5 100644
> --- a/doc/dev/system-file-format.texi
> +++ b/doc/dev/system-file-format.texi
> @@ -549,14 +549,46 @@ Compression code. Always set to 1.
> Machine endianness. 1 indicates big-endian, 2 indicates little-endian.
>
> @item int32 character_code;
> address@hidden
> -Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
> -indicates 8-bit ASCII, 4 indicates DEC Kanji.
> -Windows code page numbers are also valid.
> -
> -Experience has shown that in many files, this field is ignored or
> incorrect.
> -For a more reliable indication of the file's character encoding
> -see @ref{Character Encoding Record}.
> address@hidden Character code. The following values have
> +been actually observed in system files:
> +
> address@hidden @asis
> address@hidden 2
> +7-bit ASCII.
> +
> address@hidden 1250
> +The @code{windows-1250} code page for Central European and Eastern
> +European languages.
> +
> address@hidden 1252
> +The @code{windows-1252} code page for Western European languages, a
> +superset of ISO 8859-1.
> +
> address@hidden 28591
> +ISO 8859-1.
> +
> address@hidden 65001
> +UTF-8.
> address@hidden table
> +
> +The following additional values are known to be defined:
> +
> address@hidden @asis
> address@hidden 1
> +EBCDIC.
> +
> address@hidden 3
> +8-bit ``ASCII''.
> +
> address@hidden 4
> +DEC Kanji.
> address@hidden table
> +
> +Other Windows code page numbers are known to be generally valid.
> +
> +Old versions of SPSS always wrote value 2 in this field, regardless of
> +the encoding in use. Newer versions also write the character encoding
> +as a string (see @ref{Character Encoding Record}).
> @end table
>
> @node Machine Floating-Point Info Record
> @@ -959,8 +991,22 @@ The name of the character encoding. Normally this
> will be an official IANA char
> See @url{http://www.iana.org/assignments/character-sets}.
> @end table
>
> -This record is not present in files generated by older software.
> -See also @ref{character-code}.
> +This record is not present in files generated by older software. See
> +also the @code{character_code} field in the machine integer info
> +record (@pxref{character-code}).
> +
> +When the character encoding record and the machine integer info record
> +are both present, all system files observed in practice indicate the
> +same character encoding, e.g.@: 1252 as @code{character_code} and
> address@hidden as @code{encoding}, 65001 and @code{UTF-8}, etc.
> +
> +If, for testing purposes, a file is crafted with different
> address@hidden and @code{encoding}, it seems that
> address@hidden controls the encoding for all strings in the
> +system file before the dictionary termination record, including
> +strings in data (e.g.@: string missing values), and @code{encoding}
> +controls the encoding for strings following the dictionary termination
> +record.
>
> @node Long String Value Labels Record
> @section Long String Value Labels Record
> --
> 1.7.1
>
>
> --
> Ben Pfaff
> http://benpfaff.org
--
Ben Pfaff
http://benpfaff.org
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: even more about character encoding names,
Ben Pfaff <=