guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guile can't find a chinese named file


From: David Kastrup
Subject: Re: guile can't find a chinese named file
Date: Mon, 30 Jan 2017 19:32:14 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux)

Marko Rauhamaa <address@hidden> writes:

> David Kastrup <address@hidden>:
>
>> But at any rate, this cannot easily be fixed since Guile uses libraries
>> for encoding/decoding that cannot deal reproducibly with improper byte
>> patterns.
>
> Guile's mistake was to move to Unicode strings in the operating system
> interface.

Emacs uses an UTF-8 based encoding internally: basically, valid UTF-8 is
represented as itself, there is a number of coding points beyond the
actual limit of UTF-8 that is used for non-Unicode character sets, and
single bytes not properly belonging to the read encoding are represented
with 0x00...0x7f, 0xc0 0x80 ... 0xc0 0xbf and 0xc1 0x80 ... 0xbf (the
latter two ranges are "overlong" encodings of 0x00...0x7f and
consequently also not valid utf-8).

The result is that random binary files read as "utf-8" grow by less than
50% in the internal representation (0x00-0x7f gets represented as
itself, and 0x80-0xff gets encoded with two bytes only when not being a
part of a valid utf-8 sequence).  The internal representation has
several guarantees for processing.  And when reencoding to utf-8 as
output encoding, the input gets reconstructed perfectly even when it
wasn't actually utf-8 to start with.

Emacs does not use "Unicode strings in the operating system interface"
but rather has a number of explicit encodings:

file-name-coding-system is a variable defined in ‘C source code’.
Its value is nil

Documentation:
Coding system for encoding file names.
If it is nil, ‘default-file-name-coding-system’ (which see) is used.

On MS-Windows, the value of this variable is largely ignored if
‘w32-unicode-filenames’ (which see) is non-nil.  Emacs on Windows
behaves as if file names were encoded in ‘utf-8’.

[back]


Coding system for saving this buffer:
  U -- utf-8-emacs-unix (alias: emacs-internal)

Default coding system (for new files):
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for keyboard input:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for inter-client cut and paste:
  nil
Defaults for subprocess I/O:
  decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

  encoding: U -- utf-8-unix (alias: mule-utf-8-unix)


Priority order for recognizing coding systems when reading files:
  1. utf-8 (alias: mule-utf-8)
  2. iso-2022-7bit 
  3. iso-latin-1 (alias: iso-8859-1 latin-1)
  4. iso-2022-7bit-lock (alias: iso-2022-int-1)
  5. iso-2022-8bit-ss2 
  6. emacs-mule 
  7. raw-text 
  8. iso-2022-jp (alias: junet)
  9. in-is13194-devanagari (alias: devanagari)
  10. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
  11. utf-8-auto 
  12. utf-8-with-signature 
  13. utf-16 
  14. utf-16be-with-signature (alias: utf-16-be)
  15. utf-16le-with-signature (alias: utf-16-le)
  16. utf-16be 
  17. utf-16le 
  18. japanese-shift-jis (alias: shift_jis sjis)
  19. chinese-big5 (alias: big5 cn-big5 cp950)
  20. undecided 

  Other coding systems cannot be distinguished automatically
  from these, and therefore cannot be recognized automatically
  with the present coding system priorities.

Particular coding systems specified for certain file names:

  OPERATION     TARGET PATTERN          CODING SYSTEM(s)
  ---------     --------------          ----------------
  File I/O      "\\.dz\\'"              (no-conversion . no-conversion)
                "\\.txz\\'"             (no-conversion . no-conversion)
                "\\.xz\\'"              (no-conversion . no-conversion)
                "\\.lzma\\'"            (no-conversion . no-conversion)
                "\\.lz\\'"              (no-conversion . no-conversion)
                "\\.g?z\\'"             (no-conversion . no-conversion)
                "\\.\\(?:tgz\\|svgz\\|sifz\\)\\'"
                                        (no-conversion . no-conversion)
                "\\.tbz2?\\'"           (no-conversion . no-conversion)
                "\\.bz2\\'"             (no-conversion . no-conversion)
                "\\.Z\\'"               (no-conversion . no-conversion)
                "\\.elc\\'"             utf-8-emacs
                "\\.el\\'"              prefer-utf-8
                "\\.utf\\(-8\\)?\\'"    utf-8
                "\\.xml\\'"             xml-find-file-coding-system
                "\\(\\`\\|/\\)loaddefs.el\\'"
                                        (raw-text . raw-text-unix)
                "\\.tar\\'"             (no-conversion . no-conversion)
                "\\.po[tx]?\\'\\|\\.po\\."
                                        po-find-file-coding-system
                "\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'"
                                        latexenc-find-file-coding-system
                ""                      (undecided)
  Process I/O   nothing specified
  Network I/O   nothing specified

[back]


So in short: this is a rather complex domain.  And Elisp, as a
text-manipulating platform, has a whole lot of tools and bells and
whistles to deal with it well enough that you usually won't even notice.

It took a number of years to arrive there and caused the last large
migration to XEmacs.

-- 
David Kastrup



reply via email to

[Prev in Thread] Current Thread [Next in Thread]