[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#20623: XML and HTML files with encoding/charset="utf-8" declaration
From: |
Simon Ledergerber |
Subject: |
bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save |
Date: |
Thu, 21 May 2015 20:50:58 +0200 |
User-agent: |
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 |
Hi
When I was editing XHTML and HTML files, I wanted to make sure the BOM
was written out to the file in order to make it easier for the browser
to detect the UTF-8 encoding. Therefore I changed the coding system for
the file buffer to utf-8-with-signature-dos (since I am working on a
Windows System) before saving the file.
After some time I got surprised because the browser (IE11), didn't
report UTF-8 as the file's encoding. Having checked the hexdump of my
(X)HTML file, I saw the BOM was definitely missing.
Obviously, when a "UTF-8" string appears in the <meta charset="utf-8">
(even if commented out, see later below) or <?xml version="1.0"
encoding="utf-8"?> declaration, Emacs switches the file coding system to
utf-8, when it saves the file, even if utf-8-with-signature was
specified explicitly before. This appears to me as a bug, because there
is no way anymore to restore the BOM using Emacs.
I was not sure, if my bug is related to bug #8282, so I decided to
report it (again).
My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on
Windows 8.1 x64.
I am running Emacs in text-mode only inside a Cygwin console.
This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)
With XML the problem can be reproduced in the most basic way as detailed
out by the following steps:
- Create a new file with C-x C-f in the current directory. Name it
test.txt for example.
- Switch to fundamental mode with M-x fundamental-mode.
- Type the text '<?xml version="1.0"' (without the surrounding single
quotes).
- Switch the encoding system to include the BOM: C-x RET f
utf-8-with-signature-dos.
- Verify the current encoding system with C-h Shift-c RET: Yes, the
encoding system for the file buffer is as specified before.
- Type C-x k to kill the help buffer if necessary and save the file with
C-x C-s.
- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax
-t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written
at the beginning of the file.
- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'
- Now save the file and check again: The encoding system for the buffer
has changed to utf-8-dos and the BOM has disappeared from the file!
Now the steps for HTML:
- Create a new file test1.txt in the current directory.
- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
<head>
<title>Test</title>
</head>
<body>
</body>
</html>
- Change the coding system to utf-8-with-signature-dos and save the file.
- Verify that the coding system for the buffer is correct and the BOM is
really written: Yes, it is.
- Insert the following *comment* between <head> and <title>: <!-- <meta
charset="utf-8"> -->
- Save the file and verify: The coding system has changed to utf-8-dos
and the BOM has vanished, even if it is just a comment and has no effect!
Regards
Simon
P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
of 2015-04-10 on desktop-new
Configured using:
`configure
--srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
--prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
--docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
--with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
CPPFLAGS= LDFLAGS='
Important settings:
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix
Major mode: Help
Minor modes in effect:
tooltip-mode: t
electric-indent-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
buffer-read-only: t
column-number-mode: t
line-number-mode: t
transient-mark-mode: t
Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish
Load-path shadows:
None found.
Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)
Memory information:
((conses 16 81797 4691)
(symbols 48 17091 0)
(miscs 40 73 387)
(strings 32 11233 4887)
(string-bytes 1 291872)
(vectors 16 7587)
(vector-slots 8 342125 27930)
(floats 8 57 393)
(intervals 56 834 26)
(buffers 960 21))
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save,
Simon Ledergerber <=
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save, Eli Zaretskii, 2015/05/21
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save, Stefan Monnier, 2015/05/22
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save, Eli Zaretskii, 2015/05/22
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save, Stefan Monnier, 2015/05/22
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save, Eli Zaretskii, 2015/05/23
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save, Simon Ledergerber, 2015/05/23
- bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save, Eli Zaretskii, 2015/05/23