[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
From: |
Ruijie Yu |
Subject: |
bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding |
Date: |
Thu, 26 Oct 2023 19:43:54 +0800 |
Hello,
I have noticed that in GB18030 encoding, certain ranges of characters
have incorrect encodings.
One example is U+217A (SMALL ROMAN NUMERAL ELEVEN). The expected
encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
and verified from other programs such as iconv and MySQL), whereas the
observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
offset.
This behavior can be reproduced by the following recipe under both
GNU/Linux and Windows:
--8<---------------cut here---------------start------------->8---
$ emacs
C-x h DEL
C-x C-m f gb18030 RET
C-x 8 RET 217a RET
M-<
C-u C-x =
;; observe the "file code":
;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos)
--8<---------------cut here---------------end--------------->8---
In contrast, this is what I get on MySQL (which I have also verified
against the GB18030 standard):
--8<---------------cut here---------------start------------->8---
> CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030);
> INSERT INTO gb VALUES (0, 'ⅺ');
> SELECT HEX(c) FROM gb;
+----------+
| hex(c) |
+----------+
| 8136C530 |
+----------+
--8<---------------cut here---------------end--------------->8---
Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO)
has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35,
which has an offset of 9 codepoints.
Could someone with more expertise and time look into why there is a
mismatch between Emacs' GB18030 data and the standard?
[1]:
https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=A1931A578FE14957104988029B0833D3
(200+MB PDF. Unfortunately this is the only official source which I can find,
and it
requires a captcha.)
--
Best,
RY
In GNU Emacs 29.1 (build 2, x86_64-w64-mingw32) of 2023-08-02 built on
AVALON
Windowing system distributor 'Microsoft Corp.', version 10.0.19045
System Description: Microsoft Windows 10 Enterprise (v10.0.2009.19045.3086)
Configured using:
'configure --with-modules --without-dbus --with-native-compilation=aot
--without-compress-install --with-tree-sitter CFLAGS=-O2'
Configured features:
ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP
NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB
(NATIVE_COMP present but libgccjit not available)
Important settings:
value of $LANG: CHS
locale-coding-system: cp936
Major mode: Lisp Interaction
- bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding,
Ruijie Yu <=