bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#50391: 28.0.50; json-read non-ascii data results in malformed string


From: Zhiwei Chen
Subject: bug#50391: 28.0.50; json-read non-ascii data results in malformed string
Date: Sun, 05 Sep 2021 12:19:56 +0800

When fetch json from youdao (a dict service in China).

#+begin_src elisp
(url-retrieve
  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json";
  (lambda (_status)
    (goto-char (1+ url-http-end-of-headers))
    (write-region (point) (point-max) "/tmp/acc1.json")))
#+end_src

Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without 

But If `json-read' then `json-insert', the file is malformed even if
uchardet shows the encoding of the file is utf-8.

#+begin_src elisp
(url-retrieve
  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json";
  (lambda (_status)
    (goto-char (1+ url-http-end-of-headers))
    (let ((j (json-read)))
    (with-temp-buffer
      (json-insert j)
      (write-region (point-min) (point-max) "/tmp/acc2.json")))))
#+end_src

#+begin_src shell
diff -u <(hexdump -C /tmp/acc1.json | head -n10) <(hexdump -C /tmp/acc2.json | 
head -n10) | diff-so-fancy
#+end_src

Screenshot: https://pb.nichi.co/jazz-estate-brave

Where diff shows the first word "累积" is encoded incorrectly in
"/tmp/acc2.json". (It uses `c3 a7 c2 b4 c2 af')

Actually,

#+begin_src shell
echo -n "累积" | hexdump -C
#+end_src

should be `e7 b4 af e7 a7 af' in utf-8 where "累" is represented with
`e7 b4 af' and "积" is represented with `e7 a7 af'

The environment variable LANG is `en_US.UTF-8', all tested in `emacs -Q'

-- 
Zhiwei Chen





reply via email to

[Prev in Thread] Current Thread [Next in Thread]