[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#12291: [rev 109796] wrong UTF-8 handling
From: |
Lars Ingebrigtsen |
Subject: |
bug#12291: [rev 109796] wrong UTF-8 handling |
Date: |
Thu, 27 Jan 2022 17:32:53 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux) |
Werner LEMBERG <wl@gnu.org> writes:
> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues). It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE). If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':
>
> position: 1 of 2 (0%), column: 0
> character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
> preferred charset: unicode (Unicode (ISO10646))
(I'm going through old bug reports that unfortunately weren't resolved
at the time.)
This has changed at some point between this was reported and now:
position: 1 of 2 (0%), column: 0
character: (displayed as ) (codepoint 1266142, #o4650736, #x1351de)
charset: emacs (Full Emacs charset (excluding eight bit chars))
code point in charset: 0x1351DE
syntax: w which means: word
category: L:Strong L2R
to input: type "C-x 8 RET 1351de"
So Emacs now displays more accurate information about the utf-8
sequence.
It was pointed out that this sequence is outside the Unicode range,
which only extends up to U+10FFFF, and that Emacs should perhaps display
this as a number of raw bytes instead. Is that something we still want
to pursue, or is Emacs behaving like we want to here? Eli?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
- bug#12291: [rev 109796] wrong UTF-8 handling,
Lars Ingebrigtsen <=