bug#12291: [rev 109796] wrong UTF-8 handling

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12291: [rev 109796] wrong UTF-8 handling

From:	Lars Ingebrigtsen
Subject:	bug#12291: [rev 109796] wrong UTF-8 handling
Date:	Thu, 27 Jan 2022 17:32:53 +0100
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)

Werner LEMBERG <wl@gnu.org> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':
>
>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
>       preferred charset: unicode (Unicode (ISO10646))

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

This has changed at some point between this was reported and now:

             position: 1 of 2 (0%), column: 0
            character:  (displayed as ) (codepoint 1266142, #o4650736, #x1351de)
              charset: emacs (Full Emacs charset (excluding eight bit chars))
code point in charset: 0x1351DE
               syntax: w        which means: word
             category: L:Strong L2R
             to input: type "C-x 8 RET 1351de"

So Emacs now displays more accurate information about the utf-8
sequence.

It was pointed out that this sequence is outside the Unicode range,
which only extends up to U+10FFFF, and that Emacs should perhaps display
this as a number of raw bytes instead.  Is that something we still want
to pursue, or is Emacs behaving like we want to here?  Eli?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

[Prev in Thread]

Current Thread

[Next in Thread]

bug#12291: [rev 109796] wrong UTF-8 handling, Lars Ingebrigtsen <=
- bug#12291: [rev 109796] wrong UTF-8 handling, Eli Zaretskii, 2022/01/27

Prev by Date: bug#53497: 29.0.50; native-compile after restarting Emacs
Next by Date: bug#53497: 29.0.50; native-compile after restarting Emacs
Previous by thread: bug#53586: 29.0.50; ws-body compilation warning?
Next by thread: bug#12291: [rev 109796] wrong UTF-8 handling
Index(es):
- Date
- Thread