Re: Fwd: Re: Inadequate documentation of silly characters on screen.

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: Re: Inadequate documentation of silly characters on screen.

From:	David Kastrup
Subject:	Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Date:	Thu, 19 Nov 2009 23:31:45 +0100
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)

Alan Mackenzie <address@hidden> writes:

> OK - so what's happening is that ?ñ is unambiguously 241.  But Emacs
> cannot say whether that is unibyte 241 or multibyte 241, which it
> encodes as 4194289.  Despite not knowing, Emacs is determined never to
> confuse a 4194289 type of 241 with a 241 type of 241.  So, despite the
> fact that the character 4194289 probably originated as a unibyte ?ñ,

?ñ is the code point of a character.  Unibyte strings contain bytes, not
characters.  ?ñ is a confusing way of writing 241 in the context of
unibyte, just like '\n' may be a confusing way of writing 10 in the
context of number bases.

> Why couldn't Emacs have simply displayed the character as "ñ"?

Because there is no character with a byte representation of 241.  You
are apparently demanding that Emacs display this "wild byte" as if it
were really encoded in latin-1.  What is so special about latin-1?
Latin-1 characters have a byte representation in utf-8, but it is not
241.

> Why does it have to enforce its internal dirty linen on an
> unsuspecting hacker?

It doesn't.  And since we are talking about a non-character isolated
byte, Emacs displays it as a non-character isolated byte rather than
throwing it out on the terminal and confusing the user with whatever the
terminal may make of it.

> That meaning is an artificial one imposed by Emacs itself.  Is there
> any pressing reason to distinguish 4194289 from 241 when displaying
> them as characters on a screen?

4194289 is the Emacs code point for "invalid raw byte with value 241",
241 is the Emacs code point for "Unicode character 241, part of latin-1
plane".  If you throw them to encode-region, the resulting unibyte
string will contain 241 for the first, but whatever external
representation is proper for the specified encoding for the second.  If
you encode to latin-1, the distinction will get lost.  If you encode to
other encodings, it won't.

> Sorry, what the heck is "the byte with value 241"?  Does this concept
> have any meaning, any utility beyond the machiavellian one of
> confusing me?  How would one use "the byte with value 241", and why
> does it need to be kept distinct from "ñ"?

You can use Emacs to load an executable, change some string inside of it
(make sure that it contains the same number of bytes afterwards!) and
save, and everything you did not edit is the same.

That's a very fine thing.  To have this work, Emacs needs an internal
representation for "byte with code x that is not valid as part of a
character".

-- 
David Kastrup

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Fwd: Re: Inadequate documentation of silly characters on screen., (continued)

Prev by Date: Re: abbrevs broken since 2009-11-19 03:12:51 +0000
Next by Date: Re: Case mapping of sharp s
Previous by thread: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Next by thread: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Index(es):
- Date
- Thread