guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Running script from directory with UTF-8 characters


From: David Kastrup
Subject: Re: Running script from directory with UTF-8 characters
Date: Wed, 23 Dec 2015 22:53:14 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux)

Eli Zaretskii <address@hidden> writes:

> From: Marko Rauhamaa <address@hidden>
>
>> Why don't you tell me already what emacs does?
>
> I did, you elided that.  It represents text as superset of UTF-8, and
> uses high codepoints above the Unicode space for raw bytes.

Incorrect.  It uses overlong encodings of 0x00-0x7f for raw bytes in the
0x80-0xff range (0x00-0x7f are always represented as themselves).  Those
are not allowed in properly encoded UTF-8 and take only two bytes (byte
patterns 0xc0 0x80–0xbf and 0xc1 0x80–0xbf), so random byte patterns get
inflated by somewhat less than 50% on average (every pattern allowed in
properly encoded UTF-8 is left unchanged, of course).

That's more economical than Python's method which uses the encodings of
surrogate words not allowed in properly encoded UTF-8, taking 3 bytes
rather than the 2 Emacs makes do with.  Using high codepoints above the
Unicode space would even take 4 bytes.

-- 
David Kastrup




reply via email to

[Prev in Thread] Current Thread [Next in Thread]