guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guile can't find a chinese named file


From: David Kastrup
Subject: Re: guile can't find a chinese named file
Date: Wed, 15 Feb 2017 12:18:21 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Marko Rauhamaa <address@hidden> writes:

> David Kastrup <address@hidden>:
>
>> If you tell Emacs that some external entity is in UTF-8, it will
>> represent all valid UTF-8 sequences as properly decoded characters,
>> and it has special codes for all bytes not part of valid UTF-8.
>>
>> As a result, it works with valid UTF-8 perfectly as expected but will
>> reproduce arbitrary byte streams thrown at it perfectly when decoding
>> as UTF-8 and then reencoding into UTF-8 again.
>>
>> Guile is lacking this byte stream reproducibility when
>> decoding/reencoding. That makes it a whole lot less robust for dealing
>> with externally provided material.
>
> Python3 supports this by abusing the surrogate code points. I don't
> recommend following Python's lead.

Emacs uses overlong byte sequences for 0x00 to 0x7f to represent bytes
with values 0x80 to 0xff not part of valid UTF-8 sequences.  Those
cannot occur in valid UTF-8, but they handle nice internally with regard
to detecting character boundaries in string/character handling.
Basically, those are patterns 0xc0 0x80 ... 0xc0 0xbf and 0c1 0x80
... 0xc1 0xbf for representing 0x80 ... 0xbf and 0xc0 ... 0xff when the
latter are not part of proper (and consequently uniquely encoded) UTF-8.

Which means that random byte sequences get blown up by less than 50%
internally (less because some bytes 0x80...0xff end up in combinations
constituting valid UTF-8 sequences and thus will pass transparently).

> Instead, when decoding a byte string into Unicode, the application
> should be returned a list:
>
>    ( chars bytes chars bytes ... chars )
>
> or some similar mechanism.

This would seriously inflate random byte sequences and require string
handling to special-case the counters.  The Emacs way is comparatively
modest, and the internal representation meets most of the UTF-8
invariants important for fast string processing.  Perhaps the most
astonishing thing is that this reencoding results in sensible sort
orders: "Isolated bytes 0x80...0xff" sort right after 0x00...0x7f.

-- 
David Kastrup



reply via email to

[Prev in Thread] Current Thread [Next in Thread]