[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#31679: 26.1; detect-coding-string does not detect UTF-16
From: |
Lars Ingebrigtsen |
Subject: |
bug#31679: 26.1; detect-coding-string does not detect UTF-16 |
Date: |
Thu, 12 Aug 2021 15:51:28 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) |
Eli Zaretskii <eliz@gnu.org> writes:
>> My use-case is that I am trying to paste types other than UTF8_STRING
>> from the X11 clipboard, and have them handled as automatically as
>> possible. While official clipboard types probably have a documented
>> encoding (and I have code for those), applications like Firefox also put
>> private formats there. And Firefox seems to like UTF-16, even the
>> text/html format it puts there is UTF-16.
>
> If you have a special application in mind, you could always write some
> simple enough code in Lisp to see if UTF-16 should be tried, then tell
> Emacs to try that explicitly.
I ran into the same issue when dealing with X selections -- but there's
even more peculiarities in that area (some selections add a spurious nul
to the end, and some done), so you have to write a bit of code around
this: `decode-coding-string' in itself can't be expected to deal/guess
all these oddities (as you say).
>> I have tried to debug the C routines that implement this (s.a.), but the
>> code is somewhat hairy. I guess I'll have another look to see if I can
>> understand it better.
>
> We could add code to detect_coding_system that looks at some short
> enough prefix of the text and sees whether there's a null byte there
> for each non-null byte, and try UTF-16 if so. Assuming that we want
> to improve the chances of having UTF-16 detected for a small penalty,
> that is.
I do think that, in general, it would be nice if detect_coding_system
did try a bit harder to guess at utf-16. For instance, if (in the first
X bytes of the string) more than 90% of the byte pairs look like
non-nul/nul pairs, then it's pretty likely to be utf-16. (And I think
that would be easy enough to implement?)
On the other hand, as you point out, there's a performance penalty that
may not be worth it.
So... uhm... does anybody have an opinion here? Try harder for utf-16
or just leave it as it is?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- bug#31679: 26.1; detect-coding-string does not detect UTF-16,
Lars Ingebrigtsen <=