Re: fixing url-unhex-string for unicode/multi-byte charsets

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fixing url-unhex-string for unicode/multi-byte charsets

From:	Eli Zaretskii
Subject:	Re: fixing url-unhex-string for unicode/multi-byte charsets
Date:	Fri, 06 Nov 2020 10:02:55 +0200

> Date: Fri, 6 Nov 2020 02:47:42 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> 
> In the thread "Friendlier dired experience", Michael Albinus noted that
> the new emacs feature to place remote files in the local trash performs
> hex-encoding on remote file-names as if they were URLs, which led me to
> discover that was also happening for local files encoded in multi-byte
> (eg. unicode) character-set encodings. Neither of these cases were being
> properly handled by the current emacs function `url-unhex-string'. We
> noticed this for the case of restoring a trashed file, but it can be
> expected to exhibit in other cases.

I see no problem in url-unhex-string, because its job is very simple:
convert hex codes into bytes with the same value.  It doesn't know
what to do with the result because it has no idea what the string
stands for: it could be a file name, or some text, or anything else.
The details of the rules for decoding each kind of string vary a
little, so for optimal results the caller should apply the rules that
are relevant.

> I've solved the problem for diredc, using code from the emacs-w3m
> project (thanks). Whether for the general emacs case it should be
> handled by altering function `url-unhex-string', or whether a second
> function should be created isn't for me to decide, so here's my fix for
> you to discuss, decide, apply.

I made a suggestion in that discussion, I will repeat some of them
here:

>     (with-temp-buffer
>       (set-buffer-multibyte nil)
>       (while (string-match regexp str start)
>         (insert (substring str start (match-beginning 0))
>                  (if (match-beginning 1)
>                     (string-to-number (match-string 1 str) 16)
>                   ?\n))
>       (setq start (match-end 0)))
>       (insert (substring str start))
>       (decode-coding-string
>         (buffer-string)
>         (with-coding-priority nil
>                (car (detect-coding-region (point-min) (point-max))))))))

There's no need to insert the string into a buffer, then decode it.
It sounds like you did that because you wanted to invoke
detect-coding-region? but then we have detect-coding-string as well.
Or maybe this was because you wanted to make sure you work with
unibyte text? but then url-unhex-string returns a unibyte string
already.

The use of detect-coding-region/string in this case is also
sub-optimal: depending on the exact content of the string, it can fail
to detect the correct encoding, if more than one can support the
bytes.  By contrast, variables like file-name-coding-system already
tell us how to decode file names, and they are used all the time in
Emacs, so they are almost certainly correct (if they aren't lots of
stuff in Emacs will break).

So, for file names, something like the below should do the job
simpler:

  (decode-coding-string (url-unhex-string STR)
                        (or file-name-coding-system
                            (default-value 'file-name-coding-system)))

[Prev in Thread]

Current Thread

[Next in Thread]

fixing url-unhex-string for unicode/multi-byte charsets, Boruch Baum, 2020/11/06
- Re: fixing url-unhex-string for unicode/multi-byte charsets, Eli Zaretskii <=
  - Re: fixing url-unhex-string for unicode/multi-byte charsets, Boruch Baum, 2020/11/06
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Eli Zaretskii, 2020/11/06
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Boruch Baum, 2020/11/06
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Eli Zaretskii, 2020/11/06
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Stefan Monnier, 2020/11/06
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Eli Zaretskii, 2020/11/06
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Boruch Baum, 2020/11/08
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Stefan Monnier, 2020/11/08
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Eli Zaretskii, 2020/11/08
    - Re: fixing url-unhex-string for unicode/multi-byte charsets, Stefan Monnier, 2020/11/06

Prev by Date: fixing url-unhex-string for unicode/multi-byte charsets
Next by Date: Re: fixing url-unhex-string for unicode/multi-byte charsets
Previous by thread: fixing url-unhex-string for unicode/multi-byte charsets
Next by thread: Re: fixing url-unhex-string for unicode/multi-byte charsets
Index(es):
- Date
- Thread