[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: dired-do-find-regexp failure with latin-1 encoding
From: |
Dmitry Gutov |
Subject: |
Re: dired-do-find-regexp failure with latin-1 encoding |
Date: |
Mon, 30 Nov 2020 03:08:40 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 |
On 29.11.2020 21:37, Juri Linkov wrote:
Do we want to search the "binary" files at all? Right now we simply filter
such matches out (see the definition of xref-matches-in-files), and I have
seen no complaints.
There are two cases: a really binary file, and a legit ascii file
with an occasional ^@ char. And grep can't distinguish one from another.
There is an option --binary-files=binary, but unfortunately it doesn't help,
it still outputs "Binary file matches".
Makes sense.
So xref parser needs to be smart enough to detect whether the matched line
contains binary garbage when '-a' is used, or it's purely ascii.
I guess we can do that, but then some people might be a bit unhappy
about not being able to search inside such files? It could be useful on
occasion, too (TBC below *).
Moreover, I think we should apply the same heuristics to the grep output
in grep.el and add '-a' to the grep command by default.
I guess we should. Or do the LC_ALL thing. I'm still unclear on the
difference in effect between the two.
Then grep.el
should prettify the lines with real binary garbage e.g. by hiding groups of
bytes between 0 and 32, or adding a 'display' property with ellipsis.
Why not. xref could also do something like that.
Our interpreter is our regexp with which we parse. But I suppose as long as
Grep doesn't insert unexpected newlines, the parser will be fine.
For grep output a bigger problem is that grep on binary data
might output too long lines before the terminating newline.
(*) We already have this kind of problem with "normal" files which
contain minified assets (JS or CSS). The file contents are usually
normal ASCII, but it's just one line which can reach several MBs in length.
The usual way to deal with that is with project-ignores and
grep-find-ignored-files. That works for both cases.
I actually don't think I understand why we need -a in this case, since
Grep looks for null bytes to decide this is a binary file, and encoded
non-ASCII characters don't have null bytes 9except if they are in
UTF-16).
Good question.
The grep manual says that binary data are either output bytes that
are improperly encoded for the current locale, or null input bytes.
So... if we add LC_ALL=C but not '-a' we will allow the "improperly
encoded" case but not the "null input bytes" one?
- Re: dired-do-find-regexp failure with latin-1 encoding, (continued)
- Re: dired-do-find-regexp failure with latin-1 encoding, Dmitry Gutov, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Eli Zaretskii, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Dmitry Gutov, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Eli Zaretskii, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Dmitry Gutov, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Eli Zaretskii, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Eli Zaretskii, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Stephen Berman, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Gregory Heytings, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding, Juri Linkov, 2020/11/29
- Re: dired-do-find-regexp failure with latin-1 encoding,
Dmitry Gutov <=
- Re: dired-do-find-regexp failure with latin-1 encoding, Juri Linkov, 2020/11/30
- Re: dired-do-find-regexp failure with latin-1 encoding, Dmitry Gutov, 2020/11/30