emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dired-do-find-regexp failure with latin-1 encoding


From: Dmitry Gutov
Subject: Re: dired-do-find-regexp failure with latin-1 encoding
Date: Mon, 30 Nov 2020 03:08:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

On 29.11.2020 21:37, Juri Linkov wrote:

Do we want to search the "binary" files at all? Right now we simply filter
such matches out (see the definition of xref-matches-in-files), and I have
seen no complaints.

There are two cases: a really binary file, and a legit ascii file
with an occasional ^@ char.  And grep can't distinguish one from another.
There is an option --binary-files=binary, but unfortunately it doesn't help,
it still outputs "Binary file matches".

Makes sense.

So xref parser needs to be smart enough to detect whether the matched line
contains binary garbage when '-a' is used, or it's purely ascii.

I guess we can do that, but then some people might be a bit unhappy about not being able to search inside such files? It could be useful on occasion, too (TBC below *).

Moreover, I think we should apply the same heuristics to the grep output
in grep.el and add '-a' to the grep command by default.

I guess we should. Or do the LC_ALL thing. I'm still unclear on the difference in effect between the two.

Then grep.el
should prettify the lines with real binary garbage e.g. by hiding groups of
bytes between 0 and 32, or adding a 'display' property with ellipsis.

Why not. xref could also do something like that.

Our interpreter is our regexp with which we parse. But I suppose as long as
Grep doesn't insert unexpected newlines, the parser will be fine.

For grep output a bigger problem is that grep on binary data
might output too long lines before the terminating newline.

(*) We already have this kind of problem with "normal" files which contain minified assets (JS or CSS). The file contents are usually normal ASCII, but it's just one line which can reach several MBs in length.

The usual way to deal with that is with project-ignores and grep-find-ignored-files. That works for both cases.

I actually don't think I understand why we need -a in this case, since
Grep looks for null bytes to decide this is a binary file, and encoded
non-ASCII characters don't have null bytes 9except if they are in
UTF-16).

Good question.

The grep manual says that binary data are either output bytes that
are improperly encoded for the current locale, or null input bytes.

So... if we add LC_ALL=C but not '-a' we will allow the "improperly encoded" case but not the "null input bytes" one?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]