Re: dired-do-find-regexp failure with latin-1 encoding

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dired-do-find-regexp failure with latin-1 encoding

From:	Dmitry Gutov
Subject:	Re: dired-do-find-regexp failure with latin-1 encoding
Date:	Mon, 30 Nov 2020 03:08:40 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

On 29.11.2020 21:37, Juri Linkov wrote:

Do we want to search the "binary" files at all? Right now we simply filter
such matches out (see the definition of xref-matches-in-files), and I have
seen no complaints.


There are two cases: a really binary file, and a legit ascii file
with an occasional ^@ char.  And grep can't distinguish one from another.
There is an option --binary-files=binary, but unfortunately it doesn't help,
it still outputs "Binary file matches".


Makes sense.

So xref parser needs to be smart enough to detect whether the matched line
contains binary garbage when '-a' is used, or it's purely ascii.

I guess we can do that, but then some people might be a bit unhappyabout not being able to search inside such files? It could be useful onoccasion, too (TBC below *).

Moreover, I think we should apply the same heuristics to the grep output
in grep.el and add '-a' to the grep command by default.

I guess we should. Or do the LC_ALL thing. I'm still unclear on thedifference in effect between the two.

Then grep.el
should prettify the lines with real binary garbage e.g. by hiding groups of
bytes between 0 and 32, or adding a 'display' property with ellipsis.


Why not. xref could also do something like that.

Our interpreter is our regexp with which we parse. But I suppose as long as
Grep doesn't insert unexpected newlines, the parser will be fine.


For grep output a bigger problem is that grep on binary data
might output too long lines before the terminating newline.

(*) We already have this kind of problem with "normal" files whichcontain minified assets (JS or CSS). The file contents are usuallynormal ASCII, but it's just one line which can reach several MBs in length.

The usual way to deal with that is with project-ignores andgrep-find-ignored-files. That works for both cases.

I actually don't think I understand why we need -a in this case, since
Grep looks for null bytes to decide this is a binary file, and encoded
non-ASCII characters don't have null bytes 9except if they are in
UTF-16).


Good question.


The grep manual says that binary data are either output bytes that
are improperly encoded for the current locale, or null input bytes.

So... if we add LC_ALL=C but not '-a' we will allow the "improperlyencoded" case but not the "null input bytes" one?

[Prev in Thread]

Current Thread

[Next in Thread]

Re: dired-do-find-regexp failure with latin-1 encoding, (continued)

Prev by Date: Mouse-hovering over 'mouse-face' overlays/regions on a TTY Emacs
Next by Date: Re: Proposal for an emacs-humanities mailing list
Previous by thread: Re: dired-do-find-regexp failure with latin-1 encoding
Next by thread: Re: dired-do-find-regexp failure with latin-1 encoding
Index(es):
- Date
- Thread