[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [h-e-w] Processing chars above \200
From: |
Eli Zaretskii |
Subject: |
Re: [h-e-w] Processing chars above \200 |
Date: |
Fri, 21 Sep 2018 21:22:40 +0300 |
> Date: Fri, 21 Sep 2018 13:32:43 -0400
> From: John J. Xenakis <address@hidden>
> Cc: address@hidden
>
> (defun 8bit ()
> "Test 8-bit characters"
> (let* (
> (pos (point)) (NL "\n")
> (char1 "\235") (char2 "\220")
> (pat1 "\235") (pat2 "[\230-\237]")
> )
> (insert "This is a char: " char1 NL)
> (insert "This is another char: " char2 NL)
> (goto-char pos)
> (query-replace-regexp pat1 "x") ; replaces
> (goto-char pos)
> (query-replace-regexp pat2 "y") ; does not work
> ))
>
> Now, open a brand new empty file, and execute this macro. The first
> replace works, but the second replace does not. I don't know whether
> this is what's supposed to happen, but at least it doesn't work as I
> would expect.
After you execute this macro, if you go to the \235 or \220 characters
and type "C-x =", what do you see? Does what Emacs says about these
raw bytes give you a hint regarding what is going on?
> OK, so here's the overall problem. In the process of writing books
> and articles, I create text files with text from a variety of sources.
> The sources can include copy and paste from web sites, doc files, pdf
> files, and application windows, and can also include text generated by
> my scripts, usually in Perl or Java.
On what OS are you doing all that? I assume Windows, but what
versions? And what applications do you copy text from?
> I should mention that when I open a file, I use the coding system
> "windows-1252-dos."
That is probably wrong nowadays. Since you seem to say your files are
full of raw bytes, you should use raw-text, not cp1252. (That is, if
you cannot resolve your problem in a better way, so that what you get
in the buffer before saving it is not raw bytes, but actual non-ASCII
characters. Given your answers to some of my questions, maybe we
could make that happen, unless you are working with very old
applications.)
> Sometimes emacs opens one of these text files, and magically decides
> that it's a "(Unix)" file. This is a nightmare because then I have "^M"
> at the end of each line, and I can't get rid of them. I've written a macro
> that replaces all ^M's with "", and that gets rid of them for a while,
> but they come back. I've tried using utility programs to convert files
> to windows or unix or mac formats, and back again, but the problem is never
> fixed.
These are all signs of working with files with inconsistent encoding.
Emacs employs some guesswork to decide what is the encoding, but it
only examines a small portion of the file before it makes the guess,
so inconsistent encoding can dupe it into making the wrong decisions.
> OK, you may be sorry you asked, but that's what I'm trying to do.
I'm not sorry, I actually guessed you have something like that on your
hands.
> What's the solution?
I'd start at "emacs -Q", and upgrade to Emacs 26 if you haven't
already. I think you may have accumulated quite a bit of semi-correct
hacks trying to solve these problems, and those hacks are now biting
you.
In "emacs -Q", try copy/pasting text from the applications you care
about, and see what apps give you which problems, if any. Then we
will try to solve those problems one at a time.
Your first problem with the kind of solution you are used to is that
you assume \220 etc. are raw 8-bit bytes everywhere you see them in
Emacs. That assumption is false, as "C-x =" above shows you. I
actually hope that you won't need any such replacements at all, but if
you do, we will get to how one should go about doing this safely.