bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: (gnu)sed in texinfo


From: Mihai Moldovan
Subject: Re: (gnu)sed in texinfo
Date: Fri, 04 Jul 2014 19:11:02 +0200
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

* On 04.07.2014 06:27 pm, Mojca Miklavec wrote:
> Doesn't "C" (I forgot which one of the LC-* variables exactly needs to
> be set) mean "process as is"?

It forces bytewise processing. It's equivalent to 7-bit ASCII and also
incompatible with (reading) UTF-8, as far as I understand.


> Also, UTF-8 is just "a special case" of
> a generic 8-bit encoding. In principle UTF-8-encoded text could be
> treated as / mistaken for ISO Latin 1 (it would give wrong output of
> course), so it should work in principle, but see the other recent
> emails about environmental encoding on the tex-live mailing list (from
> today).

Yes, and this "wrong output" is exactly the problem. Latex may refuse to render
a file with control codes (or whatever multi-byte UTF-8 characters convert to
when read as 8 bit Latin or whatever encoding text.) I may be wrong though.
It's entirely possible that latex renders "garbage", but doesn't fail. I
probably need to check this out.


> I just checked and in LuaTeX sources the following two are used at
> several places:
>     setlocale(LC_ALL, "C");
>     export LC_ALL=C
>
> Feeding sed with ISO Latin 1 text when using UTF-8 should indeed fail.
> (In a way I agree that a tool operating in "UTF-8 mode" should start
> screaming when being fed with invalid UTF-8 input.)

As soon as there is any non-ASCII character in it (and well known control
characters), it may fail, yes.


> But from what I understood sed isn't used in texinfo to do anything
> special with "real text" (non-ascii)? (That is: the transformations
> probably aren't of a type that would insert newlines, whitespace or
> any other characters *in the middle of* UTF-8 characters to make UTF-8
> invalid?)

Well, ironically, sed is failing in run_recode(), which is searching for a
"@documentencoding" tag, removes it and then recodes from whatever
documentencoding was passed to 7-bit 'texinfo encoding' (whatever this may
mean... 7-bit ASCII?) with help of the 'recode' tool... or from any given
encoding to the encoding specified in the file?

You know, taking a look at run_recode(), it doesn't use the "$from" local
variable at all for encoding, which looks like a bug.

Shouldn't line 1485 rather read 'if recode "$from..$to" <"$in_input" >"$in_rcd" 
\'
instead of 'if recode "$encoding..$to" <"$in_input" >"$in_rcd" \'?

Anyway, this is a different problem and not related to our failing sed bug. I've
just seen it right now.

sed as such is not being used to convert anything, at all. It just searches for
the "@documentencoding" tag and fetches the value of it.
This said, couldn't we safely assume that the value of documentencoding is
always 7-bit ASCII?
In that case, forcing LC_ALL=C for the sed call would always work.


Still, I'd rather have bsdsed replaced in all places by gnused instead of
working around this issue by putting LC_ALL=C in front of the failing sed
call... just to see it fail at some other point afterwards.

(Also don't forget that testing full code coverage is a pain.)



Mihai

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]