[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #59402] discordance between preconv and groff
From: |
Dave |
Subject: |
[bug #59402] discordance between preconv and groff |
Date: |
Mon, 2 Nov 2020 20:18:46 -0500 (EST) |
User-agent: |
Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0 |
URL:
<https://savannah.gnu.org/bugs/?59402>
Summary: discordance between preconv and groff
Project: GNU troff
Submitted by: barx
Submitted on: Mon 02 Nov 2020 07:18:45 PM CST
Category: Preprocessor preconv
Severity: 3 - Normal
Item Group: Incorrect behaviour
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
Planned Release: None
_______________________________________________________
Details:
This bug report is the spawn of bug #57618; the relevant discussion begins at
comment 8 in that bug report. This example illustrates the problem:
$ printf "co\xF6perate\n" | uconv -f latin1 -x Any-NFD | groff -Kutf8 >
/dev/null
<standard input>:1: warning: can't find special character `u0308'
Unpacking the above: the "printf" hard-codes a Latin-1 character in its
otherwise plain ASCII output; the "uconv" converts its Latin-1 input to UTF-8,
specifically requesting Normalization Form D; groff's -K option implicitly
runs preconv first and tells it to expect UTF-8 input; and finally groff
processes preconv's output, whereupon it doesn't understand the non-ASCII
part.
(No error is generated if the above is changed to have uconv emit
Normalization Form C (NFC) instead of NFD.)
The problem, then, is that preconv emits a Unicode sequence that groff doesn't
understand. Whether that's a problem with preconv, with groff, or with both,
is unclear (to me), but if both, then this bug can fork into two bugs to track
each problem separately.
Branden's initial analysis was that this is "a bug in (1) preconv for not
emitting composite Unicode glyph escapes and (2) troff for failing to accept
them," though if the antecedent to "them" is "composite Unicode glyph
escapes," there's a logic error there: troff can't be failing to accept
something that it's not even being given. I read the "them" as referring to
the sequence that preconv does in fact emit from the above input:
* coo\[u0308]perate
So this bug could be fixed by making groff as liberal as possible with the
forms of Unicode escapes it recognizes. It already knows three equivalent
forms of this string:
* co\[:o]perate (historical roff style)
* co\[u00F6]perate (NFC)
* co\[u006F_0308]perate (NFD, but "spelled" differently)
(A fix for bug #58796 would enable preconv to emit the first of the above.)
The advantages of making groff grok the "o\[u0308]" sequence are:
0 it follows the “Be liberal in what you accept" principle
0 preconv would not have to change
0 groff could handle arbitrary combinations for which the other three forms do
not exist, such as
* Spin\[u0308]al Tap
The drawback, as Branden elucidates it, is that it's significantly easier to
change preconv to output something groff already recognizes.
In any case, the first step seems to be to define exactly what groff
should--and, if relevant, should not--accept as valid forms of Unicode-escape
input.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?59402>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [bug #59402] discordance between preconv and groff,
Dave <=