[bug #59402] discordance between preconv and groff

bug-groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #59402] discordance between preconv and groff

From:	Dave
Subject:	[bug #59402] discordance between preconv and groff
Date:	Mon, 2 Nov 2020 20:18:46 -0500 (EST)
User-agent:	Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0

URL:
  <https://savannah.gnu.org/bugs/?59402>

                 Summary: discordance between preconv and groff
                 Project: GNU troff
            Submitted by: barx
            Submitted on: Mon 02 Nov 2020 07:18:45 PM CST
                Category: Preprocessor preconv
                Severity: 3 - Normal
              Item Group: Incorrect behaviour
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
         Planned Release: None

    _______________________________________________________

Details:

This bug report is the spawn of bug #57618; the relevant discussion begins at
comment 8 in that bug report.  This example illustrates the problem:


$ printf "co\xF6perate\n" | uconv -f latin1 -x Any-NFD | groff -Kutf8 >
/dev/null
<standard input>:1: warning: can't find special character `u0308'


Unpacking the above: the "printf" hard-codes a Latin-1 character in its
otherwise plain ASCII output; the "uconv" converts its Latin-1 input to UTF-8,
specifically requesting Normalization Form D; groff's -K option implicitly
runs preconv first and tells it to expect UTF-8 input; and finally groff
processes preconv's output, whereupon it doesn't understand the non-ASCII
part.

(No error is generated if the above is changed to have uconv emit
Normalization Form C (NFC) instead of NFD.)

The problem, then, is that preconv emits a Unicode sequence that groff doesn't
understand.  Whether that's a problem with preconv, with groff, or with both,
is unclear (to me), but if both, then this bug can fork into two bugs to track
each problem separately.

Branden's initial analysis was that this is "a bug in (1) preconv for not
emitting composite Unicode glyph escapes and (2) troff for failing to accept
them," though if the antecedent to "them" is "composite Unicode glyph
escapes," there's a logic error there: troff can't be failing to accept
something that it's not even being given.  I read the "them" as referring to
the sequence that preconv does in fact emit from the above input:

* coo\[u0308]perate

So this bug could be fixed by making groff as liberal as possible with the
forms of Unicode escapes it recognizes.  It already knows three equivalent
forms of this string:

* co\[:o]perate (historical roff style)
* co\[u00F6]perate (NFC)
* co\[u006F_0308]perate (NFD, but "spelled" differently)

(A fix for bug #58796 would enable preconv to emit the first of the above.)

The advantages of making groff grok the "o\[u0308]" sequence are:

0 it follows the “Be liberal in what you accept" principle
0 preconv would not have to change
0 groff could handle arbitrary combinations for which the other three forms do
not exist, such as
* Spin\[u0308]al Tap

The drawback, as Branden elucidates it, is that it's significantly easier to
change preconv to output something groff already recognizes.

In any case, the first step seems to be to define exactly what groff
should--and, if relevant, should not--accept as valid forms of Unicode-escape
input.




    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?59402>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #59402] discordance between preconv and groff, Dave <=

Prev by Date: [bug #58831] make install-font.sh easier to find
Next by Date: [bug #57618] man/groff_char.7.man: page needs an overhaul
Previous by thread: [bug #58831] make install-font.sh easier to find
Next by thread: [bug #57618] man/groff_char.7.man: page needs an overhaul
Index(es):
- Date
- Thread