groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

z/OS porting issues, UTF-8 support, and the groff man(1) page


From: G. Branden Robinson
Subject: z/OS porting issues, UTF-8 support, and the groff man(1) page
Date: Fri, 31 Mar 2023 10:57:16 -0500

[let me know if you're subscribed to the list or if you'd prefer not to
be CCed]

[also, if you want to break any of the several subjects arising in this
message into a separate thread, please feel free]

Hi Mike,

At 2023-03-31T07:29:16-0700, Mike Fulton wrote:
> Over the last year, we have been working hard in the z/OS Open Tools
> community (https://zosopentools.github.io/meta/#/) to not only port
> the fundamental tools to z/OS, but also to do it completely in the
> open.

This is good news!  Knowing that you're a software developer might also
make communications easier.  :)

> We create one 'port' repo for each Open Source package and the repo
> contains information on compiler options, dependencies, and so forth
> so that anyone can (relatively easily) build the software.

> We also have a special repo (meta) that has a rudimentary package
> manager and build tool that we use (e.g. _zopen install_ to install
> binaries, _zopen build_ to build from source, etc.).

Much as with GNU/Linux distributions; this is a pleasure to hear.

As a groff developer, I'm interested in minimizing the number of patches
you have to carry "downstream" to support groff.

I assume the change here:

https://github.com/ZOSOpenTools/groffport/blob/main/patches/makevarescape.sed.patch

is due to a limitation of the system's sed(1)?

If the problem is the '\+' part of the pattern, I see that POSIX says
that the interpretation of that is "implementation-defined", though the
latest draft of Issue 8 (just out in the past 24 hours or so) says that
"a future version of this standard may require "\?", "\+", and "\|" to
behave as described for the ERE special characters '?', '+', and '|',
respectively." (IEEE P1003.1™-202x/D3, March 2023, p. 181).

A workaround would be:

-s|[^ ]/\+|&\\\\:|g
+s|[^ ]//*|&\\\\:|g

If you also want to steal a slight improvement from groff 1.23, you can
do this instead:

-s|[^ ]/\+|&\\\\:|g
+s|[^ ]//*|&\\\\:\\\\%|g

> We have indeed moved to a 'UTF-8 first' model, which for the most part
> is a 'ISO8859-1 first' model

Interestingly, this meshes closely with groff's assumptions.  Due to its
chronological origins ca. 1990, it does not accept UTF-8 input, but it
aware of UTF-8 and can produce it as output.  The formatter, troff(1),
accepts ISO Latin-1 input, except on systems where the C preprocessor
macro "IS_EBCDIC_HOST" evaluates true; it then assumes that its input is
encoded using code page 1047.

I reckon you've already dealt with this if necessary, and ensured that
your groff 1.22.4 build does not define that symbol.

Is code page 1047 deprecated or obsolescent on z/OS?  If groff dropped
support for it, do you suspect any z/OS users would be inconvenienced?

> and we have a special OS library that takes care of edge case
> conversions to EBCDIC (and provides a couple functions that are
> missing).  This is also Open Source (zoslib).

This really good stuff to hear about; thanks for bringing this
initiative to my attention.

> We have about 80 packages we are porting / have ported. Some are very
> far along like gnu make and Perl with many fixes upstreamed. Some are
> just barely building - htop is probably a good example of one we have
> just started on.

I'm glad groff is a member of the first 100!  :D

> I am also not sure if we want to work in UTF-8 or in ISO-8859-1. My
> goal would be UTF-8 across the board, but I expect there are things we
> still need to fix to get there. Our vim port seems to work well with
> UTF-8 but I'll be honest that the testing of that is sparse still.

My suggestion would be to back the UTF-8 horse.  groff already has
machinery in place for accommodating input in UTF-8 via the preconv(1)
preprocessor.

If there is no longer an audience for code page 1047, several aspects of
groff could be simplified, and it might make the transition of GNU
troff's internal type to int32_t easier.  (I started down this road once
before.)

> With all that background, I'm wondering if 'both' is the right answer?

I don't feel qualified to answer this question in general; for groff,
it's a pickle because the original implementer (James Clark) used many
C0 and C1 control code points for internal purposes, to encode "node
types" that could be encountered internally by the formatter when
processing diversions (a Unix nroff/troff feature that usually only
authors of macro packages mess with).

You can see these assignments in the "input.h" header file.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h

Use of these codes for internal purposes isn't necessarily incompatible
with UTF-8 input; GNU troff already rejects them upon input, and almost
none of them are meaningful for a "plain text" document that is going to
achieve format control mostly via roff language features rather than
control characters.  Input processing could be made more sophisticated
(and more stateful when reading the input byte stream to keep track of
UTF-8 sequences).

> Would others also find it valuable to be able to have the mathematical
> angle brackets in UTF-8 be transliterated to angle brackets in
> ISO8859-1?

Unless you mean degradation to basic Latin less than and greater than
signs, U+003C and U+003E, then I don't think there are any valid
transliteration targets in ISO Latin-1.  The "left-" and "right-pointing
double angle quotation mark"s (U+00AB and U+00BB) are indeed visually
similar but semantically pretty distinct.  I don't think I'd want to
impose such a fallback in general.  (There are multiple ways groff users
could provide fallbacks for themselves.)

> If so, perhaps a 'starter fix' would be if I worked with the libiconv
> folks to see if that can be added (I opened a similar question in the
> libiconv channel since honestly I'm not sure the best way to fix
> this).

You can pursue both lines of attack independently, especially if the
iconv developers have a good reason for not performing this fallback
already.

I'm not sure groff has a good reason for not performing this fallback.
At this point I think I will tap Dave Kemper, another groff developer
who has a fairly strong interest in the fallback issue.

> In parallel, I think I need to understand how I could change the way I
> build man so that it operates in UTF-8 mode.

I think that is a good idea.  It looks like your man is man-db, which is
really good news because that's developed by Colin Watson who has also
been groff's package maintainer for Debian for a long time.

Probably the first thing to do is make sure we know what groff is
producing in your environment.

Here is how to (mostly) bypass man(1) and render the groff(1) man page
much as man(1) itself would do.

$ zcat $(man -w groff) | groff -man -Tutf8 | less -R

(If less(1) is not available, try "more", "more -b", or this:

$ zcat $(man -w groff) | groff -man -Tutf8 -P -c | ul | more

FYI: The version of "more" on my Debian system breaks lines at incorrect
places when given the above.)

Here, we are using man(1) only as a librarian, to tell us where the
groff(1) man page is.  We are directing formatting ourselves.

If this looks fine and you get the angle brackets you're expecting, then
something is running in the pipeline man-db man(1) constructs, _after_
grotty(1) produces the output, and doing violence to the angle brackets;
that would be where the bug lies.

To cut out yet another source of trouble, if your terminal emulator has
more than 765 lines of scrollback buffer, you can omit paging the
groff(1) document entirely.

But if it _doesn't_ look fine, then we need to find out why.

I would next inspect groff's device-independent output (which I call
"grout" for short) to see what's being handed to groff's terminal output
driver (grotty(1)).

$ zcat $(man -w groff) | groff -man -Tutf8 | less

Around line 459 you should see a sequence of lines like this.

tGNU
wh24
Cla
h24
thttp://www.gnu.org
Cra
h24
t.

Those "Cla" and "Cra" lines are key.  If they are not absent, then you
have almost certainly found a bug in groff.

Another thing I would do is to view the groff_char(7) man page.

$ man groff_char

On my system, code point coverage is complete except for three
characters.

troff: <standard input>:1051: warning: can't find special character 'bs'
troff: <standard input>:1192: warning: can't find special character 'radicalex'
troff: <standard input>:1195: warning: can't find special character 'sqrtex'

These problems are expected everywhere[1] for historical and technical
reasons I won't get into unless asked.

Let me know what you find and we'll see if we can narrow this down.

Regards,
Branden

[1] the first everywhere, the last two on all terminal devices

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]