groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: z/OS porting issues, UTF-8 support, and the groff man(1) page


From: Mike Fulton
Subject: Re: z/OS porting issues, UTF-8 support, and the groff man(1) page
Date: Fri, 31 Mar 2023 10:46:12 -0700

Hi

I think I should probably respond in the channel so the other folks in the
z/OS Open Tools community can see. I think I may have botched this a little.
Do you typically respond through email or do you use the web interface? I'm
probably not doing this quite right on my end...
I am not subscribed yet but I can - is that a better way to respond?

thanks, mike

On Fri, Mar 31, 2023 at 8:57 AM G. Branden Robinson <
g.branden.robinson@gmail.com> wrote:

> [let me know if you're subscribed to the list or if you'd prefer not to
> be CCed]
>
> [also, if you want to break any of the several subjects arising in this
> message into a separate thread, please feel free]
>
> Hi Mike,
>
> At 2023-03-31T07:29:16-0700, Mike Fulton wrote:
> > Over the last year, we have been working hard in the z/OS Open Tools
> > community (https://zosopentools.github.io/meta/#/) to not only port
> > the fundamental tools to z/OS, but also to do it completely in the
> > open.
>
> This is good news!  Knowing that you're a software developer might also
> make communications easier.  :)
>
> > We create one 'port' repo for each Open Source package and the repo
> > contains information on compiler options, dependencies, and so forth
> > so that anyone can (relatively easily) build the software.
>
> > We also have a special repo (meta) that has a rudimentary package
> > manager and build tool that we use (e.g. _zopen install_ to install
> > binaries, _zopen build_ to build from source, etc.).
>
> Much as with GNU/Linux distributions; this is a pleasure to hear.
>
> As a groff developer, I'm interested in minimizing the number of patches
> you have to carry "downstream" to support groff.
>
> I assume the change here:
>
>
> https://github.com/ZOSOpenTools/groffport/blob/main/patches/makevarescape.sed.patch
>
> is due to a limitation of the system's sed(1)?
>
> If the problem is the '\+' part of the pattern, I see that POSIX says
> that the interpretation of that is "implementation-defined", though the
> latest draft of Issue 8 (just out in the past 24 hours or so) says that
> "a future version of this standard may require "\?", "\+", and "\|" to
> behave as described for the ERE special characters '?', '+', and '|',
> respectively." (IEEE P1003.1™-202x/D3, March 2023, p. 181).
>
> A workaround would be:
>
> -s|[^ ]/\+|&\\\\:|g
> +s|[^ ]//*|&\\\\:|g
>
> If you also want to steal a slight improvement from groff 1.23, you can
> do this instead:
>
> -s|[^ ]/\+|&\\\\:|g
> +s|[^ ]//*|&\\\\:\\\\%|g
>
> > We have indeed moved to a 'UTF-8 first' model, which for the most part
> > is a 'ISO8859-1 first' model
>
> Interestingly, this meshes closely with groff's assumptions.  Due to its
> chronological origins ca. 1990, it does not accept UTF-8 input, but it
> aware of UTF-8 and can produce it as output.  The formatter, troff(1),
> accepts ISO Latin-1 input, except on systems where the C preprocessor
> macro "IS_EBCDIC_HOST" evaluates true; it then assumes that its input is
> encoded using code page 1047.
>
> I reckon you've already dealt with this if necessary, and ensured that
> your groff 1.22.4 build does not define that symbol.
>
> Is code page 1047 deprecated or obsolescent on z/OS?  If groff dropped
> support for it, do you suspect any z/OS users would be inconvenienced?
>
> > and we have a special OS library that takes care of edge case
> > conversions to EBCDIC (and provides a couple functions that are
> > missing).  This is also Open Source (zoslib).
>
> This really good stuff to hear about; thanks for bringing this
> initiative to my attention.
>
> > We have about 80 packages we are porting / have ported. Some are very
> > far along like gnu make and Perl with many fixes upstreamed. Some are
> > just barely building - htop is probably a good example of one we have
> > just started on.
>
> I'm glad groff is a member of the first 100!  :D
>
> > I am also not sure if we want to work in UTF-8 or in ISO-8859-1. My
> > goal would be UTF-8 across the board, but I expect there are things we
> > still need to fix to get there. Our vim port seems to work well with
> > UTF-8 but I'll be honest that the testing of that is sparse still.
>
> My suggestion would be to back the UTF-8 horse.  groff already has
> machinery in place for accommodating input in UTF-8 via the preconv(1)
> preprocessor.
>
> If there is no longer an audience for code page 1047, several aspects of
> groff could be simplified, and it might make the transition of GNU
> troff's internal type to int32_t easier.  (I started down this road once
> before.)
>
> > With all that background, I'm wondering if 'both' is the right answer?
>
> I don't feel qualified to answer this question in general; for groff,
> it's a pickle because the original implementer (James Clark) used many
> C0 and C1 control code points for internal purposes, to encode "node
> types" that could be encountered internally by the formatter when
> processing diversions (a Unix nroff/troff feature that usually only
> authors of macro packages mess with).
>
> You can see these assignments in the "input.h" header file.
>
> https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h
>
> Use of these codes for internal purposes isn't necessarily incompatible
> with UTF-8 input; GNU troff already rejects them upon input, and almost
> none of them are meaningful for a "plain text" document that is going to
> achieve format control mostly via roff language features rather than
> control characters.  Input processing could be made more sophisticated
> (and more stateful when reading the input byte stream to keep track of
> UTF-8 sequences).
>
> > Would others also find it valuable to be able to have the mathematical
> > angle brackets in UTF-8 be transliterated to angle brackets in
> > ISO8859-1?
>
> Unless you mean degradation to basic Latin less than and greater than
> signs, U+003C and U+003E, then I don't think there are any valid
> transliteration targets in ISO Latin-1.  The "left-" and "right-pointing
> double angle quotation mark"s (U+00AB and U+00BB) are indeed visually
> similar but semantically pretty distinct.  I don't think I'd want to
> impose such a fallback in general.  (There are multiple ways groff users
> could provide fallbacks for themselves.)
>
> > If so, perhaps a 'starter fix' would be if I worked with the libiconv
> > folks to see if that can be added (I opened a similar question in the
> > libiconv channel since honestly I'm not sure the best way to fix
> > this).
>
> You can pursue both lines of attack independently, especially if the
> iconv developers have a good reason for not performing this fallback
> already.
>
> I'm not sure groff has a good reason for not performing this fallback.
> At this point I think I will tap Dave Kemper, another groff developer
> who has a fairly strong interest in the fallback issue.
>
> > In parallel, I think I need to understand how I could change the way I
> > build man so that it operates in UTF-8 mode.
>
> I think that is a good idea.  It looks like your man is man-db, which is
> really good news because that's developed by Colin Watson who has also
> been groff's package maintainer for Debian for a long time.
>
> Probably the first thing to do is make sure we know what groff is
> producing in your environment.
>
> Here is how to (mostly) bypass man(1) and render the groff(1) man page
> much as man(1) itself would do.
>
> $ zcat $(man -w groff) | groff -man -Tutf8 | less -R
>
> (If less(1) is not available, try "more", "more -b", or this:
>
> $ zcat $(man -w groff) | groff -man -Tutf8 -P -c | ul | more
>
> FYI: The version of "more" on my Debian system breaks lines at incorrect
> places when given the above.)
>
> Here, we are using man(1) only as a librarian, to tell us where the
> groff(1) man page is.  We are directing formatting ourselves.
>
> If this looks fine and you get the angle brackets you're expecting, then
> something is running in the pipeline man-db man(1) constructs, _after_
> grotty(1) produces the output, and doing violence to the angle brackets;
> that would be where the bug lies.
>
> To cut out yet another source of trouble, if your terminal emulator has
> more than 765 lines of scrollback buffer, you can omit paging the
> groff(1) document entirely.
>
> But if it _doesn't_ look fine, then we need to find out why.
>
> I would next inspect groff's device-independent output (which I call
> "grout" for short) to see what's being handed to groff's terminal output
> driver (grotty(1)).
>
> $ zcat $(man -w groff) | groff -man -Tutf8 | less
>
> Around line 459 you should see a sequence of lines like this.
>
> tGNU
> wh24
> Cla
> h24
> thttp://www.gnu.org
> Cra
> h24
> t.
>
> Those "Cla" and "Cra" lines are key.  If they are not absent, then you
> have almost certainly found a bug in groff.
>
> Another thing I would do is to view the groff_char(7) man page.
>
> $ man groff_char
>
> On my system, code point coverage is complete except for three
> characters.
>
> troff: <standard input>:1051: warning: can't find special character 'bs'
> troff: <standard input>:1192: warning: can't find special character
> 'radicalex'
> troff: <standard input>:1195: warning: can't find special character
> 'sqrtex'
>
> These problems are expected everywhere[1] for historical and technical
> reasons I won't get into unless asked.
>
> Let me know what you find and we'll see if we can narrow this down.
>
> Regards,
> Branden
>
> [1] the first everywhere, the last two on all terminal devices
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]