groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: z/OS porting issues, UTF-8 support, and the groff man(1) page


From: G. Branden Robinson
Subject: Re: z/OS porting issues, UTF-8 support, and the groff man(1) page
Date: Fri, 31 Mar 2023 16:55:40 -0500

[adding Dave to CC; seek your name below for my magical summons]

At 2023-03-31T13:05:09-0700, Mike Fulton wrote:
> On Fri, Mar 31, 2023 at 8:57 AM G. Branden Robinson <
> > As a groff developer, I'm interested in minimizing the number of
> > patches you have to carry "downstream" to support groff.
> >
> Definitely - I have not yet been able to build with the 'git' dev
> build but instead have been building from the tarball. I was planning
> to work to upstream changes once I had the 'git' build working (we are
> getting there now that we have more tools in place - it's a circuitous
> process!)

When you're ready to make that shift, be sure to read the "INSTALL.REPO"
file in the root of the repository or distribution archive.

> > I assume the change here:
> >
> > https://github.com/ZOSOpenTools/groffport/blob/main/patches/makevarescape.sed.patch
> >
> > is due to a limitation of the system's sed(1)?
> >
> Yes - that is the change. No - it's not because of sed. We have ported
> sed and could rely on it as a dependency. The issue we hit is a bit
> ugly.  Because z/OS is a 'multi-tenant' operating system, we want
> people to be able to install into a particular location of their
> choice (either as developer _or_ as a consumer of the binary).

...without a recompile, I assume?

> To make that work, we run a post-process on the files when someone
> downloads them to change the install 'root' location from where we
> built the code to the target location they want to install into.  It's
> ugly and we end up doing a find across files to do this trick. If that
> 'sed' change is in there, we end up 'missing' some particular updates
> because the string gets changed on us for the 'root' and so I took out
> that sed update (a complete hack that I need to do better).

Ah.  Hmm.  I can think of a better way, although it won't (completely)
help groff 1.22.4.

For groff 1.23, I revised our man pages to be much more careful about
documenting full file specifications to groff-installed files and to
compute their values based on the build's configuration
parameters--stuff like "./configure --prefix=/home/foobar".

Something I think you could do starting with the 1.23.0 release
candidates--if you keep the groff build tree around somewhere--is to
perform your sed operation on all the *.man files in the source tree
(and build tree, if it is separate), sniping any of the existing fodder
for sed replacement that you find appropriate.

To be concrete, I'm talking about this stuff:

https://git.savannah.gnu.org/cgit/groff.git/tree/Makefile.am?id=e3824d611be904bad22176f4f4eb282a5352509d#n864

So your multi-tenancy assistance script could do something like this:

MANS=$(find groff-source-dir groff-build-dir -name "*.man")
sed -i 's#@BINDIR@#'"$TENANT_HOME"'/bin#g' $MANS
cd groff-build-dir
make man-all # You can thank Keith Marshall for suggesting this.

...and as Emeril Lagasse would say, "bam!"  The pages will be
regenerated with correct file specifications with no cumbersome
workarounds.  And thanks to makevarescape.sed, if the file names wind up
being long, they'll break in pleasant locations and won't be hyphenated.

Or so I predict, not having actually done this concretely.

If you're wondering why you need to search both the build and source
directories for .man documents, that's my fault.

https://git.savannah.gnu.org/cgit/groff.git/commit/?id=31536c517dfe49b4e4a715a732f76b701531e90a

> > Interestingly, this meshes closely with groff's assumptions.  Due to
> > its chronological origins ca. 1990, it does not accept UTF-8 input,
> > but it aware of UTF-8 and can produce it as output.  The formatter,
> > troff(1), accepts ISO Latin-1 input, except on systems where the C
> > preprocessor macro "IS_EBCDIC_HOST" evaluates true; it then assumes
> > that its input is encoded using code page 1047.
> >
> From my perspective, we can drop support for 1047 altogether. However,
> I don't know if someone else has done their own 'separate' port. I
> haven't seen it if there is one.  Correct. I don't set that symbol.

Ooh, this is tempting.  Can you tell me if "OS/390 Unix" is the same
product as "z/OS"?  Or, if not, if such a thing as "OS/390 Unix" is
still supported?  I apologize for not knowing much about IBM operating
systems.  (I've heard wonderful stories about SMIT, though...)

> > I reckon you've already dealt with this if necessary, and ensured
> > that your groff 1.22.4 build does not define that symbol.
> >
> > Is code page 1047 deprecated or obsolescent on z/OS?  If groff
> > dropped support for it, do you suspect any z/OS users would be
> > inconvenienced?
> >
> I would say neither. An application can choose whether it wants to work in
> UTF-8/ASCII or whether it wants to work in EBCDIC (or both if it's careful).
> I wrote a blog on this awhile back:
> https://makingdeveloperslivesbetter.wordpress.com/2022/01/07/is-z-os-ascii-or-ebcdic-yes/

It looks like what's going on here is that z/OS has metadata available
for any file of interest to a Unix-like environment that tags a given
file as ISO 8859-1- or EBCDIC-encoded (if it has to be interpreted as a
character stream encoded using a single byte).

I presume there are facilities to permute the encodings (since ISO
8859-1 and code page 1047 are equivalent except for ordering)
dynamically as well as statically; for the latter you recommend iconv.

So, instead of maintaining groff's own facilities to interpret code page
1047 input, we would simply advise affected users to (convert and) tag
their input files with z/OS's "chtag" command.

This would indeed make possible a nice simplification to GNU troff's
input processing.

I do not yet assume it would be wise to kill off grotty(1)'s support for
generating code page 1047 _output_...but maybe we can.  Is it possible
to configure the environment on z/OS such that that is the case?  How do
you spell the standard C locale variables for this scenario?

"LC_ALL=en_US.EBCDIC"?

This may be important for ensuring that we keep nroff(1) working.

> > If there is no longer an audience for code page 1047, several
> > aspects of groff could be simplified, and it might make the
> > transition of GNU troff's internal type to int32_t easier.  (I
> > started down this road once before.)
>
> This makes sense to me. I know for Perl, we made sure to keep EBCDIC
> there, but the z/OS Open Tools community doesn't build with EBCDIC.

I think for groff the main win will be to make it easier for people to
learn and contribute to the project without this additional layer of
translation in input processing (at least).  The significant challenges
of coping nicely with UTF-8 input were going to be there anyway, arising
from the narrow-character architecture.

> > > Would others also find it valuable to be able to have the
> > > mathematical angle brackets in UTF-8 be transliterated to angle
> > > brackets in ISO8859-1?
> >
> > Unless you mean degradation to basic Latin less than and greater
> > than signs, U+003C and U+003E, then I don't think there are any
> > valid transliteration targets in ISO Latin-1.  The "left-" and
> > "right-pointing double angle quotation mark"s (U+00AB and U+00BB)
> > are indeed visually similar but semantically pretty distinct.  I
> > don't think I'd want to impose such a fallback in general.  (There
> > are multiple ways groff users could provide fallbacks for
> > themselves.)
>
> Fair enough!
> 
> > > If so, perhaps a 'starter fix' would be if I worked with the
> > > libiconv folks to see if that can be added (I opened a similar
> > > question in the libiconv channel since honestly I'm not sure the
> > > best way to fix this).
> >
> > You can pursue both lines of attack independently, especially if the
> > iconv developers have a good reason for not performing this fallback
> > already.
> >
> > I'm not sure groff has a good reason for not performing this
> > fallback.  At this point I think I will tap Dave Kemper, another
> > groff developer who has a fairly strong interest in the fallback
> > issue.
>
> Thank you.

Dave, what do you think about fallbacks for \(la and \(ra?

> > To cut out yet another source of trouble, if your terminal emulator
> > has more than 765 lines of scrollback buffer, you can omit paging
> > the groff(1) document entirely.
>
> I did this and it _does_ look good! When I ran it through less -R I
> did hit problems with the angled brackets - that may be an issue with
> less.

Okay--let us know if the problem returns to the groff court.

> > I would next inspect groff's device-independent output (which I call
> > "grout" for short) to see what's being handed to groff's terminal
> > output driver (grotty(1)).
> >
> > $ zcat $(man -w groff) | groff -man -Tutf8 | less

I forgot an important part here.

$ zcat $(man -w groff) | groff -Z -man -Tutf8 | less

Gotta have that "-Z" flag.

> > Around line 459 you should see a sequence of lines like this.
> >
> > tGNU
> > wh24
> > Cla
> > h24
> > thttp://www.gnu.org
> > Cra
> > h24
> > t.
> >
> > Those "Cla" and "Cra" lines are key.  If they are not absent, then you
> > have almost certainly found a bug in groff.

> > Another thing I would do is to view the groff_char(7) man page.
> >
> > $ man groff_char
>
> I don't get warnings here, but the Output and Input columns under:
> 8-bit Character Codes 160 to 255
> are all
> �        �

Don't worry about that.  The man page in groff 1.22.4 is wrong in that
respect.  It's fixed in the groff 1.23.0 release candidates.

https://git.savannah.gnu.org/cgit/groff.git/commit/?id=3e583c9541e4f764c175d7507a9aea1f8eeaaa55

Regards,
Branden

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]