groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Greek and Russian in Groff


From: G. Branden Robinson
Subject: Greek and Russian in Groff
Date: Sat, 25 Mar 2023 03:04:27 -0500

At 2023-03-24T19:31:45+0100, Oliver Corff wrote:
> Hi Branden,
> 
> thank you for your detailed reply. I'll try your examples over the
> weekend.

Looking forward to it.  Remember, as Deri pointed out you'll need good
fonts for Greek, which the URW faces aren't, so you will likely want to
employ Peter Schaffter's "install-font.sh" script.

https://www.schaffter.ca/mom/momdoc/appendices.html#install-font

> The main reasons why I thought of suggesting the ISO 8859-7 encoding
> instead of native Unicode were twofold:
> 
> 1. The example seen on (was it) reddit (?)

Yes, I think so.

https://www.reddit.com/r/groff/comments/112tfqv/support_for_greek_in_groff/

> looked botched, like
> typical 8 bit output to a system unaware of the specific encoding.
> Certainly it looked "greek", but was not Greek.

Well, yes, but I think it would take an attentive reader of mojibake to
distinguish UTF-8 rendered as ISO Latin-1 vs. ISO Latin/Greek rendered
as Latin-1.

To the eye expecting familiar scripts, both look pretty "baked".

> 2. This may be due to a fault in my understanding of groff: I am well
> aware of the existence of preconv(1), I was believing that the
> character commands which preconv uses in its output (instead of native
> Unicode codepoints) would cause hiccups for the hyphenation algorithm.

Only in theory.  What will happen is that groff simply won't hyphenate
Greek text, since none of the characters in the script have hyphenation
codes.  Gory details are in our Texinfo manual, §5.10, "Manipulating
Hyphenation".

It's more important for non-English languages using the Latin script to
set up (or disable) hyphenation, because if they don't, GNU troff will
hyphenate their language as if it were English, pleasing no one.

With the Greek script, GNU troff will make no such attempt, until we
add real support with el.tmac and hyphen.el and so forth; the recent
commits of Russian and Spanish localization to the post-1.23.0 branch in
Git may be instructive examples.

https://git.savannah.gnu.org/cgit/groff.git/log/?h=post-1.23.0

> I'll search for a suitable font, and then try again, because once
> there is a path for Greek (preferably, of course, in Unicode), then
> there will also be a path for all languages using the Cyrillic
> alphabet (which occurs frequently in my work).

Some of the Cyrillic work has been done, though there is still the
matter of supporting proper typesetting by recommending appropriate
fonts.

> So far, I've been using XeLaTeX, but it would be really nice to have
> clean straightforward document processing for these character sets in
> groff.

I entirely agree.

The font issue will continue to bedevil us.  Even with Russian language
support, when typesetting the results are not good if fonts lack glyphs.
And we don't have any power over what fonts the user has available.

For instance, I constructed the following document.

[UTF-8 ahead!]

$ cat ATTIC/makhnovshchina.groff
Гуляйпольщина (укр. Гуляйпольщина) или Махновщина, также Вольная
Территория — повстанческий район в Северном Приазовье в период
Гражданской войны 1918—1921 гг.

Обладала фактической самостоятельностью в период с ноября 1918 года до
начала операции по ликвидации «партизанщины» советским правительством в
1921 году. Оборону территории осуществляли повстанческие отряды Нестора
Махно, с февраля 1919 года входившие в регулярные части Украинской
советской армии.

Центром махновского движения было село Гуляйполе Александровского уезда
Екатеринославской губернии, где располагались штаб Махно и
Военно-революционный совет Гуляйпольского района.

В результате весенне-летнего (1919 год) наступления Деникина на Москву
махновцы были вынуждены либо уйти из родных мест с Красной армией, либо
перейти на нелегальное положение. В начале сентября Махно объявил о
создании Революционной повстанческой армии Украины (РПАУ) и возобновил
самостоятельные боевые действия.
.pl \n[nl]u

Using groff Git's "post-1.23.0" branch, I formatted it for the terminal.

$ ./build/test-groff -k -mru -Tutf8 ATTIC/makhnovshchina.groff
Гуляйпольщина  (укр. Гуляйпольщина) или Махновщина, также Вольная
Территория — повстанческий район в Северном  Приазовье  в  период
Гражданской войны 1918—1921 гг.

Обладала  фактической  самостоятельностью  в период с ноября 1918
года до начала операции по  ликвидации  «партизанщины»  советским
правительством  в  1921 году. Оборону территории осуществляли по‐
встанческие отряды Нестора Махно, с февраля 1919 года входившие в
регулярные части Украинской советской армии.

Центром махновского движения было село Гуляйполе Александровского
уезда Екатеринославской губернии, где располагались штаб Махно  и
Военно‐революционный совет Гуляйпольского района.

В  результате  весенне‐летнего (1919 год) наступления Деникина на
Москву махновцы были вынуждены либо уйти из родных мест с Красной
армией, либо перейти на нелегальное положение. В начале  сентября
Махно объявил о создании Революционной повстанческой армии Украи‐
ны (РПАУ) и возобновил самостоятельные боевые действия.

Here we can observe that hyphenation appears to be working.  I have no
idea if it's _correct_, as I have no knowledge of Russian hyphenation
rules, but it wouldn't be happening at all if the new "hyphen.ru" file
were not being interpreted, so I will trust in the authors of the
patterns.  :)

Typesetting the foregoing is a bit dispiriting.

$ ./build/test-groff -k -mru -Tps -z ATTIC/makhnovshchina.groff
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0413' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0443' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u043B' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u044F' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0438_0306' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u043F' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u043E' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u044C' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0449' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0438' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u043D' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0430' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u043A' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0440' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u041C' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0445' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0432' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0442' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0436' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0435' not 
defined
troff:ATTIC/makhnovshchina.groff:1: warning: special character 'u0412' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u0422' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u0441' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u0447' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u0421' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u043C' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u041F' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u0437' not 
defined
troff:ATTIC/makhnovshchina.groff:2: warning: special character 'u0434' not 
defined
troff:ATTIC/makhnovshchina.groff:3: warning: special character 'u044B' not 
defined
troff:ATTIC/makhnovshchina.groff:3: warning: special character 'u0433' not 
defined
troff:ATTIC/makhnovshchina.groff:5: warning: special character 'u041E' not 
defined
troff:ATTIC/makhnovshchina.groff:5: warning: special character 'u0431' not 
defined
troff:ATTIC/makhnovshchina.groff:5: warning: special character 'u0444' not 
defined
troff:ATTIC/makhnovshchina.groff:5: warning: special character 'u044E' not 
defined
troff:ATTIC/makhnovshchina.groff:6: warning: special character 'u0446' not 
defined
troff:ATTIC/makhnovshchina.groff:7: warning: special character 'u041D' not 
defined
troff:ATTIC/makhnovshchina.groff:8: warning: special character 'u0448' not 
defined
troff:ATTIC/makhnovshchina.groff:8: warning: special character 'u0423' not 
defined
troff:ATTIC/makhnovshchina.groff:11: warning: special character 'u0426' not 
defined
troff:ATTIC/makhnovshchina.groff:11: warning: special character 'u0410' not 
defined
troff:ATTIC/makhnovshchina.groff:12: warning: special character 'u0415' not 
defined
troff:ATTIC/makhnovshchina.groff:15: warning: special character 'u0414' not 
defined
troff:ATTIC/makhnovshchina.groff:16: warning: special character 'u041A' not 
defined
troff:ATTIC/makhnovshchina.groff:17: warning: special character 'u044A' not 
defined
troff:ATTIC/makhnovshchina.groff:18: warning: special character 'u0420' not 
defined

This is the formatter stubbing its toe on every Cyrillic code point that
appears in the document (warning once for each), because we don't have
definitions for these glyphs in the font description files for the
PostScript device.  PDF doesn't fare any better.

One of the things we might do for localization packages is add some
logic to them testing for the existence of essential glyphs.

As a proof of concept, I can patch tmac/ru.tmac as follows:

$ git diff
diff --git a/tmac/ru.tmac b/tmac/ru.tmac
index 537109d84..daceaff27 100644
--- a/tmac/ru.tmac
+++ b/tmac/ru.tmac
@@ -24,6 +24,9 @@
 .do nr *groff_ru_tmac_C \n[.cp]
 .cp 0
 .
+.if !c \[u0411] \
+.  ab ru.tmac: Russian script unavailable; no glyph for U+0411
+.
 .
 .\" If changing from an existing locale, we need to preserve the state
 .\" of the "suppress hyphenation before a page location trap" bit.

...and the result is what we expect.

$ ./build/test-groff -k -mru -Tpdf -P -y -P U -z ATTIC/makhnovshchina.groff
ru.tmac: Russian script unavailable; no glyph for U+0411

There are several decisions to be made about this sort of feature test.

1.  Why check U+0411 in particular?  (It's part of the basic standard
    Cyrillic alphabet and has no homoglyph in Latin.)  This might not
    matter; any font that claims Cyrillic script support might claim to
    have glyphs for U+0410 ("A") and U+0412 ("B"), even if under the
    hood it aliases them to the shapes for U+0041 and U+0042.

2.  Should we test more characters?  We could have an `if` request for
    each one of interest, but that would rapidly become lengthy.  We
    might consider an extension to the 'c' conditional expression
    predicate such that it will test each ordinary or special character
    in its argument in turn, returning true only if _all_ characters are
    resolvable.  This would make checking a large set of glyphs less
    garrulous.  (At present, it simply ignores all characters after the
    first, while still recognizing special character escape sequence
    syntax.)

3.  There are limits to the foregoing approach.  It probably doesn't
    make sense to have zh.tmac check for every Han character in the BMP.
    Maybe some representative sample could be selected.  By contrast,
    for the alphabetic scripts I'm familiar with, any document of
    sufficient length will be a pangram.  Unicode defines rare and
    historical characters for a large variety of scripts; my guess is
    that it's worth encouraging individual documents to test those.
    Those who help us craft localization macro files are well-placed to
    know which glyphs from their script are truly essential to
    typography in their language.  (There's a story about ISO 8859-1
    involving a Groupe Bull delegate to the standardization conference,
    sealing the fate of the characters œ, Œ, and Ÿ.  My guess is that
    this person hated DEC and didn't want to see DEC's MCS "christened"
    as an ISO standard.  Attached for the curious.)

4.  Should we really abort in this circumstance or just diagnose?

Those are some of the issues on my mind as we consider improving groff's
localization support.  I think helping the user figure out if they're
lacking font coverage for their script is a key element in reducing
their misery and traffic to support channels.

Regards,
Branden

Attachment: How Groupe Bull Screwed Latin-1_CG_1996___25_65_0.pdf
Description: how_groupe_bull_screwed_latin-1.pdf

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]