groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Greek in Groff


From: Oliver Corff
Subject: Re: Greek in Groff
Date: Fri, 24 Mar 2023 19:31:45 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0

Hi Branden,

thank you for your detailed reply. I'll try your examples over the weekend.

The main reasons why I thought of suggesting the ISO 8859-7 encoding instead of native Unicode were twofold:

1. The example seen on (was it) reddit (?) looked botched, like typical 8 bit output to a system unaware of the specific encoding. Certainly it looked "greek", but was not Greek.

2. This may be due to a fault in my understanding of groff: I am well aware of the existence of preconv(1), I was believing that the character commands which preconv uses in its output (instead of native Unicode codepoints) would cause hiccups for the hyphenation algorithm.

I'll search for a suitable font, and then try again, because once there is a path for Greek (preferably, of course, in Unicode), then there will also be a path for all languages using the Cyrillic alphabet (which occurs frequently in my work).

So far, I've been using XeLaTeX, but it would be really nice to have clean straightforward document processing for these character sets in groff.

Best regards,

Oliver.


On 23/03/2023 22:03, G. Branden Robinson wrote:
At 2023-03-18T11:09:16+0000, Ralph Corderoy wrote:
The encoding of choice would probably be ISO 8859-7 in order to
remain within the 8 bit character encoding space.
...
4. Write your documents in ISO 8859-7 or convert them from Unicode
to ISO 8859-7
I'd recommend your second option; that Mortadelas writes in UTF-8 and
uses preconv(1).
I too second this.  We can add support for ISO 8859 Latin/Greek, but an
important question to address is whether composers of modern Greek
documents need that, or whether the false belief that "groff has no
UTF-8" support is misleading people into thinking that using an 8-bit
character set is the only way they can get their language supported in
groff.

2. Localize necessary strings (like "abstract", "contents", days of
the week etc.)
This may not be needed, e.g. if a macro set isn't being used.
Also true.

P.S: This
<https://www.reddit.com/r/groff/comments/112tfqv/support_for_greek_in_groff/>
is the post I am referring to.
For others who may reply, this thread is worth a read to see what's
already been suggested to Mortadelas.
We seem to be missing some support for recombining characters on
typesetting devices after decomposing them.

For example, I prepared the following document.

[UTF-8 follows]

$ cat ATTIC/sample-greek.ms
.NH 1
Δεκεμβριανά
.LP
Η έναρξή τους, στις 3 Δεκεμβρίου του 1944, σηματοδοτείται από τους
πυροβολισμούς των Αστυνομικών δυνάμεων μπροστά στο μνημείο του άγνωστου
στρατιώτη ενάντια στη διαδήλωση του ΕΑΜ, που είχε οργανωθεί ως απάντηση
στο τελεσίγραφο της κυβέρνησης εθνικής ενότητας (1-12-1944) για τον
αφοπλισμό όλων των αντάρτικων ομάδων, με αποτέλεσμα το θάνατο 33
διαδηλωτών και τον τραυματισμό άλλων 148. Παράλληλα ο στρατηγός Σκόμπυ
προέβη σε διάγγελμα, ενώ άμεσες προσπάθειες για πολιτική λύση
απαγορεύτηκαν από τον Τσώρτσιλ.

This renders fine to a UTF-8 terminal with groff 1.22.4:

$ groff -k -ms -Tutf8 ATTIC/sample-greek.ms | cat -s

1.  Δεκεμβριανά

Η  έναρξή  τους,  στις 3 Δεκεμβρίου του 1944, σηματοδοτείται
από τους πυροβολισμούς των Αστυνομικών δυνάμεων μπροστά  στο
μνημείο  του  άγνωστου  στρατιώτη  ενάντια στη διαδήλωση του
ΕΑΜ, που είχε οργανωθεί  ως  απάντηση  στο  τελεσίγραφο  της
κυβέρνησης  εθνικής  ενότητας  (1‐12‐1944) για τον αφοπλισμό
όλων των αντάρτικων  ομάδων,  με  αποτέλεσμα  το  θάνατο  33
διαδηλωτών  και  τον  τραυματισμό  άλλων  148.  Παράλληλα  ο
στρατηγός Σκόμπυ προέβη σε διάγγελμα, ενώ άμεσες προσπάθειες
για πολιτική λύση απαγορεύτηκαν από τον Τσώρτσιλ.

In groff 1.23.0, you will even be able to use nroff:

$ ./build/nroff -k -ms ATTIC/sample-greek.ms | cat -s

1.  Δεκεμβριανά

Η  έναρξή  τους,  στις 3 Δεκεμβρίου του 1944, σηματοδοτείται
από τους πυροβολισμούς των Αστυνομικών δυνάμεων μπροστά  στο
μνημείο  του  άγνωστου  στρατιώτη  ενάντια στη διαδήλωση του
ΕΑΜ, που είχε οργανωθεί  ως  απάντηση  στο  τελεσίγραφο  της
κυβέρνησης  εθνικής  ενότητας  (1‐12‐1944) για τον αφοπλισμό
όλων των αντάρτικων  ομάδων,  με  αποτέλεσμα  το  θάνατο  33
διαδηλωτών  και  τον  τραυματισμό  άλλων  148.  Παράλληλα  ο
στρατηγός Σκόμπυ προέβη σε διάγγελμα, ενώ άμεσες προσπάθειες
για πολιτική λύση απαγορεύτηκαν από τον Τσώρτσιλ.

But when preparing DVI, PostScript, or PDF, we have a problem.

$ groff -k -ms -Tpdf ATTIC/sample-greek.ms >| ATTIC/sample-greek.pdf
troff: ATTIC/sample-greek.ms:2: warning: can't find special character 
'u03B1_0301'
troff: ATTIC/sample-greek.ms:4: warning: can't find special character 
'u03B5_0301'
troff: ATTIC/sample-greek.ms:4: warning: can't find special character 
'u03B7_0301'
troff: ATTIC/sample-greek.ms:4: warning: can't find special character 
'u03B9_0301'
troff: ATTIC/sample-greek.ms:4: warning: can't find special character 
'u03BF_0301'
troff: ATTIC/sample-greek.ms:5: warning: can't find special character 
'u03C5_0301'
troff: ATTIC/sample-greek.ms:5: warning: can't find special character 
'u03C9_0301'

What is happening is that letters with the acute accent (Greek: tonos)
are getting dropped.  preconv(1) produces them in precomposed form
(Unicode Normalization Form C), which is fine for terminals, but not
necessarily the right thing to do on typesetters.  GNU troff therefore
decomposes them.

But it appears that some logic is missing for recombining them.  What
I'm not sure about is what component of the system has the missing
functionality.  In the old days (the 1970s and 1980s, as seen in the
accent mark support of ms(7) and me(7)), you'd just barrel forward,
formatting the base character and combining character together with the
\o escape sequence.

This approach breaks down when you need to apply multiple combining
characters, as happens perhaps most famously with Vietnamese, but also
with seemingly simpler scripts like the Pinyin romanization of Mandarin.

https://savannah.gnu.org/bugs/index.php?57524

My understanding is that modern font formats like OpenType (and
TrueType?) are supposed to be smart, and are able to handle this
situation with aplomb, though there are surely limits to this and I have
no idea how the surpass of those limits is supposed to be communicated
back to typesetting software.

But as far as I know the PostScript Type 1 fonts that we work with
_aren't_ smart, which leaves the problem in groff's hands.

As an experiment I tried the following crude workaround, called
"sample-greek2.groff".

.if t \{\
.  char \[u03B1_0301] \o'\[u03B1]\[u00B4]'
.  char \[u03B5_0301] \o'\[u03B5]\[u00B4]'
.  char \[u03B7_0301] \o'\[u03B7]\[u00B4]'
.  char \[u03B9_0301] \o'\[u03B9]\[u00B4]'
.  char \[u03BF_0301] \o'\[u03BF]\[u00B4]'
.  char \[u03C5_0301] \o'\[u03C5]\[u00B4]'
.  char \[u03C9_0301] \o'\[u03C9]\[u00B4]'
.\}
\[u03B1_0301]
\[u03B5_0301]
\[u03B7_0301]
\[u03B9_0301]
\[u03BF_0301]
\[u03C5_0301]
\[u03C9_0301]
.pl \n[nl]u

This works...kind of.

It beats dropping characters entirely, but to my eyesight the acute
accents aren't truly centered over the base glyph.  This might have to
do with the glyphs being italic instead of upright; the latter is surely
implied by the code points being in the Unicode Greek and Coptic block
(U+0370-03FF).

Maybe this is a problem with the Ghostscript 9.53.3 fonts.

But this:

$ groff -Tpdf -P -y -P U ATTIC/sample-greek2.groff >| \
   ATTIC/sample-greek2.pdf

...has the same problems.  Worse, in fact, since the acute accent in
this version of URW Times roman is grazing the tops of the lowercase
Greek letters.

Do these fonts just suck?  Does someone have a good Type 1 Greek font to
recommend?

With that in hand it may be easier to decide what groff can do better
(apart from native TTF and OTF support).

Regards,
Branden

--
Dr. Oliver Corff
Mail: oliver.corff@email.de


reply via email to

[Prev in Thread] Current Thread [Next in Thread]