bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] Hangul Jamo vowels and trailing consonants should


From: Luis Javier Merino
Subject: Re: [bug-libunistring] Hangul Jamo vowels and trailing consonants should probably be 0 width
Date: Tue, 28 Dec 2021 13:40:09 +0100

On Tue, Dec 28, 2021 at 11:36 AM Bruno Haible <bruno@clisp.org> wrote:
> I agree that U+D7B0..U+D7FF (Hangul Jamo Extended-B) should be treated like
> U+1160..U+11FF (Hangul Jamo medial and final), per Unicode standard, chapter 
> 18
> https://www.unicode.org/versions/Unicode14.0.0/ch18.pdf .
>
> However, I don't think what people have been looking at is the right spot.

Yes. wcwidth() interfaces lack context. wcswidth()-style interfaces
are better in that regard. E.g: perl's Unicode::GCString:

use strict;
use warnings;

binmode(STDOUT, ":utf8");

use Unicode::GCString;
use Text::CharWidth qw(mbwidth mbswidth);

sub string_info {
       my $s = shift;
       my $gc = Unicode::GCString->new($s);
       print "$s : GCString->columns: ", $gc->columns, " : mbswidth:
", mbswidth($s), "\n";
       for (my $i = 0; $i < length($s); $i++) {
               my $c = substr($s,$i,1);
               my $cgc = Unicode::GCString->new($c);
               print "\t$c : GCString->columns: ", $cgc->columns, " :
mbswidth: ", mbswidth($c), " : mbwidth: ", mbwidth($c), "\n";
       }
}

string_info("\x{1100}\x{d7b0}\x{d7fb}\x{1101}\x{d7c0}\x{d7c2}\x{d7d0}");
string_info("\x{1100}\x{200b}\x{d7b0}\x{200b}\x{d7fb}\x{1101}\x{200b}\x{d7c0}\x{200b}\x{d7c2}\x{200b}\x{d7d0}");

The above script results in:


ᄀힰퟻᄁퟀퟂퟐ : GCString->columns: 4 : mbswidth: 4
ᄀ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
ힰ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ퟻ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ᄁ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
ퟀ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ퟂ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ퟐ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ᄀힰퟻᄁퟀퟂퟐ : GCString->columns: 14 : mbswidth: 4
ᄀ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ힰ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟻ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ᄁ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟀ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟂ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟐ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0

(The line with GCString->columns: 14 should have separate Jamos)


> 2) People argue about the use of these Hangul Jamo characters when
> they form a complete Hangul syllable, and that in this case the
> total width should be 2, and therefore 2 = 2 + medial + final the
> medial and final parts should have width 0.
>
> But in this case people would be using a precomposed Hangul syllable.

The Mac OS X filesystem stores filenames as NFD, which would separate
syllables into component Jamos. See:

https://github.com/neovim/neovim/issues/4476

>
> What I am more concerned about: When you look at the code charts
> https://www.unicode.org/charts/PDF/U1100.pdf
> https://www.unicode.org/charts/PDF/UD7B0.pdf
> you see that there are glyphs.
> - In which circumstances are these characters used individually?
>   Maybe in a text book for Korean children?
> - How are they supposed to be rendered in these situations? Surely
>   as glyphs of width 2, no?

To render as separate components, there are several options:

 - Use the non-conjoining forms from the Hangul Compatibility Jamo:
U+3130–U+318F block. It covers the Jamo in modern use, from the
standard e KS X 1001:1998. It doesn't cover archaic Jamo.
 - Use the filler choseong (initial) U+115F and jungseong (medial)
U+1160 Jamo as appropriate, to create a syllable with only the
required Jamo displayed. The font may still squeeze the Jamo in a
corner.
 - Use non-Korean to separate Jamo, e.g.  U+200B zero width space or
U+2060 word joiner. Here we have a problem.

>
> In the end, it comes down to: What is the more frequent context for
> these characters?
>

Ideally, everyone would send complete strings, or at least complete
(extended?) grapheme clusters to functions like wcswidth() or
u32_width(), and this functions would take context into account, like
perl's Unicode::GCString does. Since wcwidth/g_unichar_*/uc_width are
widely used, sometimes results are going to be wrong. But I don't
really know if filenames in NFD causing trouble and decomposed Hangul
taking 3 or 4 columns are more common than trying to use separate Jamo
in terminal emulators, though I suspect so.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]