bug-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: term, utf-8 and cooked mode, combining characters


From: Marcus Brinkmann
Subject: Re: term, utf-8 and cooked mode, combining characters
Date: Wed, 18 Sep 2002 17:03:35 +0200
User-agent: Mutt/1.4i

On Wed, Sep 18, 2002 at 04:19:15PM +0200, Niels Möller wrote:
> The unicode support I'm talking about is the ability to take the input
> stream and chop it up into units that are passed on to libtermserver
> input handling. That is support that is needed either in console or
> term, depending on how they work together.
> 
> To me it seems easier to perform the following steps:
> 
> A1. Chop the unicode stream up into graphemes.
> A2. Convert each grapheme into the local encoding, resulting in one or
>     more bytes each. (I think you can do this with iconv).
> A3. Pass each grapheme to the term input handling (libtermserver),
>     using the local encoding.
> 
> than
> 
> B1. Convert stream into local encoding.
> B2. Chop up the stream into graphemes, using rules that depend on
>     the local encoding. (I don't think iconv can do this easily).
> B3. Pass the graphemes on to the term input handling.

That's a language I understand! :)  I did not understand you earlier.

Yes, these are our two choices.  This is a very clear presentation, which
I want to thank you for.  I think there are a couple of issues that are
worth considering:

A1: How hard is that?  It is basically the same job as supporting combining
characters in the output direction, because we have to notice a base
character and the combining characters belonging to it.  So, as you said, it
seems we need to implement this anyway.

A2: I wonder if it is really true that one Unicode grapheme always encodes
in at most one grapheme in the local encoding, but I guess that is a
somewhat reasonable assumption.  Nevertheless, this feels a bit unpleasant.

A3: This shows that this option is only available if we use the builtin
term.  I have some aesthetic objections binding the console and term
together this way.  I think it should at least principially be possible to
use the console with a separate term process, and including term in the
console should be just an optimization.  This is because we also want to
stack terms, etc.

B2: This looks ugly, but it isn't so bad, as long as you don't have to do it
for multiple locales in one program, because this can be done with the
standard conversion functions rather than iconv.  We already have a working
implementation of this (it is of course libreadline), and even if we don't
use that we could rip it off.  In particular, this can be done inside of
term.

> In particular, I'm afraid that to do B2 you either have to support the
> rules of a bunch of strange multibyte charsets, or convert the stream
> back to unicode, chop it up into units of base char + combining chars,
> and then convert it back to the local encoding.

No, neither would be acceptable, nor does libreadline do it this way.  After
all, any program that wants to handle multibyte encodings cleanly has to do
it, term is not really special in that regard.  Maybe this is the other
reason why I aesthetically object against the solution A.  It special cases
term, which means that we will have to do a custom implementation with our
own bugs and debugging, while in the B model we can just reuse the code
everybody else is using.

> As I understand you, the current code performs B1 in the console, and
> B2 and B3 has to be done by term (but aren't yet implemented). But the
> work could be divided differently, either all of A1-A3 + term handling
> could be done by the console (Roland's suggestion), or perhaps the
> console could do A1-A3 and use some new interface to communicate the
> stream of graphemes (in local encoding) to term. One could also move
> some of the work even further away from term, into the input client.

Ideally, the input drivers don't need to be configured and adapted to your
local encoding environment.  I would like to keep it that way if at all
possible.  Either unicode is as universal as we want it to be, and then it
should be possible, or we have to take a closer look at this whole stuff
anyway.

Thinking about random reasons not to have it in the console: I originally
was only looking for a fix of terms cooked mode.  This is because in raw
mode, all these extra loops are not necessary, they will be done by the
application (like readline, or emacs).  So putting it in the console can be
an unnecessary bottleneck for throughput (not that _inputting_ characters is
performance critical, but I also want to watch CPU usage in general).

I hope I am not appearing dogmatic on this in any way.  I am still at the
beginning of thinking about this, and try to gather arguments for both.
Evaluation has to come later.  So far I have spent more time thinking about
the B model than the A model, that's probably why I am biased and have the
arguments in favor of B more present.  (In the end, it usually turns out
that Roland's intuition was correct anyway ;)

Thanks,
Marcus


-- 
`Rhubarb is no Egyptian god.' GNU      http://www.gnu.org    marcus@gnu.org
Marcus Brinkmann              The Hurd http://www.gnu.org/software/hurd/
Marcus.Brinkmann@ruhr-uni-bochum.de
http://www.marcus-brinkmann.de/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]