bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bytewise u??_conv_from_encoding


From: Bruno Haible
Subject: Re: Bytewise u??_conv_from_encoding
Date: Sat, 01 Jan 2022 13:57:29 +0100

Hi Marc,

> The demand to read a file (in local encoding) and to decode it
> incrementally seems a typical one.

There are four ways to satisfy this demand.

(A) Using a pipe at the shell level:
      iconv -t UTF-8 | my-program

(B) Using a programming language that has a coroutines concept.
    This way, both the decoder and the consumer can be programmed in
    a straightforward manner.

(C) In C, with multiple threads.

(D) In C, with a decoder programmed in a straightforward manner
    and a consumer that is written as a callback with state.

(E) In C, with a decoder written as a callback with state
    and a consumer programmed in a straightforward manner.

> Thus, I am wondering whether it makes sense to offer a stateful
> decoder that takes byte by byte and signals as soon as a decoded byte
> sequence is ready.

It seems that you are thinking of approach (D).

I think (D) is the worst, because writing application code in a callback
style with state is hard and error-prone. I would favour (E) instead,
if (A) is not possible.

(B) means to use a different programming language. I can't recommend C++ [1].

(C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or
pipe-filter-gi.c. Generally, threads are overkill when all you need are
coroutines.

Now, when implementing (E), it will be useful to have some kind of "abstract
input stream" data type. Such a thing does not exist in C, for historical
reasons. But it can be done similarly to the "abstract output stream" data
type that is at the heart of GNU libtextstyle [2][3][4].

> On top of that, a decoding Unicode mbfile interface can be built, say ucfile.

One of the problems of byte-by-byte decoding is that it's inefficient. It's
way more efficient to do the same task (decoding, consuming) on an entire
buffer of, say, at least 1 KiB. Buffering minimizes the context switches and
time spent in function entry/exit. That needs to be considered in the design.

Bruno

[1] https://en.cppreference.com/w/cpp/language/coroutines
[2] 
https://www.gnu.org/software/gettext/libtextstyle/manual/html_node/The-output-stream-hierarchy.html
[3] 
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=libtextstyle/gnulib-local/lib/iconv-ostream.oo.h
[4] 
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=libtextstyle/gnulib-local/lib/iconv-ostream.oo.c






reply via email to

[Prev in Thread] Current Thread [Next in Thread]