[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: unibyte<->multibyte conversion [Re: Emacs-diffs Digest, Vol 2, Issue

From: Stefan Monnier
Subject: Re: unibyte<->multibyte conversion [Re: Emacs-diffs Digest, Vol 2, Issue 28]
Date: Tue, 21 Jan 2003 12:44:50 -0500

> In article <address@hidden>, "Stefan Monnier" <monnier+gnu/address@hidden> 
> writes:
> >>  unibyte sequence (hex): 81    81    C0    C0
> >>                          result of conversion    display in multbyte buffer
> >>  string-as-multibyte:    9E A1 81    C0    C0    \201À\300
> >>  string-make-multibyte:  9E A1 9E A1 81 C0 81 C0 \201\201ÀÀ
> >>  string-to-multibyte:    9E A1 9E A1 C0    C0    \201\201\300\300
> > I find the terminology and the concepts confusing.
> I agree that those names are not that intuitive, but the
> first two were there before I noticed it.  :-p
> But, in what sense, the concepts are confusing?

The concept of string-as-multibyte made some sense in Emacs-20
when it was really "look under the hood: take the same bytes but
interpret them differently".  In Emacs-21, this is not the case
any more so I don't really understand what's the intent behind it
other than emacs-mule decoding (that it might happen to come out of
some other decoding step rather than out of a file is not really
relevant, I think).

I think what I find confusing is that the name of those functions
implicitly says "take the string and give me the same one, but
just multibyte instead of unibyte", even though there's no unambiguous
way to have "the same one".  So there has to be a choice of how
the conversion between unibyte and multibyte takes place, but this choice
is not clearly described by the functions's name.

> Please note that decode-coding-string also does eol
> conversion.  Using 'internal-unix, 'default-unix,

Sorry for my sloppyness.

> 'raw-text-unix will make them more equivalent.

This should probably be `no-conversion' (or `binary').  Admittedly, it's
the same, but I think it carries the intent a bit better.

> But, as we now have eight-bit-XXXX, I agree that
> string-as-multibyte is not that useful, string-to-multibyte
> is better.

But they do different things and the name-difference does not
explain clearly the subtle distinction, so I think it's more
confusing than anything else.

> > 2 - there is no `default' coding-system either.  Or maybe
> >     locale-coding-system is this default: if your locale is
> >     latin-1 then that's latin-1.
> If one does not do set-language-enviroment,
> locale-coding-system can be used as `default'.

And otherwise ?  The mere fact that I don't know the answer to this
question seems like a good indication that pretty much nobody knows what
`string-make-multibyte' does, so anyone who uses it is most likely
using it wrong.
Luckily, it seems only ps-mule.el uses it (although much more
code uses the underlying nonascii-translation-table functionality).

> > 3 - when called with a `raw-text' coding-system, decode-coding-string
> >     returns a unibyte string, which is obviously not what we want here.
> >     It might make sense for internal operations to return unibyte
> >     strings for the `raw-text' case, but I was really surprised that
> >     decode-coding-string would ever return a unibyte string.
> I tend to agree that it is better that decode-coding-string
> always return a multibyte string now.

If it can be fixed, we can recommend (decode-coding-string str 'no-conversion)
rather than introducing a new function string-to-multibyte.

> I think string-FOO-multibyte (and also string-FOO-unibyte)
> are conceptually different from decoding (and encoding)
> operations.  It's difficult for me to explain it clearly,
> but I'll try.
> Decoding and encoding are interface between Emacs and the
> outer world.
> Decoding is for converting an external byte sequence
> (i.e. belonging to a world out of Emacs) into Emacs'
> representation.
> Encoding is for converting Emacs' represenatation to a byte
> sequence that is used out of Emacs.

But the `emacs-mule' coding-system is used both inside and outside,
and same goes for `binary', so the distinction between inside and
outside is not very clear-cut.

I find it more helpful to think in terms of bytes and chars: unibyte
strings are sequences of bytes while multibyte strings are sequences
of chars.  Converting between bytes and chars is the purpose of
coding-systems.  In such a context, string-FOO-multibyte are obviously
just various forms of decoding, but the names don't give a good sense
of which decoding is used.

> And, if one wants to insert a result of encode-coding-string
> in a multibyte buffer (perhaps for some post-processing),
> what he should do?  If we have string-to-multibyte, we can
> do this:
>    (insert (string-to-multibyte
>              (encode-coding-string MULTIBYTE-STRING CODING)))
> If we don't have it, and provided that decode-coding-string
> always returns a multibyte string, we must do:
>    (insert (decode-coding-string
>              (encode-coding-string MULTIBYTE-STRING CODING) 'raw-text-unix))
> Isn't it very funny?

Obviously, I agree with Miles, that the second is much more clear (especially
if you replace `raw-text-unix' with `no-conversion'.  well, I prefer `binary'
myself, since the `no-conversion' is also a misnomer given that a conversion
does take place).

> By the way, I think the culprit of the current problem is
> this Emacs' doctrine:
>     Do unibyte<->mutibyte conversion by "MAKE" by default.

Since MAKE uses some kind of "default" related to the current language
environment, I think it's OK, except that it's not clear in what way
it's "related".
But of course, there should simply never be such a thing as "guess
what this unibyte stream translates into".  The coding-system used to
decode unibyte into multibyte should always be "clearly" defined
(by the process's coding-system, the keyboard's coding-system, ...).

I.e. it is simply a bug to insert a unibyte string into a multibyte buffer
(and vice versa).

As for inserting a char between 128 and 256 into a multibyte buffer...
it should ideally always be treated as an eight-bit-foo char,
but I think that making such a change right now would not be wise
because there is still too much code which forgets to decode
its bytes into chars (an instead relies on the MAKE default to
turn those chars into latin-1 chars).

> Although this doctrine surely works for handling unibyte and
> multibyte represenation transparently, it makes Elisp
> programmers very very confused.  And it is useful only for
> people whose main charset is single-byte.
> I seriously considering changing it in emacs-unicode.

Might be a good idea for emacs-unicode indeed.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]