help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: `write-region' writes different bytes than passed to it?


From: Philipp Stephani
Subject: Re: `write-region' writes different bytes than passed to it?
Date: Sat, 22 Dec 2018 23:58:07 +0100

Am Di., 11. Dez. 2018 um 16:52 Uhr schrieb Eli Zaretskii <eliz@gnu.org>:
>
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Tue, 11 Dec 2018 13:30:07 +0100
> >
> > usually `write-region' uses the coding system bound to
> > `coding-system-for-write'. However, I've found a case where this
> > doesn't seem to be the case:
> >
> > $ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
> > utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
> > /tmp/test.txt
> > 00000000  f2                                                |.|
> > 00000001
> >
> > That is, instead of the byte sequence C1 B2 it writes the single byte
> > F2, which is an invalid UTF-8 sequence. Is that expected?
>
> Yes, because "\xC1\xB2" just happens to be the internal multibyte
> representation of a raw-byte F2.  Raw bytes are always converted to
> their single-byte values on output, regardless of the encoding you
> request.
>

Is that documented somewhere?
Or, in other words, what are the semantics of

(let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))

?
There are two easy cases:
1. STRING is a unibyte string containing only bytes within the ASCII range
2. STRING is a multibyte string containing only Unicode scalar values
In those cases the answer is simple: The form writes the UTF-8
representation of STRING.
However, the interesting cases are as follows:
3. STRING is a unibyte string with at least one byte outside the ASCII range
4. STRING is a multibyte string with at least one elements that is not
a Unicode scalar value
My example is an instance of (3). I admit I haven't read the entire
Emacs Lisp reference manual, but quite some parts of it, and I
couldn't find a description of the cases (3) and (4). Naively there
are a couple options:
- Signal an error. That would seem appropriate as such strings can't
be encoded as UTF-8. However, evidently Emacs doesn't do this.
- For case 3, write the bytes in STRING, ignoring the coding system. I
had expected this to happen, but apparently it isn't the case either.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]