bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `stri

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `stri

From:	Richard Hansen
Subject:	bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
Date:	Sat, 4 Jun 2022 20:16:47 -0400
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

On 6/4/22 03:09, Eli Zaretskii wrote:

If there was some situation where you needed these details for some
Lisp program, please describe that situation.


I'm trying to understand some inconsistent behavior I'm observing
while writing code to process binary data, and I found the existing
documentation lacking.


You are digging into low-level details of how Emacs keeps strings in
memory, and the higher-level context of _why_ you need to understand
these details is left untold.


Readers either think the documentation is confusing or they don't; why
they need to understand the documentation is mostly irrelevant. I
find the documentation to be confusing, and I suspect I am not the
only one.

In general, Lisp programs are well advised to stay away of
manipulating unibyte strings, and definitely to refrain from comparing
unibyte and multibyte strings -- because these are supposed to be
never needed in Lisp applications, and because doing TRT with those
requires non-trivial knowledge of the Emacs internals.


I disagree with "well advised". The documentation in 34.1 and 34.3
make it sound like the representation is merely an internal elisp
implementation detail that programmers don't need to worry about,
unless they are doing something unusually low-level.

I consider binary data processing to be somewhat common, not
"unusually low-level". Yet manipulating byte values 128-255 in unibyte
strings, and characters with Unicode codepoints 128-255 in multibyte
strings, is fraught with peril. For example, it is risky to use `aref'
to read a character or `aset' to write a character unless you either
know the string representation or know that the character is not in
#x80-#xff or #x3fff80-#x3fffff.


I see no reason to complicate the documentation for the very rare
occasions where these issues unfortunately leak to
higher-than-expected levels.


I don't think the occasions are all that rare.  But even if they are,
the precise behavior should be documented somewhere so that
programmers who need low-level string manipulation can do so
correctly.  I would argue that programmers using `string-to-unibyte'
or `string-to-multibyte' fall into that category.

@@ -271,20 +271,19 @@ Converting Representations
  @defun string-to-multibyte string
  This function returns a multibyte string containing the same sequence
  of characters as @var{string}.  If @var{string} is a multibyte string,
-it is returned unchanged.  The function assumes that @var{string}
-includes only @acronym{ASCII} characters and raw 8-bit bytes; the
-latter are converted to their multibyte representation corresponding
-to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
-(@pxref{Text Representations, codepoints}).
+it is returned unchanged.  Otherwise, byte values are transformed to
+their corresponding multibyte codepoints (@acronym{ASCII} characters
+and characters in the @code{eight-bit} charset).  @xref{Text
+Representations, codepoints}.


This loses information, so I don't think we should make this change.
It might be trivially clear to you that unibyte string can only
contain ASCII and raw bytes, but it isn't necessarily clear to
everyone.


I still find the current wording to be confusing. To me, all bytes
have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also,
ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant
to refer only to non-ASCII values. I have attached another revision
that I think is complete, correct, and easier to understand.

Thanks,
Richard

0001-Clarify-documentation-of-string-to-multibyte.patch
Description: Text Data

OpenPGP_signature
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Richard Hansen, 2022/06/03
- bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Eli Zaretskii, 2022/06/03
  - bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Richard Hansen, 2022/06/03
    - bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Eli Zaretskii, 2022/06/04
    - bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Richard Hansen <=
    - bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Eli Zaretskii, 2022/06/05
    - bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Richard Hansen, 2022/06/05
    - bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte', Eli Zaretskii, 2022/06/06

Prev by Date: bug#50675: more info
Next by Date: bug#50470: 27.1; 'company-mode' 'eshell'
Previous by thread: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
Next by thread: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
Index(es):
- Date
- Thread