Re: [PATCH] Improve error reporting when serializing non-Unicode strings

From:

Philipp Stephani

Subject:

Re: [PATCH] Improve error reporting when serializing non-Unicode strings to JSON

Date:

Sat, 23 Dec 2017 15:19:17 +0000

Eli Zaretskii <address@hidden> schrieb am Sa., 23. Dez. 2017 um 15:52 Uhr:

> From: Philipp Stephani <address@hidden>
> Date: Sat, 23 Dec 2017 14:29:56 +0000
> Cc: address@hidden, address@hidden
>
> OK, but why do we need external functions for doing that? What is
> missing in our own code to detect such a situation?
>
> Not much I think, it's just easiest to use Gnulib functions because they are well-documented, have a clean
> interface, and are probably bug-free.
> coding.c has check_utf_8, which is quite similar, but has an incompatible interface (it takes struct
> coding_system objects) and also checks for embedded newlines, which isn't necessary here.

So let's use check_utf_8, as its downsides don't sound serious to me,

Well it needs to be rewritten significantly to take a char*, length argument instead of the coding_system struct.

and OTOH using unistring functions will bloat Emacs

u8-check.c is just 77 LoC (including all boilerplate, comments, and empty lines), so I don't think it blows up Emacs in any significant way.

for the benefit of
a single use case, not to mention create two different methods for
doing the same job, which IMO is even more confusing to any newcomer
to the Emacs internals.

Agreed it's somewhat confusing, but I think not too much. The two functions have quite different use cases: check_utf_8 is a specialized function that requires a coding system with significant set-up and is only used once (in decode_coding_gap), while u8_check is a general-purpose function.

Having not much experience with coding.c, I find the functions in that file much more confusing and harder to understand than the ones from libunistring. The libunistring functions tend to have a single, clear purpose, while the coding.c functions often do many different things at once.

Btw, doesn't find_charsets_in_text do the same job cleaner and
quicker? AFAIU, all you need is make sure there are no characters
from the 2 eight-bit-* charsets in the text, or did I miss something?

What I need to check is one of the following:

- Is the initial string either a well-formed UTF-8 unibyte string, or a multibyte string that represents a Unicode scalar value sequence?

- Is the encoded string a well-formed UTF-8 unibyte string?

Given my understanding of the implementation of coding.c, these two criteria should be equivalent. (Unfortunately that doesn't seem to be documented.) So I choose to implement the second check, which is easier and allows delaying the check until we know we have to signal an error.