Re: Unicode confusables and reordering characters considered harmful

From:

Reini Urban

Subject:

Date:

Wed, 3 Nov 2021 16:07:51 +0100

On Tue, Nov 2, 2021 at 4:08 PM Clément Pit-Claudel <cpitclaudel@gmail.com> wrote:

There is a good summary of the issue and relevant mitigations at https://research.swtch.com/trojan (it argues against compiler fixes and in favor of IDE enhancements.)

No, this summary is awful.

The issue is that libc, the C standard committee, linux and most others are ignoring the unicode identifier security guidelines.

Identifiers must be identifiable, but strings should not be touched.

Identifiers are all names, pathnames, variable names, user names, ... but not arbitrary strings.

IDE's are just one place to fix it (that's why glib does it), but the core is more important.

The ones who do care about, like java (the compiler), my cperl (the compiler and runtime, because it is dynamic), rust (the compiler), glib (the library), do follow these guidelines.

All C compilers and most others are insecure. Linux Filesystems are insecure. The old APPLE Filesystem was secure, the new is again insecure.

Also the libc's cannot deal with de-normalized characters at all. grep, sed, coreutils all have outstanding unorm patches, because libunicode is too slow. Because it iterates over the string via callbacks.

In short you need to normalize each identifier, check for proper XID_Start/XID_Continue,

check your document for mixed scripts (several combinations are allowed, several disallowed,

HAN unification did a good job, but greek vs cyrillic is the worst), and forbid bidi changes.

The C standard recently complained that making identifiers secure would require the full Unicode database, which is wrong.

You need the normalization code (one or two tiny tables), the script lists (tiny), and the XID_Start/Continue lists (small).

Further you need an api to start a document (to init scripts) with an optional script param (the language).

Scripts just need a byte, the Start/Cont two bits. Sorted lists are the best representation. (musl does it unsorted, glibc an insecure table-lookup)

gnulib is really the best place to add these features, even if libunicode is too slow.

I started adding u8id support two years ago to my safeclib and my ctl, but was too busy lately. It works fine and fast enough in rust, java and cperl.

I have good support in the wchar_t part of safelibc (wcsnorm, wcsfc, but no scripts), but not the u8 part yet. glibc and musl don't care about u8