emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode confusables and reordering characters considered harmful


From: Eli Zaretskii
Subject: Re: Unicode confusables and reordering characters considered harmful
Date: Wed, 03 Nov 2021 19:24:29 +0200

> From: Reini Urban <reini.urban@gmail.com>
> Date: Wed, 3 Nov 2021 16:07:51 +0100
> 
> The issue is that libc, the C standard committee, linux and most others are 
> ignoring the unicode identifier
> security guidelines.
> Identifiers must be identifiable, but strings should not be touched.
> 
> Identifiers are all names, pathnames, variable names, user names, ... but not 
> arbitrary strings.
> IDE's are just one place to fix it (that's why glib does it), but the core is 
> more important.
> 
> The ones who do care about, like java (the compiler), my cperl (the compiler 
> and runtime, because it is
> dynamic), rust (the compiler), glib (the library), do follow these guidelines.
> All C compilers and most others are insecure. Linux Filesystems are insecure. 
> The old APPLE Filesystem
> was secure, the new is again insecure.
> Also the libc's cannot deal with de-normalized characters at all. grep, sed, 
> coreutils all have outstanding
> unorm patches, because libunicode is too slow. Because it iterates over the 
> string via callbacks.
> 
> In short you need to normalize each identifier, check for proper 
> XID_Start/XID_Continue, 
> check your document for mixed scripts (several combinations are allowed, 
> several disallowed, 
> HAN unification did a good job, but greek vs cyrillic is the worst), and 
> forbid bidi changes.

I'm not sure I follow: the examples in the original paper which
sparked all this brouhaha didn't touch any identifiers.  All the
identifiers in those examples were perfectly compliant with the
Unicode guidelines, AFAIR.  What the examples did was insert
directional format controls so as to reorder _punctuation_ characters,
in a way that changes the visual appearance and the interpreted
semantics of the code.  All of the format controls were inserted
within whitespace, not inside any identifiers.

So I'm not sure how what you tell is relevant to the issue at hand;
could you perhaps explain?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]