emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode confusables and reordering characters considered harmful


From: Reini Urban
Subject: Re: Unicode confusables and reordering characters considered harmful
Date: Thu, 4 Nov 2021 08:50:14 +0100



On Wed, Nov 3, 2021 at 4:43 PM Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> No, this summary is awful.
> The issue is that libc, the C standard committee, linux and most others are
> ignoring the unicode identifier security guidelines.
> Identifiers must be identifiable, but strings should not be touched.

What do those rules say about code like:

    int hi = 5;
    int שָׁלוֹם = hi;
    int hello = 10;
    int السّلامعليك = hello;
    myfun(שָׁלוֹם ,السّلامعليكم)

IMO this code is fundamentally valid: we should allow
programmers to write identifiers in their native tongue.

Sure, nobody wants to forbid unicode identifiers. The rules only ensure that identifiers keep identifiable.
I converted itto perl (because I dislike java or rust), and ran it through cperl.
The problem is that from an innocent look or code review you won't see any problem, hence the security risk.
You need to adjust your tools.

But the very first RTL identifier שָׁלוֹם contains already non-identifier characters.
So I cannot tell you if this code doesn't violate any of the 4 unicode mixed script profiles (http://www.unicode.org/reports/tr39/#Mixed_Script_Detection 2-5)
Or if any of the unreadable characters are of the recommended scripts:
https://www.unicode.org/reports/tr31/#Table_Recommended_Scripts, (so no exotic or antique scripts)

http://perl11.github.io/cperl/perldata.html#Identifier-parsing


$hi = 5;
$שָׁלוֹם = $hi;
$hello = 10;
$السّلامعليك = $hello;
myfun($שָׁלוֹם, $السّلامعليك);

=> od -c
0000000   $   h   i       =       5   ;  \n   $ 327 251 326 270 327 201
0000020 327 234 327 225 326 271 327 235       =       $   h   i   ;  \n
0000040   $   h   e   l   l   o       =       1   0   ;  \n   $ 330 247
0000060 331 204 330 263 331 221 331 204 330 247 331 205 330 271 331 204
0000100 331 212 331 203       =       $   h   e   l   l   o   ;  \n   m
0000120   y   f   u   n   (   $ 327 251 326 270 327 201 327 234 327 225
0000140 326 271 327 235   ,       $ 330 247 331 204 330 263 331 221 331
0000160 204 330 247 331 205 330 271 331 204 331 212 331 203   )   ;  \n


Does the security guidelines require override chars to force the
`, ` to be in LTR, so as to fix the ordering problem (and would the
result be more or less clear to someone familiar with those RTL
scripts ;-0 )?


        Stefan



--
Reini Urban

reply via email to

[Prev in Thread] Current Thread [Next in Thread]