[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Corrupted multibyte characters in command substitutions fixes may be

From: L A Walsh
Subject: Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.
Date: Sat, 05 Feb 2022 18:41:00 -0800
User-agent: Thunderbird

On 2022/01/02 17:43, Frank Heckenbach wrote:
Chet Ramey wrote:

After all, we're talking about silent data corruption, and now I
learn the bug is known for almost a year, the fix is known and still
hasn't been released, not even as an official patch.
If you use the number of bug reports as an indication of urgency,

Why would you? Aren't you able to assess the severity of a bug
yourself? Silent data corruption is certainly one of the most severe
kind of bugs ...
That's debatable, BTW, as I was reminded of a similar
passthrough of what one might call 'invalid input' w/o warning,
resulting in code that worked in a specific circumstance to a change
in bash issuing a warning that resulted in breaking code, that, at that
point, worked as expected.

Specifically, it involved reading a value typically in the range
50 <=x <=150 from an active file (like a value from /proc that varies
based on OS internal values) where the data was stored in a
quad, or Little-Endian DWORD value, so the value was in the the
2 least significant bytes with the most significant bytes following
(in a higher position) in memory, like:
Byte# => 00 01 02 03, for value 100 decimal:
hex   => 64 00 00 00

The working code expected to see 0x64 followed by 0x00 which it
used as string terminator.

Chet "fixed" this silent use of 0x00 as a string terminator to no longer
ignore it, but have bash issue a warning message, which caused the
"read < fn" to fail and return 0 instead of the ascii character 'd', which
the program had interpret as the DPI value of the user's screen.

It took some debugging and hack arounds to find another way to access
the data.  So what some might have called silent data corruption because
bash silently passed through the nul terminated datum as a string
terminator, my program took as logical behavior.  I complained about
the change, remarking that if bash was going to sanitize returned values
(in that case checking for what should have been an ascii value with NUL
not being in the allowed value of string characters), that bash might
also be saddled with checking for invalid Unicode sequences and warning about
them as well, regardless of the source of the corruption, some programs
might expect to get a raw byte sequence rather than some encoded form
with the difference in interpretation causing noticeable bugs.

For example, the name name part of the an email address
that Chet replied to was "Ángel" where the first char in an 8-bit
Latin code page starts with a "Latin Capital Letter A with Acute".
While this worked and was passed through as a binary 0xc1 in perl 5.8.0,
Was "fixed" in 5.8.1 and later to result in the binary being
translated in to perl's internal form as U+00C1.  On output, that
gets translated to 0xc1 translated to a binary 0xC181 which is invalid
unicode (should have been 0xc381, but it's written to the first byte
position so the error is propagated throughout the field as

In the 5.8.0 version perl's non-conversion of the 8-bit latin input
resulted in a working filter.  The fixed version resulted in
the widely touted "perl-unicode bug", which exists to this day (for
backwards compatibility).

So silently returning values as-is-without modifying them may result
in working code, but modify the returned values after programs are written
that already depend on the literal byte-stream, can cause a different set
of annoying problems.  In that conversation, the idea of sanitizing UTF-8
input was raised, but as a costly endeavor for existing code.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]