bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Corrupted multibyte characters in command substitutions fixes may be


From: L A Walsh
Subject: Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.
Date: Sun, 06 Feb 2022 16:46:13 -0800
User-agent: Thunderbird




On 2022/02/06 09:26, Frank Heckenbach wrote:
On 2022/01/02 17:43, Frank Heckenbach wrote:

Why would you? Aren't you able to assess the severity of a bug
yourself? Silent data corruption is certainly one of the most severe
kind of bugs ...
---
That's debatable, BTW, as I was reminded of a similar
passthrough of what one might call 'invalid input' w/o warning,

I think you misunderstood the bug. It was not about passing through
invalid input or fixing it. It was about bash corrupting valid input
(if an internal buffer boundary happened to fall within a UTF-8
sequence)

I see that the cause of the bug you reported was due to entirely different
circumstances, they question I might have, is if bash was returning
input -- should bash scan that input for validity.  For example, if it
bash read these values, from a 'read' (spaces separating separate
bytes):
bytes
read:      | returned:
1) first case is relatively clear:
read (len=2)
 0x41 0x31
returned:
A1
2)
read (len=4):
0x41 0x31 0x00 0x00
returned: ???  A1 or nothing?
error or warning message?

In the case of bash with environment having LC_CTYPE: C.UTF-8 or en_US.UTF-8
read:
0xC3 (len=1) i.e. Ã ('A' w/tilde in a legacy 8-bit latin-compatible charset),
but invalid if bash processes the environment setting of en_US.UTF-8.

Should bash process it as legacy input or invalid UTF8?
Either way, what should it return? a UTF-8 char
(hex 0xc30x83) transcoded from the latin value of A-tilde, or
keep the binary value the same (return 0x83),
should it return a warning message?  If it does, should
it return NUL for the returned value because the input was erroneous?

I.e. should bash try to scan for validity of input? Should it use legacy ANSI or 8-bit charsets for such or
should it try to decode legacy inputs into Unicode if the environment
indicates it should be using unicode values?)
on decode-errors should it issue a warning message if so, should
it return the original unencoded value, NUL, or a decoded Unicode value?

If bash is returning a value corrupted by a memory overlap (overlapping stack values)
should it be testing the returned value as valid (especially if the
environment suggests it should be returning unicode values?).

I.e. if there was corruption -- either from reading a NUL
unexpectedly, or incorrectly encode Unicode values, if warnings
were "on", the corruption might be noticed -- even if noticed,
what should bash return -- a binary DWORD value that makes no sense as
a string: either ASCII or unicode, like
0x00 0x41 0x00 0xC1 -- maybe an attempt at 'AÀ' in UTF-16 on Windows --
where my original bug occurred in reading a registry value that could
easily be UTF-16 encoded where the user-shell was being run under
cygwin running a Unicode C.UTF-8 environment.

I.e. Bash might be expected to return different results based on
the environment it was running in and the environment specified encoded
or whether bash was expecting the reduced-ASCII character set.

Depending on what one thinks bash 'should do' and what environment it
was running in can result in very different results, which is why I
balked at bash issuing warnings in some cases and not others and
whether it returned the original binary values or some sanitized version.

At the time, due to the warning being issued, the read 'failed' and
a sanitized version was returned -- both responses preventing reading
the desired value.  If bash detected invalid Unicode sequences it might
help detect memory-based corruption, or might sanitize such sequences before
returning them -- either way possibly causing harm due to silence or due
to breaking compatibility.

Just thought it might be desirable to be consistent about what was done or
having controlled via an option (be strict+warn or ignore+don't warn).

If its decided to ignore (don't test for validity) and don't issue a warning
as the default action, then the warning for null bytes seems like it should
be removed -- with the idea of bash not testing read input for validity.


which was very unhelpful.

Or more basically should
             based character set -- as in legacy input)
returned: ???  should bash return à (U+00C3) or hexbytes 0xc3\0x83
if





reply via email to

[Prev in Thread] Current Thread [Next in Thread]