bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Bash 4.4 Test Discrepancy on OpenVMS


From: Eric W. Robertson
Subject: GNU Bash 4.4 Test Discrepancy on OpenVMS
Date: Fri, 7 Oct 2016 12:54:32 -0400
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

While building and testing GNU Bash 4.4 on OpenVMS, the GNU Bash test script issued the following difference between OpenVMS Bash produced output and reference output for the test sub-script tests/exp8.sub (lines 28 - 31)

unset array
declare -A array
array=( [$'x\001y\177z']=$'a\242b\002c' )
echo ${array[@]@A}

Currently, the reference result expected for ALL platform implementations for the above sequence of Bash test commands is embodied in tests/exp.right (line 236):

declare -A array=([$'x\001y\177z']=$'a\242b\002c' )

on OpenVMS the following output is generated instead:

declare -A array=([$'x\001y\177z']=$'a¢b\002c' )

After studying the applicable sections of the relevant ISO and POSIX standards and inspection of Bash's execution within the OpenVMS Debugger, I have come to the conclusion that this difference arises out of an implementation dependent difference with respect to the locale dependent characteristics of characters in the C/POSIX locale. The relevant ISO and POSIX standards explicitly DO NOT specify any particular requirements of the C/POSIX locale regarding locale dependent characteristics for character codes outside of the Portable Character Set (PCS). Therefore, any programmed behavior relying on locale dependent characteristics is subject to implementation differences with respect to character codes in the context of the C/POSIX locale lying outside of PCS. Using the OpenVMS Debugger, it became apparent that the expansion of the shell variable "array" ultimately results in a call to the function ansic_quote() (located within source module lib/sh/strtrans.c). The relevant excerpt from this function is:

   for (s = str; c = *s; s++)
    {
      b = l = 1;        /* 1 == add backslash; 0 == no backslash */
      clen = 1;

      switch (c)
    {
    case ESC: c = 'E'; break;
#ifdef __STDC__
    case '\a': c = 'a'; break;
    case '\v': c = 'v'; break;
#else
    case 0x07: c = 'a'; break;
    case 0x0b: c = 'v'; break;
#endif

    case '\b': c = 'b'; break;
    case '\f': c = 'f'; break;
    case '\n': c = 'n'; break;
    case '\r': c = 'r'; break;
    case '\t': c = 't'; break;
    case '\\':
    case '\'':
      break;
    default:
#if defined (HANDLE_MULTIBYTE)
      b = is_basic (c);
      /* XXX - clen comparison to 0 is dicey */
if ((b == 0 && ((clen = mbrtowc (&wc, s, MB_CUR_MAX, 0)) < 0 || MB_INVALIDCH (clen) || iswprint (wc) == 0)) ||
          (b == 1 && ISPRINT (c) == 0))
#else
      if (ISPRINT (c) == 0)
#endif
        {
          *r++ = '\\';
          *r++ = TOCHAR ((c >> 6) & 07);
          *r++ = TOCHAR ((c >> 3) & 07);
          *r++ = TOCHAR (c & 07);
          continue;
        }
      l = 0;
      break;
    }
      if (b == 0 && clen == 0)
    break;

      if (l)
    *r++ = '\\';

      if (clen == 1)
    *r++ = c;
      else
    {
      for (b = 0; b < (int)clen; b++)
        *r++ = (unsigned char)s[b];
      s += clen - 1;    /* -1 because of the increment above */
    }
    }

In the case of the Bash build for OpenVMS, the macro HANDLE_MULTIBYTE is defined by the Bash configure script. That being the case, it is apparent from the above code excerpt that the decision to quote or not to quote a particular character code in the expanded string is determined by the results of the functions is_basic(), mbrtowc(),iswprint(), and isprint() (indirectly through macro expansion of the ISPRINT() function macro). The is_basic() function seems to be coded in such a way that it it will return homogoneous results across platform implementations. However, the results for all of the other, remaining functions are locale dependent. Therefore, for character codes outside of PCS, the ANSI C quoting of the expanded string is ultimately implementation dependent.

Since the octal character code 242 that is used in defining the value for the "array" shell variable is clearly outside of PCS, the result of expanding the shell variable value in this case cannot be guaranteed to be homogoneous for all platform implementations. But, that is currently the way both the test script and the reference results are posed.

This naturally prompts a couple of questions: Is this in fact a bug? Further, if it is a bug, precisely where is the bug? Given what I know at the moment, my own answer to these questions is that if it is a bug, the bug is in the test script and its corresponding reference results which are not posed to handle platform implementation differences which applicable standards explicitly permit in the context of the C/POSIX locale and character codes outside of PCS. However, I cannot be entirely certain of this conclusion because the exp8.sub script does not contain explicit commentary on what the precise motivation is behind the above sequence of Bash test commands and what particular significance (if any) the octal character code 242 is supposed to have relative to the goal of this particular sequence of Bash test commands. So, I will leave it to the Bash experts to make a final, authoritative determination with respect to this Bash test discrepancy.

While investigating this test discrepancy with Bash 4.4 on OpenVMS I came across another potential source code bug relating to the expansion of the ISPRINT() function macro. The expansion of the ISPRINT() function macro is, in turn, partially dependent on the expansion of the IN_CTYPE_DOMAIN() function macro. In the source code module include/chartypes.h, the function macro IN_CTYPE_DOMAIN() does not seem to be correctly defined for platforms not providing the isascii() function. Given the normative definition of the isascii() function in "The Open Group Base Specifications Issue 7 (IEEE Std 1003.1-2008) 2016 Edition", the current definition of the IN_CTYPE_DOMAIN() function macro (as the literal constant expression 1) is unlikely to result in any close approximation of correct behavior for most platforms not implementing the isascii() function. Instead, I believe the IN_CTYPE_DOMAIN() function macro would be better defined as follows:

#if STDC_HEADERS || (!defined (isascii) && !HAVE_ISASCII)
#  define IN_CTYPE_DOMAIN(c) ((c & (((int)-1)<<7)) == 0)
#else
#  define IN_CTYPE_DOMAIN(c) isascii(c)
#endif

For platforms that do not implement the isascii() function the above definition for the IN_CTYPE_DOMAIN() function macro is more likely to produce correct behavior than its current definition in the Bash 4.4 release.

As always any additional wisdom and/or feedback that can be provided regarding the above is greatly appreciated.

Thanks,

Eric





reply via email to

[Prev in Thread] Current Thread [Next in Thread]