Re: gawk regex stuff you may want

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk regex stuff you may want

From:	Paul Eggert
Subject:	Re: gawk regex stuff you may want
Date:	Thu, 21 Jan 2016 09:08:37 -0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0

On 01/20/2016 11:59 AM, Aharon Robbins wrote:


I don't see that. In regex.h:

| struct re_pattern_buffer
| {
|   /* Space that holds the compiled pattern.  It is declared as
|      `unsigned char *' because its elements are sometimes used as
|      array indexes.  */
|   unsigned char *__REPB_PREFIX(buffer);

Ah, that's probably the problem, and is why your version needs extracasts between unsigned char * and re_dfa_t *. In gnulib, that last linereads 'struct re_dfa_t *__REPB_PREFIX(buffer);', which is what thebuffer really is, and avoids the need for casts later.

Since malloc (0) is well-defined to return either NULL or a valid pointer
to a zero-byte object that can be freed, the code is working as-is.

Not on older systems. Although yes, I'm not sure I still support such
systems.  I don't know which checker complained. I think it was valgrind
at some point in the past.

Those older systems are long dead, as this requirement was standardizedin C89 and even the oldest systems we formerly supported (SunOS 4, if Irecall) conformed to the standard long ago. valgrind 3.11 does notcomplain about simple uses like free (malloc (0)) on my platform (Fedora23 x86-64). Perhaps older versions of valgrind complain on someplatforms, but I'd rather not worry about that; it's really a valgrind bug.

I'd rather look for a solution that involves silencing the checking
without making the code bulkier and typically slower.

This is in compilation, not execution; not sure the difference
is really noticable.

There is some runtime slowdown and code bloat. You're right, it's notsignificant, but it's the principle of the thing: I'd rather not bloatthe code just to pacify a busted checker.

If a byte value given inside [...] is greater than 127, it should
be left alone to be matched as-is (no arbitrary limits).

Hah! We recently ran into a similar problem with 'grep'. Previously,part of the grep code assumed that unibyte locales cannot have encodingerrors, so in a unibyte locale one can just use the traditionalbyte-oriented algorithm without worrying about mbrlen and the like.Other parts of the grep code checked for encoding errors in the usualway, e.g., with mbrlen. Unfortunately, the assumption about unibytelocales is incorrect, and the inconsistencies between the differentparts of 'grep' caused confusing and arguably incorrect behavior whichtook a bit of hackery to resolve; see the thread starting at<http://bugs.gnu.org/20526#86>.

As far as 'grep' is concerned, it'll trust what regcomp does here, so wedo have some freedom to change the code in this area. However, it looksto me like your patch would do the wrong thing for unibyte locales wherebtowc (b) returns a value that neither b nor WEOF. Also, the rest thecode assumes that if btowc returns WEOF in a multibyte locale then therewon't be a match (see the setup code in init_dfa, and I have the naggingfeeling that this assumption is embedded elsewhere). So, how about theattached more-conservative patch instead?

        /* Build single byte matching table for this equivalence class.  */
+      char_buf[1] = (unsigned char) '\0';

This should be unnecessary, as the rest of the code shouldn't be looking
at that byte. Is this something to pacify a lint checker?

Or valgrind at some point. I think a static checker that someone submitted
a report from.

It might be helpful to get at the bottom of this. If it's some buggystatic checker then I would rather not worry about it.

-       wc = wc2;
+       wc = (wint_t) wc2;

wc and wc2 are both integers so the cast is unnecessary. Similarly elsewhere.

There was a compiler (maybe VMS) for which this was necessary.

Again, it'd be helpful to know what the problem actually was. I expectit was just a warning, which is fine to ignore. It's valid standard Ccode, for what it's worth.

0001-regex-treat-x-as-x-if-x-is-a-unibyte-encoding-error.patch
Description: Source code patch

[Prev in Thread]

Current Thread

[Next in Thread]

gawk regex stuff you may want, Aharon Robbins, 2016/01/18
- Re: gawk regex stuff you may want, Paul Eggert, 2016/01/18
  - Re: gawk regex stuff you may want, Aharon Robbins, 2016/01/20
    - Re: gawk regex stuff you may want, Paul Eggert <=
    - Re: gawk regex stuff you may want, Aharon Robbins, 2016/01/23
    - Re: gawk regex stuff you may want, Paul Eggert, 2016/01/24
    - Re: gawk regex stuff you may want, Aharon Robbins, 2016/01/24

Prev by Date: Re: comment fix in hash-pjw-bare
Next by Date: Re: wcwidth on OpenBSD 5.8
Previous by thread: Re: gawk regex stuff you may want
Next by thread: Re: gawk regex stuff you may want
Index(es):
- Date
- Thread