[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gawk regex stuff you may want
From: |
Paul Eggert |
Subject: |
Re: gawk regex stuff you may want |
Date: |
Thu, 21 Jan 2016 09:08:37 -0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 |
On 01/20/2016 11:59 AM, Aharon Robbins wrote:
I don't see that. In regex.h:
| struct re_pattern_buffer
| {
| /* Space that holds the compiled pattern. It is declared as
| `unsigned char *' because its elements are sometimes used as
| array indexes. */
| unsigned char *__REPB_PREFIX(buffer);
Ah, that's probably the problem, and is why your version needs extra
casts between unsigned char * and re_dfa_t *. In gnulib, that last line
reads 'struct re_dfa_t *__REPB_PREFIX(buffer);', which is what the
buffer really is, and avoids the need for casts later.
Since malloc (0) is well-defined to return either NULL or a valid pointer
to a zero-byte object that can be freed, the code is working as-is.
Not on older systems. Although yes, I'm not sure I still support such
systems. I don't know which checker complained. I think it was valgrind
at some point in the past.
Those older systems are long dead, as this requirement was standardized
in C89 and even the oldest systems we formerly supported (SunOS 4, if I
recall) conformed to the standard long ago. valgrind 3.11 does not
complain about simple uses like free (malloc (0)) on my platform (Fedora
23 x86-64). Perhaps older versions of valgrind complain on some
platforms, but I'd rather not worry about that; it's really a valgrind bug.
I'd rather look for a solution that involves silencing the checking
without making the code bulkier and typically slower.
This is in compilation, not execution; not sure the difference
is really noticable.
There is some runtime slowdown and code bloat. You're right, it's not
significant, but it's the principle of the thing: I'd rather not bloat
the code just to pacify a busted checker.
If a byte value given inside [...] is greater than 127, it should
be left alone to be matched as-is (no arbitrary limits).
Hah! We recently ran into a similar problem with 'grep'. Previously,
part of the grep code assumed that unibyte locales cannot have encoding
errors, so in a unibyte locale one can just use the traditional
byte-oriented algorithm without worrying about mbrlen and the like.
Other parts of the grep code checked for encoding errors in the usual
way, e.g., with mbrlen. Unfortunately, the assumption about unibyte
locales is incorrect, and the inconsistencies between the different
parts of 'grep' caused confusing and arguably incorrect behavior which
took a bit of hackery to resolve; see the thread starting at
<http://bugs.gnu.org/20526#86>.
As far as 'grep' is concerned, it'll trust what regcomp does here, so we
do have some freedom to change the code in this area. However, it looks
to me like your patch would do the wrong thing for unibyte locales where
btowc (b) returns a value that neither b nor WEOF. Also, the rest the
code assumes that if btowc returns WEOF in a multibyte locale then there
won't be a match (see the setup code in init_dfa, and I have the nagging
feeling that this assumption is embedded elsewhere). So, how about the
attached more-conservative patch instead?
/* Build single byte matching table for this equivalence class. */
+ char_buf[1] = (unsigned char) '\0';
This should be unnecessary, as the rest of the code shouldn't be looking
at that byte. Is this something to pacify a lint checker?
Or valgrind at some point. I think a static checker that someone submitted
a report from.
It might be helpful to get at the bottom of this. If it's some buggy
static checker then I would rather not worry about it.
- wc = wc2;
+ wc = (wint_t) wc2;
wc and wc2 are both integers so the cast is unnecessary. Similarly elsewhere.
There was a compiler (maybe VMS) for which this was necessary.
Again, it'd be helpful to know what the problem actually was. I expect
it was just a warning, which is fine to ignore. It's valid standard C
code, for what it's worth.
0001-regex-treat-x-as-x-if-x-is-a-unibyte-encoding-error.patch
Description: Source code patch