help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: pcre in gawk?


From: Miriam English
Subject: Re: pcre in gawk?
Date: Fri, 28 Jul 2023 11:36:24 +1000

Thanks for your reply. Apologies for my lack of response.

I have been trying to get pcre working with gawk, but so far I've been
unsuccessful.

Instead, I've fallen back on using pretty simple hacks to do the job.
For example, to do a non-greedy match the way .*? would do, I've broken
it into a two-step process, replacing the first instance of the target
string with a character that's extremely unlikely to be found in any
text -- hexadecimal character 1b is from the old ASCII character set
and represented a character to be substituted, so is fitting, I think.
Then do a normal greedy replace.

echo "1 and 2 and 3" | awk '{sub(/and/,"\x1b");sub(/.*\x1b/,"")}1'

results in:
 2 and 3

The only potential problem I see is in text that contains multi-byte
characters where one of the bytes is x1b. I'm not sure how to get
around that, or if there is a way to protect against it.

Alternatively, a rare unicode character, for example a cuneiform
character, could be used for the substitute character.

Thanks again,

 - Miriam


On Mon, 10 Jul 2023 22:25:25 +0000
Z <zilog@rawtext.club> wrote:

> Miriam English <mim@miriam-english.org> wrote:
> 
> > Does anybody know if there is any way to use the pcre library to
> > get a more extensive regex with gawk? In particular I want to use
> > non-greedy matches, such as:
> > 
> > *?        Match 0 or more times, not greedily
> > +?        Match 1 or more times, not greedily
> > ??        Match 0 or 1 time, not greedily
> > {n}?      Match exactly n times, not greedily (redundant)
> > {n,}?     Match at least n times, not greedily
> > {n,m}?    Match at least n but not more than m times, not greedily  
> 
> I don't think there is (yet) a gawk extension library for PCRE but
> there is for TRE, sort of a fuzzy match:
> https://laurikari.net/tre/faq/
> 
> GNU grep has a '-P' option that I guess mostly provides PCRE matching
> if PCRE is available on the system.  You could make a user-defined
> function to use it:
> 
> --
> # pcre_grep.awk -- use grep -P for pcre-type regex
> {
>   # match 'b' at least twice but no more than 3 times:
>   pcre("b{2,3}?",$0)
> }
> function pcre(PAT,REC,  CMD) {
>   CMD = "echo '"REC"' |grep -P '"PAT"'"
>   while (CMD |getline == 1)
>     print $0
>   close(CMD)
> }
> --
> 
> This is probably not the best approach but it seems to work.
> 
> The pcre2-utils package comes with a 'pcre2-test' tool that might be
> used similarly.  I think the default behavour is "non-greedy" unless
> PCRE_UNGREEDY is set.
> 



-- 
There are two wolves and they're always fighting.
One is darkness and despair. The other is light and hope.
Which wolf wins?
Whichever one you feed.
  -- Casey in Brad Bird's movie "Tomorrowland"



reply via email to

[Prev in Thread] Current Thread [Next in Thread]