help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How Is \B Supposed to Work in Regexps?


From: Wolfgang Laun
Subject: Re: How Is \B Supposed to Work in Regexps?
Date: Wed, 7 Sep 2022 17:18:54 +0200

\B matches when not at a word boundary

It might become clearer when you run the slightly modified matching
operation:

  echo "///"    | awk '{h = $0; p = gsub( /\B/, "!" ); print p " >" h "< >"
$0 "<";}'
  4 >///< >!/!/!/!<

Where there is no word, there cannot be a word boundary.

Regards
Wolfgang


On Tue, 6 Sept 2022 at 22:35, Neil R. Ormos <ormos-gnulists17@ormos.org>
wrote:

> The \B regexp operator doesn't appear to work as described in the manual.
>
> In manual Section "3.7 gawk-Specific Regexp Operators", \B is said to match
>
> | the empty string that occurs between two
> | word-constituent characters. For example,
> | /\Brat\B/ matches 'crate', but it does not match
> | 'dirty rat'. '\B' is essentially the opposite of
> | '\y'.
>
> \B seems to match even strings that contain no word-constituent characters.
>
> The little test program in the examples below tries to match() a
> one-element regexp, either \w, \y, or \B, against various test strings in
> $0.  The first print displays the value from match(), followed by $0
> sandwiched between two "|" characters.  The second print places a caret
> ("^") under $0, as printed above, at the position of the regexp identified
> by match().
>
> Output lines 9, 12, and 15 show that " " and "/" are not word-constituent
> characters, while "a" is a word-constituent.
>
> In output lines 19, 25, and 31, searching for \y, the empty string at the
> beginning or end of a word, the results are as expected: match() returns
> the position of the first "a" in $0.  Likewise, in output lines 22 and 28,
> there are no word-constituent characters in the corresponding $0, and
> match() returns 0.
>
> Lines 37-51 involve matching \B, and I don't understand those results.
>
> For output line 34, the input string "aaa" starts with a run of
> word-constituent characters.  The result is 2, as expected.
>
> For output line 46, the input string "a" is exactly one word-constituent
> character.  The result is 0, as expected, because there is no "empty string
> between two word-constituent characters".
>
> For output line 41, using input string "   aaa", the run of
> word-constituent characters begins in position 4, yet match() returns 1.  I
> would have expected 5.
>
> For output lines 38 and 44, the input strings have no word-constituent
> characters, yet match() again returns 1.  I would have expected 0.
>
> For output line 50, input string "a/a/a/", there are no pairs of adjacent
> word-constituent characters, and therefore, there should be no "empty
> string between two word-constituent characters".  Here, match() returns 7,
> identifying a position outside the six-character input string.  Again, I
> would have expected 0.
>
> I'm not sure whether these results show a bug in Gawk (or the regexp
> library or libraries it uses), a bug in the manual, my error in
> interpretation, or some other PBCAK error.  Any insights?
>
> (The gawk executable referenced in the examples was built from the most
> recent release, but I think I get the same results from an ancient gawk.)
>
> ############################################################
> ############################################################
>
>  8      echo " "      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
>  9      0 | |
> 10        ^
> 11      echo "a"      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 12      1 |a|
> 13         ^
> 14      echo "/"      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 15      0 |/|
> 16        ^
> 17
> 18      echo "aaa"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 19      1 |aaa|
> 20         ^
> 21      echo "///"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 22      0 |///|
> 23        ^
> 24      echo "   aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 25      4 |   aaa|
> 26            ^
> 27      echo "   ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 28      0 |   ///|
> 29        ^
> 30      echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 31      1 |a/a/a/|
> 32         ^
> 33
> 34      echo "aaa"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 35      2 |aaa|
> 36          ^
> 37      echo "///"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 38      1 |///|
> 39         ^
> 40      echo "   aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 41      1 |   aaa|
> 42         ^
> 43      echo "   ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 44      1 |   ///|
> 45         ^
> 46      echo "a"      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 47      0 |a|
> 48        ^
> 49      echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 50      7 |a/a/a/|
> 51               ^
>
> ############################################################
>
>

-- 
Wolfgang Laun


reply via email to

[Prev in Thread] Current Thread [Next in Thread]