help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to represent NBSP in gawk regex?


From: Neil R. Ormos
Subject: Re: How to represent NBSP in gawk regex?
Date: Mon, 21 Feb 2022 11:04:14 -0600 (CST)

david kerns wrote:
> Eli Zaretskii <eliz@gnu.org> wrote:
>> [david kerns wrote:]

>>> from the gawk user manual, my interpretation
>>> is that gawk only accepts UTF-8 encodings...

>> That's not true, AFAIK.

> Thus the sheepish wording... I was not able to
> get UTF-16 encoding to work, so I read the
> manual...  I couldn't find it clearly stated
> either way, but I did read this:

> | With the increasing popularity of the Unicode
> | character standard <http://www.unicode.org/>,
> | there is an additional wrinkle to consider.
> | Octal and hexadecimal escape sequences inside
> | bracket expressions are taken to represent
> | only single-byte characters (characters whose
> | values fit within the range 0a<c80>["]256). To
> | match a range of characters where the
> | endpoints of the range are larger than 256,
> | enter the multibyte encodings of the
> | characters directly.

> which is what Wolfgang did.

I think the lesson that should be drawn from that manual excerpt is limited to 
the specific context of escaped representations of characters in bracket 
expressions--i.e., within bracket expressions, as a special case, Gawk does not 
form multibyte characters from runs of escaped byte values, even if those runs 
of byte values are equivalent to a code point in the current locale.

But that's not true everywhere.  As an example, 

  gawk 'BEGIN{print length("\xc2\xa0") }'

prints 1 in a UTF-8 locale, showing that Gawk recognizes the run of bytes as a 
single character.

> Perhaps my real issue is that I live in an
> "LC_ALL=C" bubble

Although both David's and Wolfgang's solutions work, I wonder if there is a 
more portable way to represent the character that is not nailed-up for a 
specific character set.  As a wishful-thinking example, if iconv accepted 
"html" as one of the /character sets/ that could be specified using the 
--from-code option, it might be used at run-time to translate "&nbsp;" to 
equivalent character in the current locale.  Surely a UTF-256 is on the horizon.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]