bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#33205: 26.1; unibyte/multibyte missing in rx.el


From: Eli Zaretskii
Subject: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Mon, 05 Nov 2018 18:49:07 +0200

> Date: Wed, 31 Oct 2018 17:55:08 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 33205@debbugs.gnu.org
> 
> > From: Mattias Engdegård <mattiase@acm.org>
> > Cc: 33205@debbugs.gnu.org
> > Date: Wed, 31 Oct 2018 16:27:53 +0100
> > 
> > tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> > > I think it's a documentation bug: [:unibyte:] matches only ASCII
> > > characters.  IOW, it tests "unibyteness" in the internal
> > > representation (which might be surprising, I know).
> > > 
> > > And [:nonascii:] is only defined for multibyte characters.
> > 
> > Thus [:ascii:]/[:nonascii:] cannot be distinguished from
> > [:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?
> 
> I actually looked into this some more, and I think my original
> conclusion was wrong.  Let me dwell on that a bit more, and I will
> report what I found.  We can then revisit the questions you ask above.

After looking into this, my conclusion is that what I wrote above was
not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
out [:unibyte:] is not what one might think it is, you can see that in
re_wctype_to_bit, for example.

Thinking about this and looking at the code, I'd say that support of
named character classes is heavily biased in favor of multibyte text,
not to say supports _only_ multibyte text.  So searching unibyte
strings and unibyte buffers for the likes of [:unibyte:] will only
find ASCII characters.

In multibyte buffers and strings, unibyte characters are stored in
their multibyte representation, so it is no longer trivial to define
what does [:unibyte:] mean.  However, I discovered that there's a
workaround for what you are trying to do: use ^[:multibyte:] instead
of [:unibyte:].  Observe:

  (setq s "A\310") => "A\310"
  (string-match-p "A[[:ascii:]]" s) => nil
  (string-match-p "A[[:nonascii:]]" s) => nil
  (string-match-p "A[^[:ascii:]]" s) => 0      ;; !!!
  (string-match-p "A[[:unibyte:]]" s) => nil
  (string-match-p "A[^[:multibyte:]]" s) => 0  ;; !!!

That ^[:ascii:] is not the same as [:nonascii:], and the same with
[:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
becomes clear if you look at how we generate the fastmap in each of
these cases and how we set the bits in the work-area of the range
table, but I don't know enough to say how easy would it be to fix
that.

An alternative is to use an explicit character class, as in \000-\377,
that works as you'd expect.

> > Taking a step back: Do you agree that the missing unibyte/multibyte
> > should be added to rx
> 
> I think it depends on what we find regarding the functionality.  It's
> possible that it makes no real sense in the context of rx, for example
> (although it indeed sounds like an omission).
> 
> > If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
> > perhaps we could make them behave that way. 
> 
> Right.  Stay tuned, and thanks for pointing out this surprising
> behavior.

Well, what do you think now?  Is it worth adding those to rx.el?  I'm
not sure.  How important is it to find unibyte characters in a string,
anyway?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]