bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#29871: 25.3; ZWJ word-boundaries in regexps


From: Stefan Kangas
Subject: bug#29871: 25.3; ZWJ word-boundaries in regexps
Date: Sun, 29 Sep 2019 01:28:02 +0200

tags 29871 + notabug
close 29871
quit

Eli Zaretskii <eliz@gnu.org> writes:

>> From: "Mark Shoulson" <mark@nagas.meson.org>
>> Date: Wed, 27 Dec 2017 14:07:40 -0500
>>
>> According to http://unicode.org/reports/tr29/#Word_Boundaries rule WB4,
>> it would seem that a ZWJ character (U+200D ZERO WIDTH JOINER) between
>> two "word" characters should not constitute a word boundary.  And yet:
>>
>> (string-match "\\<" "foo\u200Dfbar" 1)
>>
>> evaluates to 4 (the 1 is to skip the word-beginning at the start of the
>> string).  Or you can search for "\\b" or "\\>" and get 3.  Either way,
>> indicative of a word-break at the ZWJ character.  Is this correct?
>
> Emacs considers a change of script as a word break, and U+200D's
> script is 'symbol', which is different from 'latin', the script of the
> ASCII characters.

According to the above explananation, this behaviour is expected.  I'm
therefore closing this as notabug.

Best regards,
Stefan Kangas





reply via email to

[Prev in Thread] Current Thread [Next in Thread]