bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] dfa: fix bug with ‘.’ and UTF-8 Hangul Syllables


From: Paul Eggert
Subject: [PATCH] dfa: fix bug with ‘.’ and UTF-8 Hangul Syllables
Date: Fri, 13 May 2022 23:26:10 -0700

This fixes a bug introduced in 2019-12-18T05:41:27Z!eggert@cs.ucla.edu,
an earlier patch that fixed dfa.c to not match invalid UTF-8.
Unfortunately that patch had a couple of typos when dfa.c is
matching against the regular expression ‘.’ (dot).  One typo
caused dfa.c to incorrectly reject the valid UTF-8 sequences
(ED)(90-9F)(80-BF) corresponding to U+D400 through U+D7FF, which
are some Hangul Syllables and Hangul Jamo Extended-B.  The other
typo caused dfa.c to incorrectly reject the valid sequences
(F4)(88-8F)(80-BF)(80-BF) which correspond to U+108000 through
U+10FFFF (Supplemental Private Use Area plane B).
* lib/dfa.c (utf8_classes): Fix typos.
* tests/test-dfa-match.sh: Test the fix.
---
 ChangeLog               | 16 ++++++++++++++++
 lib/dfa.c               |  4 ++--
 tests/test-dfa-match.sh | 11 +++++++++++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 6ed8a50735..fe26d37618 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,19 @@
+2022-05-13  Paul Eggert  <eggert@cs.ucla.edu>
+
+       dfa: fix bug with ‘.’ and UTF-8 Hangul Syllables
+       This fixes a bug introduced in 2019-12-18T05:41:27Z!eggert@cs.ucla.edu,
+       an earlier patch that fixed dfa.c to not match invalid UTF-8.
+       Unfortunately that patch had a couple of typos when dfa.c is
+       matching against the regular expression ‘.’ (dot).  One typo
+       caused dfa.c to incorrectly reject the valid UTF-8 sequences
+       (ED)(90-9F)(80-BF) corresponding to U+D400 through U+D7FF, which
+       are some Hangul Syllables and Hangul Jamo Extended-B.  The other
+       typo caused dfa.c to incorrectly reject the valid sequences
+       (F4)(88-8F)(80-BF)(80-BF) which correspond to U+108000 through
+       U+10FFFF (Supplemental Private Use Area plane B).
+       * lib/dfa.c (utf8_classes): Fix typos.
+       * tests/test-dfa-match.sh: Test the fix.
+
 2022-05-12  Paul Eggert  <eggert@cs.ucla.edu>
 
        manywarnings: update C warnings for GCC 12
diff --git a/lib/dfa.c b/lib/dfa.c
index a27d096f73..e88fabb442 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -1704,7 +1704,7 @@ add_utf8_anychar (struct dfa *dfa)
     /* G. ed (just a token).  */
 
     /* H. 80-9f: 2nd byte of a "GHC" sequence.  */
-    CHARCLASS_INIT (0, 0, 0, 0, 0xffff, 0, 0, 0),
+    CHARCLASS_INIT (0, 0, 0, 0, 0xffffffff, 0, 0, 0),
 
     /* I. f0 (just a token).  */
 
@@ -1717,7 +1717,7 @@ add_utf8_anychar (struct dfa *dfa)
     /* L. f4 (just a token).  */
 
     /* M. 80-8f: 2nd byte of a "LMCC" sequence.  */
-    CHARCLASS_INIT (0, 0, 0, 0, 0xff, 0, 0, 0),
+    CHARCLASS_INIT (0, 0, 0, 0, 0xffff, 0, 0, 0),
   };
 
   /* Define the character classes that are needed below.  */
diff --git a/tests/test-dfa-match.sh b/tests/test-dfa-match.sh
index b23851b8c0..4561584c4c 100755
--- a/tests/test-dfa-match.sh
+++ b/tests/test-dfa-match.sh
@@ -42,4 +42,15 @@ in=$(printf "bb\nbb")
 $timeout_10 ${CHECKER} test-dfa-match-aux a "$in" 1 > out || fail=1
 compare /dev/null out || fail=1
 
+# If the platform supports U+00E9 LATIN SMALL LETTER E WITH ACUTE,
+# test U+D45C HANGUL SYLLABLE PYO.
+U_00E9=$(printf '\303\251\n')
+U_D45C=$(printf '\355\221\234\n')
+if testout=$(LC_ALL=en_US.UTF-8 $CHECKER test-dfa-match-aux '^.$' "$U_00E9") &&
+   test "$testout" = 2
+then
+  testout=$(LC_ALL=en_US.UTF-8 $CHECKER test-dfa-match-aux '^.$' "$U_D45C") &&
+  test "$testout" = 3 || fail=1
+fi
+
 Exit $fail
-- 
2.34.1




reply via email to

[Prev in Thread] Current Thread [Next in Thread]