bug#56351: LC_CTYPE=C.UTF-8 causes an matching error on Sed

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#56351: LC_CTYPE=C.UTF-8 causes an matching error on Sed

From:	KIM Taeyeob
Subject:	bug#56351: LC_CTYPE=C.UTF-8 causes an matching error on Sed
Date:	Sat, 02 Jul 2022 14:03:10 +0900

Sed (and also Grep) cannot match a certain range of Korean characterswhen it operates under LC_CTYPE=C.UTF-8 (and whatever languageenvironment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, orja_JP.UTF-8 etc.)


reproducing the bug on Sed:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | sed -e 's/./a/'
a                           <-- matched and replaced without an issue
$ echo 퐀 | sed -e 's/./a/'
퐀                          <-- FAILED to match so it doesn't replace

In detail, a character that is in the range [가-폿] (<UAC00>~<UD3FF>) ismatched without any issue but a character in the range [퐀-힣](<UD400>~<UD7A3>) CANNOT be matched but it IS SUPPOSED TO be matched.


Grep has the same issue with the period regex too.

reproducing the bug on Grep:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- matched successfully
$ echo 퐀 | grep .
$                    <-- failed to match

I think it is related with <regex.h> or <iconv.h> on Glibc, but Icouldn't find way to reproduce the bug with those, so alternatively, Ireport on Sed instead.


I also report this issue on the bug-grep list too.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#56351: LC_CTYPE=C.UTF-8 causes an matching error on Sed, KIM Taeyeob <=
- bug#56351: LC_CTYPE=C.UTF-8 causes an matching error on Sed, Paul Eggert, 2022/07/02

Next by Date: bug#56351: LC_CTYPE=C.UTF-8 causes an matching error on Sed
Next by thread: bug#56351: LC_CTYPE=C.UTF-8 causes an matching error on Sed
Index(es):
- Date
- Thread