[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnulib] Handling of invalid multibyte character sequences in fnmatc
From: |
James Youngman |
Subject: |
[bug-gnulib] Handling of invalid multibyte character sequences in fnmatch() |
Date: |
Mon, 6 Jun 2005 06:47:42 +0100 |
User-agent: |
Mutt/1.3.28i |
Hello,
I have filenames on my system that are in latin1; these are installed
as part of my distribution. However, I have my environment set up for
UTF-8.
This appears to bring about a situation where gnulib's fnmatch()
function fails to match some characters with '?' and '*'. The problem
appears to affect current gnulib, but also coreutils 5.2.1, but not
bash 3.00.16(1).
Here's an example with coreutils:-
$ ls | od -c
0000000 c a r r 351 . l o g o \n e n r o u
0000020 l 351 . l o g o \n e x e m p l e 1
0000040 . l o g o \n t r i a n g l e . l
0000060 o g o \n
0000064
$ ls -1 --ignore='*'
carr?.logo
enroul?.logo
$ ls -1
carr?.logo
enroul?.logo
exemple1.logo
triangle.logo
In the above example, you will see that the filenames containing byte
0351 (octal), "LATIN SMALL LETTER E WITH ACUTE" in latin1, don't match
the glob character '*'. Here's an example with current gnulib:-
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*' .
./triangle.logo
./exemple1.logo
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find .
.
./triangle.logo
./carr?.logo
./enroul?.logo
./exemple1.logo
However, bash does not seem to be affected:-
$ ls -1 *
carr?.logo
enroul?.logo
exemple1.logo
triangle.logo
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
Perhaps bash either isn't sensitive to whatever configuration error I
have made, or it uses glob() or similar, instead of gnulib's
fnmatch().
If I switch back to the C locale, the problem does not occur:-
$ unset LANG ; locale
LANG=POSIX
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*'
.
./triangle.logo
./carr?.logo
./enroul?.logo
./exemple1.logo
At this point it dawns on me that 0351 is a valid Latin-1 character,
and indeed is a valid Unicode character (representing the same glyph).
However, it's not a valid UTF-8 encoding byte. The value 0351 is
11101001 in binary, and this is an escape character in UTF8:-
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
The filenames above don't have a 10xxxxxx byte following the accented
E, and so I suppose the explanation is that in my locale, those
filenames have invalid multibyte character sequences in them. The
current "locate" of findutils is also affected because it also uses
fnmatch(); however, if I recode the input of fnmatch() to be in UTF-8
instead of Latin-1 (by using iconv on find's output before feeding it
to frcode), then the glob characters now match the filenames.
The same problem appears not to affect the gnulib regex module:-
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*'
.
./triangle.logo
./exemple1.logo
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -regex '.*'
.
./triangle.logo
./carr?.logo
./enroul?.logo
./exemple1.logo
Any ideas/suggestions? Is this problem unavoidable?
Regards,
James Youngman.
- [bug-gnulib] Handling of invalid multibyte character sequences in fnmatch(),
James Youngman <=