[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: faster fnmatch
Re: faster fnmatch
Sat, 18 Apr 2009 20:53:13 +0200
Ondrej Bilka wrote:
> I looked more into source and discovered fnmatch doesn't work as I imagined.
> By default it converts strings into widechars and match there.
> utf8 allows searching be done bitwise. Its in most cases faster.
fnmatch converts to wide characters because it often makes several passes
across many characters of the string, and at each pass it has to call mbrtowc
for looking up the extent of that character. And while UTF-8 is the most
common encoding, there are other ones, such as ISO-8859-2 or GB18030, for
which mbrtowc is really expensive.
> Is ok just use original fnmatch if pattern contains extended wildcard or 
> with nonascii symbol?
No. If the encoding is GB18030 and the pattern is "*5*", and you attempt
to search for the '5' byte for byte, you will find a match where there
is actually none - because multibyte characters in GB18030 can contains
values in the range 0x30..0x39 in bytes 2..4.
Similarly for the BIG5, BIG5-HKSCS, GBK, and SHIFT_JIS encodings.
> Here is casefold patch for fnmatch. (abusing wchar=u32)
wchar_t == ucs4_t is only generally true on glibc systems, not on
Solaris, FreeBSD, AIX, etc.
> +#ifdef _LIBC
The symbol _LIBC is only defined when compiling glibc. It is not defined
when compiling gnulib source code on any system.
> - res = internal_fnwmatch (wpattern, wstring, wstring + strsize - 1,
> + wchar_t *wfoldpattern,*wfoldstring;
> + wfoldpattern=wpattern;wfoldstring=wstring;
If you want me to review some code, please present it with the same friendly
indentation, space-after-comma, space-around-operators, one-variable-per-
declaration, one-statement-per-line, GNU-style brace placement, max linelength
of 80, etc. that you find in the rest of the gnulib source code.
Regarding indentation: a tab's width is 8 columns. It looks like you're using
a different tab width. If that is so, and you cannot change it, please try
to avoid tabs altogether.