Re: [Bug-tar] Wildcards do not match invalid characters

Micah Cowan
Re: [Bug-tar] Wildcards do not match invalid characters
Thu, 07 Feb 2008 16:33:45 -0800
Bruno Haible wrote:
> Any volunteer wants to write a 'mbsfnmatch' function that works like fnmatch
> but supports invalid byte sequences?

(I've removed bug-tar from the Cc list but left everyone else; I hope
that's as it should be.)

Wget is in need of such a facility as well:

Or, possibly, a "c-fnmatch" would suit our needs more. Wget is currently
locale/character set unaware; while we'd like to change that in the
future, in the meantime we need things to "work" :) ... in any case, it
could be a challenge to figure out the encoding used for remote
filenames on an FTP server.

What would be involved in writing such a facility? I might be interested
in doing so, but need a clearer picture of what it would be.

It appears, from looking at the current code, that the current
mbs-handling fnmatch() simply converts the strings to wcs format, and
then passes them to internal_fnwmatch().

One dead-simple approach would be that whenever an unrecognized byte is
found, it is simply expanded to its wide-character version. This would
end up doing the right thing if the locale is UTF-8 but the input string
is in ISO-8859-1. It would be less functional for other encodings,
including the other ISO-8859-* ones: character classification would be

OTOH, perhaps it's better not to let such characters be mapped to real
wide characters at all, so that they'll work fine for * and ?, but fail
all character-classification tests (or perhaps succeed at one specific
one we've chosen for such cases). Perhaps the WEOF value (where
available) could be used for this purpose (but care might be needed to
ensure we don't pass it on
to standard library functions).

Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
