bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uuencode: multi-bytes char in remote file name contains bytes >0x80


From: Eric Blake
Subject: Re: uuencode: multi-bytes char in remote file name contains bytes >0x80
Date: Tue, 05 Jul 2011 09:06:12 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10

On 07/05/2011 08:40 AM, Bruce Korb wrote:
> 2. Assuming that you want a localized file name for this archive file,
>    you thus still want to encode the file name for transmission.
>    To do this, you would use code like this:
>       dst = malloc(2 * strlen(p) + 1);
>       while (*p) {
>         if (*p == '/') // if I am not mistaken, '/' is always a '/' char

The next version of POSIX will be enforcing that '/' and '.' are
unambiguous across all POSIX encodings supported by all locales on a
system (it was a happy accident that no POSIX system has attempted to do
otherwise), as well as further clarifying that yes, filenames are not
necessarily character strings in all locales, unless those filenames are
drawn solely from the portable filename character set.

See http://austingroupbugs.net/view.php?id=291

There are, however, some non-POSIX encodings where '/' can appear as the
second byte in a shift-state sequence encoder (ISO-2022-JP-2), although
they are rare in practice these days.

Also, if you worry about systems where backslash is a directory
separator, there are encodings such as Shift_JIS where '\\' can appear
as a second byte within a multi-byte character (hence, '\\' is
ambiguous, even though '/' is not).

> 3. Any uuencode-ed file with an encoded file name in it would need to
>    be marked so that uudecode could cope (translate the encoded name).
>    This format change should be compatible with POSIX specifications
>    for the uuencode output.  e.g. a preamble to the "begin"
>    line and not be part of that begin line?  Maybe a prefix line:
>       puts("encoded-file-name\n");
>    Eric Blake would be a better person for suggesting ways to "extend"
>    the POSIX format.  If this is worth the bother, then adding options
>    after the file name on the begin line would surely be "more
> convenient"....

I'm not quite sure what you are asking me to do here.  Maybe it helps to
read the current POSIX requirements on uuencode output:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html

Note this statement:

"The standard output shall be a text file"

but if filename is _not_ a character string in the current locale, then
the output would _not_ be a text file (among other things, a text file
has the property that at least one locale can interpret every byte
sequence in the file as valid characters).  At which point, we are no
longer constrained by POSIX, and can arguably do whatever we want!  That
is, supporting file names that consist of characters outside of the
portable file name character set (a-z, A-Z, 0-9, ., _, /, and -) is
already outside the realm of what POSIX requires uuencode to support,
and it would be just as reasonable for uuencode to refuse to operate on
such file names as it would be for uuencode to emit some sort of header
that tells uudecode how to try and decode a string back into characters
appropriate for the current locale.

>> 1. strlen may be wrong to count how many bytes in argv[optind].

No, strlen is _always_ the way to count how many bytes are in an element
of argv, since each argv entry is always a NUL-terminated sequence of
bytes (that might also, but are not required to, have meaning when
interpreted as multi-byte characters under the current locale).

-- 
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]