[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] Re: [Fwd: i18n and file systems]

From: Kevin Smith
Subject: Re: [Arx-users] Re: [Fwd: i18n and file systems]
Date: Wed, 21 Dec 2005 21:26:53 -0500
User-agent: Mozilla Thunderbird 1.0.7 (X11/20051011)

Walter Landry wrote:

Thanks.  I had seen this when it came out.  I think that bzr has more
problems with filenames because it stores the weave in file names that
match the file in the working copy.  ArX stores everything as
changesets, so this does not come up as much.  As for dealing with
cases where a file can not be created on a particular filesystem, I
think ArX will already give a (cryptic) error message.  It most likely
will die when unpacking the tarball or when creating a file during

I was thinking in particular of the discussion we had earlier this month where I proposed encoding branch names such that they could be stored on, and moved to, any filing system. The information in this email reinforces my belief that storing user-chosen text as filenames is a bad idea.


-------- Original Message --------
Subject: i18n and file systems
Date: Tue, 13 Dec 2005 16:46:58 +1100
From: Robert Collins <address@hidden>
To: address@hidden <address@hidden>
CC: Alexander Belchenko <address@hidden>

Hi, in debugging a recent problem with jbaileys automatic debian
packages we found an interesting problem.

When LANG=C, the test suite fails to pass:

ERROR: test_commit_template (bzrlib.tests.test_msgeditor.MsgEditorTest)

log from this test:

Traceback (most recent call last):
"/home/robertc/source/baz/run_tests_twice_for_i18n/bzrlib/tests/", line 40, in test_commit_template
    working_tree = self.make_uncommitted_tree()
TypeError: make_uncommitted_tree() takes no arguments (1 given)


----- should trigger
this on everyones system.

Martin is currently disabling the specific test when it can't run (which
is appropriate here).

But it raises an interesting discussion we've kindof ignored. Firstly
the background:

Some file systems/platforms are unicode through and through - no matter
what your terminal encoding is, the file system can still represent and
return an unicode path. (Whether python figures this out and uses the
appropriate apis is a good question). Examples are NTFS(on win32)
(IIRC), and HFS+(with MacOSX). Lets call this unicode safe.

Other file systems are 'code page' file systems - they essentially store
just a byte string, and your user-space translation rule determines what
that looks like. For instance linux's apis are all just byte-strings,
the actual meaning of any file path segment is all in the eye of the
user - on linux, try creating a unicode file name in a utf16 locale, or
utf8 locale, and then switching to the other (or to something not even
unicode, like one if the iso8859-x locales. all linux mounted fs's, VFAT
are all I know about offhand. lets call this unicode-sometimes

Theres a final category, which is platforms that cannot represent
unicode in file paths at all - where the locale is non-unicode and you
have a code page style file system api. buggah. non-unicode

Now, when you access a URL or use something like FTP, it gets even
trickier, because the encoding of the file being served by (say) apache
may not match that that the user who wrote the file was using. This
leads to URL's that cannot be predicted, and other such fun.

Now for the interesting bits :).

Firstly, I think we should be aiming to ensure that *no matter what*,
files that bzr creates are named such that all such environments
described above pun the filename as having the same value. Thats
essentially 7 bit ascii (the places this breaks are sufficiently far
between IME that we can ignore them).

At the moment we *may* do that but we should go further:
 * We should write tests that check that regardless of revision-id
value, or file-id value, the stores do not request non-ascii characters
of paths from the transport layer. (Volunteers sought!) This involves
teaching the stores to escape for the transport as part of the
id->filename mapping *before* the url encoding is put on.

That means that no matter where it is, a .bzr dir and its contents will
look the same to us, so we are insulated from the coding effects.

Secondly, the working tree is controlled by the users content, and there
are many ways this can be broken: they can change their locale between
runs of bzr; they can try to branch a branch that has unicode file names
on a non-unicode platform. I think we can catch most of these errors and
For instance, if status sees some big % of files disappear, and a large
number of unknowns, it could try a couple of the unknowns and recode the
relative path - that might just become valid known paths. Likewise if
you branch a branch that needs unicode support on a non unicode platform
we should give a good error.

If someone wants to do up a wiki page and track the status of this, or
even better start some tests, that would rock!

Alexander - I explicitly copied you because I think you probably have
the most complex setup of a bzr contributor at the moment, and are ideal
to provide input/testing into this.


GPG key available at: <>.

Arx-users mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]