[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] The Future (long)

From: Walter Landry
Subject: Re: [Arx-users] The Future (long)
Date: Thu, 08 Dec 2005 16:08:41 -0800 (PST)

Kevin Smith <address@hidden> wrote:
> Walter Landry wrote:
> > Kevin Smith <address@hidden> wrote:
> > 
> >>Walter Landry wrote:
> >>
> >>>For every branch, there is a directory with the same name as the
> >>>branch, but with a period "." appended.  The period "." makes it easy
> >>>to distinguish branch names (which can be almost anything) from
> >>>everything else.
> >>
> >>This might cause problems with certain tools on MS Windows (where 
> >>"empty" extensions are unusual). Otherwise, seems reaonable.
> > 
> > 
> > What kind of trouble?  I can add an extension easily enough (.arx?
> > .bra? .brc?).
> I have memories of Notepad really disliking files without extensions, 
> and vague memories (perhaps false) of running into a case where Windows 
> Explorer couldn't tell the difference between "foo." and "foo". That may 
> have been with Windows 3.1 and 8.3 filenames, though.
> If it's easy to add an extension, I would probably do so, just to be 
> more conventional. My first thought would be ".d" to reaffirm that it's 
> a directory.

Sounds good.

> > I think you are overestimating the complexity of serialization.  If I
> > recall correctly, for a list of strings, the serialization library
> > would write a header, the length of the list as an ascii string
> > (e.g. "12"), and then the elements of the list.  Each element is again
> > a length and then the string itself.  You are not going to get any
> > simpler than that and cover all of the corner cases with embedded
> > nulls etc.  So a list with the elements "crate" and "barrel" would be
> > serialized as
> > 
> >   22 serialization::archive 2 5 crate 6 barrel
> > 
> > The serialization format is not complicated.  What you are probably
> > complaining about is that the _arx/++manifest file has some binary
> > elements.  Those are sha256's of files, and I put them in that format
> > for efficiency (though it may be premature optimization).
> I have two wishes:
> 1) That the file is plain text so I can look at it with any tool. I know 
> I shouldn't need to, but when I'm coding it's a pain to not be able to 
> look at important data easily.

With the exception of the manifest file, I don't think any of the
formats uses raw binary.

> 2) That the full file spec is documented. I want to be able to write a 
> tool in any language to access the data, and reverse-engineering binary 
> data is a royal pain. I think this is more important to me than #1, and 
> I would shy away from a format that might change at any moment due to 
> the whims of the boost developers. Unless they provide assurances about 
> stability.

Documenting the file spec is not hard.  Once it is all implemented, I
can certainly do that.  Also, the format only changes when I decide to
incorporate a new version of boost.

> Your rationale was to allow special characters in revision names. You 
> must have meant branch names.

Yes.  There are also filenames.

> Even so, disallowing newlines and nul bytes doesn't seem like a
> severe limitation on branch names.

It is not just newlines.  Some encodings like to use null's, and ArX
should not be getting in the way.

> >>You mentioned several drawbacks of skip-deltas. What are the big 
> >>benefits they bring, and what alternatives did you consider?
> > 
> > It only takes O(log(Number of revisions)) to get a particular
> > revision.  So revision 63222 takes about 16 patches.  Currently, it
> > would take 63222 patches.  ArX gets around this somewhat with repo
> > caches.  But that requires repo maintenance, which I really want to
> > get rid of.  Even I don't update cached revisions as much as I should.
> Ah. You mentioned that svn uses skip-deltas. How do the other tools 
> solve that problem? It seems like most systems are, at their core, 
> either a vector or a linked list of revisions. I suppose the speed 
> optimizations would be snapshots (ArX caches, darcs has something 
> similar, maybe GIT bundles?), or some kind of b-tree index, or ???

No one else tries to create revisions over the network.  They all
first sync the repos and then construct the project tree.  With that
said, they have various schemes.

Darcs is actually similar to ArX 2, in that you have ordinary patches
and you can decide to make a cached revision.

Bzr uses per-file weaves.  So the time to retrieve any revision of a
file is proportional to the number of revisions of that file.  I have
not benchmarked it with a large repository.  Bzr itself has a few
thousand revisions and seems relatively snappy.

Hg also stores history on a per-file basis, but it uses a heuristic to
occasionally cache full revisions of that file.  It actually seems to
work pretty well.

Vesta and unpacked Git just store every version of every file as plain
text.  Great for speed, terrible for space.  I don't know what packed
git does.

I think monotone stores history on a per-file basis.  They work
exclusively with a local database, so they can quickly reconstruct
files.  It is certainly fast [1], and it seems constant time.

The main problem with per-file history like that is that you have to
write many files for each revision.  Also, you have to download the
entire repo just to get the latest revision.

However, it does seem that I should at least optimize the application
of patches.  Right now, I use gnu tar, diff and patch.  If I instead
stored the patch as an uncompressed binary file with xdeltas, I would
not have to spawn tar, gzip, or patch.  I might store forward and
backward patches separately.  I should do some profiling and maybe
some experimentation to see what kind of difference it would all make.

Still, to get revision 60000, I would have to download 60000 patches.
Also, Subversion has already done this, and they decided that they
needed skip-deltas.  Moreover, I would have to convert from xdeltas to
gnu diffs for the user interface (e.g. patch-report, diff) and to get
fuzzy patching to work.  Implementing annotate would be more
difficult.  Finally, patches would no longer be browseable with
standard tools.

> >>* Cheap branching, even on systems without hardlinks or symlinks. A 
> >>FAT32 user should be able to create a new branch of a large project as 
> >>quickly and using only as much disk space as someone on an ext3 system. 
> >>Multiple branches on a web server should not consume excess space.
> > 
> > 
> > Would these be microbranches or no-history branches?  Microbranches do
> > not consume excess space.  No-history branches do take up some space.
> > This is all independent of what file system you are using.
> My concern is simply that MS-Windows FAT32 users should not be 
> second-class citizens. They should be able to work as efficiently, using 
> the same processes, as other folks. That's not the case right now with 
> darcs or mercurial. Or bzr, but the bzr folks are working on it.

I see what you're thinking. I think you don't have to worry about it.

> If I want to work on ten features a day, each in its own branch that 
> might last an hour or two, what ArX 3 mechanism would I use?


> >>* Will be able to support quilt/bzr-shelve/mq functionality.
> > 
> > If I understand this functionality correctly, this is just selectively
> > reverting files and putting them into a changeset?  Storing revisions
> > as patches against complete trees (as opposed to weaves) makes this
> > pretty trivial.
> > 
> > However, I get the feeling that there is more to it than that.
> It seems that the primary use of bzr shelve is:
> I have made several changes to my working tree, but they really should 
> be two different revisions/changesets. I can "shelve" some of my 
> changes, leaving me with a single changeset that I can test and commit. 
> Then I can unshelve those changes, test the full result, and commit the 
> second revision. It includes darcs-style per-hunk selection.

This is supported through "arx undo".  "bzr shelve" has the option of
per-hunk undo's.  That is not present in "arx undo", but could be

> It seems that the primary use of quilt is:
> I am tracking an upstream repo. I am maintaining several of my own 
> patches on top of that repo. Every time I sync with the upstream repo, I 
> can push my patches (changesets) aside, sync with upstream, and then 
> re-apply my patches on top. The unit of work is changesets, not files or 
> hunks.
> Further, I can (or at least theoretically could) do patch refactoring:
> - Combine small patches into a single large patch
> - Split a large patch into several smaller patches
> - Reorder patches
> - Modify the patch description or other metadata

This is only partly supported.  Combining and splitting patches is
supported.  Having multiple patches that can only apply in a certain
order is more tricky.  ArX wants a tree to diff against, and the only
one available is the original pristine.  I could see how it could be
done, but I think it would be best as a separate tool (like mq).

> I think mq is very similar to quilt, except that since it is integrated 
> with mercurial, it actually stores my patches in the repo itself. When 
> necessary, those patches are ripped out of the repo, and then reapplied 
> after the upstream sync.
> There are some concerns that mq is dangerous because it can remove 
> changesets from a repo that may already have been published. Darned 
> handy, though.

I don't see how there could be any danger in the context of ArX.  It
is as if you published some microbranches that are no longer being
developed.  That is one of the benefits of separating the project tree
from the repo.

> >>It might be worth storing branch names in a table, rather than exposing 
> >>them as raw filenames. The bzr folks are discussing something similar at 
> >>the moment. I believe that if you burn a backup on one system, and 
> >>restore it on another system, it should just work. That means you can't 
> >>allow just any character, nor can you escape only the characters that 
> >>won't work on the particular file system you are writing to at the moment.
> > 
> > 
> > That introduces another place where things can fail, leaving your repo
> > in an inconsistent state.  Any time you update a file, you have to be
> > prepared to deal with it missing or corrupted.  Bzr, hg, git, etc. all
> > deal with local filesystems where the window for wedging your repo is
> > small.
> So if you use any unusual characters anywhere in your repo, it becomes 
> non-portable. That would include: Repos stored on Samba shares, on plain 
> http servers, and burnt onto CD-ROM's.

This is not a complete solution, but you are allowed to rename
branches.  So the branch named "foo" in one place can be named "bar"
in another.  Two branches "up" and "down" can be combined into one
branch "updown".  So if you are mirroring a repo onto a CD-ROM, you
can choose a legal name.

> I understand the hassle of using indirection to avoid using branch names 
> as filenames, but that still seems like a significant problem to me.

There are basic problems with reliability that I don't want to expose
ArX to.

> >>* Reasonable support for archiving large binary files
> > 
> > 
> > That is another good one.  Basically, you need a streamy binary diff.
> > ArX has a binary diff, but it is not streamy.  I think Subversion (and
> > thus SVK) are the only ones with this.
> As long as big binary files can be stored, retrieved, and updated to a 
> new copy, that's sufficient. Better diffing is a plus.

But streamy diffs are required for big (larger than memory) files.

> > ArX stores dates using boost's to_simple_string, which gives you dates like
> > 
> >   2002-Jan-01 10:00:01.123456789Z
> > 
> > instead of ISO
> > 
> >   20020131T100001,123456789
> I'm not thrilled that English text in part of the stored format, but 
> otherwise that seems sane. It's a known set of twelve short strings, 
> appearing at a fixed location, so it would be easy to localize in the UI.

That is a good point.  I had not thought of localization.  You have
convinced me to store the date in ISO format.

> >>* Repo and/or branch "nicknames" or "aliases"
> > 
> > 
> > Are you thinking of multiple names for the same branch?  So that 
> > 
> >   arx get,address@hidden
> Nope. I'm thinking of:
>      arx get walter

That would be simple enough to add with an alias command.  However, I
want to thoroughly implement what I have, and then we can see if this
sort of thing is required.

> >>* Facilities to mitigate newline conversions when a project is shared by 
> >>people using different workstation OS's.
> > 
> > 
> > I recognize the difficulties that people have, but this is such a bag
> > of worms that I have been unmotivated to think about it.
> Yup. For a while, I advocated SCM tools not doing newline conversions, 
> but enough people seem to still be using brain-dead tools that it must 
> be supported to be a mainstream cross-platform tool. I think monotone 
> pioneered the use of hooks for this, and I haven't heard bad things 
> about it.

The standard problem is that the file you have on disk is not the file
you commit to the repo.  Automatic conversion just gives me the
heebie-jeebies, especially since ArX goes to so much effort to ensure
file integrity.  Also, Subversion is _still_ having problems from

> >>* Support for plugins (see bzr and hg), because it makes it far easier 
> >>for non-core developers to experiment with cool stuff, and to prototype 
> >>potential new features before adding them to the core.
> > 
> > 
> > This is, indeed, nice.  ArX has python bindings, but you can't create
> > new commands with it.
> Would it be possible to support C++ plugins? That would be better than 
> nothing, and perhaps a framework could be built on top of that which 
> would actually allow plugins to be written in python, ruby or other 
> languages.

Doing C++ plugins portably would be a bit of a pain.  I am more
inclined to get everything else working first.

> Oh, the repo format should also handle file attributes, as ArX 2 does. 
> Handy for executable bits and other stuff.

Of course.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]