Re: [Arx-users] The Future (long)

arx-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] The Future (long)

From:	Kevin Smith
Subject:	Re: [Arx-users] The Future (long)
Date:	Wed, 07 Dec 2005 21:34:44 -0500
User-agent:	Mozilla Thunderbird 1.0.7 (X11/20051011)

Walter Landry wrote:

Kevin Smith <address@hidden> wrote:

Walter Landry wrote:
For every branch, there is a directory with the same name as the
branch, but with a period "." appended.  The period "." makes it easy
to distinguish branch names (which can be almost anything) from
everything else.
This might cause problems with certain tools on MS Windows (where"empty" extensions are unusual). Otherwise, seems reaonable.



What kind of trouble?  I can add an extension easily enough (.arx?
.bra? .brc?).

I have memories of Notepad really disliking files without extensions,and vague memories (perhaps false) of running into a case where WindowsExplorer couldn't tell the difference between "foo." and "foo". That mayhave been with Windows 3.1 and 8.3 filenames, though.

If it's easy to add an extension, I would probably do so, just to bemore conventional. My first thought would be ".d" to reaffirm that it'sa directory.

I think that's a bit of an overstatement. It's true that an attackercouldn't just drop a fake revision in to replace one that you hadsigned. However, someone could disrupt the system by signing twodifferent revisions that share the same hash but have differentcontents. Just something to consider as a corner case.
Could you be more specific?  I don't see how what you are describing
is different from just making a directory with the same 60 bit name
and putting junk in it.  Yes, it is disruptive, but allowing people to
modify the repository opens you up to that kind of thing.  In either
case, whatever you get won't be signed or won't validate to the
correct 256 bit hash.

Also, when you say "share the same hash", I presume you are talking
about the first 60 bits, not the entire 256 bits.  It is infeassible
to create different files with the same 256 bits of hash.


Yes, I was referring to sharing the "first 60 bits".

You are correct that it's not a real issue, as long as you never do asha check of the contents against the abbreviated directory name. Aslong as you also check the full hash at the same time, you would catchany problems. So: never mind.

That sounds like a good idea, although I think "index" is the wrongword. It's more of a cached hash.
How about "dirhash"?


Sounds fine.

I think you are overestimating the complexity of serialization.  If I
recall correctly, for a list of strings, the serialization library
would write a header, the length of the list as an ascii string
(e.g. "12"), and then the elements of the list.  Each element is again
a length and then the string itself.  You are not going to get any
simpler than that and cover all of the corner cases with embedded
nulls etc.  So a list with the elements "crate" and "barrel" would be
serialized as

  22 serialization::archive 2 5 crate 6 barrel

The serialization format is not complicated.  What you are probably
complaining about is that the _arx/++manifest file has some binary
elements.  Those are sha256's of files, and I put them in that format
for efficiency (though it may be premature optimization).


I have two wishes:

1) That the file is plain text so I can look at it with any tool. I knowI shouldn't need to, but when I'm coding it's a pain to not be able tolook at important data easily.

2) That the full file spec is documented. I want to be able to write atool in any language to access the data, and reverse-engineering binarydata is a royal pain. I think this is more important to me than #1, andI would shy away from a format that might change at any moment due tothe whims of the boost developers. Unless they provide assurances aboutstability.

Your rationale was to allow special characters in revision names. Youmust have meant branch names. Even so, disallowing newlines and nulbytes doesn't seem like a severe limitation on branch names.

It is fairly simple to go from there to using the branch==repo
paradigm that hg, bzr, darcs, etc. have.  My thought right now is that
that paradigm is sufficiently different from the separate repo and
tree paradigm that I would want a different command for it.
I'm not quite sure what you're saying, but I think the repo == branchesparadigm of ArX is one of its strengths.



I assume you mean branch!=repo here?  In any case, I am just saying
that, for those who prefer branch==repo, it would be simple to create
a tool to cater to them.  Everyone would use the same master repo.

Yes, I struggled with the wording, which is why I said repo==branchES,as opposed to repo==branch. I have not yet seen a branch==repo SCM appthat allows cheap branching on non-hardlink file systems, so I remainhappy about ArX repos.

You mentioned several drawbacks of skip-deltas. What are the bigbenefits they bring, and what alternatives did you consider?


It only takes O(log(Number of revisions)) to get a particular
revision.  So revision 63222 takes about 16 patches.  Currently, it
would take 63222 patches.  ArX gets around this somewhat with repo
caches.  But that requires repo maintenance, which I really want to
get rid of.  Even I don't update cached revisions as much as I should.

Ah. You mentioned that svn uses skip-deltas. How do the other toolssolve that problem? It seems like most systems are, at their core,either a vector or a linked list of revisions. I suppose the speedoptimizations would be snapshots (ArX caches, darcs has somethingsimilar, maybe GIT bundles?), or some kind of b-tree index, or ???

At this point, I
am still waiting for someone else to figure out the best merging
strategy ;)


Smart.

The bzr folks keep talking about "knits", which are some variant ofweaves. I think those are both part of a more generic strategy of doingmerges based on annotated lines, regardless of how those are stored.
I have seen mention of knits, but I don't really know what they are.

Me neither. I think I half-understood them a few weeks ago, but it'sgone now.

The bzr folks are almost talking as if bzr will have multiple back ends.One might store weaves, another knits, and another might store "deltahistories".

* Signatures on patches and revisions:
 -
 The signature on the patch log covers the sha256 of the revision and
 patch.  Sha256 should be good for the next 50 years or so, barring
 unforseen developments.  The same can not be said for sha1.


As long as this is fast enough, I think it's a good choice.



It is the _only_ choice if you actually care about security.  Don't
get me started.

Well, I could get into a whole thing about how SHA-1 might be goodenough for most purposes for a while, or about how there might be somelegitimate competitors to SHA-256, but I won't. SHA-256 makes sense.

* Cheap branching, even on systems without hardlinks or symlinks. AFAT32 user should be able to create a new branch of a large project asquickly and using only as much disk space as someone on an ext3 system.Multiple branches on a web server should not consume excess space.
Would these be microbranches or no-history branches?  Microbranches do
not consume excess space.  No-history branches do take up some space.
This is all independent of what file system you are using.

My concern is simply that MS-Windows FAT32 users should not besecond-class citizens. They should be able to work as efficiently, usingthe same processes, as other folks. That's not the case right now withdarcs or mercurial. Or bzr, but the bzr folks are working on it.

If I want to work on ten features a day, each in its own branch thatmight last an hour or two, what ArX 3 mechanism would I use?

* Will be able to support quilt/bzr-shelve/mq functionality.


If I understand this functionality correctly, this is just selectively
reverting files and putting them into a changeset?  Storing revisions
as patches against complete trees (as opposed to weaves) makes this
pretty trivial.

However, I get the feeling that there is more to it than that.


It seems that the primary use of bzr shelve is:

I have made several changes to my working tree, but they really shouldbe two different revisions/changesets. I can "shelve" some of mychanges, leaving me with a single changeset that I can test and commit.Then I can unshelve those changes, test the full result, and commit thesecond revision. It includes darcs-style per-hunk selection.


It seems that the primary use of quilt is:

I am tracking an upstream repo. I am maintaining several of my ownpatches on top of that repo. Every time I sync with the upstream repo, Ican push my patches (changesets) aside, sync with upstream, and thenre-apply my patches on top. The unit of work is changesets, not files orhunks.


Further, I can (or at least theoretically could) do patch refactoring:
- Combine small patches into a single large patch
- Split a large patch into several smaller patches
- Reorder patches
- Modify the patch description or other metadata

I think mq is very similar to quilt, except that since it is integratedwith mercurial, it actually stores my patches in the repo itself. Whennecessary, those patches are ripped out of the repo, and then reappliedafter the upstream sync.

There are some concerns that mq is dangerous because it can removechangesets from a repo that may already have been published. Darnedhandy, though.

It might be worth storing branch names in a table, rather than exposingthem as raw filenames. The bzr folks are discussing something similar atthe moment. I believe that if you burn a backup on one system, andrestore it on another system, it should just work. That means you can'tallow just any character, nor can you escape only the characters thatwon't work on the particular file system you are writing to at the moment.
That introduces another place where things can fail, leaving your repo
in an inconsistent state.  Any time you update a file, you have to be
prepared to deal with it missing or corrupted.  Bzr, hg, git, etc. all
deal with local filesystems where the window for wedging your repo is
small.

So if you use any unusual characters anywhere in your repo, it becomesnon-portable. That would include: Repos stored on Samba shares, on plainhttp servers, and burnt onto CD-ROM's.

I understand the hassle of using indirection to avoid using branch namesas filenames, but that still seems like a significant problem to me.

* Works with write-once media. (No one really has this, although
 tla/baz/arx and subversion (with fsfs) could be modified to do so.
 We just need a place to put the lock files.)
 -
 No
I thought GIT had this. I don't think supporting write-once media is acritical feature. Requiring only append access could help in certainhigh-security cases.
Doesn't git have a file which tells you what HEAD is?  You need
something to serve over http.

You're right, although I believe HEAD is/was an optional convention.Linus resisted adding tags for a long time, instead just announcing thehash value of the latest release.

Correct.  I call what hg has a microbranch.  What this new repo format
really gives us is microbranches.


Ok. I like that term.

I still prefer the term "distributed branch" for the "no history"
case.



Actually, I like the term "no history".  "Truncated history" would also
work, although that is a bit longer.


Any of those work for me. Just not lightweight :-)

* Reasonable support for archiving large binary files



That is another good one.  Basically, you need a streamy binary diff.
ArX has a binary diff, but it is not streamy.  I think Subversion (and
thus SVK) are the only ones with this.

As long as big binary files can be stored, retrieved, and updated to anew copy, that's sufficient. Better diffing is a plus.

ArX stores dates using boost's to_simple_string, which gives you dates like

  2002-Jan-01 10:00:01.123456789Z

instead of ISO

  20020131T100001,123456789

I'm not thrilled that English text in part of the stored format, butotherwise that seems sane. It's a known set of twelve short strings,appearing at a fixed location, so it would be easy to localize in the UI.

* Repo and/or branch "nicknames" or "aliases"
Are you thinking of multiple names for the same branch? So that
  arx get http://foo.com,address@hidden


Nope. I'm thinking of:

    arx get walter

instead of whatever long URL happens to contain the latest official ArXtree. I guess it depends on how often I have to type the URL. If it'sone a month, I don't really care. If it's several times a day (as itseemed to be with ArX 2), it's important.

This could just be a lookup table stored in a .conf file. Bzr has"branch nicks" (nicknames) which seem similar, although I haven'tactually used them. In ArX, I'm not sure whether having aliases forbranches would be valuable or not.

* Facilities to mitigate newline conversions when a project is shared bypeople using different workstation OS's.
I recognize the difficulties that people have, but this is such a bag
of worms that I have been unmotivated to think about it.

Yup. For a while, I advocated SCM tools not doing newline conversions,but enough people seem to still be using brain-dead tools that it mustbe supported to be a mainstream cross-platform tool. I think monotonepioneered the use of hooks for this, and I haven't heard bad thingsabout it.

* Support for plugins (see bzr and hg), because it makes it far easierfor non-core developers to experiment with cool stuff, and to prototypepotential new features before adding them to the core.
This is, indeed, nice.  ArX has python bindings, but you can't create
new commands with it.

Would it be possible to support C++ plugins? That would be better thannothing, and perhaps a framework could be built on top of that whichwould actually allow plugins to be written in python, ruby or otherlanguages.

Oh, the repo format should also handle file attributes, as ArX 2 does.Handy for executable bits and other stuff.

Thanks for the comments.


You're welcome. Fun stuff.

Kevin

[Prev in Thread]

Current Thread

[Next in Thread]

[Arx-users] The Future (long), Walter Landry, 2005/12/07
- Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/07
  - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/07
    - Re: [Arx-users] The Future (long), Kevin Smith <=
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/08
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/08
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/09
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/09
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/09
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/09
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/11
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/22
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/22
- Re: [Arx-users] The Future (long), Catatonic Porpoise, 2005/12/07

Prev by Date: Re: [Arx-users] The Future (long)
Next by Date: Re: [Arx-users] The Future (long)
Previous by thread: Re: [Arx-users] The Future (long)
Next by thread: Re: [Arx-users] The Future (long)
Index(es):
- Date
- Thread