gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Storage efficiency of revlibs


From: Mikhael Goikhman
Subject: Re: [Gnu-arch-users] Storage efficiency of revlibs
Date: Fri, 9 Dec 2005 13:04:23 +0000
User-agent: Mutt/1.4.2.1i

On 08 Dec 2005 09:08:47 +0100, Ludovic Courtès wrote:
> 
> Mikhael Goikhman <address@hidden> writes:
> 
> >   % revision=archzoom--devel--0--patch-300
> >   % cd `tla library-find $revision`/..
> >   % tar cf - --exclude $revision/,,patch-set --exclude $revision/,,index \
> >     --exclude $revision/,,index-by-name $revision | gzip -9 
> > >$revision.tar.gz
> >   % du -s --block-size=1 $revision
> >   % ls -s --block-size=1 $revision.tar.gz
> >   3403776 archzoom--devel--0--patch-300
> >   163840 archzoom--devel--0--patch-300.tar.gz
> >
> > The ratio is 21. There is a small, but increasing gain when compared with
> > earlier revisions (18), in particular because {arch} contains a lot of
> > small files that are compressed nicely. Probably better than hardlinking.
> 
> You're comparing the size of a *single* revision directory against
> tar+gz.  This doesn't make much sense since, by definition, the hard
> link trick compresses data *across* several revisions.

It makes perfect sence for me. Only if you show that this ratio is lower
than the revlib compression ratio (du -s against du -sl), then you may
come to your previously stated conclusion. This math is more correct,
because it accounts for extra files stored in revlib only. So, what are
the results of these commands for your project?

> > Please don't forget that a hardlink costs more than 0,
> 
> Can you elaborate on that?

The actual implementation of hardlinks is filesystem dependent, there is
usually an entry in the directory listing for each hardlink (just like
for any filename) with a pointer to inode. But without missuring, you
can't tell whether it is several bytes or several kilobytes per hardlink.

Subdirectories are not hardlink-able, they occupy at least 1 inode each,
but often many inodes, since revlib has large and ever growing subdirs.

> > For me (and for du/rm) it is not the size, but number of inodes that is
> > more important, so this very CPU expensive solution would not solve much.
> 
> There are several good papers on the topic [0,1,2].  I'm pretty
> confident that hard link + gzip of individual files would yield a better
> compression ratio than keeping several whole revision tarballs, *when*
> several subsequent revisions are kept.

It may sound correct theoretically, but I would not be surprised if even
this is not always true. Remember, revlib includes ever-growing ,,index*
files that may easily become 200Kb per revision. It includes changeset
too, and both indexes and the changeset diffs are not sharable at all.
So any theory is just words without actual verification on real projects.

And again, individually gzipped files although may reduce the disk usage,
produce new problems (busy CPU) and do not solve the file count problem.

Regards,
Mikhael.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]