[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FYI: Large file actual performance report; cvs use of ,v header i s some

From: Bulgrien, Kevin
Subject: FYI: Large file actual performance report; cvs use of ,v header i s sometimes non-optimal.
Date: Thu, 17 Jan 2008 11:45:23 -0600




  * This is not the latest revision.  A Mandriva Corporate Server 4.0
    distribution is in use and this is the stock build included with it.

  * While the situation that first illustrated the problem resulted from
    improper use of arcanery (cvs admin -o; a recognized hazard), the
    following is intended to point out a situation where cvs fails to
    optimally use revision information in the ,v file without attempting
    to determine if the suboptimal performance provides some measure of
    additional robustness.

  * It is recognized that most traditional "source" files are not as large
    as the ones used to cite this situation, and, that prior Info-CVS list
    posts exist that say it is unwise to use CVS as a general configuration
    management tool.  If anything, this post should underscore those
    statements by giving a rough order of magnitude to the cost of
    processing large files.


  * Server is old (Dell PowerEdge 2300; Dual Pentium II 400 MHz; 1 GB RAM,
    disks are SCSI RAID) but server idles with very little CPU usage.

  * Source file size (UTF-16LE XML file) after it is reduced in size by 50%
    by recoding to UTF-8 is on the order of 50MB plus or minus.

To set the stage, a large text file (kkv / ko, not -kb) file is committed.
The case that illustrated the author's situation was on the order of 50M
with about 2.5 million lines of text in the largest instances).  For the
example, 22 commits of varying sizes were made, with some diffs being very
few lines and others large (less than 800,000 lines).  Commits to HEAD are
slow, but bearable, on the order of some minutes.

At 1.25, the repository size of the file is on the order of 315 MB.

Consider that in this case, files are being committed after the fact rather
than during development, so what happens is not typical of in-line CVS use.
A long stream of tarred revisions were being untarred and committed to CVS.

At the revision that would be 1.26, it was realized that 1.23, 1.24, and
1.25 were a dead end and that 1.26 was a rework from 1.22.  It seemed easy
enough to delete 1.23, 1.24, and 1.25 and recommit them to a branch so that
the HEAD stream would show the useful progression of changes in actual use
without the noise of the dropped revisions.

cvs admin -o 1.23:1.25 was done.  This operation took quite a bit
longer than the sum of the three commit operations.  This starts to make it
more obvious why CVS use with large files is discouraged on list.  While
work on HEAD is reasonable, working anywhere but HEAD is very costly.

At this point, creating a branch, and committing to a branch takes hours,
with each subsequent commit to the branch getting longer and longer, due
to the way diffs are handled in the ,v file.  This is not indicative of a
failure to use ,v file information optimally, however.  Commit times are
now taking multiple hours.  It is fairly easy to understand the reason when
looking at the CVS file structure.  This is not a complaint, however.  It
happens that CVS' natural development did not really need to anticipate
this type of use.  Anyone considering control of large text files should
consider whether these performance issues are acceptable.

For additional consideration, during the CVS operation, one CPU is at 100% 
usage by the cvs server for the duration of the commit (hours of time).
If multiple developers were attempting this type of work, the server would
be taken to its knees for hours on end as all of the processors would be
loaded at 100%.

After the branch commits, the next file is to be committed to head, but the
developer did not run `cvs update` doing the `cvs admin -o` command, and so
the next commit was destined to fail since the working directory 
CVS/Entries file is now erroneous and marks the working directory file as
a new deleted revision 1.25.  See the prior post "FYI: cvs can break a
checked out working directory" for details.  

The commit of the new 1.23 should have been quite fast (on the order of
minutes) because it is took place at the top of HEAD, but instead it takes
perhaps 8-12 hours, and, in fact, fails with an error saying 1.25 can not
be found.  This is the situation where the title "cvs use of ,v header is
sometimes non-optimal" comes to play.  The only way it seems possible to
have consumed this much time prior to failing out seems to be that it
diffed each and every version in the file from the top of HEAD to the 
origin, checking each revision as it went until it came to 1.1 and had
found no match.

Oddly, the revision record was entirely in the ,v file header, and could
have allowed an error to be returned almost instantaneously had cvs tried
to use this information optimally.  cvs could have errored out simply by
reporting that the header contained no record of the revision it was
looking for, but instead, it insisted on traversing the diffs.

The diff traversal would not be terribly expensive with small files and
diff history, but with these "3XL" source files/diffs the hit is terrible
and makes it more obvious why CVS may not be the ideal tool for controlling
files of this size.

The FYI on the subject line intends to communicate that this is not a
judgement against CVS.  Especially as there is a possibility that this
behavior could perhaps add a measure of robustness in that the header need
not be trusted (shrug?).  Further, the problem situation was created with
through use of the non-recommended cvs admin -o as a root cause, so this
behavior should not occur otherwise.

Nevertheless, this post is is written in the spirit of archival of data that
was costly to obtain through actual experience.  It may be useful only to
the author, but might also be useful in helping people in decisions about
whether or not to use CVS for large files (if they were to happen to stumble
across it).

Kevin R. Bulgrien

This email message is for the sole use of the intended recipient(s) and may 
contain General Dynamics SATCOM Technologies confidential or privileged 
information.  Any unauthorized review, use, disclosure or distribution is 
prohibited.  If you are not an intended recipient, please contact the sender by 
reply email and destroy all copies of the original message.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]