[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
CVS soft-tagging + other problems & solutions
CVS soft-tagging + other problems & solutions
Wed, 15 Aug 2007 10:56:07 -0000
Following is a rather lengthy discussion of problems that we at Jungo
faced with using CVS, and solutions used. We especially go into
details of a
new concept we call 'soft-tagging'. This concept was such a success
users that we feel obliged to spread the idea.
This first mail generally describes the issues encountered and the
Jungo is using CVS for its SCM, handling ~40 GB of data, in ~70 K
~300K files, and ~3000 tags/branches. We have ~100 developers using
automated build-and-test hosts using CVS, and all our infrastructure
using CVS as well to automatically fetch their configuration.
Jungo also uses an automated merging tool to merge bug fixes and
between product branches.
We have encountered The following problems with CVS:
- Slow checkout times: over 1.5 hours.
- Slow update times: over 40 minutes.
- Slow tagging and branching: over 12 hours!!!
- Automated merging failed on some delicate cases due to subtle bugs
- Binary files
- adding a file to MAIN trunk, then to a branch.
- A full checkout of the repository takes 5 GB, while a typical
needs a 1GB subset of the repository.
We reviewed the options of moving to another opensource SCM system
subversion), or to a commercial tool (such as clearcase). In the end
decided that the internal design of CVS is actually the best design,
preferred to just improve the (good) basic design of CVS, rather than
move to a
completely new tool which has a lot of design problems.
All the solutions implemented by Jungo are 100% server side, thus
to CVS clients, so no need to update CVS clients working with the
to make use of the new features.
The only feature which requires a patched client in order to work is
Modules feature (which we will describe later on), but also for this
have now came up with a new design that if implemented will enable the
Modules feature to be transparent to the CVS clients. A volunteer from
community would be most welcome to receive from us the design
Also, all the features have configuration flags to enable/disable
keep behavior compatible with previous versions of CVS.
In addition, many sanity tests were added to validate the behavior of
After we implemented all the improvements, we ended with a CVS server
that performs very fast and in a highly reliable manner.
Following are the solutions used by Jungo to handle these problems:
Problem: slow tagging and branching
Solution: immediate tagging and branching
Looking at the RCS files on the server, we found that on average the
symbols info accounted for over 70% of the contents of the RCS files!
3000 tags take a lot of room compared to a small typical C source
The soft-tagging/soft-branching feature allows one to define tags/
using a BRANCH+DATE setting. This is a tremendous step forward to CVS
allowing it to out perform any other SCM system in tagging/branching
also allows for the first time in CVS to keep accurate track of when
was a TAG
logically done. Before soft-tagging, one could only try to guess the
from looking at all the files where the tag was included.
This method works, since over 95% of tags/branches in typical project
tag/branch is for the same BRANCH+DATE for all its files.
The soft-tagging/soft-branching still leaves in tact the regular
tagging (now referred as 'hard-tags' and 'hard-branches'...).
Soft-tagging is designed to live well together side-by-side with
hard-tagging. It even allows using mixed-mode tags: if you added a
the whole repository (300K files) and now you want to manually move
the tag on
2 of the files, you can simple do a hard-tag on these two files to
The implementation originates in code from the newtags2 patch
available on the
Savannah.gnu.org CVS server, extending the concept of built-in
This is also related to the slow check-out problem, discussed below.
The design outline is at the end of this email.
TESTS: sanity.sh test name: soft-tags.
CONFIG: CVSROOT/stags: define here the soft-tags and soft-branches.
COMPATIBILITY: all clients
Problem: binary files merging complains on conflict even when
Solution: corrected behavior in merging binary identical files
We understand that Binary files cannot be merged when there are REAL
conflicts. The problem is that CVS does not handle binary files merges
when the files are 100% identical and there is no real conflict! In
files, if you do 'cvs update' and the server has a new revision which
identical to your locally modified revision, cvs will say: "Changes
So we also made binary file merge check if the locally modified file
identical to the newly merged file. If so - it will allow the merge.
TESTS: sanity.sh test name: binarymerge
CONFIG: CVSROOT/config: MergeBinaryIdentical=y MergeAddToNonDead=y
COMPATIBILITY: all clients
Problem: 'cvs up -j -j' incorrect behavior when adding a file to
and later on to branch.
Solution: fixed logical bug related to the way CVS creates revisions
for new files on trunk and branches.
Jungo's automated merging system requires 'cvs up -j -j' to always
correctly. With plain CVS it worked correctly 98% of the time, leaving
issues handled incorrectly.
This issue is fixed by always starting with a dead revision - possibly
revision 1.1 dead. CVS 1.12.13 added dead revision on branch, which
solves this issue, but still left some cases uncovered.
'cvs update -rBRANCH:DATE' and 'cvs up -j -j' do not work correctly in
Jan-07: Before this file was ever added, we tag the whole tree:
cvs tag -b Jan-branch
Feb-07: We add a new file (cvs add) to trunk:
1.1 Feb-07 alive
Now we do 'cvs tag -b Feb-branch':
126.96.36.199 alive (Feb-branch)
Mar-07: Then continue work on trunk:
1.2 Mar-07 alive
Apr-07: modify on Feb-branch:
188.8.131.52 Apr-07 alive (Feb-branch)
and on Jan-branch (we add the file: cvs add):
184.108.40.206 Apr-07 alive (Jan-branch)
revision 1.2 is not really the parent of 220.127.116.11. The real parent
should be 1.0
(not that such a revision exists...). A simple way to see this
problem: 'cvs up
-rJan-branch:1-Mar-07' will retrieve revision 1.2 (live), instead of a
Solution: Always branch out pre-existing branches from BEFORE the
first live revision on the trunk: So always make sure 1.1 is dead
(CreateTrunkDeadRev) and always branch out pre existing branches from
Since 'cvs up' gives incorrect results, 'cvs up -j -j' will give
results as well.
And what about cases that the RCS file already has 1.1 live since the
file was created in older versions of CVS server before this issue has
The solution is to branch from 1.1, and create a DEAD revision on the
(18.104.22.168 dead and 22.214.171.124 live) and to put the date of 126.96.36.199
same as the 1.1 revision. This 'backwards compatiblity' hack solves
with pre-patch existing CVS repositories.
TESTS: sanity.sh test name: createdeadrev
COMPATIBILITY: all clients
Problem: doing 'cvs ci' after 'cvs rm' does not print out the revision
added to the RCS file
Solution: allow printing out the newly added revision for remove
be compatible with the rest of the operations.
It is hard to guess the version for a file when it is removed,
the case when removing a file that had no real revision on the branch.
The 'cvs ci' message for a typical commit is:
cvs_repository/dir/f1,v <-- f1
new revision: 188.8.131.52; previous revision: 1.3
The 'cvs ci' message for 'cvs rm' is:
cvs_repository/dir/f1,v <-- f1
new revision: delete; previous revision: 184.108.40.206
Jungo's automated merging system relies on the output of 'cvs ci' to
the revision numbers added to the RCS file, so it is critical to have
'cvs ci' print out the revision for ALL operations. the modified
cvs_repository/dir/f1,v <-- f1
new revision: delete 220.127.116.11; previous revision: 18.104.22.168
COMPATIBILITY: all clients
Problem: modules only work in checkout: cannot work effectively and
Solution: keep sticky modules setting
CVS implementation of modules works for the checkout. However, once
the module setting does not persist. Developers want to get any new
added to the same module with 'cvs up -d', but doing this with plain
also bring directories excluded from the module.
This feature is very important for very large projects, such as BSD or
where one would typically want to check out only a subset of the
Jungo, a full checkout is around 5GB. A typical partial check out is
1GB. This means that this feature improves x5 the 'cvs checkout' speed
subsequently the 'cvs update' speed)!
It also saves disk space! - Many engineers have multiple checked out
their PCs.... Since "sticky modules" is such an important feature for
projects, Mozilla project wrote a whole wrapper ABOVE cvs that does
There is a Makefile script called "client.mk" that is a wrapper ABOVE
checks out and updates subsets of Mozilla CVS trees.
The required behavior for this feature needs to be similar to CVSNT's
feature. The problem with CVSNT's implementation is that modules2 was
designed to enable subset checkouts and updates, but it also has a
capability of virtually changing names of sub-directories.
This is NOT a good thing. It is similar to the "smart" feature of many
systems that allow to rename and move files and subdirectories. All
of "smart" features have bad impact of the logical correctness of an
The current implementation is not client-transparent. It requires
#define STICKY_MODULES 1
Older CVS clients will still be able to work with the CVS server, but
modules behavior will be as with vanilla CVS: non sticky.
The sticky-modules feature keeps the module name in a file named
the CVS directory together with Root, Repository, Entries, and
like "Tag", it is there only when sticky module is used.
We already published the code for this specific feature as a patch to
We have come up with a new design which we did not yet implement - of
allow the sticky modules feature to be client transparent (thus not
the Client/Server protocol, and not requiring patched CVS clients to
this feature). We believe the implementation of this would be no more
100-200 lines of patch to the CVS source code. If anyone would like
volunteer to implement this - please contact us and we will provide
TESTS: no tests were added.
src/cvs.h: #define STICKY_MODULES 1
CVSROOT/config: StatusShowModule=y --> makes 'cvs status' print
out the sticky module name.
COMPATIBILITY: patched CVS clients with compiled with STICKY_MODULES
PROTOCOL: a Module command was added to the CVS Client/Server
Problem: 'cvs import' command does not work correctly:
Solution: alternative application...
We added (see later on in the email) an external application that
"cvs import" differently (we call it "jcvs import").
Since we wanted people NOT to use the built-in "cvs import" and only
import" (since the logic of the built in import is totaly wrong), we
cvs an "import disable" feature, to prevent users from mistakenly
built-in "cvs import".
Beyond the improvements inside CVS source code, we also implemented
improvements as an external applications above CVS (a wrapper). We
jcvs (Jungo CVS), and it calls CVS when needed (for low level CVS
The source code for this application is completely separate from CVS,
not part of this patch, and we did not "productise" it (sanity tests,
documentation etc...), since we use it only internally in Jungo in the
It is a bit in the concept of "cvsu" application that is a kind of off-
client, "jcvs" complements CVS with commands that do not exist in CVS,
can be improved.
If anyone is interested in productizing this, please contact us, and
provide the source code under GPL.
Problem: 'cvs import' problem working incorrectly
Solution: alternative application: 'jcvs import'
'cvs import' implementation has many logical bugs. It should really
exactly like "cvs commit", where that you can first review what is
going to get
into the repository (what files will be added, removed, changed, and
to be able
to see the diff of the changes), then you "approve" this by doing a
operation (i.e. "cvs commit"), and then the server must pass it
through all the
regular validations that regular commits go through.
There is no reason that imported code will not pass the same
regular committed code. Then comes the question to what branch will
be imported: Why should imported code not have a symbolic name of a
which the user can select? why does it have to be this cryptic
And what if the user wants to import directly into main branch (HEAD)
or into a
different branch of his selection?
All RCS revision numbers have a very straight forward simple logic:
revisions are built like a tree, 1.x being the trunk, then 1.1.2.x
1.1.6.x etc for example being branched out of 1.1 (in this example).
means that if we want the newest revision on a branch, we need to find
branch's number (1.x for HEAD or 1.1.2.x for example for a certain
find the highest "x" value of an available revision. Simple? Yes. And
"cvs import" breaks all this: for the newest revision of HEAD you need
latest 1.x, but if 1.2 does not exist and 1.1.1.x does exist, then you
exception where you need to take the newest revision of 1.1.1.x!
There are many bugs documented relating to "cvs import". Another
the official CVS manual:
"WARNING: If you use a release tag that already exists in one of the
archives, files removed by an import may not be detected. "
Why do all these problems exist? because "cvs import" concept is
wrong. What is
needed is that the "cvs import" feature will behave as a tool to add
controlled files for a given subdirectory that is not version
The procedure for importing a package (let's assume linux-2.6.18.tgz)
something like this:
$ tar xvzf linux-2.6.18.tgz
$ cd linux
$ jcvs import -b linux-original project/os/linux
$ cvs commit -m "import Linux 2.6.18 from kernel.org"
$ cvs tag linux-2_6_18
and if we want this to be merged to HEAD, we will then also do:
$ cvs up -A
$ cvs up -j HEAD -j linux-2_6_18
(or may do "cvs up -j HEAD -j linux-original")
$ cvs commit -m "merge Linux 2.6.18"
This is simple logic, that fixes all the problems with the regular
import". So: what is the behavior required from this new "jcvs import"
make the above sequence work?
It is required to add CVS control files (CVS/Root, CVS/Entries,
CVS/Repository) in a given sub tree that DOES NOT have any CVS control
directory. It needs to mark all files as "A" (add) if they exist on
(the -b option supplied to the "cvs import" command), or
"M" (modified) for all
files that already exist on the branch - but their contents do not
"R" (remove) for all files that do not exist in the sub-tree, but DO
the repository on that branch. In order to be able to prepare the
"A", "M" and "R" for "cvs commit", notice 'jcvs import' may also need
empty directories on the CVS server (by doing "cvs add" to the
This means that doing the combination of "jcvs import" and then "cvs
brings the CVS repository in-sync with a tarball!
To sum up this feature: if you have a subtree of files and you want a
the repository on a certain branch to be 100% in sync with this
subtree, you do "jcvs import". We also believe that in the long term,
original "cvs import" code should be removed, and replaced with code
behaves exactly like Jungo's "jcvs import" - since this is the correct
behavior "cvs import" should behave like!
Problem: slow 'cvs update'
Solution: 'jcvs update': DB-based delta fetching using external
'cvs update' of Jungo main CVS tree involves going over ~250K files.
The average number of commit sessions (commitid) per branch per day is
than 10. The number of whole tree updates on this tree is in the
In Jungo, like in most CVS server setups, we have ViewCVS installed.
So each commit is recorded in the ViewCVS DB (lately renamed to ViewCV
We have created a wrapper application that keeps a timestamp of last
tries to get the delta from the DB when feasible.
So when you do 'jcvs update' it sends an SQL query to ViewCVS to get
of files modified on the required branch since the last 'jcvs
update' (or since
'cvs checkout' or the tree, if 'jcvs update' was not run yet on this
But what about locally modified files? Well, it also runs "cvsu"
get the list of local changes.
It merges the list of server changes (ViewCVS) and local changes
then it calls 'cvs update' with the SPECIFIC list of files.
The result: 'jcvs update' for 250K files takes under 1 minute, instead
40 minutes that regular 'cvs update' calls take. The 1 minute is
mainly due to
the time it takes 'cvsu' to scan all the local files for detecting
modifications. This means the load on the CVS server and the network
are very low!
Soft Tagging design details
The operation of tagging in CVS is using the file's RCS file to store
and branches defined for the file.
- Currently we have in the our main CVS repository ~250,000 files.
operation needs to write to all these files. This takes hours to
complete. During this time CVS performance is seriously degraded.
- Having all the tags in the files bloats the files by orders of
magnitude. We found out that a typical file has 50KB
worth of tags in it, and on average 5KB of real content.
- We use 'soft tagging'
- We define soft tags and branches by writing in file CVSROOT/stags
name with branch & date.
- We avoid writing of tag in RCS files, so tag time is reduced to
- CVS is updated to look for symbolic tag names in soft tags list as
well as in
the RCS list of symbols.
- Legacy hard tags can later be converted to soft tags at our leisure.
requires analyzing the tags on the files to make sure that they have
point in time on a specific branch (or, for that matter, a
of points, i.e. a a specific period of the branch life time), that
revisions tagged with the tag share.
- Keep in file CVSROOT/stags list of soft tags:
- format of line:
T <tag name> <branch>:<date UTC in RCS format> <repository>
B <tag name> <branch>:<date UTC in RCS format> <repository>
- example: tag-4_0_5 branch-4_0 2007.05.28.11.04.19 # version 4.0.5
- This defines the tag in terms of date on the branch. This allows
replace the writing to all the files with a single update in this
Soft tagging takes seconds instead of hours for hard tagging.
- instead of branch+date the soft-tag can use the name of another
tag - this
serves as an aliasing mechanism.
- repository field is used to make sure that the tag is used only
supposed to be used. it can contain either name of top-level
'seamonkey', or a path within them, like
- blocking users from doing 'cvs tag' outside the destined
be done by adding a filter in CVSROOT/taginfo.
- Users can define the tags as soft tags by adding a line for the tag
CVSROOT/stags. optionally, one can choose (as done in Jungo) to auto-
this file based on contents of an external system defining the tags
versions used (Jungo uses the same system to define names of tags
branches shown in our Bugzilla-based issue-tracking system).
- CVS update is modified to handle selection by soft tags - see
version by tag" below.
- CVS commit is changed to do hard-branch only where necessary.
See "Doing soft branching".
Selecting file version by tag
For each file, we select the version during update using the logic
Below. Since checkout is based on update, this covers it as well.
This is implemented in CVS code in file src/rcs.c, function
- if parameter to "-r" matches 'dots separated numbers' format:
- return the version it specifies
- if symbolic tag is in RCS - use the definition from RCS
- if the tag is in CVSROOT/stags, look up matching version on branch
- if found matching version, return that version
- return NULL
Doing soft branching
- On commit:
- if the file has sticky tag set which is a soft tag of type branch,
is not found in the file's RCS data:
- do hard tagging of file, using the version specified in the
i.e. add branch for that soft-branch to RCS file.
- continue as usual
- This assures that branch tags are added to RCS files only when
This way not only soft tagging is immediate but also soft-branching.
The RCS file needs to be written to only when a commit to a file
real revision on the branch to be created.
Any comments and feedback are welcome.
- I will post a patch with all our CVS changes (diff vs. GNU CVS
1.12.13) to the CVS project in
GNU Savannah site (https://savannah.nongnu.org/patch/?group=cvs).
- work on CVS improvements in Jungo began in 2003.
People involved, apart from me, included Derry Shribman & Or Tal.
Yaron Yogev <firstname.lastname@example.org>
- CVS soft-tagging + other problems & solutions,