cons-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

pcons on Mosix - Lessons Learned


From: Nolish, Kevin
Subject: pcons on Mosix - Lessons Learned
Date: Thu, 13 Sep 2001 11:41:29 -0400

This is really a note for the list archives in case somebody else wants to
try this.

If what you want to do is run pcons on a single SMP machine, then there
isn't much to do.  Find and fix the multiple line command problem, see
earlier e-mail on the list, and things will run properly.

I also modifed pcons to use a custome Perl extension to call the mexec
system call instead of exec.  The former is Mosix specific and can be
thought of as "migrate and exec".

If you want to run pcons on a clustering system, like Mosix or Beowulf, then
the biggest obstacle is finding a proper networking file system.  What you
are looking for is cache coherency.  NFS doesn't have it.

What goes wrong is that your build will fail because a target generated on
one node doesn't make its way off of the local cache and back to the server
before that target is needed by a subsequent build step running on a
different node in the cache.  For SMP, it doesn't matter.  All of the
processors are executing on top of the same local cache - guaranteed
coherency.

We found that by turning off NFS caching, our builds ran to completion.
Unfortunately, the performance when you do this is absolutly lousy.  A
forking build without NFS caching was 2-3 times faster than a non-forking
build with NFS caching, but that speedup really doesn't justify the expense
of setting up the cluster.  The experiment also verified our suspicions
about file system cache coherency as a troublesome aspect of clustering.

We added sleep statements to what appear to be the proper places in pcons
and the build jobs got more reliable, but they aren't perfect.  There's
still the occasional build failure caused by a missing, or otherwise out of
sync, dependency file.  If you start pcons over, it picks up and continues
on to the next failure. The '-k' option helps as pcons tries to do the
maximum work that it can in each job.

What we DID get with the delay hacks was enough of a speedup to justify the
time and expense in looking at alternate networking file systems that
support coherency.  We are looking at AFS and GFS.  There's some GFS support
being worked on by some of the Mosix developers.  We have some local
expertise with AFS in-house.  Coda is probably another feasible alternative,
although for a build cluster the disconnected operation support of Coda
really isn't a necessary feature.  All three of these are open-source to an
extent.  AFS and GFS both use custome licenses and I haven't studied these
well enough to figure out if there are any gotchas buried in them.  I don't
know how Coda is licensed.

So, if you are thinking of some sort of clustered build engine,  your
biggest problem is going to be the networked file system.

Kevin Nolish
724-742-6989
<mailto:address@hidden> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]