[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: blocked jobs
Re: blocked jobs
Thu, 24 Apr 2008 10:41:23 -0400
Thunderbird 220.127.116.11 (X11/20080213)
Please always reply to the list, unless asked to do otherwise.
Jeevesh Kaul wrote, On 04/23/2008 07:58 PM:
thanks Todd for your questions and I will try to answer them all. Hope it
On Wed, Apr 23, 2008 at 5:34 AM, Todd Denniston <
Jeevesh Kaul wrote, On 04/21/2008 03:29 PM:
we have a situation where if we run ps on the state we see blocked jobs
are high in number around 8 ( vmstat )
what options are you passing to ps?
ps -ef S | grep cvs
what options are you passing to vmstat?
what makes you think _cvs_, instead of something else, may be causing this
high blocked number?
the output from ps above
do you have any cvs jobs that are not being blocked?
when you run top, what is are the 4 items at the top of the list and how
much cpu are they pulling?
top - 16:30:02 up 94 days, 22:06, 2 users, load average: 10.41, 10.81,
Tasks: 223 total, 4 running, 219 sleeping, 0 stopped, 0 zombie
Cpu(s): 18.9% us, 15.9% sy, 0.0% ni, 35.4% id, 29.8% wa, 0.0% hi, 0.0% si
Mem: 4149144k total, 4086096k used, 63048k free, 91724k buffers
Swap: 2040244k total, 202476k used, 1837768k free, 3308392k cached
OK, I was not clear here...
I meant when you run top, what is are the 4 PROCESSES at the top of the list
and how much cpu are they pulling?
top -bS -n 1 |grep -A5 %MEM
The above info was still somewhat useful though.
what does the output of `vmstat 5 5` look like?
procs -----------memory---------- ---swap-- -----io---- --system--
r b swpd free buff cache si so bi bo in cs us sy id
10 5 202476 17188 107576 3378600 0 0 22 33 1 0 7 8 66
7 2 202476 16984 110992 3368684 0 0 29621 21129 4752 30331 25 27 30
6 7 202476 82968 100916 3297120 0 0 29859 15482 4845 27388 33 26 26
4 7 202476 67672 94880 3160416 0 0 58607 18979 4273 19166 36 24 21
13 3 202476 145096 98244 3236612 0 0 43240 16982 5020 27587 38 28 15
1) the machine is 200 Megs into swap space, which for a file server is NOT
2) 3.1 GB in the cache... so you have A LOT of ram. 32 or 64 bit or PAE kernel?
3) only 17 to 140 MB of ram for program operation.
4) waiting on the IO system ~20% of the time.
5) The disk subsystem is peaking out at ~46MB/s read and ~14MB/s write. (I
don't think NFS is included in that, and NFS peaks out at less than your
6) you are seeing a HUGE number of context switches
7) your CPU is not getting to idle much (20%idle).
Has someone messed around with the kernel config such that the cache/process
memory balance has been changed?
how many cvs processes?
depends.. 15 - 84 sometimes
for 84 cvs processes accessing at the same time... load level 10 does not seem
all that bad.
how many different users own those processes?
should be on the order of 15 - 84 :)
and what I was wanting to make sure of, is that no one user had more than 1
cvs process running.
Are _all_ of those users physically at their terminals right now? (i.e.,
has someone started a commit or other operation that locked the repo, but
either left it hanging or somehow killed the controlling process?
not always. Folks have scripts as cron to update their code.
`cvs update` should be OK, as it is read only and should not create long lived
locks, but you might want to make sure folks stagger their cron starts so
their read locks don't get in each others way.
The cvs server is run on a linux box RH AS release 4.
using Nagios to monitor server load we dont find any underlying problems
with NFS or memory or disk, yet the cvs app is slow in response.
What do you mean slow in response?
appreciable delay in reponse. Havent timed it.
Does the same operation take nearly the same time to do if only ONE user
is accessing the server machine?
no it varies.
you need to characterize it with actual timings.
and with how many MBytes were transfered in the operations.
how should we go about debugging what puts the cvs app into the sleep
There are probably high cvs reads happening and there is nothing obvious
that leaps up.
cvs server version used 1.11.17-9.
What is the cvs connection method? (:ext:, :ext: with ssh, pserver, NFS)?
are developers running cvs at their local workstations or on the CVS server?
i.e., are they double loading the server with cvs processes, and are they
writing their sandboxes to a disk on the cvs server?
Are all of your clients (developers) using the same connection method?
This worries me! (because you are not sure)
you are OK if only pserver and ext are being used....but
If ANYONE is accessing the repository over NFS or SMB(Samba/CIF) have your
boss inform them that they are endangering the company's data, as it is known
that accessing a repository over those methods has caused much corruption over
search for: "DON'T use CVS in :local: mode with a server on a network drive!!!"
Are you sure that something else on the server is not slowing things down?
i.e., did the admin make the mistake of 1) leaving the RH install booting in
runlevel 5 instead of 3 and 2) logged in and then lock the screen running
the 3D Gears screen saver, or even just stay logged in and let one of the
gnome applets go crazy? (I have seen both on THE SAME machine, it is a real
drag even with quad processors)
no not at all.
Is the repository on a local disk or NFS mounted?
PLEASE tell me that NO OTHER MACHINE mounts that share!
Do you have a dedicated Ethernet card & line to the nfs server?
what is the speed of the Ethernet to the nfs server?
What file system is the repository on? any non-default options used in the
creation or mounting of that file system?
How much memory?
How much VM?
vmstat and top indicates 2GB, with 200MB in use.
How much disk space in /tmp?
Any IO errors showing up in /var/log/messages or dmesg
any in the logs of the NFS machine?
How large are the four largest files in the repository and are any of them
in the same directory?
250M and not in the same directory
OK, IIRC that means any diffs, or commits are going to require ~500M ram and
considerable space in /tmp/
Are many of your developers working on branches instead of the trunk?
Long lived branches tend to slow cvs down, because any work at the tip of a
branch requires building the file from deltas. Trunk access is MUCH faster.
search for: "What is the best branching practice to use with CVS?"
does /usr/share/cvs*/contrib/ check_cvs  or validate_repo  indicate
i.e., not nearly enough info to make an educated guess.
thanks for asking the right question, we have been making educated guess to
fix it and nothing seems to work.
I would suspect you are being slowed down by:
1) accessing the repository over NFS,
a) uses a lot of CPU to do the transfers
b) tops out at less than network speed
10Mb/s = ~1.2 MByte/second
100Mb/s = ~12 MByte/second
1000Mb/s= ~120 MByte/second
and those are only if the NFS share and the cvs server are the ONLY computers
on the network.
c) has the risk of developers attempting to use the NFS share directly with
CVS (very bad consequences).
2) your CPU is overloaded (though this may be due to NFS use).
a dual/quad processor would probably handle the load better, if it had fast
access to the disk.
3) If someone has changed the cache/process ram balance, such that more cache
is in use, they may be causing the machine to take longer to process cvs
a) it pushed the processes into swap.
b) it causes more context switches do to being in swap.
Changing the balance to favor cache would be an OK thing if the machine was
JUST acting as a file server, but for cvs the normal balance is better.
I suspect you could speed the whole system up by an order of magnitude by
putting the repository in a large fast disk locally. Even a USB 2.0 connected
(assuming the machine supports USB 2.0) disk at ~30MBytes/second could be
faster than a network connection.
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter
- blocked jobs, Jeevesh Kaul, 2008/04/22
- Re: blocked jobs, Todd Denniston, 2008/04/23
- Message not available
- Re: blocked jobs,
Todd Denniston <=