[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: blocked jobs

From: Todd Denniston
Subject: Re: blocked jobs
Date: Thu, 24 Apr 2008 10:41:23 -0400
User-agent: Thunderbird (X11/20080213)

Please always reply to the list, unless asked to do otherwise.

Jeevesh Kaul wrote, On 04/23/2008 07:58 PM:
thanks Todd for your questions and I will try to answer them all. Hope it

On Wed, Apr 23, 2008 at 5:34 AM, Todd Denniston <
address@hidden> wrote:

Jeevesh Kaul wrote, On 04/21/2008 03:29 PM:

we have a situation where if we run ps on the state we see blocked jobs
are high in number around 8 ( vmstat )

what options are you passing to ps?

 ps -ef S | grep cvs

what options are you passing to vmstat?

vmstat 2

what makes you think _cvs_, instead of something else, may be causing this
high blocked number?

 the  output from ps above

do you have any cvs jobs that are not being blocked?


when you run top, what is are the 4 items at the top of the list and how
much cpu are they pulling?

top - 16:30:02 up 94 days, 22:06,  2 users,  load average: 10.41, 10.81,
Tasks: 223 total,   4 running, 219 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.9% us, 15.9% sy,  0.0% ni, 35.4% id, 29.8% wa,  0.0% hi,  0.0% si
Mem:   4149144k total,  4086096k used,    63048k free,    91724k buffers
Swap:  2040244k total,   202476k used,  1837768k free,  3308392k cached

OK, I was not clear here...
I meant when you run top, what is are the 4 PROCESSES at the top of the list and how much cpu are they pulling?
top -bS -n 1 |grep -A5 %MEM

The above info was still somewhat useful though.

what does the output of `vmstat 5 5` look like?

procs -----------memory---------- ---swap-- -----io---- --system--
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
10  5 202476  17188 107576 3378600    0    0    22    33    1     0  7  8 66
 7  2 202476  16984 110992 3368684    0    0 29621 21129 4752 30331 25 27 30
 6  7 202476  82968 100916 3297120    0    0 29859 15482 4845 27388 33 26 26
 4  7 202476  67672  94880 3160416    0    0 58607 18979 4273 19166 36 24 21
13  3 202476 145096  98244 3236612    0    0 43240 16982 5020 27587 38 28 15

1) the machine is 200 Megs into swap space, which for a file server is NOT good usually.
2) 3.1 GB in the cache... so you have A LOT  of ram. 32 or 64 bit or PAE kernel?
3) only 17 to 140 MB of ram for program operation.
4) waiting on the IO system ~20% of the time.
5) The disk subsystem is peaking out at ~46MB/s read and ~14MB/s write. (I don't think NFS is included in that, and NFS peaks out at less than your network bandwidth.)
6) you are seeing a HUGE number of context switches
7) your CPU is not getting to idle much (20%idle).

Has someone messed around with the kernel config such that the cache/process memory balance has been changed?

how many cvs processes?

depends..  15 - 84 sometimes

for 84 cvs processes accessing at the same time... load level 10 does not seem all that bad.

how many different users own those processes?

probably hundreds.

should be on the order of 15 - 84 :)
and what I was wanting to make sure of, is that no one user had more than 1 cvs process running.

Are _all_ of those users physically at their terminals right now? (i.e.,
has someone started a commit or other operation that locked the repo, but
either left it hanging or somehow killed the controlling process?

not always. Folks have scripts as cron to update their code.

`cvs update` should be OK, as it is read only and should not create long lived locks, but you might want to make sure folks stagger their cron starts so their read locks don't get in each others way.

 The cvs server is run on a linux box RH AS release 4.
using Nagios to monitor server load we dont find any underlying problems
with NFS or memory or disk, yet the cvs app is slow in response.

What do you mean slow in response?

appreciable delay in  reponse.  Havent timed it.

Does the same operation take nearly the same time to do if only ONE user
is accessing the server machine?

no it varies.

you need to characterize it with actual timings.
and with how many MBytes were transfered in the operations.

 how should we go about debugging what puts the cvs app into the sleep
There are probably high cvs reads happening and there is nothing obvious
that leaps up.
cvs server version used 1.11.17-9.

What is the cvs connection method? (:ext:, :ext: with ssh, pserver, NFS)?


are developers running cvs at their local workstations or on the CVS server?
i.e., are they double loading the server with cvs processes, and are they writing their sandboxes to a disk on the cvs server?

Are all of your clients (developers) using the same connection method?

typically yes

This worries me! (because you are not sure)
you are OK if only pserver and ext are being used....but
If ANYONE is accessing the repository over NFS or SMB(Samba/CIF) have your boss inform them that they are endangering the company's data, as it is known that accessing a repository over those methods has caused much corruption over the years.
search for: "DON'T use CVS in :local: mode with a server on a network drive!!!"

Are you sure that something else on the server is not slowing things down?
i.e., did the admin make the mistake of 1) leaving the RH install booting in
runlevel 5 instead of 3 and 2) logged in and then lock the screen running
the 3D Gears screen saver, or even just stay logged in and let one of the
gnome applets go crazy? (I have seen both on THE SAME machine, it is a real
drag even with quad processors)

no not at all.

Is the repository on a local disk or NFS mounted?


PLEASE tell me that NO OTHER MACHINE mounts that share!

Do you have a dedicated Ethernet card & line to the nfs server?
what is the speed of the Ethernet to the nfs server?

What file system is the repository on? any non-default options used in the
creation or mounting of that file system?


How much memory?

 4 G

How much VM?


vmstat and top indicates 2GB, with 200MB in use.

How much disk space in /tmp?

132 gigs


Any IO errors showing up in /var/log/messages or dmesg


any in the logs of the NFS machine?

How large are the four largest files in the repository and are any of them
in the same directory?

 250M and not in the same directory

OK, IIRC that means any diffs, or commits are going to require ~500M ram and considerable space in /tmp/

Are many of your developers working on branches instead of the trunk?

about 50%

Long lived branches tend to slow cvs down, because any work at the tip of a branch requires building the file from deltas. Trunk access is MUCH faster.
search for: "What is the best branching practice to use with CVS?"

does /usr/share/cvs*/contrib/ check_cvs [2] or validate_repo [1] indicate
any problems?


i.e., not nearly enough info to make an educated guess.

thanks for asking the right  question, we have been making educated guess to
fix  it and nothing seems to work.

I would suspect you are being slowed down by:
1) accessing the repository over NFS,
        a) uses a lot of CPU to do the transfers
        b) tops out at less than network speed
                10Mb/s  = ~1.2 MByte/second
                100Mb/s = ~12  MByte/second
                1000Mb/s= ~120 MByte/second
and those are only if the NFS share and the cvs server are the ONLY computers on the network. c) has the risk of developers attempting to use the NFS share directly with CVS (very bad consequences).

2) your CPU is overloaded (though this may be due to NFS use).
a dual/quad processor would probably handle the load better, if it had fast access to the disk.

3) If someone has changed the cache/process ram balance, such that more cache is in use, they may be causing the machine to take longer to process cvs actions because
        a) it pushed the processes into swap.
        b) it causes more context switches do to being in swap.
Changing the balance to favor cache would be an OK thing if the machine was JUST acting as a file server, but for cvs the normal balance is better.

I suspect you could speed the whole system up by an order of magnitude by putting the repository in a large fast disk locally. Even a USB 2.0 connected (assuming the machine supports USB 2.0) disk at ~30MBytes/second could be faster than a network connection.


Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter

reply via email to

[Prev in Thread] Current Thread [Next in Thread]