Re: blocked jobs

info-cvs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: blocked jobs

From:	Todd Denniston
Subject:	Re: blocked jobs
Date:	Thu, 24 Apr 2008 10:41:23 -0400
User-agent:	Thunderbird 2.0.0.12 (X11/20080213)


Please always reply to the list, unless asked to do otherwise.


Jeevesh Kaul wrote, On 04/23/2008 07:58 PM:

thanks Todd for your questions and I will try to answer them all. Hope it
helps

On Wed, Apr 23, 2008 at 5:34 AM, Todd Denniston <
address@hidden> wrote:

Jeevesh Kaul wrote, On 04/21/2008 03:29 PM:

we have a situation where if we run ps on the state we see blocked jobs
that
are high in number around 8 ( vmstat )

what options are you passing to ps?


 ps -ef S | grep cvs

what options are you passing to vmstat?


vmstat 2

what makes you think _cvs_, instead of something else, may be causing this
high blocked number?


 the  output from ps above

do you have any cvs jobs that are not being blocked?

yes

when you run top, what is are the 4 items at the top of the list and how
much cpu are they pulling?


top - 16:30:02 up 94 days, 22:06,  2 users,  load average: 10.41, 10.81,
12.49
Tasks: 223 total,   4 running, 219 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.9% us, 15.9% sy,  0.0% ni, 35.4% id, 29.8% wa,  0.0% hi,  0.0% si
Mem:   4149144k total,  4086096k used,    63048k free,    91724k buffers
Swap:  2040244k total,   202476k used,  1837768k free,  3308392k cached


OK, I was not clear here...

I meant when you run top, what is are the 4 PROCESSES at the top of the listand how much cpu are they pulling?

top -bS -n 1 |grep -A5 %MEM

The above info was still somewhat useful though.

what does the output of `vmstat 5 5` look like?

procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
wa
10  5 202476  17188 107576 3378600    0    0    22    33    1     0  7  8 66
19
 7  2 202476  16984 110992 3368684    0    0 29621 21129 4752 30331 25 27 30
18
 6  7 202476  82968 100916 3297120    0    0 29859 15482 4845 27388 33 26 26
15
 4  7 202476  67672  94880 3160416    0    0 58607 18979 4273 19166 36 24 21
19
13  3 202476 145096  98244 3236612    0    0 43240 16982 5020 27587 38 28 15
20

OK

1) the machine is 200 Megs into swap space, which for a file server is NOTgood usually.

2) 3.1 GB in the cache... so you have A LOT  of ram. 32 or 64 bit or PAE kernel?
3) only 17 to 140 MB of ram for program operation.
4) waiting on the IO system ~20% of the time.

5) The disk subsystem is peaking out at ~46MB/s read and ~14MB/s write. (Idon't think NFS is included in that, and NFS peaks out at less than yournetwork bandwidth.)

6) you are seeing a HUGE number of context switches
7) your CPU is not getting to idle much (20%idle).

Has someone messed around with the kernel config such that the cache/processmemory balance has been changed?

how many cvs processes?


depends..  15 - 84 sometimes

for 84 cvs processes accessing at the same time... load level 10 does not seemall that bad.

how many different users own those processes?



probably hundreds.


should be on the order of 15 - 84 :)

and what I was wanting to make sure of, is that no one user had more than 1cvs process running.

Are _all_ of those users physically at their terminals right now? (i.e.,
has someone started a commit or other operation that locked the repo, but
either left it hanging or somehow killed the controlling process?

not always. Folks have scripts as cron to update their code.

`cvs update` should be OK, as it is read only and should not create long livedlocks, but you might want to make sure folks stagger their cron starts sotheir read locks don't get in each others way.

 The cvs server is run on a linux box RH AS release 4.

using Nagios to monitor server load we dont find any underlying problems
with NFS or memory or disk, yet the cvs app is slow in response.

What do you mean slow in response?



appreciable delay in  reponse.  Havent timed it.


Does the same operation take nearly the same time to do if only ONE user
is accessing the server machine?



no it varies.


you need to characterize it with actual timings.
and with how many MBytes were transfered in the operations.


 how should we go about debugging what puts the cvs app into the sleep

state.
There are probably high cvs reads happening and there is nothing obvious
that leaps up.
cvs server version used 1.11.17-9.

What is the cvs connection method? (:ext:, :ext: with ssh, pserver, NFS)?


pserver


are developers running cvs at their local workstations or on the CVS server?

i.e., are they double loading the server with cvs processes, and are theywriting their sandboxes to a disk on the cvs server?


Are all of your clients (developers) using the same connection method?



typically yes


This worries me! (because you are not sure)
you are OK if only pserver and ext are being used....but

If ANYONE is accessing the repository over NFS or SMB(Samba/CIF) have yourboss inform them that they are endangering the company's data, as it is knownthat accessing a repository over those methods has caused much corruption overthe years.

http://ximbiot.com/cvs/wiki/CVS%20FAQ
search for: "DON'T use CVS in :local: mode with a server on a network drive!!!"


Are you sure that something else on the server is not slowing things down?
i.e., did the admin make the mistake of 1) leaving the RH install booting in
runlevel 5 instead of 3 and 2) logged in and then lock the screen running
the 3D Gears screen saver, or even just stay logged in and let one of the
gnome applets go crazy? (I have seen both on THE SAME machine, it is a real
drag even with quad processors)


no not at all.


Is the repository on a local disk or NFS mounted?

nfs


PLEASE tell me that NO OTHER MACHINE mounts that share!

Do you have a dedicated Ethernet card & line to the nfs server?
what is the speed of the Ethernet to the nfs server?


What file system is the repository on? any non-default options used in the
creation or mounting of that file system?

Vxfs

How much memory?

4 G


How much VM?


 none


vmstat and top indicates 2GB, with 200MB in use.


How much disk space in /tmp?


132 gigs


free?


Any IO errors showing up in /var/log/messages or dmesg

no


any in the logs of the NFS machine?


How large are the four largest files in the repository and are any of them
in the same directory?


 250M and not in the same directory

OK, IIRC that means any diffs, or commits are going to require ~500M ram andconsiderable space in /tmp/


Are many of your developers working on branches instead of the trunk?

about 50%

Long lived branches tend to slow cvs down, because any work at the tip of abranch requires building the file from deltas. Trunk access is MUCH faster.

http://ximbiot.com/cvs/wiki/CVS%20FAQ
search for: "What is the best branching practice to use with CVS?"


does /usr/share/cvs*/contrib/ check_cvs [2] or validate_repo [1] indicate
any problems?

no


i.e., not nearly enough info to make an educated guess.



thanks for asking the right  question, we have been making educated guess to
fix  it and nothing seems to work.


I would suspect you are being slowed down by:
1) accessing the repository over NFS,
        a) uses a lot of CPU to do the transfers
        b) tops out at less than network speed
                10Mb/s  = ~1.2 MByte/second
                100Mb/s = ~12  MByte/second
                1000Mb/s= ~120 MByte/second

and those are only if the NFS share and the cvs server are the ONLY computerson the network.c) has the risk of developers attempting to use the NFS share directly withCVS (very bad consequences).


2) your CPU is overloaded (though this may be due to NFS use).

a dual/quad processor would probably handle the load better, if it had fastaccess to the disk.

3) If someone has changed the cache/process ram balance, such that more cacheis in use, they may be causing the machine to take longer to process cvsactions because

        a) it pushed the processes into swap.
        b) it causes more context switches do to being in swap.

Changing the balance to favor cache would be an OK thing if the machine wasJUST acting as a file server, but for cvs the normal balance is better.

I suspect you could speed the whole system up by an order of magnitude byputting the repository in a large fast disk locally. Even a USB 2.0 connected(assuming the machine supports USB 2.0) disk at ~30MBytes/second could befaster than a network connection.

[1]
http://cvs.savannah.nongnu.org/viewvc/*checkout*/ccvs/contrib/validate_repo.pl?root=cvs&content-type=text%2Fplain
[2]
http://cvs.savannah.nongnu.org/viewvc/ccvs/contrib/check_cvs.in?revision=1.17&root=cvs&view=markup



--
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter

[Prev in Thread]

Current Thread

[Next in Thread]

blocked jobs, Jeevesh Kaul, 2008/04/22
- Re: blocked jobs, Todd Denniston, 2008/04/23
  - Message not available
    - Re: blocked jobs, Todd Denniston <=

Prev by Date: Re: blocked jobs
Next by Date: How to get consistent output for: cvs -q -n up -d ?
Previous by thread: Re: blocked jobs
Next by thread: CHACL problem - why does CVS exist with code 0 when commit fails?
Index(es):
- Date
- Thread