monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: <service> start Generates email noise


From: Martin Pala
Subject: Re: <service> start Generates email noise
Date: Thu, 21 Dec 2006 12:20:58 +0100
User-agent: Mozilla/5.0 (X11; U; SunOS i86pc; en-US; rv:1.7) Gecko/20060627

Thanks :)

I have reproduced the problem - it is fixed in cvs no. It was caused in the validate.c:check_skip() by the order of s->def_every vs. s->visited tests, correct order is:

--8<--
  if(s->visited) {
    DEBUG("'%s' check skipped -- service already handled "
          "in a dependency chain\n", s->name);
    return TRUE;
  }

  if(!s->def_every)
    return FALSE;
--8<--

When there was no 'every' statement used, the check_skip() was FALSE and monit performed the service check in the same cycle where the user-requested start action was called. Because the just-started process was not running yet, the process existence test failed and monit performed the restart action in the same cycle as well.

I had problems to reproduce it, since i used 'every' statement in the testing configuration, which masked the bug.

Thanks for help :)

Martin



Aaron Scamehorn wrote:
Hi Martin,

Thanks for the detailed description...

I've attached my monitrc file.  Obviously the executable (logClient) is
an in-house exe, but that shouldn't matter, should it?

I wonder if it is due to the amount of time it takes for the exe to
update it's pid file...

Seems as though it has something to do with the wait_start starting it's
own thread to wait???

The scenario is rather simple.  I can reproduce by stopping the service,
then issuing a monit <service> start via the CLI.

If this is not enough detail, or I can help out more, please let me
know.

Thanks,
Aaron
-----Original Message-----
From: address@hidden
[mailto:address@hidden On
Behalf Of Martin Pala
Sent: Wednesday, December 06, 2006 8:50 AM
To: The monit developer list
Subject: Re: <service> start Generates email noise

I have looked on it ...

I will first explain how it works in monit 4.8.2:

Two threads come into play:
- http thread
- monitoring thread

The http thread process the user requested actions (posted either using CLI or HTML interface). The action to be done is scheduled in http/cervlet.c:handle_action() via setting of the s->doaction flag for the appropriate service. When there is no action scheduled, the s->doaction flag is set to ACTION_IGNORE (in p.y during service initialization or in validate.c after it was handled). In addition the Run.doaction is set to TRUE just to signalize that there is some scheduled action in the service tree. The main monitoring thread is then

wake up by http thread to speedup the action handling.

The main thread then in validate.c:validate() checks whether the Run.doaction flag is set, since the user actions are preferred. In the case that it is set, it walks the service tree and for each service performs the scheduled s->doaction using control_service() and then resets the s->doaction flag to ACTION_IGNORE. This is all done under mutex and signal protection, so it cannot be interrupted nor race condition can occure. The only thread which can call control_service and

physicaly start/restart/etc. the service is the main thread. The control_service also sets the s->visited flag.

The second service loop is then evaluated - monit walks the service tree, for each service locks mutex and blocks signals. In the case that the service was not handled in the same cycle already (s->visited flag is compared in the check_skip) it checks the s->doaction flag again (to improve the response time for the services, which has scheduled action in between the first and second loop in the same cycle). In the case that it is set, it performs the action, otherwise it checks the service.


The design is similar to signal handling. The http thread just sets the flag, whereas the monitoring thread handle the action. From theory point

of view, i think no race condition could occure.

I tried to reproduce the problem (official monit-4.8.2 release) without success.

Can you prepare simple monit configuration and procedure for problem reproduction?

Thanks,
Martin



Aaron Scamehorn wrote:

Hi Martin,

Actually I think you've now got one thread doing an ACTION_START, and
another doing an ACTION_RESTART on the exact same service.

It is the ACTION_RESTART that is generating what I perceived to be
extraneous emails.
It looks like the do_wakeupcall that you added to
http/cervlet.c:handle_action() is the culprit.  Without it, I don't

get

the ACTION_RESTART problem.

Of course you need this now, or else it takes Poll Time to actully
respond to the HTTP events, which is what you were trying to speed up

in

the first place.

Here is the log output, with a bunch of extra messages, including
pthread_t.

3086927552 [CST Dec  1 14:53:25] debug    : 'data_dir' filesystem

flags

has not changed since last cycle
3086927552 [CST Dec  1 14:53:25] debug    : 'data_dir' space usage

check

passed [current space usage=10.6%]
3086924720 [CST Dec  1 14:53:26] info     : monit daemon at 24175
awakened
3086927552 [CST Dec  1 14:53:26] info     : Awakened by User defined
signal 1
3086927552 [CST Dec  1 14:53:26] debug    : control_service:
ACTION_START for 'LogClient'
3086927552 [CST Dec  1 14:53:26] debug    : control_service:
ACTION_START Util_isProcessRunning for 'LogClient'
3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
process id [24220] -- No such process
3086927552 [CST Dec  1 14:53:26] debug    : do_start:
Util_isProcessRunning for 'LogClient'
3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
process id [24220] -- No such process
3086927552 [CST Dec  1 14:53:26] info     : 'LogClient' start:
/cogcap/ccts/bin/logclnt
3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
process id [24220] -- No such process
3086927552 [CST Dec  1 14:53:26] debug    : Monitoring enabled --
service LogClient
3086927552 [CST Dec  1 14:53:26] debug    : check_process: calling
Util_isProcessRunning for 'LogClient'
3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
process id [24220] -- No such process
3086927552 [CST Dec  1 14:53:26] error    : 'LogClient' process is not
running
3086927552 [CST Dec  1 14:53:26] debug    : Does not exist

notification

is NOT sent to address@hidden
3086927552 [CST Dec  1 14:53:26] debug    : Does not exist

notification

is sent to address@hidden
3076434864 [CST Dec  1 14:53:26] debug    : static void* wait_start

for

'LogClient'
3076434864 [CST Dec  1 14:53:26] debug    : 1) wait_start: calling
Util_isProcessRunning for 'LogClient', max_tries= 29
3076434864 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
process id [24220] -- No such process
3086927552 [CST Dec  1 14:53:26] debug    : control_service:
ACTION_RESTART for 'LogClient'
3086927552 [CST Dec  1 14:53:26] info     : 'LogClient' trying to
restart
3086927552 [CST Dec  1 14:53:26] debug    : Monitoring disabled --
service LogClient (stop)
3086927552 [CST Dec  1 14:53:26] debug    : do_stop:
Util_isProcessRunning for 'LogClient'
3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
process id [24220] -- No such process
3086927552 [CST Dec  1 14:53:26] debug    : 'data_dir' filesystem

flags

has not changed since last cycle
3086927552 [CST Dec  1 14:53:26] debug    : 'data_dir' space usage

check

passed [current space usage=10.6%]
3076434864 [CST Dec  1 14:53:27] debug    : 1) wait_start: calling
Util_isProcessRunning for 'LogClient', max_tries= 28
3076434864 [CST Dec  1 14:53:27] debug    : 2) wait_start: calling
Util_isProcessRunning for 'LogClient'
3086927552 [CST Dec  1 14:53:56] debug    : check_process: calling
Util_isProcessRunning for 'LogClient'
3086927552 [CST Dec  1 14:53:56] info     : 'LogClient' process is
running with pid 24375
3086927552 [CST Dec  1 14:53:56] debug    : Exists notification is NOT
sent to address@hidden
3086927552 [CST Dec  1 14:53:56] debug    : Exists notification is

sent

to address@hidden
3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' zombie check
passed [status_flag=0000]
3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' loadavg(5min)
check passed [current loadavg(5min)=0.2]
3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' cpu usage

check

passed [current cpu usage=0.0%]
3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' mem amount

check

passed [current mem amount=2764kB]
3086927552 [CST Dec  1 14:53:56] debug    : 'data_dir' filesystem

flags

has not changed since last cycle
3086927552 [CST Dec  1 14:53:56] debug    : 'data_dir' space usage

check

passed [current space usage=10.6%]


-----Original Message-----
From: address@hidden
[mailto:address@hidden On
Behalf Of Martin Pala
Sent: Thursday, November 30, 2006 4:20 PM
To: The monit developer list
Subject: Re: <service> start Generates email noise

Hello,

this behavior isn't bug - the 'nonexist' event type has possitive and negative variants:

  Does not exist (positive 'nonexist')

    vs.

  Exists (negative 'nonexist')

The alert statement allows to filter just the general event type, not the particular polarity (there is no 'exist' option).

=> when you have registered the 'nonexist' event, you should get two alerts informing about the beggining and end of the problem.

Martin


Aaron Scamehorn wrote:

Hello,

From version 4.8 to 4.8.2, the following bug has been introduced:

When we issue a monit <service> start command, we get "Does not

exist"

and a corresponding "Exists" emails.

Here is the debug output showing this behavior in 4.8.2:
'LogClient' Error testing process id [11034] -- No such process
'LogClient' Error testing process id [11034] -- No such process
'LogClient' start: /cogcap/ccts/bin/logclnt
'LogClient' Error testing process id [11034] -- No such process
Monitoring enabled -- service LogClient
'LogClient' Error testing process id [11034] -- No such process
'LogClient' process is not running
Does not exist notification is sent to address@hidden
'LogClient' Error testing process id [11034] -- No such process
'LogClient' trying to restart
Monitoring disabled -- service LogClient (stop)
'LogClient' Error testing process id [11034] -- No such process
'LogClient' process is running with pid 11189
Exists notification is sent to address@hidden
'LogClient' zombie check passed [status_flag=0000]
'LogClient' loadavg(5min) check passed [current loadavg(5min)=0.2]
'LogClient' cpu usage check passed [current cpu usage=0.0%]
'LogClient' mem amount check passed [current mem amount=2776kB]


Under version 4.8, we don't get the annoying "Does not exist" and a corresponding "Exists" emails:

'LogClient' Error testing process id [10970] -- No such process
'LogClient' Error testing process id [10970] -- No such process
'LogClient' start: /cogcap/ccts/bin/logclnt
'LogClient' Error testing process id [10970] -- No such process
Monitoring enabled -- service LogClient
'LogClient' Error testing process id [10970] -- No such process
'LogClient' Error testing process id [10970] -- No such process
'LogClient' zombie check passed [status_flag=0000]
'LogClient' loadavg(5min) check passed [current loadavg(5min)=0.1]
'LogClient' cpu usage check passed [current cpu usage=0.0%]
'LogClient' mem amount check passed [current mem amount=2776kB]



Additionally, in our config file, we have the following set:
set alert address@hidden only on { nonexist, exec, connection

}

We shouldn't be getting an "Exists" email under any circumstance,

should
we?

Thanks,
Aaron




------------------------------------------------------------------------

_______________________________________________
monit-dev mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monit-dev


_______________________________________________
monit-dev mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monit-dev


_______________________________________________
monit-dev mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monit-dev



_______________________________________________
monit-dev mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monit-dev


------------------------------------------------------------------------

_______________________________________________
monit-dev mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monit-dev




reply via email to

[Prev in Thread] Current Thread [Next in Thread]