monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: <service> start Generates email noise


From: Aaron Scamehorn
Subject: RE: <service> start Generates email noise
Date: Thu, 21 Dec 2006 13:13:21 -0600

Glad to hear it!  Glad to help.

Also, thanks for pointing out the "every" directive.  I think I might
want to start using it...


Thanks,
Aaron

-----Original Message-----
From: address@hidden
[mailto:address@hidden On
Behalf Of Martin Pala
Sent: Thursday, December 21, 2006 5:21 AM
To: The monit developer list
Subject: Re: <service> start Generates email noise

Thanks :)

I have reproduced the problem - it is fixed in cvs no. It was caused in
the validate.c:check_skip() by the order of s->def_every vs. s->visited
tests, correct order is:

--8<--
   if(s->visited) {
     DEBUG("'%s' check skipped -- service already handled "
           "in a dependency chain\n", s->name);
     return TRUE;
   }

   if(!s->def_every)
     return FALSE;
--8<--

When there was no 'every' statement used, the check_skip() was FALSE and
monit performed the service check in the same cycle where the
user-requested start action was called. Because the just-started process
was not running yet, the process existence test failed and monit
performed the restart action in the same cycle as well.

I had problems to reproduce it, since i used 'every' statement in the
testing configuration, which masked the bug.

Thanks for help :)

Martin



Aaron Scamehorn wrote:
> Hi Martin,
> 
> Thanks for the detailed description...
> 
> I've attached my monitrc file.  Obviously the executable (logClient) 
> is an in-house exe, but that shouldn't matter, should it?
> 
> I wonder if it is due to the amount of time it takes for the exe to 
> update it's pid file...
> 
> Seems as though it has something to do with the wait_start starting 
> it's own thread to wait???
> 
> The scenario is rather simple.  I can reproduce by stopping the 
> service, then issuing a monit <service> start via the CLI.
> 
> If this is not enough detail, or I can help out more, please let me 
> know.
> 
> Thanks,
> Aaron
>  
> 
> -----Original Message-----
> From: address@hidden
> [mailto:address@hidden On 
> Behalf Of Martin Pala
> Sent: Wednesday, December 06, 2006 8:50 AM
> To: The monit developer list
> Subject: Re: <service> start Generates email noise
> 
> I have looked on it ...
> 
> I will first explain how it works in monit 4.8.2:
> 
> Two threads come into play:
> - http thread
> - monitoring thread
> 
> The http thread process the user requested actions (posted either 
> using CLI or HTML interface). The action to be done is scheduled in
> http/cervlet.c:handle_action() via setting of the s->doaction flag for

> the appropriate service. When there is no action scheduled, the
> s->doaction flag is set to ACTION_IGNORE (in p.y during service
> initialization or in validate.c after it was handled). In addition the

> Run.doaction is set to TRUE just to signalize that there is some 
> scheduled action in the service tree. The main monitoring thread is 
> then
> 
> wake up by http thread to speedup the action handling.
> 
> The main thread then in validate.c:validate() checks whether the 
> Run.doaction flag is set, since the user actions are preferred. In the

> case that it is set, it walks the service tree and for each service 
> performs the scheduled s->doaction using control_service() and then 
> resets the s->doaction flag to ACTION_IGNORE. This is all done under 
> mutex and signal protection, so it cannot be interrupted nor race 
> condition can occure. The only thread which can call control_service 
> and
> 
> physicaly start/restart/etc. the service is the main thread. The 
> control_service also sets the s->visited flag.
> 
> The second service loop is then evaluated - monit walks the service 
> tree, for each service locks mutex and blocks signals. In the case 
> that the service was not handled in the same cycle already (s->visited

> flag is compared in the check_skip) it checks the s->doaction flag 
> again (to improve the response time for the services, which has 
> scheduled action in between the first and second loop in the same 
> cycle). In the case that it is set, it performs the action, otherwise
it checks the service.
> 
> 
> The design is similar to signal handling. The http thread just sets 
> the flag, whereas the monitoring thread handle the action. From theory

> point
> 
> of view, i think no race condition could occure.
> 
> I tried to reproduce the problem (official monit-4.8.2 release) 
> without success.
> 
> Can you prepare simple monit configuration and procedure for problem 
> reproduction?
> 
> Thanks,
> Martin
> 
> 
> 
> Aaron Scamehorn wrote:
> 
>>Hi Martin,
>>
>>Actually I think you've now got one thread doing an ACTION_START, and 
>>another doing an ACTION_RESTART on the exact same service.
>>
>>It is the ACTION_RESTART that is generating what I perceived to be 
>>extraneous emails.
>>
>>It looks like the do_wakeupcall that you added to
>>http/cervlet.c:handle_action() is the culprit.  Without it, I don't
> 
> get
> 
>>the ACTION_RESTART problem.
>>
>>Of course you need this now, or else it takes Poll Time to actully 
>>respond to the HTTP events, which is what you were trying to speed up
> 
> in
> 
>>the first place.
>>
>>Here is the log output, with a bunch of extra messages, including 
>>pthread_t.
>>
>>3086927552 [CST Dec  1 14:53:25] debug    : 'data_dir' filesystem
> 
> flags
> 
>>has not changed since last cycle
>>3086927552 [CST Dec  1 14:53:25] debug    : 'data_dir' space usage
> 
> check
> 
>>passed [current space usage=10.6%]
>>3086924720 [CST Dec  1 14:53:26] info     : monit daemon at 24175
>>awakened
>>3086927552 [CST Dec  1 14:53:26] info     : Awakened by User defined
>>signal 1
>>3086927552 [CST Dec  1 14:53:26] debug    : control_service:
>>ACTION_START for 'LogClient'
>>3086927552 [CST Dec  1 14:53:26] debug    : control_service:
>>ACTION_START Util_isProcessRunning for 'LogClient'
>>3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
>>process id [24220] -- No such process
>>3086927552 [CST Dec  1 14:53:26] debug    : do_start:
>>Util_isProcessRunning for 'LogClient'
>>3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
>>process id [24220] -- No such process
>>3086927552 [CST Dec  1 14:53:26] info     : 'LogClient' start:
>>/cogcap/ccts/bin/logclnt
>>3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
>>process id [24220] -- No such process
>>3086927552 [CST Dec  1 14:53:26] debug    : Monitoring enabled --
>>service LogClient
>>3086927552 [CST Dec  1 14:53:26] debug    : check_process: calling
>>Util_isProcessRunning for 'LogClient'
>>3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
>>process id [24220] -- No such process
>>3086927552 [CST Dec  1 14:53:26] error    : 'LogClient' process is not
>>running
>>3086927552 [CST Dec  1 14:53:26] debug    : Does not exist
> 
> notification
> 
>>is NOT sent to address@hidden
>>3086927552 [CST Dec  1 14:53:26] debug    : Does not exist
> 
> notification
> 
>>is sent to address@hidden
>>3076434864 [CST Dec  1 14:53:26] debug    : static void* wait_start
> 
> for
> 
>>'LogClient'
>>3076434864 [CST Dec  1 14:53:26] debug    : 1) wait_start: calling
>>Util_isProcessRunning for 'LogClient', max_tries= 29
>>3076434864 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
>>process id [24220] -- No such process
>>3086927552 [CST Dec  1 14:53:26] debug    : control_service:
>>ACTION_RESTART for 'LogClient'
>>3086927552 [CST Dec  1 14:53:26] info     : 'LogClient' trying to
>>restart
>>3086927552 [CST Dec  1 14:53:26] debug    : Monitoring disabled --
>>service LogClient (stop)
>>3086927552 [CST Dec  1 14:53:26] debug    : do_stop:
>>Util_isProcessRunning for 'LogClient'
>>3086927552 [CST Dec  1 14:53:26] debug    : 'LogClient' Error testing
>>process id [24220] -- No such process
>>3086927552 [CST Dec  1 14:53:26] debug    : 'data_dir' filesystem
> 
> flags
> 
>>has not changed since last cycle
>>3086927552 [CST Dec  1 14:53:26] debug    : 'data_dir' space usage
> 
> check
> 
>>passed [current space usage=10.6%]
>>3076434864 [CST Dec  1 14:53:27] debug    : 1) wait_start: calling
>>Util_isProcessRunning for 'LogClient', max_tries= 28
>>3076434864 [CST Dec  1 14:53:27] debug    : 2) wait_start: calling
>>Util_isProcessRunning for 'LogClient'
>>3086927552 [CST Dec  1 14:53:56] debug    : check_process: calling
>>Util_isProcessRunning for 'LogClient'
>>3086927552 [CST Dec  1 14:53:56] info     : 'LogClient' process is
>>running with pid 24375
>>3086927552 [CST Dec  1 14:53:56] debug    : Exists notification is NOT
>>sent to address@hidden
>>3086927552 [CST Dec  1 14:53:56] debug    : Exists notification is
> 
> sent
> 
>>to address@hidden
>>3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' zombie check
>>passed [status_flag=0000]
>>3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' loadavg(5min)
>>check passed [current loadavg(5min)=0.2]
>>3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' cpu usage
> 
> check
> 
>>passed [current cpu usage=0.0%]
>>3086927552 [CST Dec  1 14:53:56] debug    : 'LogClient' mem amount
> 
> check
> 
>>passed [current mem amount=2764kB]
>>3086927552 [CST Dec  1 14:53:56] debug    : 'data_dir' filesystem
> 
> flags
> 
>>has not changed since last cycle
>>3086927552 [CST Dec  1 14:53:56] debug    : 'data_dir' space usage
> 
> check
> 
>>passed [current space usage=10.6%]
>>
>>
>>-----Original Message-----
>>From: address@hidden
>>[mailto:address@hidden On 
>>Behalf Of Martin Pala
>>Sent: Thursday, November 30, 2006 4:20 PM
>>To: The monit developer list
>>Subject: Re: <service> start Generates email noise
>>
>>Hello,
>>
>>this behavior isn't bug - the 'nonexist' event type has possitive and 
>>negative variants:
>>
>>   Does not exist (positive 'nonexist')
>>
>>     vs.
>>
>>   Exists (negative 'nonexist')
>>
>>The alert statement allows to filter just the general event type, not 
>>the particular polarity (there is no 'exist' option).
>>
>>=> when you have registered the 'nonexist' event, you should get two 
>>alerts informing about the beggining and end of the problem.
>>
>>Martin
>>
>>
>>Aaron Scamehorn wrote:
>>
>>>Hello,
>>>
>>> From version 4.8 to 4.8.2, the following bug has been introduced:
>>>
>>>When we issue a monit <service> start command, we get "Does not
> 
> exist"
> 
>>>and a corresponding "Exists" emails.
>>>
>>>Here is the debug output showing this behavior in 4.8.2:
>>>'LogClient' Error testing process id [11034] -- No such process 
>>>'LogClient' Error testing process id [11034] -- No such process 
>>>'LogClient' start: /cogcap/ccts/bin/logclnt 'LogClient' Error testing

>>>process id [11034] -- No such process Monitoring enabled -- service 
>>>LogClient 'LogClient' Error testing process id [11034] -- No such 
>>>process 'LogClient' process is not running Does not exist 
>>>notification is sent to address@hidden 'LogClient' Error 
>>>testing process id [11034] -- No such process 'LogClient' trying to 
>>>restart Monitoring disabled -- service LogClient (stop) 'LogClient' 
>>>Error testing process id [11034] -- No such process 'LogClient' 
>>>process is running with pid 11189 Exists notification is sent to 
>>>address@hidden 'LogClient' zombie check passed 
>>>[status_flag=0000] 'LogClient' loadavg(5min) check passed [current 
>>>loadavg(5min)=0.2] 'LogClient' cpu usage check passed [current cpu 
>>>usage=0.0%] 'LogClient' mem amount check passed [current mem 
>>>amount=2776kB]
>>>
>>>
>>>Under version 4.8, we don't get the annoying "Does not exist" and a 
>>>corresponding "Exists" emails:
>>>
>>>'LogClient' Error testing process id [10970] -- No such process 
>>>'LogClient' Error testing process id [10970] -- No such process 
>>>'LogClient' start: /cogcap/ccts/bin/logclnt 'LogClient' Error testing

>>>process id [10970] -- No such process Monitoring enabled -- service 
>>>LogClient 'LogClient' Error testing process id [10970] -- No such 
>>>process 'LogClient' Error testing process id [10970] -- No such 
>>>process 'LogClient' zombie check passed [status_flag=0000] 
>>>'LogClient' loadavg(5min) check passed [current loadavg(5min)=0.1] 
>>>'LogClient' cpu usage check passed [current cpu usage=0.0%] 
>>>'LogClient' mem amount check passed [current mem amount=2776kB]
>>>
>>>
>>>
>>>Additionally, in our config file, we have the following set:
>>>set alert address@hidden only on { nonexist, exec, connection
> 
> }
> 
>>>We shouldn't be getting an "Exists" email under any circumstance,
>>
>>should
>>
>>>we?
>>>
>>>Thanks,
>>>Aaron
>>>
>>>
>>>
>>
> ----------------------------------------------------------------------
> --
> 
>>>_______________________________________________
>>>monit-dev mailing list
>>>address@hidden
>>>http://lists.nongnu.org/mailman/listinfo/monit-dev
>>
>>
>>_______________________________________________
>>monit-dev mailing list
>>address@hidden
>>http://lists.nongnu.org/mailman/listinfo/monit-dev
>>
>>
>>_______________________________________________
>>monit-dev mailing list
>>address@hidden
>>http://lists.nongnu.org/mailman/listinfo/monit-dev
> 
> 
> 
> _______________________________________________
> monit-dev mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/monit-dev
> 
> 
> ----------------------------------------------------------------------
> --
> 
> _______________________________________________
> monit-dev mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/monit-dev


_______________________________________________
monit-dev mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monit-dev




reply via email to

[Prev in Thread] Current Thread [Next in Thread]