Re: testsuite failure - 193 parallel execution

bug-autoconf

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: testsuite failure - 193 parallel execution

From:	Ralf Wildenhues
Subject:	Re: testsuite failure - 193 parallel execution
Date:	Tue, 20 Jul 2010 19:55:23 +0200
User-agent:	Mutt/1.5.20 (2010-04-22)

Hi Eric,

thanks for analyzing and tracking this down!

* Eric Blake wrote on Tue, Jul 20, 2010 at 07:20:52PM CEST:
> I'm seeing failures about 3 out of 10 times on my moderately-loaded
> machine on test 193; invariably, the failures are due to unexpected
> output on stderr, such as:
> 
> $ sh ./micro-suite -j4
> ## -------------------------------------------------------------- ##
> ## GNU Nonsense 1.0 test suite: suite to test parallel execution. ##
> ## -------------------------------------------------------------- ##
> 
>   1: test number 1                                   ok
>   2: test number 2                                   ok
>   3: test number 3                                   ok
>   4: test number 4                                   ok
>   7: test number 7                                   ok
>   8: test number 8                                   ok
>   5: test number 5                                   ok
>   6: test number 6                                   ok
> ./micro-suite: line 1726: echo: write error: Broken pipe
> ./micro-suite: line 4: echo: write error: Broken pipe
[...]
> In looking closer, those two line numbers correspond to
>  echo token >&6
> lines (one occurs inside the trap at line 1711; bash reports $LINENO in
> a trap relative to the start of the trap rather than the overall script).
> 
> It seems like a race in parallel tests - we are closing fd 6 prior to
> the last few subshells being permitted to finish writing 'token' into fd
> 6, and bash warns about the EPIPE failure to write in that case,
> followed by triggering the PIPE trap, where the second echo is attempted
> and also warns about the EPIPE failure to write.

I'm not sure I follow this reasoning completely.  At the time the master
closes the fd, it should have read back all tokens.  Why would any of
the workers try to write to the fd after that?  And if they don't need
to write any more data, why should close generate a SIGPIPE?

> I'm thinking about the following patch, but am not comfortable pushing
> it without some review.  The idea is that we should not close the token
> collector fd until we know that no subshells will try to write into the fd.

The patch seems fairly safe in that it shouldn't hurt.  How do multiple
runs of the test fare on your moderately-loaded system with it?

Thanks,
Ralf

> --- i/lib/autotest/general.m4
> +++ w/lib/autotest/general.m4
> @@ -1425,8 +1425,8 @@ dnl         kill -13 $$
>        read at_token
>      done <&AT_JOB_FIFO_FD
>    fi
> -  exec AT_JOB_FIFO_FD<&-
>    wait
> +  exec AT_JOB_FIFO_FD<&-
>  else
>    # Run serially, avoid forks and other potential surprises.
>    for at_group in $at_groups; do

[Prev in Thread]

Current Thread

[Next in Thread]

testsuite failure - 193 parallel execution, Eric Blake, 2010/07/20
- Re: testsuite failure - 193 parallel execution, Ralf Wildenhues <=
  - Re: testsuite failure - 193 parallel execution, Eric Blake, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Paul Eggert, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Ralf Wildenhues, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Eric Blake, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Eric Blake, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Paul Eggert, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Paul Eggert, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Eric Blake, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Eric Blake, 2010/07/20
    - Re: testsuite failure - 193 parallel execution, Eric Blake, 2010/07/20

Prev by Date: testsuite failure - 193 parallel execution
Next by Date: Re: testsuite failure - 193 parallel execution
Previous by thread: testsuite failure - 193 parallel execution
Next by thread: Re: testsuite failure - 193 parallel execution
Index(es):
- Date
- Thread