autoconf-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parallel autotest [2/3]: Implement 'testsuite --jobs'.


From: Ralf Wildenhues
Subject: Re: parallel autotest [2/3]: Implement 'testsuite --jobs'.
Date: Mon, 29 Sep 2008 23:35:40 +0200
User-agent: Mutt/1.5.18 (2008-05-17)

Hi Eric,

* Eric Blake wrote on Fri, Sep 26, 2008 at 03:05:32PM CEST:
> 
> Hi Ralf, and (finally) reviewing this...

Thank you!

> According to Ralf Wildenhues on 5/25/2008 11:50 PM:
> >  * Major changes in Autoconf 2.62a (2008-??-??)
> >  
> > +** Autotest testsuites accept an option --jobs[=N] for parallel testing.
> 
> Make sure you adjust this to account for 2.63

Sure; thanks for the reminder.

> (hmm.  that means I should
> try to resync from coreutils' maint.mk, which has since added logic to
> commit the md5sum of the previous release's NEWS snippet in cfg.mk, to
> make it easier to identify inadvertent changes to the wrong NEWS row when
> doing 'make maintainer-distcheck').

Sounds interesting.

> > address@hidden address@hidden@address@hidden
> > address@hidden address@hidden
> > address@hidden -j
> 
> This second line is redundant, since you used @ovar on the line above.

The first line addresses --jobs=N, the second line addresses -jN.
What is redundant there?  (see below at --help output, for more info)

> > +Parallel mode requires the @command{mkfifo} command to work, and will be
> > +silently disabled otherwise.
> 
> Good - the option always exists, but is a no-op where we can't get it to
> work, because we have a sane fallback of linear execution.  Thus, if my
> testing on cygwin shows problems (I still haven't had time to apply your
> patch), it merely becomes a matter of adding cygwin to the list of reasons
> to disable -j, alongside the existing reason of missing mkfifo.  Thus, I
> think your patch is safe - even if it causes failures on some platforms
> during our testing phase, we have some time before the next release to fix
> them.  Please go ahead and apply, once you've addressed the remaining nits
> in this mail.

OK, will do.  Let's finish discussing this first, though.

> > +   case $at_jobs in *[[!0-9]]*)
> > +     AS_ERROR([non-numeric argument to -j/--jobs: $at_jobs]) ;;
> 
> Is there any way to feed back the user's spelling to this message, rather
> than listing both spellings?  If you don't spot an easy way, then I guess
> this is okay.

Easy:
          at_optname=`echo " $at_option" | sed 's/^ //; s/[[0-9=]].*//'`
          AS_ERROR([non-numeric argument to $at_optname: $at_jobs]) ;;

I just found another issue, though: AS_ERROR wants to write to fd 5,
which is opened only much later.  Since this concerns a number of
different places, too, I consider it an independent issue, to be fixed
separately.

> >  Execution tuning:
> >    -C, --directory=DIR
> >  [                 change to directory DIR before starting]
> > +  -j[[N]], --jobs[[=N]]
> > +[                 Allow N jobs at once; infinite jobs with no arg]
> 
> Don't we have the convention that in --help output, optional arguments for
> long options are also optional for short args?  In other words, is it
> better to try to write this, for compactness (after m4 processing)?
> 
>   -j, --jobs[=N]  Allow N jobs at once; ...

Yes, I think GNU software typically does this.  However, one crucial
difference to other programs' semantics is that
  ./testsuite -j 3 4 5

runs the tests number 3, 4, and 5, with unlimited jobs, rather than 3
jobs for tests number 4 and 5.  That's the primary reason I listed -j[N]
explicitly, to make it clear that no space may follow.

Also, even with that '[N]' removed we need two lines, unless we add one
more character of indentation to all of the help.  Preferences?

> If there's room, should we mention that this defaults to 1?

Yes, sure.

> > +if test $at_jobs -ne 1 &&
> > +     rm -f "$at_job_fifo" &&
> > +     ( mkfifo "$at_job_fifo" ) 2>/dev/null &&
> > +     exec AT_JOB_FIFO_FD<> "$at_job_fifo"
> 
> Is exec n<> portable?  Yes, POSIX specifies it, but I haven't seen it used
> anywhere else in autoconf.  I guess we commit it now, and if it is not
> portable, we wrap it inside an eval alongside the mkfifo check as a reason
> to skip the parallel testing.

According to <http://www.unix.org/whitepapers/shdiffs.html>, they were
not in System V shell, but:
| The new <> operator has been supported on many implementations, but not
| documented. It should cause no compatibility problems, but its use is
| rather specialised.

Then <http://www.in-ulm.de/~mascheck/bourne/common.html> documents:
| All Bourne shells accept the redirection <> (undocumented) but don't have
| implemented it correctly (exception: fixed on SunOS 5.6 ff). 

I haven't found a problem in practice yet (I have tested this patch with
all systems I could get hold of).  I would vote for fixing this when it
shows up in practice.

> > +  # Turn jobs into a list of numbers, starting from 1.
> > +  at_joblist=`AS_ECHO([" $at_groups_all "]) | \
> > +    sed -e 's/\( '$at_jobs'\) .*/\1/'`
> 
> In general, we've been omitting the -e.

Sure.  I copied the code from the -k range digestion which uses -e in a
couple of places, too.

> Looks sane, on a first reading.

Just to be sure, you did notice that commented bit:

> > +      . "$at_test_source" # AT_JOB_FIFO_FD<&-

that would be nice to have without the '#' but doesn't work this way?

> > +mkdir serial
> > +AT_CHECK([$CONFIG_SHELL ./micro-suite --jobs=[]AT_PARALLEL_TESTS_RUN & ]dnl
> > +         [sleep 3 && ]dnl
> > +         [cd serial && $CONFIG_SHELL ../micro-suite -3 >/dev/null && ]dnl
> 
> Should that -3 instead be written as -m4_decr(AT_PARALLEL_TESTS_RUN), in
> case someone plays with tuning the parameters?

Hmm.  I'm starting to think this whole test is a bit too fragile; if the
load is too high (for example because you run ./testsuite -jN), then it
can fail sometimes.

But if we keep it, the 'sleep 3' would need to be adjusted as well,
and the computation is different.  The idea is this: a test run takes

  startup + ntests * (serial_overhead + 1 / njobs)

time.  Further, the assumption is that several 'sleep 1' jobs can be run
in parallel with low enough overhead.  And that startup overhead is
similar for serial and parallel runs.

Making it robust enough will probably cause the test to take an unduly
long time,

> > +         [{ kill $! && exit 1; :; }], [], [stdout], [ignore])

Cheers,
Ralf




reply via email to

[Prev in Thread] Current Thread [Next in Thread]