bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connect


From: Ludovic Courtès
Subject: bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
Date: Mon, 16 May 2022 11:13:17 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

> I noticed this error: https://ci.guix.gnu.org/eval/320682/log/raw.
>
> The following lines can be caused by having a multi-threaded program:
> (cuirass scripts evaluate) because of its par-for-each procedure, call
> "primitive-fork" in open-bidirectional-pipe of (guix inferior).
>
> warning: call to primitive-fork while multiple threads are running;
>          further behavior unspecified.  See "Processes" in the
>          manual, for more information.
>
> Not sure that it could cause the observed hang, Ludo what do you think?

As you write, the warning comes from ‘open-bidirectional-pipe’, added in
commit bd86bbd300474204878e927f6cd3f0defa1662a5.  I noticed it back then
but as far as I could see, it was harmless (but definitely suboptimal):

  https://issues.guix.gnu.org/48007#11

Indeed, “make cuirass-jobs” works for me.

I decided to go ahead and run ‘cuirass evaluate’ by hand to gather more
info:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo su - cuirass -s /bin/sh -c 
"/gnu/store/qamrzl942khb7w0cb4csyflnxfkwvvxq-cuirass-1.1.0-11.9f08035/bin/cuirass
 evaluate dbname=cuirass 323183"
Computing Guix derivation for 'x86_64-linux'... |
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
2022-05-16T09:56:43 Registering builds for evaluation 323183.
2022-05-16T10:08:47 Registering builds for evaluation 323183.
--8<---------------cut here---------------end--------------->8---

I straced it in another terminal.  It did its work for a long while, and
eventually activity stopped:

--8<---------------cut here---------------start------------->8---
read(228, "nu/store/7gi4qsakz775z5wcmddg8xan7q806sj1-findutils-4.8.0\")) 
(#:nix-name . \"findutils-4.8.0\") (#:sys"..., 4096) = 4096
read(228, "nrd-gcc-cross-x86_64-w64-mingw32-10.3.0.drv\" 
\"/gnu/store/ykrksw421rd22zh4w3pc751r90wc881n-patch-2.7."..., 4096) = 4096
read(228, "wm-binutils-cross-x86_64-w64-mingw32-2.37.drv\" 
\"/gnu/store/xfh4965w2amrsl7x0h3dslrhndzn5bbz-linux-li"..., 4096) = 4096
read(228, "1ir-gawk-5.1.0.drv\" 
\"/gnu/store/p1d2w4fg1ljbyfs0ayv854snahcimamk-libunistring-0.9.10.drv\" 
\"/gnu/stor"..., 4096) = 4096
read(228, "3ysc-guile-2.2.7.drv\" 
\"/gnu/store/b0ma6l61ad4dfb3030sl1ngvdlicc523-glibc-utf8-locales-2.33.drv\" 
\"/gn"..., 4096) = 4096
read(228, "j6a5mc7w13xiw8136y4brv7w-xz-5.2.5.drv\" 
\"/gnu/store/w1bxsf02c4y0hj0nd39awv7i55ych641-libffi-3.3.drv\" "..., 4096) = 
4096
read(228, "tore/qbqgamcpwgs2sw8lcr6pxl5y38fpyz0d-sed-4.8.drv\" 
\"/gnu/store/qmsbxbgb758h6f7018snxlwn69w9w9i4-diff"..., 4096) = 4096
read(228, "oconf-2.69.drv\" 
\"/gnu/store/6n5mn2vkl4x4fab37sx1rmqyqxqcwd4y-libidn2-2.3.1.drv\" 
\"/gnu/store/6q5cp0m2"..., 4096) = 4096
mmap(0x7fddd26af000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 
-1, 0) = 0x7fddd268f000
read(228, "-binutils-cross-x86_64-w64-mingw32-2.37.drv\" 
\"/gnu/store/wdmk9llhdzx5y10mziss1i5z6q037a7z-guile-3.0."..., 4096) = 954
futex(0x7fdf015381f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
futex(0x7fdf015381f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 
FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? 
ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 
EINTR (Interrupted system call)
rt_sigreturn({mask=[]})                 = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 
FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? 
ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 
EINTR (Interrupted system call)
rt_sigreturn({mask=[]})                 = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 
FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[…]
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 
EINTR (Interrupted system call)
rt_sigreturn({mask=[]})                 = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 
FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdf01546c00, FUTEX_WAKE_PRIVATE, 1) = 0
close(224)                              = 0
futex(0x7fdf01538f70, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 
FUTEX_BITSET_MATCH_ANY
[stops here]
--8<---------------cut here---------------end--------------->8---

The ‘read’ calls are about getting the jobset alists; FD 228 is a socket
(to a ‘guix repl’ subprocess I presume).  SIGPWR/SIGXCPU suggests
attempts to make a stop-the-world GC.

GDB shows that the process has 120 threads (it’s a 96-core machine, but
still…) and most threads are stuck in pthread_cond_wait:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007fdf01d62afa in __futex_abstimed_wait_common64 
(futex_word=futex_word@entry=0x7fded20cd8b4, 
    expected=expected@entry=0, clockid=clockid@entry=0, 
abstime=abstime@entry=0x0, private=private@entry=0, 
    cancel=cancel@entry=true) at ../sysdeps/nptl/futex-internal.c:74
#1  0x00007fdf01d62b5b in __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x7fded20cd8b4, 
    expected=expected@entry=0, clockid=clockid@entry=0, 
abstime=abstime@entry=0x0, private=private@entry=0)
    at ../sysdeps/nptl/futex-internal.c:123
#2  0x00007fdf01d5cab0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, 
mutex=0x7fdef9c15fc0, 
    cond=cond@entry=0x7fded20cd888) at pthread_cond_wait.c:504
#3  __pthread_cond_wait (cond=cond@entry=0x7fded20cd888, 
mutex=mutex@entry=0x7fdef9c15fc0) at pthread_cond_wait.c:619
#4  0x00007fdf01ea2a55 in scm_pthread_cond_wait 
(cond=cond@entry=0x7fded20cd888, mutex=mutex@entry=0x7fdef9c15fc0)
    at threads.c:1605
#5  0x00007fdf01ea7b6b in block_self (
    queue=((#<smob thread 7fdedfd05eb0>) #<smob thread 7fdedfd05af0> #<smob 
thread 7fdedfd05950> #<smob thread 7fdedfd05870> #<smob thread 7fdedfd05790> 
#<smob thread 7fdedfd056b0> #<smob thread 7fdedfd05570> #<smob thread 
7fdedfd05490> #<smob thread 7fdedfd05350> #<smob thread 7fdedfd05150> #<smob 
thread 7fdedfd05070> #<smob thread 7fdedfd18f50> #<smob thread 7fdedfd18d70> 
#<smob thread 7fdedfd189a0> #<smob thread 7fdedfd18730> #<smob thread 
7fdedfd185e0> #<smob thread 7fdedfd182b0> #<smob thread 7fdedfd18160> #<smob 
thread 7fdedfd21f50> #<smob thread 7fdedfd21e00> #<smob thread 7fdedfd21cb0> 
#<smob thread 7fdedfd217a0> #<smob thread 7fdedfd21650> #<smob thread 
7fdedfd21500> #<smob thread 7fdedfd213b0> #<smob thread 7fdedfd21260> #<smob 
thread 7fdedfd2ff90> #<smob thread 7fdedfd2fe60> #<smob thread 7fdedfd2fcf0> 
#<smob thread 7fdedfd2fba0> #<smob thread 7fdedfd2fa50> #<smob thread 
7fdedfd2f900> #<smob thread 7fdedfd2f650> #<smob thread 7fdedfd2f570> #<smob 
thread 7fdedfd2f430> #<smob thread 7fdedfd2f310> #<smob thread 7fdedfd2f1a0> 
#<smob thread 7fdedfd2f0b0> #<smob thread 7fdedfd4efd0> #<smob thread 
7fdedfd4ee90> #<smob thread 7fdedfd4ebd0> #<smob thread 7fdedfd4e940> #<smob 
thread 7fdedfd4e760> #<smob thread 7fdedfd4e660> #<smob thread 7fdedfd4e270> 
#<smob thread 7fdedfdbefc0> #<smob thread 7fdedfdbeec0> #<smob thread 
7fdedfdbede0> #<smob thread 7fdedfdbed00> #<smob thread 7fdedfdbeb00> #<smob 
thread 7fdedfdbe8f0> #<smob thread 7fdedfdbe810> #<smob thread 7fdedfdbe730> 
#<smob thread 7fdedfdbe650> #<smob thread 7fdedfdbe3f0> #<smob thread 
7fdedfdbe220> #<smob thread 7fdedfdbe120> #<smob thread 7fdedfdbe020> #<smob 
thread 7fdedfa34610> #<smob thread 7fdedfa344d0> #<smob thread 7fdedfa34110> 
#<smob thread 7fdedfa52e90> #<smob thread 7fdedfa52d30> #<smob thread 
7fdedfa52c40> #<smob thread 7fdedfa737e0> #<smob thread 7fdedfa73400> #<smob 
thread 7fdedfa73000> #<smob thread 7fdedfa88ef0> #<smob thread 7fdedfa88cc0> 
#<smob thread 7fdedfa888e0> #<smob thread 7fdedfa88800> #<smob thread 
7fdedfa88720> #<smob thread 7fdedfa88620> #<smob thread 7fdedfa884b0> #<smob 
thread 7fdedfa88370> #<smob thread 7fdedfa880b0> #<smob thread 7fdedfb06fa0> 
#<smob thread 7fdedfb06b90> #<smob thread 7fdedfb066a0> #<smob thread 
7fdedfb06470> #<smob thread 7fdedfb062e0> #<smob thread 7fdedfb06140> #<smob 
thread 7fdedfb58f80> #<smob thread 7fdedfb58af0> #<smob thread 7fdedfb58830> 
#<smob thread 7fdedfb93cd0> #<smob thread 7fdedfb93ae0> #<smob thread 
7fdedfb936c0> #<smob thread 7fdedfb935c0> #<smob thread 7fdedfb93400> #<smob 
thread 7fdedfb932c0> #<smob thread 7fdedfd05ff0> #<smob thread 7fdedfd05c20> 
#<smob thread 7fdedfd05eb0>), mutex=mutex@entry=0x7fdef9c15fc0, 
waittime=waittime@entry=0x0) at threads.c:312
#6  0x00007fdf01eaa1cd in timed_wait (waittime=0x0, 
current_thread=0x7fded20cd6c0, c=0x7fdef9c10d20, 
    m=0x7fdef9c15fc0, kind=SCM_MUTEX_STANDARD) at threads.c:1331
#7  scm_timed_wait_condition_variable (cond=#<smob condition-variable 
7fdef9c122e0>, 
    mutex=#<smob mutex 7fdef9c12320>, timeout=#<undefined 904>) at 
threads.c:1424
#8  0x00007fded169a7f3 in ?? ()
#9  0x0000000000b83688 in ?? ()
#10 0x00007fded169a750 in ?? ()
#11 0x00007fdeec009e98 in ?? ()
#12 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, 
mcode=0xb8369c "\034\330\003") at jit.c:6038
#13 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at 
vm-engine.c:360
#14 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized 
out>, nargs=0) at vm.c:1608
#15 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 
7fdef732e880>) at async.c:406
#16 0x00007fdf01ead336 in vm_regular_engine (thread=0x7fded20cd6c0) at 
vm-engine.c:972
#17 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized 
out>, nargs=0) at vm.c:1608
#18 0x00007fdf01ea9be6 in really_launch (d=0x7fdef730c860) at threads.c:778
#19 0x00007fdf01e2085a in c_body (d=0x7fdec8c44d80) at continuations.c:430
#20 0x00007fded2874532 in ?? ()
#21 0x00007fdef9a227e0 in ?? ()
#22 0x00007fdef7d8c370 in ?? ()
#23 0x0000000000000048 in ?? ()
#24 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, 
mcode=0xb81874 "\034<\003") at jit.c:6038
#25 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at 
vm-engine.c:360
#26 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized 
out>, nargs=2) at vm.c:1608
#27 0x00007fdf01e2209a in scm_call_2 (proc=<optimized out>, arg1=<optimized 
out>, arg2=<optimized out>) at eval.c:503
#28 0x00007fdf01ed8752 in scm_c_with_exception_handler.constprop.0 (type=#t, 
    handler_data=handler_data@entry=0x7fdec8c44d10, 
thunk_data=thunk_data@entry=0x7fdec8c44d10, 
    thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#29 0x00007fdf01eaa88f in scm_c_catch (tag=<optimized out>, body=<optimized 
out>, body_data=<optimized out>, 
    handler=<optimized out>, handler_data=<optimized out>, 
pre_unwind_handler=<optimized out>, 
    pre_unwind_handler_data=0x7fdef9b97040) at throw.c:168
#30 0x00007fdf01e22e66 in scm_i_with_continuation_barrier 
(pre_unwind_handler=0x7fdf01e22b80 <pre_unwind_handler>, 
    pre_unwind_handler_data=0x7fdef9b97040, handler_data=0x7fdec8c44d80, 
handler=0x7fdf01e298b0 <c_handler>, 
    body_data=0x7fdec8c44d80, body=0x7fdf01e20850 <c_body>) at 
continuations.c:368
#31 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized 
out>) at continuations.c:464
#32 0x00007fdf01ea9b39 in with_guile (base=0x7fdec8c44e08, data=0x7fdec8c44e30) 
at threads.c:645
#33 0x00007fdf01d800ba in GC_call_with_stack_base ()
   from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#34 0x00007fdf01ea216d in scm_i_with_guile (dynamic_state=<optimized out>, 
data=0x7fdef730c860, 
    func=0x7fdf01ea9b70 <really_launch>) at threads.c:688
#35 launch_thread (d=0x7fdef730c860) at threads.c:787
#36 0x00007fdf01d56d7e in start_thread (arg=0x7fdec8c45640) at 
pthread_create.c:473
#37 0x00007fdf01954eff in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
--8<---------------cut here---------------end--------------->8---

There’s also one thread stuck in read(2):

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  __libc_read (nbytes=1, buf=0x7fdedf97bdd0, fd=7) at 
../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=7, buf=buf@entry=0x7fdedf97bdd0, nbytes=nbytes@entry=1) at 
../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007fdf01e382e8 in fport_read (port=<optimized out>, dst=<optimized 
out>, start=<optimized out>, count=1)
    at fports.c:597
#3  0x00007fdf01e78d22 in scm_i_read_bytes (port=port@entry=#<port #<port-type 
file 7fdef9b5fb40> 7fdef59b7180>, 
    dst="#<vu8vector>" = {...}, start=start@entry=0, count=1) at ports.c:1566
#4  0x00007fdf01e7b1c7 in scm_fill_input (port=port@entry=#<port #<port-type 
file 7fdef9b5fb40> 7fdef59b7180>, 
    minimum_size=1, minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc3e8, 
    avail_out=avail_out@entry=0x7fde86fcc3e0) at ports.c:2693
#5  0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type 
file 7fdef9b5fb40> 7fdef59b7180>, 
    buf_out=buf_out@entry=0x7fde86fcc448, cur_out=cur_out@entry=0x7fde86fcc440) 
at ports.c:1526
#6  0x00007fdf01e7c1b5 in maybe_consume_bom (port=port@entry=#<port #<port-type 
file 7fdef9b5fb40> 7fdef59b7180>, 
    bom=bom@entry=0x7fdf01ee56ad <scm_utf8_bom> "\357\273\277scm_i_read_bytes", 
bom_len=bom_len@entry=3)
    at ports.c:2468
#7  0x00007fdf01e7b324 in port_clear_stream_start_for_bom_read 
(io_mode=BOM_IO_TEXT, 
    port=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>) at ports.c:2504
#8  scm_fill_input (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 
7fdef59b7180>, minimum_size=1, 
    minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc4e8, 
avail_out=avail_out@entry=0x7fde86fcc4e0)
    at ports.c:2648
#9  0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type 
file 7fdef9b5fb40> 7fdef59b7180>, 
    buf_out=buf_out@entry=0x7fde86fcc548, cur_out=cur_out@entry=0x7fde86fcc540) 
at ports.c:1526
#10 0x00007fdf01e7b97c in peek_utf8_codepoint (port=#<port #<port-type file 
7fdef9b5fb40> 7fdef59b7180>, 
    buf_out=buf_out@entry=0x7fde86fcc5a8, cur_out=cur_out@entry=0x7fde86fcc5a0, 
len_out=len_out@entry=0x7fde86fcc598)
    at ports.c:1790
#11 0x00007fdf01e7bee4 in peek_codepoint (len=0x7fde86fcc598, 
cur=0x7fde86fcc5a0, buf=0x7fde86fcc5a8, 
    port=<optimized out>) at ports.c:1984
#12 scm_peek_char (port=<optimized out>) at ports.c:2202
#13 0x00007fdef997b1ab in ?? ()
#14 0x0000000000b82ae0 in ?? ()
#15 0x00007fdef997b120 in ?? ()
#16 0x00007fdec4292c48 in ?? ()
#17 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded1366000, 
mcode=0xb82af0 "\034\234\003") at jit.c:6038
#18 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded1366000) at 
vm-engine.c:360
#19 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized 
out>, nargs=0) at vm.c:1608
#20 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 
7fdef59a0f60>) at async.c:406
--8<---------------cut here---------------end--------------->8---

A plausible scenario is a synchronization issue: ‘cuirass evaluate’ got
all the data it asked for from ‘guix repl’, but it’s still waiting on
read(2) even though there’s nothing left.  (There are two child ‘guix
repl’ processes and both are waiting on read(0, …).)

Thoughts?

Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]