guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Now crashing [was Re: guile-2.9.2 and threading


From: Linas Vepstas
Subject: Re: Now crashing [was Re: guile-2.9.2 and threading
Date: Wed, 17 Jul 2019 11:27:47 -0500

Investigating the crash with good-old printf's in libguile/vm.c produces
a vast ocean of prints ... that should have not been printed, and/or should
have been actual errors, but somehow were not handled by scm_error.
Using today's git pull of master, here's the diff containing a printf:

--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }

       proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
       SCM_FRAME_LOCAL (vp->fp, 0) = proc;
       return SCM_PROGRAM_CODE (proc);
     }

+printf("duuude wrong type to apply!\n"
+"proc=%lx\n"
+"ip=%p\n"
+"sp=%p\n"
+"fp=%p\n"
+"sp_min=%p\n"
+"stack_lim=%p\n",
+SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
+vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
+fflush(stdout);
+
   vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);

   scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
              scm_list_1 (proc), scm_list_1 (proc));
 }

As you can see, shortly after my printf, there should have been an error
report.
There is no error report... until 5-10 minutes later, when the error report
itself
causes a crash.  Before then, I get an endless high-speed spew of prints:

duuude wrong type to apply!
proc=2e
ip=0x7efcbbd5b69c
sp=0x7efcd14589b8
fp=0x7efcd14589c0
sp_min=0x7efcd1458708
stack_lim=0x7efcd1458000
duuude wrong type to apply!
proc=36
ip=0x7efcbbd5b69c
sp=0x7efcd15ffad0
fp=0x7efcd15ffad8
sp_min=0x7efcd15ff528
stack_lim=0x7efcd15ff000
duuude wrong type to apply!
proc=36
ip=0x7efcbbd5b69c
sp=0x7efcd14568a0
fp=0x7efcd14568a8
sp_min=0x7efcd1456318
stack_lim=0x7efcd1456000
duuude wrong type to apply!
proc=2a
ip=0x7efcbbd5b69c
sp=0x7efcd15ffb40
fp=0x7efcd15ffb48
sp_min=0x7efcd15ff3d8
stack_lim=0x7efcd15ff000
duuude wrong type to apply!
proc=26
ip=0x7efcbbd5b69c
sp=0x7efcd15fdc90
fp=0x7efcd15fdc98
sp_min=0x7efcd15fd798
stack_lim=0x7efcd15fd000
duuude wrong type to apply!
proc=22
ip=0x7efcbbd5b69c
sp=0x7efcd152ba98
fp=0x7efcd152baa0
sp_min=0x7efcd152b888
stack_lim=0x7efcd152b000
duuude wrong type to apply!

The addresses all look healthy. I also added a guard-word, looking
for memory corruption, but none was found. Next up: decoding the
SCM by hand, and figuring out why it's there.

-- Linas

On Sun, Jul 14, 2019 at 10:03 PM Linas Vepstas <address@hidden>
wrote:

> Exactly the same crash, same stack trace (slightly different line numbers),
> with a fresh pull today:
>     commit 89e28df1c9069dcb65188fe7b3973c333d87d7e2
>     Author: Andy Wingo <address@hidden>
>     Date:   Thu Jun 20 14:02:05 2019 +0200
> which is the current HEAD on master.
>
> FWIW, 60-odd guile threads waiting here:
>
> #0  __lll_lock_wait () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f45ff4fbdbd in __GI___pthread_mutex_lock (mutex=mutex@entry
> =0x16e1f68)
>     at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f45ff7e085f in scm_c_weak_set_lookup (set=<optimized out>,
>     raw_hash=raw_hash@entry=551753256168943069,
>     pred=pred@entry=0x7f45ff7d1660 <string_lookup_predicate_fn>,
>     closure=closure@entry=0x7f3e44ff7ac0, dflt=dflt@entry=0x4)
>     at ../../libguile/weak-set.c:760
> #3  0x00007f45ff7d11e9 in lookup_interned_symbol
> (raw_hash=551753256168943069,
>     name=0x2965ca0) at ../../libguile/symbols.c:112
> #4  scm_i_str2symbol (str=0x2965ca0) at ../../libguile/symbols.c:244
>
> The parallelism is low because of this one lock.  This appears to be the
> primary bottleneck for my workload.
>
> -- Linas
>
>
> On Sun, Jul 14, 2019 at 5:03 PM Linas Vepstas <address@hidden>
> wrote:
>
>> Below was for
>> guile (GNU Guile) 2.9.2.14-1fb399
>>
>> --linas
>>
>> On Sun, Jul 14, 2019 at 4:59 PM Linas Vepstas <address@hidden>
>> wrote:
>>
>>>
>>> So, here's my next installment on using guile-2.9.2. The first
>>> installment said that I'd piled up CPU-months of guile 2.9.2 experience
>>> without any crashes. Well, now, a different workload crashes in minutes.
>>> Below is a highly simplified, edited gdb session -- it crashes because it
>>> unexpectedly aborts, during an abort(!) because `get_callee_vcode()`
>>> failed. Harrumpf.
>>>
>>> Background: there are 140 threads, half in guile, the other half waiting
>>> for guile to finish. Yes, that's too many, but anyways ... 70 threads in
>>> guile and one crashed:
>>>
>>> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>>>     current_registers=<optimized out>, dynstack=<optimized out>,
>>>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>>>     saved_fp=<optimized out>, vp=<optimized out>) at
>>> ../../libguile/vm.c:1327
>>> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>>>     at ../../libguile/vm.c:1454
>>>
>>> Both frames are interesting, because libguile/vm.c:1327 shows
>>>   if (SCM_FRAME_DYNAMIC_LINK (base_fp) != saved_fp)
>>>      abort();
>>> hey!? who called this? line 1454 is in the middle of abort_to_prompt ()
>>> Yow! an unexpected abort during an abort...
>>>
>>> How did we get here?
>>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>>>     subr=<optimized out>, message=message@entry=0x1607c9380,
>>>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>>>     at ../../libguile/error.c:90
>>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>>>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
>>> args=0x15af130e0,
>>>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
>>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>>>     at ../../libguile/vm.c:1527
>>>
>>> and libguile/vm.c:1527 tells me that get_callee_vcode () is very
>>> unhappy. But why? I cannot tell .. after that, things peter out in boring
>>> stack frames that started with my call scm_c_catch() ... the same seemingly
>>> harmless call that is pending in 70 other threads. (the same call that has
>>> survived several CPU month of pounding with a different collection of
>>> scheme code)
>>>
>>> My best guess is that the current workload, by unintentionally launching
>>> gobs of threads is exposing a race condition that has been hithertho
>>> hidden.  I don't know how to debug any further.  I will try a slightly
>>> newer guile shortly, to see if I get lucky.
>>>
>>> -- Linas
>>>
>>> p.s. here's the whole stack trace. But really, its boring, except for
>>> the above highlights.
>>>
>>>
>>>
>>> (gdb) bt
>>> #0  0x00007f85d38ef428 in __GI_raise (sig=sig@entry=6)
>>>     at ../sysdeps/unix/sysv/linux/raise.c:54
>>> #1  0x00007f85d38f102a in __GI_abort () at abort.c:89
>>> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>>>     current_registers=<optimized out>, dynstack=<optimized out>,
>>>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>>>     saved_fp=<optimized out>, vp=<optimized out>) at
>>> ../../libguile/vm.c:1327
>>> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>>>     at ../../libguile/vm.c:1454
>>> #4  0x00007f85ac539041 in ?? ()
>>> #5  0x00007f85ac37f040 in ?? ()
>>> #6  0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #7  0x000000015f692dc0 in ?? ()
>>> #8  0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #9  0x00007f85d3f70600 in vm_debug_engine (thread=0x7f85ac539000)
>>>     at ../../libguile/vm-engine.c:370
>>> #10 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0xe45a20,
>>> argv=<optimized out>,
>>>     nargs=5) at ../../libguile/vm.c:1605
>>> #11 0x00007f85d3eefdcb in scm_apply_0 (proc=0xe45a20, args=0x304)
>>>     at ../../libguile/eval.c:603
>>> #12 0x00007f85d3ef0a0d in scm_apply_1 (proc=<optimized out>,
>>>     arg1=arg1@entry=0xdc5420, args=args@entry=0x15af130a0)
>>>     at ../../libguile/eval.c:609
>>> #13 0x00007f85d3f6c546 in scm_throw (key=key@entry=0xdc5420,
>>> args=0x15af130a0)
>>>     at ../../libguile/throw.c:272
>>> #14 0x00007f85d3f6caf9 in scm_ithrow (key=key@entry=0xdc5420,
>>> args=<optimized out>,
>>>     no_return=no_return@entry=1) at ../../libguile/throw.c:619
>>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>>>     subr=<optimized out>, message=message@entry=0x1607c9380,
>>>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>>>     at ../../libguile/error.c:90
>>> ---Type <return> to continue, or q <return> to quit---
>>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>>>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
>>> args=0x15af130e0,
>>>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
>>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>>>     at ../../libguile/vm.c:1527
>>> #18 0x00007f85b4314805 in ?? ()
>>> #19 0x00007f85b428a000 in ?? ()
>>> #20 0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #21 0x000000015f692dc0 in ?? ()
>>> #22 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #23 0x00007f85d3f70600 in vm_debug_engine (thread=0x2)
>>>     at ../../libguile/vm-engine.c:370
>>> #24 0x00007f85d3f76db2 in scm_call_n (proc=<optimized out>,
>>>     argv=argv@entry=0x7f7e85fe2600, nargs=nargs@entry=3) at
>>> ../../libguile/vm.c:1605
>>> #25 0x00007f85d3eef97f in scm_call_3 (proc=<optimized out>,
>>> arg1=<optimized out>,
>>>     arg2=<optimized out>, arg3=<optimized out>) at
>>> ../../libguile/eval.c:510
>>> #26 0x00007f85d4262b6f in ?? ()
>>> #27 0x00007f85d4262a80 in ?? ()
>>> #28 0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #29 0x000000015f692dc0 in ?? ()
>>> #30 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #31 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>>>     at ../../libguile/vm-engine.c:370
>>> #32 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341ee0,
>>>     argv=argv@entry=0x0, nargs=nargs@entry=0) at
>>> ../../libguile/vm.c:1605
>>> #33 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341ee0)
>>>     at ../../libguile/eval.c:490
>>> #34 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404,
>>> thunk=0x15b341ee0,
>>>     handler=0x15b341ec0, pre_unwind_handler=0x15b341ea0) at
>>> ../../libguile/throw.c:146
>>> #35 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler
>>> (key=key@entry=0x404,
>>> ---Type <return> to continue, or q <return> to quit---
>>>     thunk=<optimized out>, handler=<optimized out>,
>>>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
>>> #36 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
>>> body=<optimized out>,
>>>     body_data=<optimized out>,
>>>     handler=handler@entry=0x7f85c95d0f00
>>> <opencog::SchemeEval::catch_handler_wrapper(void*, scm_unused_struct*,
>>> scm_unused_struct*)>,
>>>     handler_data=handler_data@entry=0x7f7e60000980,
>>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85c95d0c40
>>> <opencog::SchemeEval::preunwind_handler_wrapper(void*, scm_unused_struct*,
>>> scm_unused_struct*)>,
>>>     pre_unwind_handler_data=0x7f7e60000980) at ../../libguile/throw.c:385
>>> #37 0x00007f85c95d122a in opencog::SchemeEval::do_eval
>>> (this=0x7f7e60000980,
>>>     expr="(observe-mpg \"The countess, with her loving heart, felt that
>>> her children were being ruined, that it was not the count's fault for he
>>> could not help being what he was -- that (though he tried to hide "...)
>>>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:590
>>> #38 0x00007f85c95d12aa in opencog::SchemeEval::c_wrap_eval
>>> (p=0x7f7e60000980)
>>>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:507
>>> #39 0x00007f85d3eeb47a in c_body (d=0x7f7e85fe2d40)
>>>     at ../../libguile/continuations.c:430
>>> #40 0x00007f85d4262b6f in ?? ()
>>> #41 0x00007f85d4262a80 in ?? ()
>>> #42 0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #43 0x000000015f692dc0 in ?? ()
>>> #44 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #45 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>>>     at ../../libguile/vm-engine.c:370
>>> #46 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341fe0,
>>>     argv=argv@entry=0x0, nargs=nargs@entry=0) at
>>> ../../libguile/vm.c:1605
>>> #47 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341fe0)
>>>     at ../../libguile/eval.c:490
>>> #48 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404,
>>> thunk=0x15b341fe0,
>>> ---Type <return> to continue, or q <return> to quit---
>>>     handler=0x15b341fc0, pre_unwind_handler=0x15b341fa0) at
>>> ../../libguile/throw.c:146
>>> #49 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler
>>> (key=key@entry=0x404,
>>>     thunk=<optimized out>, handler=<optimized out>,
>>>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
>>> #50 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
>>>     body=body@entry=0x7f85d3eeb470 <c_body>,
>>>     body_data=body_data@entry=0x7f7e85fe2d40,
>>>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>>>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
>>> <pre_unwind_handler>,
>>>     pre_unwind_handler_data=0xe174a0) at ../../libguile/throw.c:385
>>> #51 0x00007f85d3eeb9e3 in scm_i_with_continuation_barrier (
>>>     body=body@entry=0x7f85d3eeb470 <c_body>,
>>>     body_data=body_data@entry=0x7f7e85fe2d40,
>>>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>>>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
>>> <pre_unwind_handler>,
>>>     pre_unwind_handler_data=0xe174a0) at
>>> ../../libguile/continuations.c:368
>>> #52 0x00007f85d3eebac5 in scm_c_with_continuation_barrier
>>> (func=<optimized out>,
>>>     data=<optimized out>) at ../../libguile/continuations.c:464
>>> #53 0x00007f85d3575127 in GC_call_with_gc_active (
>>>     fn=fn@entry=0x7f85d3f6a070 <with_guile_trampoline>,
>>>     client_data=client_data@entry=0x7f7e85fe2e20) at
>>> ../pthread_support.c:1343
>>> #54 0x00007f85d3f6ac4f in with_guile (base=base@entry=0x7f7e85fe2df0,
>>>     data=data@entry=0x7f7e85fe2e20) at ../../libguile/threads.c:683
>>> #55 0x00007f85d356f132 in GC_call_with_stack_base (
>>>     fn=fn@entry=0x7f85d3f6abb0 <with_guile>, arg=arg@entry
>>> =0x7f7e85fe2e20)
>>>     at ../misc.c:1941
>>> #56 0x00007f85d3f6aff8 in scm_i_with_guile (dynamic_state=<optimized
>>> out>,
>>>     data=0x7f7e60000980,
>>>     func=0x7f85c95d1290 <opencog::SchemeEval::c_wrap_eval(void*)>)
>>>     at ../../libguile/threads.c:698
>>> ---Type <return> to continue, or q <return> to quit---
>>> #57 scm_with_guile (
>>>     func=func@entry=0x7f85c95d1290
>>> <opencog::SchemeEval::c_wrap_eval(void*)>,
>>>     data=data@entry=0x7f7e60000980) at ../../libguile/threads.c:704
>>> #58 0x00007f85c95d126e in opencog::SchemeEval::eval_expr
>>> (this=0x7f7e60000980,
>>>     expr=...) at
>>> /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:479
>>> #59 0x00007f85bc783439 in opencog::GenericShell::eval_loop
>>> (this=0x7f7ef0001e90)
>>>     at
>>> /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:588
>>> #60 0x00007f85c6e5ac80 in ?? () from
>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>> #61 0x00007f85d3c916ba in start_thread (arg=0x7f7e85fe3700) at
>>> pthread_create.c:333
>>> #62 0x00007f85d39c141d in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>> (gdb)
>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you


reply via email to

[Prev in Thread] Current Thread [Next in Thread]