[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Misleading error message from lt_dlopen()
From: |
Jeff Squyres |
Subject: |
Misleading error message from lt_dlopen() |
Date: |
Thu, 23 Oct 2008 09:35:04 -0400 |
Greetings!
We have run across a misleading error message from lt_dlerror() after
a failed lt_dlopenadvise() in LT 2.2.6a that caused considerable
confusion for some developers in the Open MPI project for a while; I
had to step through lt_dlopen() to figure out what was going on.
Open MPI uses lots of DSO plugins, and we use lt_dlopenadvise() to
open them. One of our developers is working in a temp branch creating
some new functionality, including some new DSOs. However,
lt_dlopenadvise() was returning NULL for one of his new DSOs, and
lt_dlerror() was returning "file not found". We could clearly see
that the .la and .so files for the DSO were in the Right place in the
filesystem, and the string filename we were passing to lt_dlopen() was
correct. The DSO has no obscure library dependencies; ldd showed that
all of them are present. So how could the error be "file not found"?
(I was doing my testing on an RHEL4U4 system, but I think the problem
is a bit more generic)
I stepped through lt_dlopen() and discovered the real error: his DSO
was referencing a symbol that didn't exist, and therefore the
underlying dlopen() failed. dlopen.c:198 correctly called dlerror()
and LT__SETERRORSTR() to set the error string to "/home/jsquyres/bogus/
lib/openmpi/mca_routed_binomial.so: undefined symbol:
orte_routed_tree_t_class" (this is the real error), and returned NULL
for the module.
So far, so good.
But then tryall_dlopen() advances on to the next loader --
lt_preopen. It [predictably] fails because we have not preopened this
DSO. But then preopen.c:188 calls LT__SETERROR(FILE_NOT_FOUND). This
is now the last error reported, but it really isn't accurate.
Later, as the stack is unwinding, ld_dlopenadvise(), in ltdl.c:1664
*also* calls:
/* Still here? Then we really did fail to locate any of the file
names we tried. */
LT__SETERROR (FILE_NOT_FOUND);
return 0;
Which also sets the last error reported string to "file not found".
But this seems clearly wrong (at least in this case): the fact that
we're falling out through the error case in lt_dlopenadvise() does
*not* indicate that the file was not found -- it just means that
nothing was successfully loaded. The real error can be (and is, in
this case) something else.
I realize that this is somewhat complex issue because libltdl have a
generic loader engine and it's just reporting the "last" error. So I
don't know what the right solution is, but from a the perspective of
someone who is using libltdl, I would much rather have the "missing
symbol" error reported rather than the misleading "file not
found" [non-]error.
FWIW: prior versions of libltdl *did* report the "missing symbol"
error properly, so one could actually consider this a regression
against prior behavior.
--
Jeff Squyres
Cisco Systems
- Misleading error message from lt_dlopen(),
Jeff Squyres <=