bug-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)


From: Ludovic Courtès
Subject: Re: bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)
Date: Mon, 10 Oct 2022 23:14:15 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux)

Ludovic Courtès <ludo@gnu.org> skribis:

> Through a dichotomy I tried to see how far it goes.  The info I have so
> far is that ld.so errors out from elf/rtld.c:563 (line 565 is not
> reached):
>
> 558:  if (bootstrap_map.l_addr || ! 
> bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)])
> 559:    {
> 560:      /* Relocate ourselves so we can do normal function calls and
> 561:         data access using the global offset table.  */
> 562:
> 563:      ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0);
> 564:    }
> 565:  bootstrap_map.l_relocated = 1;
> ...
> 578:  __rtld_malloc_init_stubs ();

Via brute force¹, I found that ‘__assert_fail’ is hit, with its first
argument in $eax being:

--8<---------------cut here---------------start------------->8---
db> x/c 0x28604,80                                                              
                    
                ELF32_R_TYPE (reloc->r_info) == R_386_RELATIVE\000\000map->l_in 
                    
                fo[VERSYMIDX (DT_VERSYM)] != NULL\000\000Fatal glibc error: Too 
                    
                 many audit mo                                                  
                    
--8<---------------cut here---------------end--------------->8---

This comes from i386/dl-machine.h:

--8<---------------cut here---------------start------------->8---
auto inline void
__attribute ((always_inline))
elf_machine_rel_relative (Elf32_Addr l_addr, const Elf32_Rel *reloc,
                          void *const reloc_addr_arg)
{
  Elf32_Addr *const reloc_addr = reloc_addr_arg;
  assert (ELF32_R_TYPE (reloc->r_info) == R_386_RELATIVE);
  *reloc_addr += l_addr;
}
--8<---------------cut here---------------end--------------->8---

How can we get there?  Looking at ‘_dl_start’, it could be that
‘elf_machine_load_address’ returns a bogus value and we end up reading
wrong ELF data?  Or it could be memory corruption somewhere.  Or…?

Thing is, it’s not fully deterministic (happens 9 times out of 10 with
KVM, never happens without KVM).

Ideas?  :-)

Ludo’.

¹ Building with ‘-fno-optimize-sibling-calls’ didn’t help get nicer
  backtraces, but that’s prolly because all that early relocation code
  is inlined.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]