grub-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grub.pxe, ARP-after-boot, DMA, and trouble


From: Daniel Kahn Gillmor
Subject: grub.pxe, ARP-after-boot, DMA, and trouble
Date: Sat, 07 Apr 2012 01:36:44 -0400
User-agent: Notmuch/0.12 (http://notmuchmail.org) Emacs/23.4.1 (i486-pc-linux-gnu)

I've been recently using grub.pxe (from debian's version 1.99-17)
according to the instructions at [0] to boot memtest86+ [1] (from
debian's version 4.20-1.1) over the network on x86 machines.  Due to the
problems described below, i'm using a serial console.

The grub configuration is very simple:

--------------------------
serial --speed=115200
terminal_input console serial
terminal_output console serial
menuentry 'memtest86+ serial console' {
  set root='(pxe)'
  echo 'loading memory tester...'
  linux16 /memtest86+.bin console=ttyS0,115200n8
}
--------------------------

On some machines i've done this with, memtest86+ reports transient
memory failures very early in the run, and the failures seem to happen
even on brand new sticks of RAM, placed in any combination and order in
the hardware.  The errors were transient -- sometimes i'd get as many as
~300 32-bit words of RAM failing, other times memtest could complete a
full pass with no errors.

The failures came during an early test where memtest86+ writes each
address's value to its own memory location, and then re-reads the memory
to verify.

Using the serial line, i was able to record the memory failures from a
run that had 24 words fail.  I was able to transcribe them and convert
them to a hexdump format.  These are the 24 words that failed (the
memory address indices are in the left-hand column):

*
00095d30  9c c8 e3 71 dc ff 00 65  32 4b a0 29 08 06 00 01
00095d40  08 00 06 04 00 01 00 65  32 4b a0 29 c0 a8 17 54
00095d50  00 00 00 00 00 00 c0 a8  17 86 55 55 55 55 55 55
00095d60  55 55 55 55 55 55 55 55  55 55 55 55 9a a2 8c 53
*
00097590  00 00 00 00 30 5d 09 00  40 00 00 00 04 00 00 00
000975a0  4d 4d 00 00 00 00 00 00  00 00 00 00 10 38 6a 94
*

The first block (of 16 words) appears to be an ARP request packet
From the local network's DHCP server to the failing machine (the MAC
addresses have been obfuscated here, and i didn't bother updating the
checksum to match)

The second block (of 8 words) appears to contain a pointer to the
first block, a size indicator, and some other stuff i don't recognize.

So i think what's happening is something like Matthew Garrett describes
in his recent work with UEFI [2], although i'm using BIOS and not UEFI.

In particular, i suspect that *after* the bootloader has turned over
control to the kernel (memtest in this case), the PXE-driven NIC is
continuing to DMA received packets into active RAM.

This seems pretty dangerous!

Would using pxe_unload before the close of the stanza prevent this
situation from happening (i regret i haven't been able to test it myself
because i haven't had access to the failing hardware since i completed
this diagnosis)?  If so, it seems like that should be clearly documented
and strongly recommended in grub.texi.

Or, should grub be marking certain sections of memory as unavailable
somehow before handoff to the kernel?

Or is there some other way to avoid this sort of corruption?

I've seen similar failures now on pretty different hardware (a fairly
old Dell Optiplex GX260 SFF and a new Lenovo ThinkCentre M77).

Any ideas?

        --dkg

[0] https://www.gnu.org/software/grub/manual/grub.html#Network
[1] http://www.memtest.org/
[2] http://mjg59.dreamwidth.org/11235.html

Attachment: pgplQgdPUJ1ah.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]