GRUB hangs before menu, after a HDD upgrade. How to debug?

I'm going to answer the third part of my question, about a procedure to install GRUB with debugging enabled. I'd still appreciate informed suggestions about where the trouble may lie, or strategies to solve with minimal downtime and maximum information as to the cause.


Some general points: GRUB provides other methods of debugging - grub-mkrescue will produce an .iso that includes all modules you might possibly need built-in, so like a live USB could be used to try to navigate a RAID array and try to load the .cfg file or even the kernel. The grub-emu emulator is available in most distros, but is more oriented towards how the menu will look. More advanced is the standard GRUB module for debugging using gdb over a serial cable.

Procedure to install GRUB with debugging enabled

So, the procedure to get debug messages is referred to in the GRUB manual section 6, but not in detail. The first thing you may want to consider is doing the debugging over a serial console and run script before screen to record the debug messages. Obviously you need root privileges. Note that the drive layout in this answer does not necessarily match the question and is just an example. Assume that normal (non-debug) GRUB is installed to other drives as appropriate: this is just the procedure for installing a debug GRUB to the drive that you expect to boot. (That means debug messages make it obvious which drive is booting. For installing to a RAID partition, the prefix is likely to be the same in both cases, so you can just run the same command for /dev/sda as /dev/sdb.)

Firstly, check where the existing grub files are, /boot/grub or more likely /boot/grub/<platform>. In this case assume they are in /boot/grub/i386-pc/. We'll not modify the files already there, but add an additional core image with debug enabled. If the .cfg files are missing or have been modified, regenerate them as standard with grub-mkconfig -o /boot/grub/grub.cfg.

Checking installed modules and prefix

The quick and dirty way to show which modules are already compiled into your core image is just to run grub-install again. This works in GRUB 2.02:

grub-install -v /dev/sda 2>&1 | grep '\(mkimage\|setup\)'

In a simple case without RAID or lvm this might reveal a list like ext2 part_gpt biosdisk. However GRUB 1.99 does not use -v for verbose, so use --debug instead. We'll combine this with the trick to not actually install the image, to save a little time:

grub-install --debug --grub-setup=/bin/true /dev/sda 2>&1 | grep '\(-mkimage\|-setup\|true\)'

Note that grub-install can run shell scripts in place of the programs it calls, so instead we could have done something like:

# create grub-mkimage wrapper
cat > /usr/local/bin/grub-mkimage.sh <<"EOF"
echo Arguments to grub-mkimage: $*
/usr/bin/grub-mkimage $*
EOF
# create a dummy grub-setup
cat > /usr/local/bin/grub-setup.sh <<"EOF"
#!/bin/bash
echo Arguments are: $*
EOF
# run grub-install using the above
chmod u+x /usr/local/bin/grub-*.sh
grub-install --grub-mkimage=/usr/local/bin/grub-mkimage.sh \
  --grub-setup=/usr/local/bin/grub-setup.sh /dev/sda 2>&1 \
  | grep 'Arguments' | tee grub-args.txt

Paths of course may vary according to your distribution and chosen shell.

Setting the debug variable

We now create a file we can call debug.cfg with the debug settings. (The core generates a non-fatal error if it encounters a comment at this stage, so we won't use any.)

set pager=1
set debug='init modules disk ata,scsi,linuxefi,efi,badram,drivemap linux,fs,elf,dl,chain serial,usb,usb_keyboard,video'
set

Any combination of whitespace, ,, ; or | can be used to separate the module names within the string.

I extracted the list of debug facilities from the GRUB 2.02 source and ordered them semantically. 'all' produces too much memory information from the scripting interpreter. There are additional facilities for particular filesystems like 'xfs' and 'reiserfs', as well as 'net', 'partition' and 'loader' ('loader' is too late for what we're interested in before the menu. If we can get a menu, we can set the debug variable there.) There are no debug messages unfortunately in the 'mdraid_linux' source, but disk shows the most important operations.

The pager variable is needed to read the debug messages if you are not capturing them over a console (for instance with script). I've found that pager doesn't work without including an additional module like sleep or configfile, which more than doubles the size of the image. The debug environment variable takes effect regardless.

Installing

Now make a variant image of the one you want to debug:

grub-mkimage -p '(,msdos3)/boot/grub' -c debug.cfg \
   -O i386-pc -o dcore.img -C auto ext2 part_msdos biosdisk

where the list of modules is that from grub-install that you want to debug, and include sleep or anything else you need. The prefix -p should be copied from the output of grub-install too, as obviously it has a huge effect on what happens after the GRUB banner. You may however want to experiment with using a GRUB device code (as in this case) rather than the standard UUID. You can show UUIDs with lsblk -o NAME,TYPE,FSTYPE,LABEL,SIZE,STATE,UUID or ls -l /dev/disk/by-id/ and on RAID drives with mdadm --detail /dev/sda.

Now install the core that has just been created to whichever disk is normally booted:

cp dcore.img /boot/grub/i386-pc
grub-bios-setup -d /boot/grub/i386-pc -c dcore.img /dev/sda

For versions of GRUB before 2.0, the grub-bios-setup command may still be called grub-setup as in the manual.

Reboot. You should see the Welcome to GRUB! followed by several pages of debug messages before the menu is shown (or not as the case may be).