Friday, November 29, 2019

HPE ILO 4 SD card issues

I've been a big fan of HPE server hardware for years.  I've been using them for as long as I have been in IT.  We switched to blades very early in their development.  It hasn't always been a fun ride, but not all server hardware is perfect through the years.  Most recently we had a mix of HPE blade servers.  Gen 8 and Gen 9 blades as our main VMware hosts in our data centers.  Then we purchased some DL160 Gen 9 servers as a very cheap vSAN 2-node ROBO solution for some of our remote locations.  They worked great.  All of these servers were built without disk for the OS, for the blades, there were no disks in them at all.  They all booted from SD Cards.  We knew with our redundancy, we could withstand a failure and rebuild quickly if necessary.

Then we started to see the ESXi errors on writing to hardware.  When we rebooted the hosts, they would fail to find the boot disk or SD card or it would look like a corrupted ESXi instance.  After messing with support for a while, we tried replacing the SD Cards, but they still wouldn't work.  Replacing the Motherboard did work.

Turns out there was an issue were the NAND memory on the ilo (which is where the SD card controller is located) would become corrupted.

We had only a few issues for a while but once we had a project to upgrade ESXi from 6.0 to 6.5 and upgrade firmware (BIOS requirement for Spectre\Meltdown) we started to see these issues en masse.  Almost every one of the hosts we tried to upgrade saw these issues.  It was a huge pain and slowed our migration down significantly.

HPE released many different advisories and ilo firmware in an attempt to fix the issue.

This is the latest version of that Advisory.

The Simple Procedure for blades:

  1. Upgrade the firmware to 2.61+. (When I started this it was 2.50.  There were many versions that changed the behavior throughout the year.  Some version were better than others.)
  2. Run ilo command to Format the NAND Memory.  You can get the Force_Format.xml details from the advisory.
    1. You can run it from a Windows host with the Ilo configuration utility.
    2. You can run it from SSH session from the Enclosure OA.
    3. NOTE:  This will “format” the NAND memory.  It will not erase anything.  Just resets the memory on the ilo that the SD card data runs from.  This can be run while ESXi is online or not, but it is preferable to shut the server off.  It will not format the SD-Card.
  3. Then reset the bay via the e-fuse command. 
    1. From SSH session on OA, run “Show server list”  To view the blade statuses.  Confirm the bay that you want to reset is correct.  This will reset the bay you input, very easy to make a mistake.
    2. Run “Reset Bay XX” Change XX to the Bay number.  Then Type Yes to continue.  Can’t stress enough that the blade will be reset immediately.
    3. Run “Show server list”  to monitor the status of the reset.
  4. After the blade is back up, it should boot automatically.  This usually fixes it.  Sometimes you need to reset it again.
In firmware version 2.51 (I think) HPE added a GUI button for this procedure.  Only if the SD-card controller experiences the error.  Once the error disappears, the GUI buttons for this procedure disappears.  See the advisory for details.

For DL class servers, not blades.  You do not have the ability to reset the e-fuse in order to reset the ilo.  You have to use an AUX Power cycle command.  This is only available via HPE Restful utility.

For the most part this worked.  Sometimes, a simple power off and power on fixes some boot issues.

Recently, with ilo version 2.70, we've had a couple of failures where the SD-card actually fails.  I don't believe that the card is actually failed, but it seems that the NAND format fixes the ilo, but the ilo fails to recognize the SD card.  Switching Motherboard did not resolve it, but swapping SD-cards does.  But we had to rebuild the host.  I do not have a solution for this yet.  Maybe we'll figure something out.

Suffice to say, this was one main reason we did not re-buy HPE hardware.  There were others though, but this contributed to a couple of really stressful years of upgrades.

1 comment:

  1. If resetting the nand fails, you needyo disconnect the power (either efuse or )