Restarting while expanding a RAID can get your server to get stuck at boot 3

The ability to do an online RAID capacity expansion is something we had for some time. Still, every time I do this I feel like I am doing something amazing and state-of-the-art. The idea that you can expand an existing volume without losing any data and without downtime is almost magical and requires some clever algorithms to accomplish. Unfortunately, extensive complexity can introduce many places for errors and bugs. I am planning to write an article to cover some of the theory behind RAID expansion, but this time I just want to issue a general warning about a possible bug in Adaptec RAID expansion on Windows server.

Normally, the RAID controller will keep track of the progress of the reconfiguration and hide the intermediary state from the OS. The progress is kept in persistent storage on the controller (or disks) such that the reconfiguration can resume correctly after shutdown or restart. Recently, I helped troubleshoot a case where a server was restarted and didn’t boot properly to the OS. IT finally was able to put the server online after replacing the RAID controller. One of the logical devices on that server was in the process of reconfiguration due to an online expansion. That logical device was lost due to the replacement but the server did boot correctly after the replacement. Fortunately this device only had an empty partition and no data was lost.

I started investigating whether there is a connection between the expansion and the failed boot. Having performed many such reconfigurations in the past, including ones across reboots, I was skeptical at first that those two are connected, but there was nothing else special about this server, and everything went back to normal once the reconfiguring partition was gone. Eventually I was successful at reproducing the issue.

Setup:

  • Windows 2012 R2 Standard
  • Adaptec ASR 71605 (7.2 – 7.5)
  • OS on a RAID1 logical device
  • 10 x 600GB disks unassigned.

Sequence to reproduce:

  • Add a new RAID10 logical device using 8 of the disks. This will form a logical device larger than 2TB
  • Set up this device as a GPT disk and create a partition on the whole available space
  • Start an expand of the device from 8 to 10 disks using the 2 available disks (keep the RAID level)
  • Restart
  • System should not boot and be stuck at the Windows logo.

Windows Stuck at loading

I am speculating that the issue is in the driver of the controller; the boot stage loads the drivers and this can explain why the OS doesn’t boot. It is unlikely that the issue is in Windows itself. as the driver+controller should hide the expansion process from the OS. It is also unlikely that the issue is in the controller BIOS or firmware, because even with the bricked OS it is still possible to hit Ctrl-A at boot time and boot into the controller BIOS, where proper status is shown. What is certain is that this only happens in a presence of a large GPT partition. I have not been successful to reproduce this with MBR or with partition smaller than 2TB. Partitions over 2TB in size have more than 2^32 512-byte sectors, which is the main reason to use GPT in the first place. I can only speculate that some code in the driver uses 32bit arithmetic, when 64bit has to be used.

If something like this happens to you, there are few things you can do:

  1. You can boot to the controller BIOS and just wait till the reconfiguration is done. Then you will be able to boot properly to Windows as before.
  2. You can delete the expanding device to break the reconfiguration. This will allow you to boot to the OS, but, for some strange reason, the other RAID1 device becomes degraded and needs to be rebuilt.
  3. You can move the non-reconfiguring RAIDs to another compatible server and let the reconfiguring one finish rebuilding in the background.

I am looking to hear from Adaptec and Microsoft as to whether they can confirm the issue. I will post an update if they do. Meanwhile I suggest the following:

  • Have a proper backup that you can restore from quickly before doing a RAID expansion. This is also an advice you get from the vendors;
  • If possible, do an offline expansion to another device;
  • Don’t restart during expansions;
  • Read my blog! 🙂

Next time, how a RAID expansion can cause your GPT partition to disappear and how to get it back. Stay tuned.

3 thoughts on “Restarting while expanding a RAID can get your server to get stuck at boot

  1. Reply Antoine May 7,2016 2:09 pm

    Highly interesting article. I am a regular consumer (no particular IT skills) and I am experiencing this issue.

    Hardware: Adaptec 6805 (+ AFM600) w/ 6 WD Red 4TB hdds ; ASRock Z97M Formula (also tested w/ MSI Z97M G43); Windows 10 Pro installed on a SSD.

    What happened : I had a 5-hdd Raid-5 array configured and working well, added a 6th one to the array, rebooted while reconfiguring (had lost contact at that point w/ the controller through the Maxview Storage software).

    I am now blocked at the boot: cannot access the OS, the motherboard BIOS. Can access the controller Configuration Utility though. If I disconnect the controller from the motherboard, everything goes back to normal.

    Antoine

  2. Reply Scott Whitlock Jul 19,2016 2:49 pm

    Next time, how a RAID expansion can cause your GPT partition to disappear and how to get it back. Stay tuned.

    Hi, I have this problem where the GPT partition disappeared during the RAID expansion. I know this post was from 2014, but is there any chance you can remember what you did to get it back? 🙂

  3. Reply Arik Yavilevich Jul 20,2016 5:40 am

    Hi Scott, sorry to hear.

    Is this also with an Adaptec card? What model and firmware version?

    I recall that in my case, the partition disappeared at the end of the resize. You shouldn’t make any other changes to the disk for now. You should wait for the resize to finish and see what happens. Perhaps the partition will re-appear.

    The following helped in my case, you might need to adapt this to your case if it is not exactly like mine.

    What I have done is:
    Used http://www.runtime.org/diskexplorer.htm to see that data is still on the disk. That only the first sectors are zero and that the primary GPT is lost.
    Knowing the previous disk size, found the “secondary GPT” table that was at the previous end of the disk. See https://en.wikipedia.org/wiki/GUID_Partition_Table for parameters. You can then use Disk Explorer to search for the GPT header around the previous known location.
    Backed it up and wrote it to the new end of the disk (again using disk explorer).
    Used http://www.rodsbooks.com/gdisk/repairing.html to make the recovered secondary GPT valid for new disk size.
    Used gdisk to re-construct the primary GPT based on the secondary GPT.

    This should work and make Windows recognize the partition table assuming that the RAID resize itself was valid and that nothing destroyed the secondary copy of the partition.

    Good luck.

Leave a Reply