The ability to do an online RAID capacity expansion is something we had for some time. Still, every time I do this I feel like I am doing something amazing and state-of-the-art. The idea that you can expand an existing volume without losing any data and without downtime is almost magical and requires some clever algorithms to accomplish. Unfortunately, extensive complexity can introduce many places for errors and bugs. I am planning to write an article to cover some of the theory behind RAID expansion, but this time I just want to issue a general warning about a possible bug in Adaptec RAID expansion on Windows server.
Normally, the RAID controller will keep track of the progress of the reconfiguration and hide the intermediary state from the OS. The progress is kept in persistent storage on the controller (or disks) such that the reconfiguration can resume correctly after shutdown or restart. Recently, I helped troubleshoot a case where a server was restarted and didn’t boot properly to the OS. IT finally was able to put the server online after replacing the RAID controller. One of the logical devices on that server was in the process of reconfiguration due to an online expansion. That logical device was lost due to the replacement but the server did boot correctly after the replacement. Fortunately this device only had an empty partition and no data was lost.
I started investigating whether there is a connection between the expansion and the failed boot. Having performed many such reconfigurations in the past, including ones across reboots, I was skeptical at first that those two are connected, but there was nothing else special about this server, and everything went back to normal once the reconfiguring partition was gone. Eventually I was successful at reproducing the issue.
- Windows 2012 R2 Standard
- Adaptec ASR 71605 (7.2 – 7.5)
- OS on a RAID1 logical device
- 10 x 600GB disks unassigned.
Sequence to reproduce:
- Add a new RAID10 logical device using 8 of the disks. This will form a logical device larger than 2TB
- Set up this device as a GPT disk and create a partition on the whole available space
- Start an expand of the device from 8 to 10 disks using the 2 available disks (keep the RAID level)
- System should not boot and be stuck at the Windows logo.
I am speculating that the issue is in the driver of the controller; the boot stage loads the drivers and this can explain why the OS doesn’t boot. It is unlikely that the issue is in Windows itself. as the driver+controller should hide the expansion process from the OS. It is also unlikely that the issue is in the controller BIOS or firmware, because even with the bricked OS it is still possible to hit Ctrl-A at boot time and boot into the controller BIOS, where proper status is shown. What is certain is that this only happens in a presence of a large GPT partition. I have not been successful to reproduce this with MBR or with partition smaller than 2TB. Partitions over 2TB in size have more than 2^32 512-byte sectors, which is the main reason to use GPT in the first place. I can only speculate that some code in the driver uses 32bit arithmetic, when 64bit has to be used.
If something like this happens to you, there are few things you can do:
- You can boot to the controller BIOS and just wait till the reconfiguration is done. Then you will be able to boot properly to Windows as before.
- You can delete the expanding device to break the reconfiguration. This will allow you to boot to the OS, but, for some strange reason, the other RAID1 device becomes degraded and needs to be rebuilt.
- You can move the non-reconfiguring RAIDs to another compatible server and let the reconfiguring one finish rebuilding in the background.
I am looking to hear from Adaptec and Microsoft as to whether they can confirm the issue. I will post an update if they do. Meanwhile I suggest the following:
- Have a proper backup that you can restore from quickly before doing a RAID expansion. This is also an advice you get from the vendors;
- If possible, do an offline expansion to another device;
- Don’t restart during expansions;
- Read my blog! 🙂
Next time, how a RAID expansion can cause your GPT partition to disappear and how to get it back. Stay tuned.