Horrors in RAID-ville
The past few days have motivated me to build a new computer from scratch. I mean I’ve been meaning to build one for the last year, this just was the straw that broke the camels back. This is an embarrassing story given how boneheaded I was in assessing the problem, but nonetheless.
It all started when I moved to Long Beach, for some reason my monitor was going to sleep immediately after booting up. A fix I saw online had me reset the CMOS to no avail, but it wasn’t until I was able to finally boot in safe mode (with a DVI from the onboard video rather than the Nvidia video card either) that I realized its an issue is something deeper. So I decided that a clean install of windows was in order, hey I’ll finally make the plunge to windows 10 then, I thought. After $200 dollars on a clean license, the USB tool that Microsoft provides to create a bootable USB wouldn’t work in safe mode. So I downloaded Rufus and then scoured the web for a clean, uncracked version of windows 10. I have the license, I now just need the iso file.
After finding a clean iso online, then installing it on the usb, then installing that onto my computer, the monitor issue was now fixed (at least with the onboard video) and I could boot into windows. After about an hour of installing some of the usual apps, Steam, Discord, WinRar, I noticed that my RAID 10 wasn’t being recognized anymore. OH SHIT, and I mean it. My raid(called Library as a volume) doesn’t just have terabytes of books, music, shows, games, and movies, but it has almost 2 terabytes of photos from all my childhood, my high school football career, all my photos I took when I was a professorial photographer. Even old photos I scanned, its my entire family Polaroid collection. It’s my entire life digitized. There is no other place where it’s stored. The files in my library are priceless. When this snowball of data started growing about a decade ago, (some songs even still remain from napster in 1999) it was too big for any cloud storage solution.
It got to the point where I bought 4 3TB drives so I could set up a RAID10 and have not just size, but redundancy. I didn’t want to mess with a raid 5 either with the parity bit. So I pondered for a moment while the hairs on my neck stood on end, Wait a second, its probably just a windows 10 issue, I thought. I’m going to take this as a sign that I should just stick with Windows 7.
So I decided to find a cracked copy of Windows 7 online (Just when I thought I was out, they pull me back in). I overwritten my windows 10 usb drive and reinstalled. I quickly remembered the wifi usb adapter wasn’t going to work on windows 7 without the software, so it looks like I have to haul my computer into the living room and plug it into the Ethernet cable and download the software. So I haul my computer into the living room, set it up next to the TV where the Ethernet cord to the router will reach. I suddenly discover that this windows PC has literally no drivers, not even for the onboard Ethernet. So even with the cable plugged in, I still no connection to the internet. This computer has practically became a paperweight. I don’t have any other computer to use except for my work laptop, so it looks like I’m going to have to use that to save the drivers onto a USB and then transfer it over.
Nope.
Murphy’s law is in full affect here. My work’s endpoint protection doesn’t allow any writing to mobile drives, at all. So I have no way to get drivers to this computer. Looks like I’m running down to FedEx Office about 15 blocks away.
About 45 minutes, a Jimmy Johns sub, and then a lazy uber back, I finally install the drivers onto my PC, then can I finally get back to work, but wait, my raid isn’t being detected here either. Instead there are two drives (D: and E:). Clicking on either brings up a dialog box that declares that the drives need to be formatted in order to be used.
Umm yeah, that’s gonna be a firm no on that one there professor.
These drives might as well be priceless Egyptian relics with coordinates to the Stargate as far as I’m concerned. So I sat and thought for a while. This has to be a BIOS issue, i just feel it, because resetting the CMOS probably early on probably turned the raid off somewhere. So when I finally reboot, turn on SATA RAID, and go into the Intel Raid controller, what do I see? The official status of the RAID says FAILED and suddenly my brain starts to play gymnastics with semantics.
By “failed” do they mean that there’s a slight issue, or that all hope is lost and everything is gone? I have 2 striped discs, mirrored. I should be able to recover the data on a single drive failure. Right? Worst case scenario, I bag these drives up and take em down to someplace with a clean room or something. To add even more head scratching, when I try to regularly boot with raid on, windows doesn’t even start. Despite my boot order, it says it cannot find windows, so I have to then turn it off. When I get back into windows I decide to use Data Essentials Raid Recovery. With it, I was able to finally piece together the raid where I was able to identify folders that were in my library. Being so spooked by the idea of having no place to put terabytes of data if I can somehow recover it, I quickly found a 8TB external hard drive on Newegg for $200. Delivered the next day, which is how long the deep scanning to rebuild the NTFS took(8 hours) to allow the files to be transferred someplace else. After selecting the main folders in my library, it took about 4 hours to transfer all the data to the external hard drive. However, I noticed after a moment that I was not out of the woods yet.
In my library I organize media by type, folders for images, movies, music, etc. My images folder wasn’t there, as well as a couple of other ones. When I went to play music I transferred, the songs were all distorted every few seconds. Alt-J sounded like Metallica’s Kill em all every 6 or so seconds. I quickly realized that the folder structure and some files were recovered for the most part. However, all the data within the files themselves were completely corrupted. I’m a newbie at raid recovery, so I figured I probably used the wizard wrong. Still, I’m really starting to sweat a little here.
For the life of me I couldn’t defend my reasoning, but I decided to hop back onto windows 10 and then piece it together there. I already bought a license, so I might as well do it.
Over the next day I tried every driver I could find, I tried googling Intel’s Smart Storage tutorials with no results. Started seeing a lot of this Intel software was super buggy and had compatibility issues with Windows 10 and the chipset support was also extremely vague. I ran into multiple installs that ended due to platform incompatibility. It was about right then that I stopped focusing on getting the drives recognized and shifted my focus with recovering the data. After all, I now have a new drive where it’ll all fit. After I can confirm transfer, I’ll just wipe the whole raid and remake it or set it up where the configuration is clearer and more concise before transferring data back. After 3 more different deep scans to rebuild the NTFS, all had the same results, partial files, but all internals corrupted. Home videos of me playing with action figures looked like Kanye West’s Welcome To Heartbreak music video. I need to either get another tool (got one and then realized it wasn’t meant for raid 10s), or do some deeper digging.
I found a thread on HardForum.com that is dealing with what appears to be the exact same issue on the same Z77X intel chipset. On that page, there was a link to an Intel thread that discusses this same “issue” with a guy that has the exact same size HDs and a similar setup. The guy at Intel said he was doomed.
It was at this moment I felt like Aragon confronting the mouth of Sauron in The Return of the King when they are faced with what appears to be the death of Frodo. “I Do not believe it.”
God, I hope allan_intel stumbles on my blog. If you’re reading this Allan, Fuck You. For about 15 minutes, I felt like my entire memory of life was lost forever. Gigabytes of family photos going back decades where I don’t know if the physicals exist anymore, Gone. That’s like a game breaking bug in Final Fantasy, how on earth would I know that if I allow it to boot into IDE just once it destroys all the data. That seems like a absurdly gigantic design flaw. That would be like attempting to open a car door when it’s locked destroys the engine of a car. After a half hour of cooling down and even pondering acceptance at losing all my data, I kept digging, I saw that this guys issue was close, but not exact as mine. A little more scrolling I saw that these guys on HardForum were ABLE to recover their data, but the steps provided were prefaced with warnings about how the steps do walk extremely comes close to physically wiping the whole drive.
Apparently, I need to just “unmake” the raid group, it does a logical wipe of the metadata, resets all the disks to be non members and then I recreate the group. From that, I can recover the data using RAID Recovery. So here I go, I unmade the group which destroyed the metadata, but for some reason, the big bold flashing letters warning of data loss scared me away from remaking a new group. It was at this moment I noticed the new Raid size would be about 2.7TB smaller than it originally was. It should be just under 6TB with 12TB of total space in a Raid 10, that doesn’t make sense until I finally decided to do the child skill that was long overdue, I counted. I realized that there was only 3 physical disks that one could choose to be a part of the new RAID. One hard drive is disconnected. Suddenly, I ponder if it’s just a power or cable issue, so I open the side of the case and I see that one of the drive’s power cable is indeed loose. Oh God, kill me now I think to myself as I plug it in and I see it now appear on screen.
That’s it. that’s all it was.
I could boot up to Windows 10 and it should work natively since this RAID is on the chipset level, if I didn’t JUST DELETE THE GOD DAMN METADATA ONLY 10 SECONDS EARLIER. IF ONLY I COUNTED AND CHECKED THE CABLES FIRST, THIS WOULD HAVE WORKED DAYS AGO WITHOUT ANY ISSUE. OH MY GOD. I haven’t felt this rage in me in years. Now that I deleted the metadata, I’m forced to recover the RAID through deep scanning.
So back in Raid Recovery after 12 hours of scanning, I see the new folder list of the RAID that has Images and other folders that are indeed correct. So far so good. Another 6 hours of scanning to transfer the most precious image files later and I’m only about 30% done. It appears the images are all there, uncorrupted. But this is going to take at least 20 or so more hours of transferring.
I somehow turned a 5 second issue, into 6 days of misery. So yes, life is meaningless, we’re all going to be dead soon and that’s why I’m building a new PC. And fuck RAIDS, especially if the support and documentation with them is sparse. I can afford to store this all in the cloud now, so I’m done managing my own infrastructure beyond a single internal and external HD. Give me the thinnest infrastructure I can get. Going for a micro-ATX Ryzen Build with M2 storage for about $1200 and storing everything else in the cloud and this 8TB external as an easy second backup.
I hate computers sometimes.