System Recovery Status

From LinuxMIPS
Jump to navigationJump to search

This article will be updated.

Lmo-workshop.jpg

Status

Due to the defective boot drive the system is not able to boot itself. It needs extensive kicking from the outside to startup a virtual machine from lmo's VM on the secondary drive. The defective drive has been removed from the LVM array so the system performance is no longer directly affected by the broken disk. Now hardware is about to be ordered.

Email service is still suspended pending further work and testing. This also impacts other systems relying on email such as patchwork, wiki user registration and ecartis. The temporary arrangements from during the time that the machine was unserviceable are still in place.

Upgrades

  • The OS has been upgraded to Fedora 34.
  • Mediawiki has been upgraded from the dusty 1.33.0 to the latest 1.36.1 release.
  • XCVS' URL schema is less than obvious to say the least. Now a redirect has been provided for the sake of a user that may not longer exist ;-)
  • nameserver configuration has been reworked. All the issues fixed were pre-existing but this appeared to be the time to do it.
  • Wiki now spam free.

Temporary website

When it became clear the outage wasn't going to be resolved as fast as desirable a temporary site was setup. This site has been removed.

What happened

linux-mips.org is running as the sole virtual machine of a reasonably beefy system hosted at a data centre. The hardware as well as the Fedora 18 system installed on it turned out to be exceptionally reliable picks as well as the hoster.

Anyway, as they say paranoid people live longer and so I logged into host system to take a lvm snapshot and perform an offsite backup which is when the fun started. I was able to perform an offsite backup of the host system with the exception of several syslog files which won't be missed. But the hosts system only contains Linux distribution and system configuration files.

Less luck with the backup of lmo. lmo's entire data resides on an LVM RAID1. Rsync started to missbehave so I started to analyze. Only at this point I noticed that /dev/sda had failed. Logged in via SSH I performed a surface scan of the drive. Within the coming 8 hours the scan was 25% complete and had found just under 1000 bad blocks. The scan then died due to an unrelated network issue.

Worst of all, the drive had turned itself to read-only mode. This meant it was no longer possible to change the configuration of the RAID in any way, not even drop the broken drive from the RAID1.

At was at this point that I decided that all data was still safe on the surviving part of the RAID drive and the best thing to do was to power down the system so the hoster could swap the drive. And so I shut down the system after 2338 day 4 hours and 23 minutes.

First thing next morning I call the hoster listening to the reassuring sound of big airconditioners in the background. Wonderful, I'm talking straight to the machine room, that's gonna be a breeze!

I couldn't have been more wrong than that. The account with the hoster is split in two accounts, one master account which is for billing and an admin account which I am holding and which can perform all sorts of technical tasks.

Except opening a bloody support ticket!

You'd think that's one of the most basic things an admin account could do but no, no admin ticket. So I had to contact whoever holds the master account.

At this point the hunt for the responsible person at MIPS started. I didn't even know who the previous owner of the master account was. It turned out he had retired from his job due to health issues. The somewhat turbulent company history of MIPS in recent years meant all my contact information had gone stale. I hit a dead end. No way to open a support ticket ... No hardware to run on ...

A few days ago things finally started moving again. Not only is there the repeated commitment to finance some new hardware - the old hardware's somehow changed its behaviour. So far the failed disk had switched itself to read-only meaning it was not possible to change anything at all. This made it possible to retire the failed drive from the LVM RAID configuration. Sounds quick and easy - but actually was an endless series of work to make the data actually fit on the remaining drive. The second surprise the broken disk drive was cooperating through the entire operation.

It also sent the original plan for how to recover the RAID array which was developed under great pain down the drain. Oh well ...

With the boot drive lost the system had to be booted from a rescue image. This rescue image has now been taken to the extreme. Virtualization software has been installed into the rescue image and the linux-mips.org VM brought online again.

So technically at this point the system was up and open for business!

Sorta. IPv4 only at that point. And a reboot would have destroyed the temporary setup so a script to redo all that had to be developed. Also much of OS and installed software was pretty dated so all that was updated. Which turned out to be the roughest upgrade in a life time. And while lmo was sleeping on a disk drive the world around it did change which in turn required yet more changes. All in all this turned into weeks of solid work.