Fri, 28 May 2010

Server Crash


On Tuesday, 25 May 2010, this server crashed hard. It has an ext3 file system on a RAID-1 mirrored pair using Linux Software RAID. The disks were OK, but we must have hit a file system bug because dmesg started spewing ext3 errors and remounted the root filesystem read-only.

The system failed to reboot. A manual fsck threw hundreds of errors and resulted in a non-bootable system.

Our hosting provider put in new disks and reinstalled Debian Lenny. They attached the original disk via USB.

This server hosts Roaring Penguin's corporate web site, my sister's framing store site, the MIMEDefang site and the OMJS site. It also hosts our mail filter.

What Went Right

We had offsite backups of most important things, including an almost-live backup of our web site on another colocated server. For a few hours, we redirected our Web traffic there.

We moved our mail filtering quite seamlessly to our hosted filtering service. The dead server was a secondary MX for that service, but the primary MX machine just kept chugging along.

What Went Wrong

We didn't have a quick way to restore from "bare metal". Recovering the server took me several hours late at night. Not fun.

The hosting provider was way too slow to react. Our server was down for over 12 hours.

We didn't have offsite backups of everything. There were a few crusty little scripts in /usr/local/bin that weren't backed up or version-controlled; they made their absence known in annoying cron messages.

We'll back up and version-control everything from now on!

[permalink]


Blog    RSS    Home