Tag Archives: open source

All posts relating to Open Source software, mostly but not exclusively UNIX focused.

Adventures in I/O Hell

Earlier this year I had a disk fail in my NZ-based file & VM server. This isn’t something unexpected, the server has 12x hard disks, which are over 2 years old, some failures are to be expected over time. I anticipated just needing to arrange the replacement of one disk and being able to move on with my life.

Of course computers are never as simple as this and it triggered a delightful adventure into how my RAID configuration is built, in which I encountered many lovely headaches and problems, both hardware and software.

My beautiful baby

Servers are beautiful, but also such hard work sometimes.

 

The RAID

Of the 12x disks in my server, 2x are for the OS boot drives and the remaining 10x are pooled together into one whopping RAID 6 which is then split using LVM into volumes for all my virtual machines and also file server storage.

I could have taken the approach of splitting the data and virtual machine volumes apart, maybe giving the virtual machines a RAID 1 and the data a RAID 5, but I settled on this single RAID 6 approach for two reasons:

  1. The bulk of my data is infrequently accessed files but most of the activity is from the virtual machines – by mixing them both together on the same volume and spreading them across the entire RAID 6, I get to take advantage of the additional number of spindles. If I had separate disk pools, the VM drives would be very busy, the file server drives very idle, which is a bit of a waste of potential performance.
  2. The RAID 6 redundancy has a bit of overhead when writing, but it provides the handy ability to tolerate the failure and loss of any 2 disks in the array, offering a bit more protection than RAID 5 and much more usable space than RAID 10.

In my case I used Linux’s Software RAID – whilst many dislike software RAID, the fact is that unless I was to spend serious amounts of money on a good hardware RAID card with a large cache and go to the effort of getting all the vendor management software installed and monitored, I’m going to get little advantage over Linux Software RAID, which consumes relatively little CPU. Plus my use of disk encryption destroys any minor performance advantages obtained by different RAID approaches anyway, making the debate somewhat moot.

The Hardware

To connect all these drives, I have 3x SATA controllers installed in the server:

# lspci | grep -i sata
00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
02:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
03:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)

The onboard AMD/ATI controller uses the standard “ahci” Linux kernel driver and the Marvell controllers use the standard “sata_mv” Linux kernel driver. In both cases, the controllers are configured purely in JBOD mode, which exposes each drive as-is to the OS, with no hardware RAID or abstraction taking place.

You can see how the disks are allocated to the different PCI controllers using the sysfs filesystem on the server:

# ls -l /sys/block/sd*
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sda -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sdb -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host1/target1:0:0/1:0:0:0/block/sdb
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdc -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host2/target2:0:0/2:0:0:0/block/sdc
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sdd -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host3/target3:0:0/3:0:0:0/block/sdd
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sde -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host4/target4:0:0/4:0:0:0/block/sde
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdf -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host5/target5:0:0/5:0:0:0/block/sdf
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdg -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdg
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdh -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdh
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdi -> ../devices/pci0000:00/0000:00:11.0/host8/target8:0:0/8:0:0:0/block/sdi
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdj -> ../devices/pci0000:00/0000:00:11.0/host9/target9:0:0/9:0:0:0/block/sdj
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdk -> ../devices/pci0000:00/0000:00:11.0/host10/target10:0:0/10:0:0:0/block/sdk
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdl -> ../devices/pci0000:00/0000:00:11.0/host11/target11:0:0/11:0:0:0/block/sdl

There is a single high-wattage PSU driving the host and all the disks, in a well designed Lian-Li case with proper space and cooling for all these disks.

The Failure

In March my first disk died. In this case, /dev/sde attached to the host AMD controller failed and I decided to hold off replacing it for a couple of months as it’s internal to the case and I would be visiting home in May in just a month or so. Meanwhile the fact that RAID 6 has 2x parity disks would ensure that I had at least a RAID-5 level of protection until then.

However a few weeks later, a second disk in the same array also failed. This become much more worrying, since two bad disks in the array removed any redundancy that the RAID array offered me and would mean that any further disk failures would cause fatal data corruption of the array.

Naturally due to Murphy’s Law, a third disk then promptly failed a few hours later triggering a fatal collapse of my array, leaving it in a state like the following:

md2 : active raid6 sdl1[9] sdj1[8] sda1[4] sdb1[5] sdd1[0] sdc1[2] sdh1[3](F) sdg1[1] sdf1[7](F) sde1[6](F)
      7814070272 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/7] [UUU_U__UUU]

Whilst 90% of my array is regularly backed up, there was some data that I hadn’t been backing up due to excessive size that wasn’t fatal, but nice to have. I tried to recover the array by clearing the faulty flag on the last-failed disk (which should still have been consistent/in-line with the other disks) in order to bring it up so I could pull the latest data from it.

mdadm --assemble /dev/md2 --force --run /dev/sdl1 /dev/sdj1 /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sdc1 /dev/sdh1 /dev/sdg1
mdadm: forcing event count in /dev/sdh1(3) from 6613460 upto 6613616
mdadm: clearing FAULTY flag for device 6 in /dev/md2 for /dev/sdh1
mdadm: Marking array /dev/md2 as 'clean'

mdadm: failed to add /dev/sdh1 to /dev/md2: Invalid argument
mdadm: failed to RUN_ARRAY /dev/md2: Input/output error

Sadly this failed in a weird fashion, where the array state was cleared to OK, but /dev/sdh was still failed and missing from the array, leading to corrupted data and all attempts to read the array successfully being hopeless. At this stage the RAID array was toast and I had no other option than to leave it broken for a few weeks till I could fix it on a scheduled trip home. :-(

Debugging in the murk of the I/O layer

Having lost 3x disks, it was unclear as to what the root cause of the failure in the array was at this time. Considering I was living in another country I decided to take the precaution of ordering a few different spare parts for my trip, on the basis that spending a couple hundred dollars on parts I didn’t need would be more useful than not having the part I did need when I got back home to fix it.

At this time, I had lost sde, sdf and sdh – all three disks belonging to a single RAID controller, as shown by /sys/block/sd, and all of them Seagate Barracuda disks.

lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sde -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host4/target4:0:0/4:0:0:0/block/sde
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdf -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host5/target5:0:0/5:0:0:0/block/sdf
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdg -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdg
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdh -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdh

Hence I decided to order 3x 1TB replacement Western Digital Black edition disks to replace the failed drives, deliberately choosing another vendor/generation of disk to ensure that if it was some weird/common fault with the Seagate drives it wouldn’t occur with the new disks.

I also placed an order of 2x new SATA controllers, since I had suspicions about the fact that numerous disks attached to one particular controller had failed. The two add in PCI-e controllers were somewhat simplistic ST-Labs cards, providing 4x SATA II ports in a PCI-e 4x format, using a Marvell 88SX7042 chipset. I had used their SiI-based PCI-e 1x controllers successfully before and found them good budget-friendly JBOD controllers, but being a budget no-name vendor card, I didn’t hold high hopes for them.

To replace them, I ordered 2x Highpoint RocketRaid 640L cards, one of the few 4x SATA port options I could get in NZ at the time. The only problem with these cards was that they were also based on the Marvell 88SX7042 SATA chipset, but I figured it was unlikely for a chipset itself to be the cause of the issue.

I considered getting a new power supply as well, but the fact that I had not experienced issues with any disks attached to the server’s onboard AMD SATA controller and that all my graphing of the power supply showed the voltages solid, I didn’t have reason to doubt the PSU.

When back in Wellington in May, I removed all three disks that had previously exhibited problems and installed the three new replacements. I also replaced the 2x PCI-e ST-Labs controllers with the new Highpoint RocketRaid 640L controllers. Depressingly almost as soon as the hardware was changed and I started my RAID rebuild, I started experiencing problems with the system.

May  4 16:39:30 phobos kernel: md/raid:md2: raid level 5 active with 7 out of 8 devices, algorithm 2
May  4 16:39:30 phobos kernel: created bitmap (8 pages) for device md2
May  4 16:39:30 phobos kernel: md2: bitmap initialized from disk: read 1 pages, set 14903 of 14903 bits
May  4 16:39:30 phobos kernel: md2: detected capacity change from 0 to 7000493129728
May  4 16:39:30 phobos kernel: md2:
May  4 16:39:30 phobos kernel: md: recovery of RAID array md2
May  4 16:39:30 phobos kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
May  4 16:39:30 phobos kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May  4 16:39:30 phobos kernel: md: using 128k window, over a total of 976631296k.
May  4 16:39:30 phobos kernel: unknown partition table
...
May  4 16:50:52 phobos kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:50:52 phobos kernel: ata7.00: failed command: IDENTIFY DEVICE
May  4 16:50:52 phobos kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
May  4 16:50:52 phobos kernel:         res 40/00:24:d8:1b:c6/00:00:02:00:00/40 Emask 0x4 (timeout)
May  4 16:50:52 phobos kernel: ata7.00: status: { DRDY }
May  4 16:50:52 phobos kernel: ata7: hard resetting link
May  4 16:50:52 phobos kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:50:52 phobos kernel: ata7.00: configured for UDMA/133
May  4 16:50:52 phobos kernel: ata7: EH complete
May  4 16:51:13 phobos kernel: ata16.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:51:13 phobos kernel: ata16.00: failed command: IDENTIFY DEVICE
May  4 16:51:13 phobos kernel: ata16.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
May  4 16:51:13 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May  4 16:51:13 phobos kernel: ata16.00: status: { DRDY }
May  4 16:51:13 phobos kernel: ata16: hard resetting link
May  4 16:51:18 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:51:23 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:51:23 phobos kernel: ata16: hard resetting link
May  4 16:51:28 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:51:33 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:51:33 phobos kernel: ata16: hard resetting link
May  4 16:51:39 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:52:08 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:52:08 phobos kernel: ata16: limiting SATA link speed to 3.0 Gbps
May  4 16:52:08 phobos kernel: ata16: hard resetting link
May  4 16:52:13 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:52:13 phobos kernel: ata16: reset failed, giving up
May  4 16:52:13 phobos kernel: ata16.00: disabled
May  4 16:52:13 phobos kernel: ata16: EH complete
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 c8 00 00 01 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: Disk failure on sdk, disabling device.
May  4 16:52:13 phobos kernel: md/raid:md2: Operation continuing on 6 devices.
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396112 on sdk).
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396280 on sdk).
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 c9 00 00 04 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396288 on sdk)
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84397304 on sdk).
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 cd 00 00 03 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84397312 on sdk).
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84398072 on sdk).
May  4 16:53:14 phobos kernel: ata18.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata15.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata18.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata18.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata18.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata18: hard resetting link
May  4 16:53:14 phobos kernel: ata17.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata17.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata17.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata17.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata17: hard resetting link
May  4 16:53:14 phobos kernel: ata15.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata15.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata15.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata15: hard resetting link
May  4 16:53:15 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:15 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:53:15 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:20 phobos kernel: ata15.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata17.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata18.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata17.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata17: hard resetting link
May  4 16:53:20 phobos kernel: ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata15.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata15: hard resetting link
May  4 16:53:20 phobos kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata18.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata18: hard resetting link
May  4 16:53:21 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:21 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:53:21 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:31 phobos kernel: ata15.00: qc timeout (cmd 0xec)
May  4 16:53:31 phobos kernel: ata17.00: qc timeout (cmd 0xec)
May  4 16:53:31 phobos kernel: ata18.00: qc timeout (cmd 0xec)
May  4 16:53:32 phobos kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata15.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata15: limiting SATA link speed to 1.5 Gbps
May  4 16:53:32 phobos kernel: ata15: hard resetting link
May  4 16:53:32 phobos kernel: ata17.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata17: limiting SATA link speed to 3.0 Gbps
May  4 16:53:32 phobos kernel: ata17: hard resetting link
May  4 16:53:32 phobos kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata18.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata18: limiting SATA link speed to 1.5 Gbps
May  4 16:53:32 phobos kernel: ata18: hard resetting link
May  4 16:53:32 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
May  4 16:53:33 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
May  4 16:53:33 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
May  4 16:54:05 phobos kernel: ata15: EH complete
May  4 16:54:05 phobos kernel: ata17: EH complete
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] Unhandled error code
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdj, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 5 devices.
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] Unhandled error code
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdl, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 4 devices.
May  4 16:54:05 phobos kernel: ata18: EH complete
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] Unhandled error code
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdm, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 4 devices.
May  4 16:54:05 phobos kernel: md: md2: recovery done.

The above logs demonstrate my problem perfectly. After rebuilding the array, a disk would fail and lead to a cascading failure where other disks would throw up “failed to IDENTIFY” errors until they eventually all failed one-by-one until all the disks attached to that controller had vanished from the server.

At this stage I knew the issue wasn’t caused by all the hard disk themselves, but there was the annoying fact that at the same time, I couldn’t trust all the disks – there were certainly at least a couple bad hard drives in my server triggering a fault, but many of the disks marked as bad by the server were perfectly fine when tested in a different system.

Untrustworthy Disks

Untrustworthy Disks

By watching the kernel logs and experimenting with disk combinations and rebuilds, I was able to determine that the first disk that always failed was usually faulty and did need replacing. However the subsequent disk failures were false alarms and seemed to be part of a cascading triggered failure on the controller that the bad disk was attached to.

Using this information, it would be possible for me to eliminate all the bad disks by removing the first one to fail each time till I had no more failing disks. But this would be a false fix, if I was to rebuild the RAID array in this fashion, everything would be great until the very next disk failure occurs and leads to the whole array dying again.

Therefore I had to find the root cause and make a permanent fix for the solution. Looking at what I knew about the issue, I could identify the following possible causes:

  1. A bug in the Linux kernel was triggering cascading failures of RAID rebuilds (bug in either software RAID or in AHCI driver for my Marvell controller).
  2. The power supply on the server was glitching under the load/strain of a RAID rebuild and causing brownouts to the drives.
  3. The RAID controller had some low level hardware issue and/or an issue exhibited when one failed disk sent a bad signal to the controller.

Frustratingly I didn’t have any further time to work on it whilst I was in NZ in May, so I rebuilt the minimum array possible using the onboard AMD controller to restore my critical VMs and had to revisit the issue again in October when back in the country for another visit.

Debugging hardware issues and cursing the lower levels of the OSI layers.

Debugging disks and cursing hardware.

When I got back in October, I decided to eliminate all the possible issues, no matter how unlikely…. I was getting pretty tired of having to squeeze all my stuff into only a few TB, so wanted a fix no matter what.

Kernel – My server was running the stock CentOS 6 2.6.32 kernel which was unlikely to have any major storage subsystem bugs, however as a precaution I upgraded to the latest upstream kernel at the time of 3.11.4.

This failed to make any impact, so I decided that it would be unlikely I’m the first person to discover a bug in the Linux kernel AHCI/Software-RAID layer and marked the kernel as an unlikely source of the problem. The fact that when I added a bad disk to the onboard AMD server controller I didn’t see the same cascading failure, also helped eliminate a kernel RAID bug as the cause, leaving only the possibility of a bug in the “sata_mv” kernel driver that was specific to the controllers.

Power Supply – I replaced the few year old power supply in the server with a newer higher spec good quality model. Sadly this failed to resolve anything, but it proved that the issue wasn’t power related.

SATA Enclosure – 4x of my disks were in a 5.25″ to 3.5″ hotswap enclosure. It seemed unlikely that a low level physical device could be the cause, but I wanted to eliminate all common infrastructure possibility, and all the bad disks tended to be in this enclosure. By juggling the bad disks around, I was able to confirm that the issue wasn’t specific to this one enclosure.

At this time the only hardware in common left over was the two PCI-e RAID controllers – and of course the motherboard they were connected via. I really didn’t want to replace the motherboard and half the system with it, so I decided to consider the possibility that the fact the old ST-Lab and new RocketRaid controllers both used the same Marvell 88SX7042 chipset meant that I hadn’t properly eliminated the controllers as being the cause of the issue.

After doing a fair bit of research around the Marvell 88SX7042 online, I did find a number of references to similar issues with other Marvell chipsets, such as issues with the 9123 chipset series, issues with the 6145 chipset and even reports of issues with Marvell chipsets on FreeBSD.

The Solution

Having traced something dodgy back to the Marvell controllers I decided the only course of action left was to source some new controller cards. Annoyingly I couldn’t get anything that was 4x SATA ports in NZ that wasn’t either Marvell chipset based or extremely expensive hardware RAID controllers offering many more features than I required.

I ended up ordering 2x Syba PCI Express SATA II 4-Port RAID Controller Card SY-PEX40008 which are SiI 3124 chipset based controllers that provide 4x SATA II ports on a single PCIe-1x card. It’s worth noting that these controllers are actually PCI-X cards with an onboard PCI-X to PCIe-1x converter chip which is more than fast enough for hard disks, but could be a limitation for SSDs.

Marvell Chipset-based controller (left) and replacement SiI based controller (right).

Marvell Chipset-based controller (left) and replacement SiI based controller (right). The SiI controller is much larger thanks to the PCI-X to PCI-e converter chip on it’s PCB.

I selected these cards in particular, as they’re used by the team at Backblaze (a massive online back service provider) for their storage pods, which are built entirely around giant JBOD disk arrays. I figured with the experience and testing this team does, any kit they recommend is likely to be pretty solid (check out their design/decision blog post).

# lspci | grep ATA
00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
03:04.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
05:04.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)

These controllers use the “sata_sil24” kernel driver and are configured as simple JBOD controllers in the exact same fashion as the Marvell cards.

To properly test them, I rebuilt an array with a known bad disk in it. As expected, at some point during the rebuild, I got a failure of this disk:

Nov 13 10:15:19 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x3fe SErr 0x0 action 0x6
Nov 13 10:15:19 phobos kernel: ata14.00: irq_stat 0x00020002, device error via SDB FIS
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:08:00:c5:c7/01:00:02:00:00/40 tag 1 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 41/40:00:2d:c5:c7/00:01:02:00:00/00 Emask 0x409 (media error) <F>
Nov 13 10:15:19 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:19 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x3fe SErr 0x0 action 0x6
Nov 13 10:15:19 phobos kernel: ata14.00: irq_stat 0x00020002, device error via SDB FIS
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:08:00:c5:c7/01:00:02:00:00/40 tag 1 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 41/40:00:2d:c5:c7/00:01:02:00:00/00 Emask 0x409 (media error) <F>
Nov 13 10:15:19 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:19 phobos kernel: ata14.00: error: { UNC }
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:10:00:c6:c7/01:00:02:00:00/40 tag 2 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 9c/13:04:04:00:00/00:00:00:20:9c/00 Emask 0x2 (HSM violation)
Nov 13 10:15:19 phobos kernel: ata14.00: status: { Busy }
Nov 13 10:15:19 phobos kernel: ata14.00: error: { IDNF }
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/c0:18:00:c7:c7/00:00:02:00:00/40 tag 3 ncq 98304 in
Nov 13 10:15:19 phobos kernel:         res 9c/13:04:04:00:00/00:00:00:30:9c/00 Emask 0x2 (HSM violation)
...
Nov 13 10:15:41 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 13 10:15:41 phobos kernel: ata14.00: irq_stat 0x00060002, device error via D2H FIS
Nov 13 10:15:41 phobos kernel: ata14.00: failed command: READ DMA
Nov 13 10:15:41 phobos kernel: ata14.00: cmd c8/00:00:00:c5:c7/00:00:00:00:00/e2 tag 0 dma 131072 in
Nov 13 10:15:41 phobos kernel:         res 51/40:00:2d:c5:c7/00:00:02:00:00/02 Emask 0x9 (media error)
Nov 13 10:15:41 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:41 phobos kernel: ata14.00: error: { UNC }
Nov 13 10:15:41 phobos kernel: ata14.00: configured for UDMA/100
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Unhandled sense code
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Sense Key : Medium Error [current] [descriptor]
Nov 13 10:15:41 phobos kernel: Descriptor sense data with sense descriptors (in hex):
Nov 13 10:15:41 phobos kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov 13 10:15:41 phobos kernel:        02 c7 c5 2d 
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Add. Sense: Unrecovered read error - auto reallocate failed
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] CDB: Read(10): 28 00 02 c7 c5 00 00 01 00 00
Nov 13 10:15:41 phobos kernel: ata14: EH complete
Nov 13 10:15:41 phobos kernel: md/raid:md5: read error corrected (8 sectors at 46646648 on sdk)

The known faulty disk failed as expected, but rather than suddenly vanishing from the SATA controller, the disk remained attached and proceeded to spew out seek errors for the next few hours, slowing down the rebuild process.

Most importantly, no other disks failed at any stage following the move from the Marvell to the SiI controller. Having assured myself with the stability and reliability of the controller, I was able to add in other disks I weren’t sure had or had not failed and quickly eliminate the truly faulty disks.

My RAID array rebuild completed successfully and I was then able to restore the LVM volume and all my data and restore full services on my system.

With the data restored, time for a delicious IPA.

With the data restored successfully at last, time for a delicious local IPA to celebrate victory at long last.

Having gone through the pain and effort to recover and rebuild this system, I had to ask myself a few questions about my design decisions and what I could do to avoid a fault like this again in future.

Are the Marvell controllers shit?

Yes. Next question?

In seriousness, the fact that two completely different PCI-e card vendors exhibited the exact same problem with this controller shows that it’s not going to be the fault of the manufacturer of the controller card and must be something with the controller itself or the kernel driver operating it.

It is *possible* that the Linux kernel has a bug in the sata_mv controller, but I don’t believe that’s the likely cause due to the way the fault manifests. Whenever a disk failed, the disks would slowly disappear from the kernel entirely, with the kernel showing surprise at suddenly having lost a disk connection. Without digging into the Marvell driver code, it looks very much like the hardware is failing in some fashion and dropping the disks. The kernel just gets what the hardware reports, so when the disks vanish, it just tries to cope the best that it can.

I also consider a kernel bug to be a vendor fault anyway. If you’re a major storage hardware vendor, you have an obligation to thoroughly test the behaviour of your hardware with the world’s leading server operating system and find/fix any bugs, even if the driver has been written by the open source community, rather than your own in house developers.

Finally posts on FreeBSD mailing lists also show users reporting strange issues with failing disks on Marvell controllers, so that adds a cross-platform dimension to the problem. Further testing with the RAID vendor drivers rather than sata_mv would have been useful, but they only had binaries for old kernel versions that didn’t match the CentOS 6 kernel I was using.

Are the Seagate disks shit?

Maybe – it’s hard to be sure, since my server has been driven up in my car from Wellington to Auckland (~ 900km) and back which would not have been great for the disks (especially with my driving) and has had a couple of times that it’s gotten warmer than I would have liked thanks to the wonders of running a server in a suburban home.

I am distrustful of my Seagate drives, but I’ve had bad experienced with Western Digital disks in the past as well, which brings me to the simple conclusion that all spinning rust platters are crap and the price drops of SSDs to a level that kills hard disks can’t come soon enough.

And even if I plug in the worst possible, spec-violating SATA drive to a controller, that controller should be able to handle it properly and keep the disk isolated, even if it means disconnecting one bad disk. The Marvell 88SX7042 controller is advertised as being a 4-channel controller, I therefore do not expect issues on one channel to impact the activities on the other channels. In the same way that software developers code assuming malicious input, hardware vendors need to design for the worst possible signals from other devices.

What about Enterprise vs Consumer disks

I’ve looked into the differences between Enterprise and Consumer grade disks before.

Enterprise disks would certainly help the failures occur in a cleaner fashion. Whether it would result in correct behaviour of the Marvell controllers, I am unsure… the cleaner death *may* result in the controller glitching less, but I still wouldn’t trust it after what I’ve seen.

I also found rebuilding my RAID array with one of my bad sector disks would take around 12 hours when the disk was failing. After kicking out the disk from the array, that rebuild time dropped to 5 hours, which is a pretty compelling argument for using enterprise disks to have them die quicker and cleaner.

Lessons & Recommendations

Whilst slow, painful and frustrating, this particular experience has not been a total waste of time, there are certainly some useful lessons I’ve learnt from the exercise and things I’d do different in future. In particular:

  1. Anyone involved in IT ends up with a few bad disks after a while. Keep some of these bad disks around rather than destroying them. When you build a new storage array, install a known bad disk and run disk benchmarks to trigger a fault to ensure that all your new infrastructure handles the failure of one disk correctly. Doing this would have allowed me to identify the Marvell controllers as crap before the server went into production, as I would have seen that a bad disks triggers a cascading fault across all the good disks.
  2. Record the serial number of all your disks once you build the server. I had annoyances where bad disks would disappear from the controller entirely and I wouldn’t know what the serial number of the disk was, so had to get the serial numbers from all the good disks and remove the one disk that wasn’t on the list in a slow/annoying process of elimination.
  3. RAID is not a replacement for backup. This is a common recommendation, but I still know numerous people who treat RAID as an infallible storage solution and don’t take adequate precautions for failure of the storage array. The same applies to RAID-like filesystem technologies such as ZFS. I had backups for almost all my data, but there were a few things I hadn’t properly covered, so it was a good reminder not to succumb to hubris in one’s data redundancy setup.
  4. Don’t touch anything with a Marvell chipset in it. I’m pretty bitter about their chipsets following this whole exercise and will be staying well away from their products for the foreseeable future.
  5. Hardware can be time consuming and expensive. As a company, I’d stick to using hosted/cloud providers for almost anything these days. Sure the hourly cost can seem expensive, but not needing to deal with hardware failures is of immense value in itself. Imagine if this RAID issue had been with a colocated server, rather than a personal machine that wasn’t mission critical, the cost/time to the company would have been a large unwanted investment.

 

Restoring LVM volumes

LVM is a fantastic technology that allows block devices on Linux to be sliced into volumes(partitions) and allows for easy resizing and changing of the volumes whilst running, in a much easier fashion than traditional MSDOS or GPT partition tables. LVM is commonly used as the default at install of most popular Linux distributions due to the ease of management.

I use LVM for providing logical volumes for each virtual machine on my KVM server. This makes it easy to add, delete, resize and snapshot my virtual machines and also reduces I/O overhead compared to having virtual machines inside files ontop of a file system.

I recently had the misfortune of needing to rebuild this array after an underlying storage failure in my RAID array. Doing so manually would be time consuming and error prone, I would have to try and determine what the sizes of my different VMs used to be before the failure and then issue lvcreate commands for them all to restore the configuration.

 

LVM Configuration Backup & Restore

Thankfully I don’t need to restore configuration manually – LVM keeps a backup of all configured volumes/groups, which can be used to restore the original settings of the LVM volume group, including the physical volume settings, the volume group and all the logical volumes inside it.

There is a copy of the latest/current configuration in /etc/lvm/backup/, but also archived versions in /etc/lvm/archive/, which is useful if you’ve broken something badly and actually want an older version of the configuration.

Restoring is quite simple, assuming that firstly your server still has /etc/lvm/ in a good state, or you’ve backed it up to external media and secondly that the physical volume(s) that you are restoring to are at least as large as they were at the time of the configuration backup.

Firstly, you need to determine which configuration you want to restore. In my case, I want to restore my “vg_storage” volume group.

# ls -l /etc/lvm/archive/vg_storage_00*
-rw-------. 1 root root 13722 Oct 28 23:45 /etc/lvm/archive/vg_storage_00419-1760023262.vg
-rw-------. 1 root root 14571 Oct 28 23:52 /etc/lvm/archive/vg_storage_00420-94024216.vg
...
-rw-------. 1 root root 14749 Nov 23 15:11 /etc/lvm/archive/vg_storage_00676-394223172.vg
-rw-------. 1 root root 14733 Nov 23 15:29 /etc/lvm/archive/vg_storage_00677-187019982.vg
#

Once you’ve chosen which configuration to restore, you need to fetch the UUID of the old physical volume(s) from the LVM backup file:

# less /etc/lvm/archive/vg_storage_00677-187019982.vg
...
   physical_volumes {
      pv0 {
         id = "BgR0KJ-JClh-T2gS-k6yK-9RGn-B8Ls-LYPQP0"
...

Now we need to re-create the physical volume with pvcreate. By specifying both the UUID and the configuration file, pvcreate rebuilds the physical volume with all the original values, including extent size and other details to ensure that the volume group configuration applies correctly onto it.

# pvcreate --uuid "BgR0KJ-JClh-T2gS-k6yK-9RGn-B8Ls-LYPQP0" \
  --restorefile /etc/lvm/archive/vg_storage_00677-187019982.vg

If your volume group is made up of multiple physical volumes, you’ll need to do this for each physical volume – make sure you give the right volumes the right UUIDs if the sizes of the physical volumes differ.

Now that the physical volume(s) are created, we can restore the volume group with vgcfgrestore. Because we’ve created the physical volume(s) with the same UUIDs, it will correctly rebuild using the right volumes.

# vgcfgrestore -f /etc/lvm/archive/vg_storage_00677-187019982.vg vg_storage

At this time, you should be able to display the newly restored volume group:

# vgdisplay
  --- Volume group ---
  VG Name               vg_storage
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1044
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                30
  Open LV               11
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               4.55 TiB
  PE Size               4.00 MiB
  Total PE              1192176
  Alloc PE / Size       828160 / 3.16 TiB
  Free  PE / Size       364016 / 1.39 TiB
  VG UUID               OIs8bs-nIE1-bMTj-vSt2-ue80-jf1X-oZCYiO
#

Finally make sure you activate all logical volumes – sometimes LVM restores with volumes set to inactive so they can’t be modified/used – use vgchange -a y to activate all volumes for the volume group:

# vgchange -a y vg_storage

The array is now restored and ready to use. Run lvdisplay and you’ll see all the restored logical volumes. Of course, this is only the configuration – there’s no data in those volumes, so you still have to restore the volumes from backups (you DO have backups right?).

 

Data Backup & Restore

Once you have your volume configuration restored, you need restore your data. The way you do this is really going to depend on the nature of your data and the volume of it, however I use a combination of two of the most common approaches to backups:

  1. A regular data-only level backup to protect all the data I care about. However in the event of a storage failure, I’m left having to rebuild all my virtual machines OS images, before being able to load in the data-only backups into the created VMs.
  2. An infrequent block-level backup of all the LVM volumes, where images of each volume are dumped onto external media, which can be loaded back in once the LVM configuration has been restored.

I have a small Perl script I’ve written which does this snapshot & backup process for me. It makes use of LVM snapshots to ensure data consistency during the backup process and compresses with gzip to reduce the on disk backup size without too much CPU overhead.

#!/usr/bin/perl -w
#
# Run through a particular LVM volume group and perform a snapshot
# and compressed file backup of each volume.
#
# This script is intended for use with backing up complete system
# images of VMs, in addition to data level backups.
#

my $source_lvm_volgroup = 'vg_storage';
my $source_lvm_snapsize = '5G';
my @source_lvm_excludes = ('lv_unwanted', 'lv_tmpfiles');

my $dest_dir='/mnt/backup/snapshots';

foreach $volume (glob("/dev/$source_lvm_volgroup/*"))
{
    $volume            =~ /\/dev\/$source_lvm_volgroup\/(\S*)$/;
    my $volume_short    = $1;

    if ("$volume_short" ~~ @source_lvm_excludes)
    {
        # Excluded volume, we skip it
        print "[info] Skipping excluded volume $volume_short ($volume)\n";
        next;
    }

    print "[info] Processing volume $volume_short ($volume)\n";

    # Snapshot volume
    print "[info] Creating snapshot...\n";
    system("lvcreate -n ${volume_short}_snapshot --snapshot $volume -L $source_lvm_snapsize");

    # Write compressed backup file from snapshot, but only replace existing one once finished
    print "[info] Creating compressed snapshot file...\n";
    system("dd if=${volume}_snapshot | gzip --fast > $dest_dir/$volume_short.temp.gz");
    system("mv $dest_dir/$volume_short.temp.gz $dest_dir/$volume_short.gz");

    # Delete snapshot
    print "[info] Removing snapshot...\n";
    system("lvremove --force ${volume}_snapshot");

    print "[info] Volume $volume_short backup completed.\n";
}

print "[info] FINISHED! VolumeGroup backup completed\n";

exit 0;

Since I was doing these backups before my storage failure, by restoring the LVM volume group configuration, it was an easy process to restore all my LVM volume data with a simple matching restore script:

#!/usr/bin/perl -w
#
# Runs through LVM snapshots taken by backup_snapshot_lvm.pl and
# restores them to the LVM volume in question.
#

my $dest_lvm_volgroup = 'vg_storage';
my $source_dir = '/mnt/backup/snapshots';

print "[WARNING] Beginning restore process in 5 SECONDS!!\n";
sleep(5);

foreach $volume (glob("$source_dir/*"))
{
    $volume            =~ /$source_dir\/(\S*).gz$/;
    my $volume_short    = $1;

    print "[info] Processing volume $volume_short ($volume)\n";

    # Just need to decompress & write into LVM volume
    system("zcat $source_dir/$volume_short.gz > /dev/$dest_lvm_volgroup/$volume_short");

    print "[info] Volume $volume_short restore completed.\n";
}

print "[info] FINISHED! VolumeGroup restore completed\n";

exit 0;

Because the block-level backups are less frequent than the data ones, the way I do a restore is to restore these block level backups and then apply the latest data backup over the top of the VMs once they’re booted up again.

Using both these techniques, it can be very fast and hassles free to rebuild a large LVM volume group after a serious storage failure – I’ve used these exact scripts a few times and can restore a several TB volume with 30+ VMs in a matter of hours simply by kicking off a few commands and going out for a beer to wait for the data to transfer from the backup device.

Resize ext4 on RHEL/CentOS/EL

I use an ext4 filesystem for a couple large volumes on my virtual machines. I know that btrfs and ZFS are the new cool kids in town for filesystems, but when I’m already running RAID and LVM on the VM server underneath the guest, these newer filesystems with their fancy features don’t add a lot of value for me, so a good, simple, traditional, reliable filesystem is just what I want.

There’s actually a bit of confusion and misinformation online about using ext4 filesystems with RHEL/CentOS (EL) 5 & 6 systems, so just wanted to clarify some things for people.

ext4 uses the same tools as ext3 and ext2 before it – this means still using the e2fsprogs package which provides utilities such as e2fsck and resize2fs, which handle ext2, ext3 and ext4 filesystems with the same command. But there are a few catches with EL…

On an EL 5 system, the version of e2fsprogs is too old and doesn’t support ext4. Therefore there is an additional package called “e4fsprogs” which is actually just a newer version of e2fsprogs that provides renamed tools such as “resize4fs”. This tends to get confusing for users who then try to find these tools on EL 6 or other distributions, so it’s important to be aware that it’s a non-standard thing. Why they didn’t just upgrade the tools/backport the ext4 support is a bit weird to me, especially considering the fact these tools are extremely backwards compatible and should meet even Red Hat’s API compatibility requirements, but it-is-what-it-is sadly.

On an EL 6 system, the version of e2fsprogs is modern enough to support ext4, which means you need to use the normal “resize2fs” command to do an ext4 filesystem resize (as per the docs). Generally this works fine, but I think I may have found a bug with the stock EL 6 kernel and e2fsprog tools.

One of my filesystems almost reached 100% full, so I expanded the underlying block volume and then issued an offline e2fsck -f /dev/volume & resize2fs /dev/volume which completed as normal without error, however it left the filesystem size unchanged. Repeated checks and resize attempts made no difference. The block volume correctly reported it’s new size, so it wasn’t a case of the kernel not having seen the changed size.

However by mounting the ext4 filesystem and doing an online resize, the resize works correctly, although quite slowly, as the online resize seems to trickle-resize the disk, gradually increasing it’s size until finally complete.

This inability to resize offline is not something I’ve come across before, so may be a rare bug trigged by the full size of the filesystem that could well be fixed in newer kernels/e2fsprog packages, but figured I’d mention it for any other poor sysadmins scratching their heads over a similar issue.

Finally, be aware that you need to use either no partition table, or GPT if you want file systems over 2TB and also that the e2fsprogs package shipped with EL 6 has a 16TB limit, so you’ll have to upgrade the package manually if you need more than that.

“OpsDev”

I recently did a talk at one of the regular Fairfax “Brown Bag” lunches about tools used by the operations team and how developers can use these tools to debug some of their systems and issues.

It won’t be anything mind blowing for experienced *nix users, but it will be of interest to less experienced engineers or developers who don’t venture into server land too often.

If you’re interested, my colleague and I are both featured on the YouTube video below – my block starts at 14:00, but my colleague’s talk about R at the start may also be of interest.

Additionally, Fairfax AU has also started blogging and publishing other videos and talk like this, as well as blog posts from other people around the technology business (developers, operations, managers, etc) to try and showcase a bit more about what goes on behind the scenes in our organisation.

You can follow the Fairfax Engineering blog at engineering.fairfaxmedia.com.au or on Twitter at @FairfaxEng.

Exposing name servers with Puppet Facts

Carrying on from the last post, I needed a good reliable way to point my Nginx configuration at a DNS server to use for resolving backends. The issue is that I wanted my Puppet module to be portable across various environments, some which block outbound DNS traffic to external services and others where the networks may be redefined on a frequent basis and maintaining an accurate list of all the name servers would be difficult (eg the cloud).

I could have used dnsmasq to setup a localhost resolver, but when it comes to operational servers, simplicity is key – having yet another daemon that could crash or cause problems is never desirable if there’s a simpler way to solve the issue.

Instead I used Facter (sic), Puppet’s tool for exposing values pulled from the system into variables that can be used in your Puppet manifests or templates. The following custom fact is included in my Puppet module and is run before any configuration is applied to the host running my Nginx configuration:

#!/usr/bin/env ruby
#
# Returns a string with all the IPs of all configured nameservers on
# the server. Useful for including into applications such as Nginx.
#
# I live in mymodulenamehere/lib/facter/nameserver_list.rb
# 

Facter.add("nameserver_list") do
    setcode do
      nameserver = false

      # Find all the nameserver values in /etc/resolv.conf
      File.open("/etc/resolv.conf", "r").each_line do |line|
        if line =~ /^nameserver\s*(\S*)/
          if nameserver
            nameserver = nameserver + " " + $1
          else
            nameserver = $1
          end
        end
      end

      # If we can't get any result (bad host config?) default to a
      # public DNS server that is likely to be reachable.
      unless nameserver
        nameserver = '8.8.8.8'
      end

      nameserver
    end
end

On a system with a typically configured /etc/resolv.conf file such as:

search example.com
nameserver 192.168.0.1
nameserver 10.1.1.1

The fact will expose the nameservers in a space-delineated string such as:

# facter -p | grep 'nameserver_list'
nameserver_list => 192.168.0.1 10.1.1.1

I can then use the Fact inside my Puppet templates for Nginx to configure the resolver:

server {
    ...
    resolver <%= @nameserver_list %>;
    resolver_timeout 1s;
    ...
}

This works pretty well, but there are a couple things to watch out for:

  1. If the Fact fails to execute at all, your configuration will be broken. Having said that, it’s a very simple Fact and there’s not a lot that really could fail (eg no dependencies on other apps/non-standard resources).
  2. Linux hosts resolve DNS using the nameservers specified in the order in /etc/resolv.conf. If one fails, they move on and try the next. However Nginx differs, and just uses the list of provides nameservers in round-robin fashion. This is fine if your nameservers are all equals, but if some are more latent or less reliable than others, it could cause slight delays.
  3. You want to drop the resolver_timeout to 1 second, to ensure a failing nameserver doesn’t hold up re-resolution of DNS for too long. Remember that this re-resolution should only occur when the TTL of the DNS records for the backend has expired, so even if one DNS server is bad, it should have almost no impact to performance for your requests.
  4. Nginx isn’t going to pickup stuff in /etc/hosts using these resolvers. This should be common sense, but thought I better put that out there just-in-case.
  5. This Ruby could be better, but I’m not a dev and hacked it up in 15mins. The regex should probably also be improved to handle some of the more exotic /etc/resolv.confs that I’m sure people manage to write.

Varnish DoS vulnerability

The Varnish developers have recently announced a DoS vulnerability in Varnish (CVE-2013-4484) , if you’re using Varnish in your environment make sure you adjust your configurations to fix the vulnerability if you haven’t already.

In a test of our environment, we found many systems were protected by a default catch-all vcl_error already, but there were certainly systems that suffered. It’s a very easy issue to check for and reproduce:

# telnet failserver1 80
Trying 127.0.0.1...
Connected to failserver1.example.com.
Escape character is '^]'.
GET    
Host: foo
Connection closed by foreign host.

You will see the Varnish child dying in the system logs at the time:

Oct 31 14:11:51 failserver1 varnishd[1711]: Child (1712) died signal=6
Oct 31 14:11:51 failserver1 varnishd[1711]: child (2433) Started
Oct 31 14:11:51 failserver1 varnishd[1711]: Child (2433) said Child starts

Make sure you go and apply the fix now, upstream advise applying a particular configuration change and haven’t released a code fix yet, so distributions are unlikely to be releasing an updated package to fix this for you any time soon.

SPF with SpamAssassin

I’ve been using SpamAssassin for years, it’s a fantastic open source anti-spam tool and plugs easily into *nix operating system mail transport agents such as Sendmail and Postfix.

To stop sender address forgery, where spammers email using my domain to email either myself, or others entities, I configured SPF records for my domain some time ago. The SPF records tell other mail servers which systems are really mine, vs which ones are frauds trying to send spam pretending to be me.

SpamAssassin has a plugin that makes use of these SPF records to score incoming mail – by having strict SPF records for my domain and turning on SpamAssassin’s validation, it ensures that any spam I receive pretending to be from my domain will be blocked, as well as anyone trying to spam under the name of other domains with SPF enabled will also be blocked.

Using SpamAssassin’s scoring offers some protection against false positives – if an organisation missconfigures their mail server so that their SPF record fails, but all the other details in the email are OK, the email may still be delivered, if the content looks like ham, comes from a properly configured server, etc, even if the SPF is incorrect – generally a couple different checks need to fail in order for emails to be blacklisted.

To turn this on, you just need to ensure your SpamAssassin configuration is set to load the SPF plugin:

loadplugin Mail::SpamAssassin::Plugin::SPF

You *also* need the Perl modules Mail::SPF or Mail::SPF::Query installed – without these, SpamAssassin will silently avoid doing SPF validations and you’ll be left wondering why you’re still getting silly spam.

On CentOS/RHEL, these Perl modules are available in EPEL and you can install both with:

yum install perl-Mail-SPF perl-Mail-SPF-Query

To check if SPF validation is taking place, check the mailserver logs or the X-Spam-Status email header for SPF_PASS (or maybe SPF_FAIL!), this proves the module is loaded and running correctly.

X-Spam-Status: No, score=-1.9 required=3.5 tests=AWL,BAYES_00,SPF_PASS,
 T_RP_MATCHES_RCVD autolearn=ham version=3.3.1

Finally sit back and enjoy the quieter, spam-free(ish) inbox :-)

Puppet CRL Time Errors

Puppet is much loved for it’s clear meaningful messages when something goes wrong, made even more delightful when you combine it with the lovely error messages thrown out by OpenSSL.

Warning: SSL_connect returned=1 errno=0 state=SSLv3 read server
certificate B: certificate verify failed: [CRL is not yet valid for
/CN=host.example.com]

This error indicates that the certificate is failing to validate since the clock between the node and the puppet master differs. In my case, the clock on the node was far behind the master due to a VirtualBox clock drift issue.

In this case, it was a simple case of re-syncing the clock to resolve the issue. However if the master had been generating certs with the clock far in the future, I would have needed to re-generate my node certificates entirely as the certs would also be incorrect.

Android, the leading propietary mobile operating system

The Linux kernel has had a long history in the mobile space, with the successes and benefits of the OS in the embedded world transferring across to the smart phone and tablet market once devices evolved to a level requiring (and supporting) powerful multitasking operating systems.

But whilst there had been other Linux-based mobiles before, it wasn’t until Android was first released to the world by Google that Linux began to obtain true mass-maket consumer acceptance.  With over 1 billion devices activated by late 2013, Android is certainly the single most successful mobile Linux distribution ever and possibly even the single largest mobile OS on basis of number of devices sold.

Whilst Open Source and Free Software [By Free Software I mean software that is Libre, ie Free as in Freedom, rather than Free as in Beer] had historically succeeded strongly in the server space, it always suffered limited mass market appeal in the desktop. With the sudden emergence and success of Android, proponents of both Open Source and Free Software camps could enjoy a moment of victory and success. Sure we may not have won the desktop wars, and sure it wasn’t GNU/Linux in the traditional sense, but damnit, we had a Linux kernel in every other consumer device, something worth celebrating!

 

Whilst Android still features the Linux kernel, it differs from a conventional GNU/Linux system, as it doesn’t feature the GNU user space and applications. When building Android, Google took the Open Source Linux kernel but threw out most of the existing user space, instead building a new Apache-licensed user space designed for consumers and interaction via touch interfaces.

For Google themselves, Android was a way to prevent vendors like Microsoft or Apple getting a new monopoly in the mobile world where they could then squeeze Google out and strangle their business in the new emerging market – a world where Microsoft or Apple could dictate what browser or search engine that a user could use would not be in Google’s best financial interests and it was vital to take steps to prevent that from being possible.

The proposition to device vendors is that Android was an answer to reducing their R&D costs to compete with incumbent market players, making their devices more attractive and allowing some collaboration with their peers via means of a common application platform which would attract developers and enable a strong ecosystem, that in turn would make Android phones more attractive for consumers.

For Google and device vendors, this was a win-win relationship and it quickly began to pay off.

 

Yet even as soon as we started consuming the delicious Android desert (with maybe a slightly dubious Google advertising crust we could leave on the side), we found the taste souring with every mouthful. For whilst Google and device vendors brought into the idea of Android the operating system, they never brought into the idea of the Free Software movement which had lead to the software and community that had made this success possible in the first place.

To begin with, unlike the GNU/Linux distributions pre-dating Android which generally fostered collaboration and joint effort around a shared philosophy of working together to make a better system, Android was developed in a closed-room model, with Google and select partners developing new features in private before throwing out completed releases to coincide with new devices. It’s an approach that’s perfectly compliant with Open Source licensing, but not necessarily conducive to building a strong community.

Even the open source nature of the OS was quickly tainted, with device vendors taking Android and instead of evolving the source code as part of a community effort, they added in their own proprietary front ends and variations, shipped devices with locked boot loaders preventing OS customisation and shoved binary drivers and firmware into their device kernels.

This wasn’t the activity of just a few bad vendors either. Even Google’s own popular “Google Nexus” series targeted at developers of both applications and operating system requires proprietary blobs to get hardware such as cellular radios, WiFi, Cameras and GPUs to function. [Depending whom you ask, this is a violation of the Linux kernel’s GPLv2 license, but there is disagreement amongst kernel developers and the fears that a ban on kernel proprietary drivers will just lead to vendors moving the proprietary blobs to user space, a legally valid but still ethically dubious approach.]

Google’s main maintainer for AOSP recently departed Google over frustrations getting Qualcomm to release drivers for the 2013 revision of the popular Nexus 7 tablet, which illustrates the hurdles that developers face when getting even just the binaries from vendors.

Despite all these road blocks thrown up, a strong developer community has still managed to form around hacking on the Android source code, with particular credit to Cyanogenmod a well polished and very popular enhanced distribution of Android, Replicant which seeks to build a purely free Android OS replacing binary blobs along the way, and FDroid a popular alternative to the “Google Play” application store offering only Free Software licensed applications for download.

It’s still not perfect and there’s a lot of work left to do – projects like Cyanogenmod and Replicant still tend to need many proprietary modules if you want to make full use of the features of your device. The community is working on fixing these short comings, but it’s always much more frustrating having to play catch up to vendors, rather than working collaboratively with them.

But whilst this community effort can resolve the issue of proprietary drivers and applications and lead us to a proper Free Software Android, there is a much more tricky issue coming up which could cause far greater headaches.
In order to resolve the issue of Android version fragmentation amongst vendors causing challenges for application developers, Google has been introducing new APIs inside a package called “Google Play Services”, which is a proprietary library distributed only via the Google Play application store.

Any application that is reliant on this new library (not to mention over existing proprietary components such as Google Cloud Messaging used for push notifications) will be unable to run on pure Free Software devices that are stripped of non-free components. And whilst at the immediate moment the features offered by this API are mostly around using specific Google cloud-based APIs and features which are non-free by their very nature, there’s nothing preventing more and more features being included in this API in future, reducing the scope of applications that will run on a Free Software Android.

If Google Play Services proves to be a successful way for Google to enforce consistency and conformity on their platform to tackle the fragmentation issues they face, it’s not inconceivable that they’ll push more and more library functions into proprietary layers distributed via the Play Store like this.

 

But if Google chooses to change Android in this way, I feel that it will be inappropriate to continuing calling Android an Open Source or Free Software operating system. Instead it will be better described as a proprietary operating system with an open core – in similar fashion to that of Apple’s MacOS.

Such an evolution could lead to two distinct forks of Android being created:

  1. Propietary/Android, the version identified by the public, offered by Google and their associated vendors, a polished experience but with increasingly reduced user and developer freedoms.
  2. Free/Android, the community variations with it’s own application ecosystem that diverges away from Propietary/Android as more and more applications refuse to run on it due to Free/Android lacking libraries like Google Play Services.

Some readers will ponder why having some proprietary components is such a concern – who really wants to hack around with drivers or application compatibility APIs? Generally they’re not the most exciting part of computers [subjectively speaking of course] and on some level I can understand this mindset.

But proprietary software chunks are more than just being an annoyance to developers who want to tinker. Proprietary software makes your device opaque, obscuring what the software is doing, how it works and how it can be (ab)used.

The Google Play application has the capability to install content on your phone, a feature often used by users to install applications to their device from their browser. But does the source code of the the Google Play application ensure that it can never happen without your awareness? There’s already due cause to distrust the close association between companies like Google and the NSA, without the ability to see inside the software’s source code, you can’t be sure of it’s capabilities.

Building applications around proprietary APIs like Google Play Services removes the freedom of a user to decided to replace calls to proprietary systems to free ones. It may be preferable to use a Free Software mapping API rather than Google’s privacy lacking Maps offering for example, but without the source code, it’s not possible to make this change.

Even something as innocent as a driver or firmware of hardware such as the GSM modem could be turned into a weapon by a powerful adversary, by taking advantage of backdoors in the firmware to deliver malware to spy on an individual – whether for the “right” reasons or not, depends on your moral views and whom is doing the spying at the time.

Admittedly a pessimistic view, but I’ve laid out my personal justifications for taking this approach before and believe we need to look at how this technology could potentially (hopefully never) be used against individuals for immoral reasons.

 

I think Android illustrates the differences between Open Source and Free Software extremely well. Whilst Android is licensed under an Open Source license, it doesn’t have the same philosophy of Free Software.

It’s source code is open because it provided Google with a commercial advantage, not because Google believe that user freedom is important. Google and their partners have no qualms about making future applications and/or features proprietary, even at the detriment to developers and users by restricting their freedoms to understand and modify the software in their device.

Richard M. Stallman (RMS), the founder of the Free Software movement, wrote about the differences in Free Software vs Open Source and tells how whilst these two different ideologies have overlapping goals, at times they also differ. In some ways the terminology Open Source can be dangerous, as it lets us lose sight of the real reasons why software needs to be Free for Freedom’s sake above all.

 

Interestingly, despite how strongly I feel about Free Software,  I’ve found it somewhat personally easy to ignore concerns of proprietary software on mobiles for a prolonged period of time. In many ways, I see my mobile as  just a tool and not a serious “real” computer like my GNU/Linux laptop where I conduct most of my digital activities.  It’s possibly a result of my historical experiences with the devices, starting off using mobiles when they were just phones only and having had them slowly gain more capabilities around me, but always being seen as “phones” rather than “pocket computers”.

I’m certainly a digital native, a child of the internet generation separated from my parent’s generation by being the first to really grow up with widely available internet connectivity and computers. But to me, computers are still laptops and servers, despite having a good understanding of the mobile space and using mobile devices every day to possibly excessive amounts.

Yet for the current and next generation growing up, mobile phones and tablets are *the* computer that will define their learning experiences and interaction with the world – they may very well end up never owning a conventional computer, for the old guard of Windows, Linux and PC are gone, replaced with iOS, Android and handhelds.

It’s clear that mobiles operating systems are the platform of the future, it’s time we consider them equals with our conventional operating systems and impose the same strict demands for privacy and freedom that we have grown to expect onto the mobile space. I know that personally I don’t trust my Android mobile even one tenth as much as I trust my GNU/Linux laptop and this is unacceptable when my phone already has access to my files, my emails, my inner most private communications with others and who knows what else.

 

So the question is, how do we get from the Kinda-Propietary/Android we have now, to the Free/Android that we need?

I know there are some who will take a purist approach of running only pure Free Software Android and ignoring any applications or features that don’t run on it as-is. Unfortunately taking this approach will inevitably lead to long term discrepancies between the mass market Android OS and the Free Software purists pulling the OS feature set in different directions.

A true purist risks becoming a second class citizen – we are already at the stage where not being able to run popular applications can seriously restrict your ability to take part in our world – consider the difficulties of not being able to load applications needed to use public transport, do banking (online or NFC banking) or to communicate with friends due to all these applications requiring a freedom impacting proprietary layer.

It will be difficult to encourage users and application developers to use a Free Software Android build if they discover their existing collection of applications that rely on various proprietary APIs and library features no longer work, so we need to be somewhat pragmatic and make it easier for them to take up Free Software and still run proprietary applications on top of a free base, until such time as free alternatives arise for their applications.

I think the solution is a collection of three different (but all vital) efforts:

  1. Firstly, to support development of community Android distributions, such as Cyanogenmod and Replicant, something which has been successful so far, it’s clear that Google isn’t interested in working as equals with the community, so having a strong independent community is important for grass-roots innovation.
  2. Secondly to support the replacement of binary blobs in the core Android OS, such as the work that the Replicant project has started with writing Free Software drivers for hardware.
  3. Thirdly (and not at all least) we need to make it easy to provide the same functionality in Free/Android as Proprietary/Android by re-implementing closed source applications and libraries such as Google Play application store, Google Cloud Messaging (Push notifications) and the Google Play Services library/API.

Whether we like it or not, Google’s version of Android will be the platform than the majority of developers target long term. It doesn’t suit all developers, but it has suited most Free (as in beer and/or Freedom) and paid application developers for Android well enough for a long period already that I don’t see it being easy to de-rail that momentum.

If we can re-implement Google’s proprietary layers to a level sufficient for maintaining compatibility with the majority of these applications, it opens up some interesting possibilities. A Free/Android mobile with a Free/PlayServices API layer developed using the documented API calls published by Google is entirely possible and would allow users to run a Free/Android mobile and still maintain support for the majority of public applications being released for the Android platform, even if they use more and more proprietary API features.

Such a compatibility layer will enable users to run applications on their own terms – a user might decide to only run Free as in Freedom software, or they could decide that running proprietary software is OK sometimes -and that’s an acceptable choice, but the user is the one that should be making it, not Google or their device vendor.

Potentially we could take this idea a step further and re-implement features like contact and setting synchronisation against a Free Software server that technically capable users can choose to setup on their own servers, giving them the benefits of cloud-type technologies without loss of freedoms and privacy that takes place if using the Google proprietary features.

 

I’m not alone in these concerns – neither RMS or the Free Software Foundation (FSF) have been idle on this issue – RMS has an excellent write up on the freedom of Android here, and on a more mainstream level, the FSF is running campaigns promoting freeing Android phones and encouraging efforts to keep the platform Free as in Freedom.

I’m currently taking steps to move my Android Mobile off various proprietary dependencies to Free Software alternatives – it’s going to be slow and gradual and it will take time to determine replacements for various applications and libraries.

I haven’t done much in the way of Android application development, but I’m not afraid to pick up some Java if that’s what it takes to fill in a few gaps to get there – and if it means reverse engineering some features like Google Play Services, I’ll go down that path if need be.

Because Free Software computing is vital for privacy, vital for security and vital for a free society itself. And if the cost is a few weekends hacking at code, it’s a price well worth paying.

Ubuntu, the Windows of the Linux world

Sometimes I do wonder about if Ubuntu is actually the Windows of the Linux world, some of their design decisions, like non-closable restart windows…

Nice desktop you have there. Be a good fellow and reboot now ok?

Nice desktop you have there. Be a good fellow and reboot now ok?

Thankfully xkill closes *all* windows and leaves no survivors. It’s a shame that the general desktop environment on Ubuntu has been so cut down and over simplified over the last few years, since the server Ubuntu LTS releases are actually pretty damn good.