linux.conf.au 2014

I’ve just returned from my annual pilgrimage to linux.conf.au, which was held in Perth this year. It’s the first time I’ve been over to West Australia, it’s a whole 5 hour flight from Sydney –  longer than it takes to fly to New Zealand.

Perth’s climate is a very dry heat compared to Sydney, so although it was actually hotter than Sydney for most of the week, it didn’t feel quite as unpleasant – other than the final day which hit 45 degrees and was baking hot…

It’s also a very clean/tidy city, the well maintained nature was very noticeable with the city and gardens being immaculately trimmed – not sure if it’s always been like this, or if it’s a side effect of the mining wealth in the economy allowing the local government to afford it more effectively.

The towering metropolis of mining wealth.

The towering metropolis of mining wealth.

As usual, the conference ran for 5 full days and featured 4-5 concurrent streams of talks during the week. The quality was generally high as always, although I feel that content selection has shifted away from a lot of deep dive technical talks to more high level talks, and that OpenStack (whilst awesome) is taking up far too much of the conference and really deserves it’s own dedicated conference now.

I’ve prepared my personal shortlist of the talks I enjoyed most of all for anyone who wants to spend a bit of time watching some of the recorded sessions.

 

Interesting New(ish) Software

  1. RatticDB – A web-based password storage system written in Python written by friends in Melbourne. I’ve been trialling it and since then it’s growing in popularity and awareness, as well as getting security audits (and fixes) [video] [project homepage].
  2. MARS Light – This is an insanely awesome replacement for DRBD designed to address the issues of DRBD when replicating over slower long WAN links. Like DRBD, MARS Light is a block-level replication, so ideal for entire datacenter and VM replication. [video] [project homepage].
  3. Pettycoin – Proposal/design for an adjacent network to Bitcoin designed for microtransactions. It’s currently under development, but is an interesting idea. [video] [project homepage].
  4. Lua code in Mediawiki – the Mediawiki developers have added the ability for Wikipedia editors to write Lua code that is executed server side which is pretty insanely awesome when you think about how normally nobody wants to allow untrusted public the ability to remote execute code on systems. The developers have taken Lua and created a “safe” version that runs inside PHP with restrictions to make this possible. [video] [project homepage].
  5. OpenShift – RedHat did a demonstration on their hosted (and open source) PAAS platform, OpenShift. It’s a solution I’ve been looking at before, if you’re a developer whom doesn’t care about infrastructure management, it looks very attractive. [video] [project homepage].

 

Evolution of Linux

  1. D-Bus in the Kernel – Lennart Pottering (of Pulseaudio and SystemD fame) presented the efforts he’s been involved in to fix D-Bus’s shortcomings and move it into the kernel itself and have D-Bus as a proper high speed IPC solution for the Linux kernel. [video]
  2. The Six Stages of SystemD – Presentation by an engineer who has been moving systems to SystemD and the process he went through and his thoughts/experience with SystemD. Really showcases the value that moving to SystemD will bring to GNU/Linux distributions. [video]
  3. Development Tools & The UNIX Philosophy – Excellent talk by a Python developer on how we should stop accepting command-line only tools as being the “right” or “proper” UNIX-style tools. Some tools (eg debuggers) are just better suited for graphical interfaces, and that it still meets the UNIX philosophy of having one tool doing one thing well. I really like the argument he makes and have to agree, in some cases GUIs are just more suitable for some tasks. [video]

 

Walkthroughs and Warstories

  1. TCP Tuning for the Web – presented by one of the co-founders of Fastly showing the various techniques they use to improve the performance of TCP connections and handle issues such as DDOS attacks. Excellent talk by a very smart networking engineer. [video]
  2. Massive Scaling of Graphite – very interesting talk on the massive scaling issues involved to collect statistics with Graphite and some impressive and scary stats on the lifespans and abuse that SSDs will tolerate (which is nowhere near as much as they should!). [video]
  3. Maintaining Internal Forks – One of the FreeBSD developers spoke on how his company maintains an internal fork of FreeBSD (with various modifications for their storage product) and the challenges of keeping it synced with the current releases. Lots of common problems, such as pain of handling new upstream releases and re-merging changes. [video]
  4. Reverse engineering firmware – Mathew Garrett dug deep into vendor firmware configuration tools and explained how to reverse engineer their calls with various tools such as strace, IO and memory mapping tools. Well worth a watch purely for the fact that Matthew Garrett is an amazing speaker. [video]
  5. Android, The positronic brain – Interesting session on how to build native applications for Android devices, such as cross compiling daemons and how the internal structure of Android is laid out. [video]
  6. Rapid OpenStack Deployment – Double-length Tutorial/presentation on how to build OpenStack clusters. Very useful if you’re looking at building one. [video]
  7. Debian on AWS – Interesting talk on how the Debian project is using Amazon AWS for various serving projects and how they’re handling AMI builds. [video]
  8. A Web Page in Seven Syscalls – Excellent walk through on Varnish by one of the developers. Nothing too new for anyone who’s been using it, but a good explanation of how it works and what it’s used for. [video]

 

Other Cool Stuff

  1. Deploying software updates to ArduSat in orbit by Jonathan Oxer – Launching Arduino powered satelittes into orbit and updating them remotely to allow them to be used for educational and research purposes. What could possibly be more awesome than this? [video].
  2. HTTP/2.0 and you – Discussion of the emerging HTTP/2.0 standard. Interesting and important stuff for anyone working in the online space. [video]
  3. OpenStreetMap – Very interesting talk from the director of OpenStreetMap Team about how OpenStreetMap is used around disaster prone areas and getting the local community to assist with generating maps, which are being used by humanitarian teams to help with the disaster relief efforts. [video]
  4. Linux File Systems, Where did they come from? – A great look at the history and development cycles of the different filesytems in the Linux kernel – comparing ext1/2/3/4, XFS, ReiserFS, Btrfs and others. [video]
  5. A pseudo-random talk on entropy – Good explanation of the importance of entropy on Linux systems, but much more low level and about what tools there are for helping with it. Some cross-over with my own previous writings on this topic. [video]

Naturally there have been many other excellent talks – the above is just a selection of the ones that I got the most out from during the conference. Take a look at the full schedule to find other talks that might interest, almost all sessions got recorded during the conference.

Amberdms Billing System 2.0.1 Release

Just pushed a new stable release of the Amberdms Billing System (version 2.0.1), my open source web-based billing platform that does accounting, invoicing, ISP billing and more.

This release is mostly just a bug fix release to correct a few annoying issues, but it also has some improvements as well.
New Functionality

  • Invoices and credit notes can be downloaded via SOAP API call (thanks to Max Milaney’s contribution).
  • Database schema updater now supported hosted/multi-instance mode.

Bug Fixes

  • Service type “licenses” was missing in release 2.0.0
  • Quotes page was missing edit/delete links (issue 395)
  • Compatibility fixes for MySQL 5.6 STRICT mode.
  • Fixes to the PHP HTTPS redirect (thanks to Dmitry Smirnov)
  • Minor user interface fixes.

Other

  • Upgraded to latest Amberphplib framework.
  • Developer stats collection option provide more details about what gets sent home to developers.

The latest code and installation instructions can be found at:
https://projects.jethrocarr.com/p/oss-amberdms-bs/

You can also find the Amberdms Billing System on GitHub at:
https://github.com/jethrocarr

If you are using RHEL/CentOS 5/6, Ubuntu 12.04 LTS or Debian 7 Wheezy, you can install using your usual package manage by using my repositories at http://repos.jethrocarr.com/.

And for community support, see the mailing list at http://lists.amberdms.com/mailman/listinfo/amberdms-bs.

Encrypting disk on Android 4

Traditional computer operating systems have been around for a while, long enough that concerns around physical security have been well addressed. We understand the value and power that the information on our computers can provide to an attacker, so we have locked them down with features such as disk encryption, passphrase protected lock screens and techniques to prevent unwanted DMA attacks via high speed buses.

Yet despite the massive development of mobile devices technology in the past several years, a number of these features didn’t manage to make their way into the mobile operating systems as defaults. Whilst we take the time to setup disk encryption on our laptops and maybe desktops, we tend not to bother securing our mobile devices, possibly due to the perception of them being less risky to have exposed, or that they are less attractive targets.

Even a relatively paranoid IT geek like myself with an encrypted laptop, secure passphrases, and VPNs, still had a mobile phone that was protected with nothing more than it’s physical proximity to myself. Anyone gaining physical access to my phone could unlock it, whether it be by guessing a trivial unlock pattern, or by attaching it to another computer and reading the unencrypted filesystem.

And as these mobile devices have increased in functionality, so has the risk of an attacker getting hold of the device. When a mobile phone did nothing but phone calls and txts, having someone gain access would be more of a annoyance when they rack up a bill or prank call your contacts, than a serious risk.

But rather than leave it there, we started adding other productivity features – email, so we could keep in touch on the go. Instant messaging. Fully featured web browers that sync account details, bookmarks and history with your desktop. Banking applications. Access to shared storage solutions like Dropbox. Suddenly a mobile device is a much more attractive target.

And even if we decide that the mobile apps are too limited in scope, there’s  the risk of an attacker using the information such as credentials stored on the device to gain full access to the desktop version of these services. Having an email application that limits the phone to the inbox can reduce risk by protecting your archives, but not if the attacker can obtain your full username/passphrase from the device and then use it to gain full access with some alternative software.

Remember that obtaining credentials from a device isn’t hard – the credentials  have to be kept in some decrypted format somewhere on disk, so even if they’re hashed/obfuscated in some form, they’ll still have the key that enables them to be exposed somewhere on disk.

A quick grep through the /data/ volume on my phone revealed numerous applications that had my passphrases in plain text, extremely easy pickings for an attacker.

Mmmm plain text passwords. :-)

Mmmm plain text passwords. :-)

I was getting increasingly concerned with this hole in my security, so recently having replaced my Galaxy Nexus with a Samsung Galaxy Note II, I decided to set it up in a more secure fashion.

Android added disk encryption in Android 3, but it’s suffered two main issues that limits it’s usefulness:

  1. The disk encryption only covers the data volumes (/data, /sdcard) which is good in that it protects the data, but it still leaves the application volumes open to be exploited by anyone wanting to install malware such as key loggers.
  2. Turning on Android disk encryption then forces the user to use either a PIN or a passphrase to unlock their device as swipe or pattern unlock is disabled. For a frequent phone user this is too much of an usability issue, it makes frequent locks/unlocks much more difficult, so users may chose not to use encryption altogether, or choose a very easy/weak passphrase.

The first point I can’t do much about without digging into the low guts of Android, however the second is fixable. My personal acceptable trade-off is a weaker lock screen using a pattern, but being able to have a secure disk encryption passphrase. This ensures that if powered off, an attacker can’t exploit my data and the passphrase is long and secure, but if the phone is running, I take a compromise of security for convenience and ease of use.

There’s still the risk of an attacker installing malware on the non-encrypted OS portion of the mobile device, however if I lose physical access of my phone in an untrusted environment (eg border security confiscation) I can reload the OS from backup.

To setup disk encryption on Android 4 without losing pattern unlock, instead of adjusting via the settings interface, you need to enable it via the shell -easiest way is via the ADB shell in root mode.

Firstly you need to enable developer mode in Settings -> About Phone by tapping the build number multiple times, until it tells you that the developer mode has been unlocked. Then inside Settings -> Developer options, change the “Root Access” option to “Apps and ADB”.

Enable ADB root for all the fun stuff!

Enable ADB root for all the fun stuff!

Secondly, you need a workstation running the latest version of ADB (ships with the Android ADK under platform-tools) and to connect your phone via USB. Once done, you can enable disk encryption with the following commands (where PASSWORD is the desired encryption passphrase).

user@laptop # adb root
user@laptop # adb shell
root@phone:/ #
root@phone:/ # /system/bin/vdc cryptfs enablecrypto inplace PASSWORD

Your Android device will then restart and encrypt itself. This process takes time, factor up to an hour for it to complete it’s work.

Android phone undergoing encryption; and subsequent boot with encryption enabled.

Android phone undergoing encryption; and subsequent boot with encryption enabled.

Once rebooted, your existing pattern based unlock continues to work fine and all your private data and credentials are now secured.

Recovering SW RAID with Ubuntu on Amazon AWS

Amazon’s AWS cloud service is a very popular and generally mature offering, but it does have it’s issues at times – in particular it’s storage options and limited debug facilities.

When using AWS, you have three main storage options for your instances (virtual machine servers):

  1. Ephemeral disk , storage attached locally to your instance which is lost at shutdown or if the instance terminates unexpectedly. A fixed amount is included with your instance, the size set depending on your instance size.
  2. Elastic Block Storage (EBS) which is a network-attached block storage exposed to your Linux instance as if it was a traditional local disk.
  3. EBS with provisioned IOPs – the same as the above, but with guarantees around performance – for a price of course. ;-)

With EBS, there’s no need to use RAID from a disk reliability perspective- the EBS volume itself has it’s own underlying redundancy (although one should still perform snapshots and backups to handle end user failure or systematic EBS failure), which is the common reason for using RAID with conventional physical hosts.

So with RAID being pointless for redundancy in an Amazon world, why write about recovering hosts in AWS using software RAID? Because there are still situations where you may end up using it for purposes other than redundancy:

  1. Poor man’s performance gains – EBS provisioned IOPs are the proper way of getting performance from EBS to meet your particular requirements. But it comes with a cost attached – you pay increasingly more for faster disk, but also need proportionally larger disks minimum sizes to go with the higher speeds (10:1 ratio IOPs:size) which can quickly make a small fast volume prohibitively expensive. A software RAID array can allow you to get more performance by combining numerous small volumes together at low cost.
  2. Merging multiple EBS volumes – EBS volumes have an Amazon-imposed limit of 1TB per volume. If a single filesystem of more than 1TB is required, either LVM or software RAID is needed to merge them.
  3. Merging multiple ephemeral volumes – software RAID can be used to also merge the multiple EBS volumes that Amazon provides on some larger instances. However being ephemeral, if your RAID gets degraded, there’s no need to repair it – just destroy the instance and build a nice new one.

So whilst using software RAID with your AWS Instances can be a legitimate exercise, it can also introduce it’s own share of issues.

Firstly you can no longer use EBS snapshotting to do backups of the EBS volumes, unless you first halt the entire RAID array/freeze the filesystem writes for the duration of all the snapshots to be created – which depending on your application may or may not be feasible.

Secondly you now have the issue of increased complexity of your I/O configuration. If using automation to build your instances, you need to do additional work to handle the setup of the array which is a one-time investment, but the use of RAID also adds complexity to the maintenance (such as resizes) and increases the risk of a fault occurring.

I recently had the excitement/misfortune of such an experience. We had a pair of Ubuntu 12.04 LTS instances using GlusterFS to provide a redundant NFS mount to some of our legacy applications running in AWS (AWS unfortunately lacks a hosted NFS filer service). To provide sufficient speed to an otherwise small volume, RAID 0 had been used with a number of small EBS volumes.

The RAID array was nearly full, so a resize/grow operation was required. This is not an uncommon requirement and just involves adding an EBS volume to the instance, growing the RAID array size and expanding the filesystem on top. Unfortunately something nasty happened between Gluster and the Linux kernel, where the RAID resize operation on one of the two hosts suddenly triggered a kernel panic and failed, killing the host. I wasn’t able to get the logs for it, but at this stage it looks like gluster tried to do some operation right when the resize was active and instead of being blocked, triggered a panic.

Upon a subsequent restart, the host didn’t come back online. Connecting to the AWS Instance’s console output (ec2-get-console-output <instanceid>) showed that the RAID array failure was preventing the instance from booting back up, even through it was an auxiliary mount, not the root filesystem or anything required to boot.

The system may have suffered a hardware fault, such as a disk drive
failure.  The root device may depend on the RAID devices being online. One
or more of the following RAID devices are degraded:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : inactive xvdn[9](S) xvdm[7](S) xvdj[4](S) xvdi[3](S) xvdf[0](S) xvdh[2](S) xvdk[5](S) xvdl[6](S) xvdg[1](S)
      13630912 blocks super 1.2

unused devices: <none>
Attempting to start the RAID in degraded mode...
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
[31761224.516958] bio: create slab <bio-1> at 1
[31761224.516976] md/raid:md0: not clean -- starting background reconstruction
[31761224.516981] md/raid:md0: reshape will continue
[31761224.516996] md/raid:md0: device xvdm operational as raid disk 7
[31761224.517002] md/raid:md0: device xvdj operational as raid disk 4
[31761224.517007] md/raid:md0: device xvdi operational as raid disk 3
[31761224.517013] md/raid:md0: device xvdf operational as raid disk 0
[31761224.517018] md/raid:md0: device xvdh operational as raid disk 2
[31761224.517023] md/raid:md0: device xvdk operational as raid disk 5
[31761224.517029] md/raid:md0: device xvdl operational as raid disk 6
[31761224.517034] md/raid:md0: device xvdg operational as raid disk 1
[31761224.517683] md/raid:md0: allocated 10592kB
[31761224.517771] md/raid:md0: cannot start dirty degraded array.
[31761224.518405] md/raid:md0: failed to run raid set.
[31761224.518412] md: pers->run() failed ...
mdadm: failed to start array /dev/md0: Input/output error
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
Could not start the RAID in degraded mode.
Dropping to a shell.

BusyBox v1.18.5 (Ubuntu 1:1.18.5-1ubuntu4.1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

Dropping to a shell during bootup problems is an approach that has differing perspectives – personally I want my hosts to boot regardless of how messed up things are so I can get SSH, but others differ and prefer the safety of halting and dropping to a recovery shell for the sysadmin to resolve. Ubuntu is configured to do the latter by default.

But regardless of your views on this subject, dropping to a shell leaves you stuck when running AWS instances, since there is no way to interact with this console – Amazon doesn’t have a proper console for interacting with instances like a traditional VPS provider, you’re limited to only seeing the console log.

Ubuntu’s documentation actually advises that in the event of a failed RAID array, you can still force a boot by setting a kernel option bootdegraded=true. This helps if the array was degraded, but in this case the array had entirely failed, rather than being degraded, and Ubuntu treats that differently.

Thankfully it is possible to recover the failed instance, by attaching it’s volume to another instance, adjusting the initramfs to allow booting even whilst the RAID is failed and then once booted, you can do a repair on the host itself.

To do this repair you require an additional Linux instance to use as a recovery host and the Amazon CLI tools to be installed on your workstation.

# Set some variables with your instance IDs (eg i-abcd3)
export FAILED=setme
export RECOVERY=setme

# Fetch the root filesystem EBS volume ID and set a var with it:
export VOLUME=vol-setme

# Now stop the failed instance, so we can detach it's root volume.
# (Note: wait till status goes from "stopping" to "stopped")
ec2-stop-instances --force $FAILED
ec2-describe-instances $FAILED | grep INSTANCE | awk '{ print $5 }'

# Attach the root volume to the recovery host as /dev/sdo
ec2-detach-volume $VOLUME -i $FAILED
ec2-attach-volume $VOLUME -i $RECOVERY -d /dev/sdo

# Mount the root volume on the recovery host
ssh recoveryhost.example.com
mkdir /mnt/recovery
mount /dev/sdo /mnt/recovery

# Disable raid startup scripts for initramfs/initrd. We need to
# unpack the old file and modify the startup scripts inside it.
cp /mnt/recovery/boot/initrd.img-LATESTHERE-virtual /tmp/initrd-old.img
cd /tmp/
mkdir initrd-test
cd initrd-test
cpio --extract < ../initrd-old.img
vim scripts/local-premount/mdadm
- degraded_arrays || exit 0
- mountroot_fail || panic "Dropping to a shell."
+ #degraded_arrays || exit 0
+ #mountroot_fail || panic "Dropping to a shell."
find . | cpio -o -H newc > ../initrd-new.img
cd ..
gzip initrd-new.img
cp initrd-new.img.gz /mnt/recovery/boot/initrd.img-LATESTHERE-virtual

# Disable mounting of filesystem at boot (otherwise startup process
# will fail despite the array being skipped).
vim /mnt/recovery/etc/fstab
- /dev/md0    /mnt/myraidarray    xfs    defaults    1    2
+ #/dev/md0    /mnt/myraidarray    xfs    defaults    1    2

# Work done, umount volume.
umount /mnt/recovery

# Re-attach the root volume back to the failed instance
ec2-detach-volume $VOLUME -i $RECOVERY
ec2-attach-volume $VOLUME -i $FAILED -d /dev/sda1

# Startup the failed instance.
# (Note: Wait for status to go from pending to running)
ec2-start-instances $FAILED
ec2-describe-instances $FAILED | grep INSTANCE | awk '{ print $5 }'

# Watch the startup console. Note: java.lang.NullPointerException
# means that there is no output from the console yet.
ec2-get-console-output $FAILED

# Host should startup, you can get access via SSH and repair RAID
# array via usual means.

The above is very Ubuntu-specific, but the techniques shown are transferable to other platforms as well – just note that the scripts inside the initramfs/initrd will vary per distribution, it’s one of the components of a GNU/Linux system that is completely specific to the distribution vendor.

Route53 with NamedManager 1.8.0

Just released NamedManager 1.8.0, my open source web-based DNS management tool. This release fixes some bugs with MySQL 5.6 and internationalized domain names, but also includes support for using Amazon AWS Route53 alongside the existing Bind9 support.

Just add a name server entry with the type of Route53 and your Amazon credentials and a background process will sync all DNS changes to Route53. You can mix and match thanks to the groups feature, so if you want some zones going to both Bind9 and Route53 and others going to just Route53 or Bind9, you can do so.

NamedManager, now with cloudy goodness.

NamedManager, now with cloudy goodness.

As always, the easiest installation is from the provided RPMs, however you can also install from tarball or from Git – just refer to the installation documentation.

This feature is considered stable, however it is new, so be wary for bugs and issues – and report any issues you encounter back to me via email or the project manager issue tracker.

Jconsole to remote servers, easily

At work I support a number of different Java applications. Sadly they’re not all well behaved and it’s sometimes necessary to have to connect to the JMX port with Jconsole and take a look at what’s going on.

Jconsole isn't as pretty as newer services like Newrelic, but it's always there for you.

Jconsole isn’t as pretty as newer services like Newrelic, but it’s always there for you and information isn’t delayed.

This is usually easy enough when dealing with a local application, or an application on a LAN with very lax firewall policies, however in a hardened hosting environment with tight policies, it can be more difficult.

Even if the JMX port is permitted by your firewall policies, there’s another challange due to Jconsole needing to also connect to the RMI port, which can vary and is unlikely to be included in your firewall policies. This also makes it hard to do SSH port forwarding, since you have to find out what port is in use each time, and it just gets messy.

Thankfully there is an easier way using good old SSH – this will work with GNU/Linux and MacOS (Windows users will need to figure out the equivalent Putty configuration).

Firstly, open an SSH SOCKS proxy connection with:

ssh -D 1234 myjavapp.example.com

Secondly in a different shell, launch Jconsole, passing parameters to use the SOCKS proxy for all connections. Assuming that the JMX is listening on 7199, you’d end up with:

jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=1234 \
          service:jmx:rmi:///jndi/rmi://localhost:7199/jmxrmi

And you’re in, through any firewalls and SSH ensures your connection is properly encrypted.  If opening multiple Jconsole connections, you just need to establish a different SOCKS proxy port connection each time.

If your application isn’t currently providing a JMX listening port, the following configuration will setup the JMX port. Note that this configuration has no authentication, so anyone with an account on the server could connect to the JMX, or if the server lacks a firewall, from other network locations.

-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=7199 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false

These parameters need to get added to the Java startup parameters for your application… this varies a lot by application/platform, but look for JAVA_OPTS or a startup script of some kind and trace through form there.

Adventures in I/O Hell

Earlier this year I had a disk fail in my NZ-based file & VM server. This isn’t something unexpected, the server has 12x hard disks, which are over 2 years old, some failures are to be expected over time. I anticipated just needing to arrange the replacement of one disk and being able to move on with my life.

Of course computers are never as simple as this and it triggered a delightful adventure into how my RAID configuration is built, in which I encountered many lovely headaches and problems, both hardware and software.

My beautiful baby

Servers are beautiful, but also such hard work sometimes.

 

The RAID

Of the 12x disks in my server, 2x are for the OS boot drives and the remaining 10x are pooled together into one whopping RAID 6 which is then split using LVM into volumes for all my virtual machines and also file server storage.

I could have taken the approach of splitting the data and virtual machine volumes apart, maybe giving the virtual machines a RAID 1 and the data a RAID 5, but I settled on this single RAID 6 approach for two reasons:

  1. The bulk of my data is infrequently accessed files but most of the activity is from the virtual machines – by mixing them both together on the same volume and spreading them across the entire RAID 6, I get to take advantage of the additional number of spindles. If I had separate disk pools, the VM drives would be very busy, the file server drives very idle, which is a bit of a waste of potential performance.
  2. The RAID 6 redundancy has a bit of overhead when writing, but it provides the handy ability to tolerate the failure and loss of any 2 disks in the array, offering a bit more protection than RAID 5 and much more usable space than RAID 10.

In my case I used Linux’s Software RAID – whilst many dislike software RAID, the fact is that unless I was to spend serious amounts of money on a good hardware RAID card with a large cache and go to the effort of getting all the vendor management software installed and monitored, I’m going to get little advantage over Linux Software RAID, which consumes relatively little CPU. Plus my use of disk encryption destroys any minor performance advantages obtained by different RAID approaches anyway, making the debate somewhat moot.

The Hardware

To connect all these drives, I have 3x SATA controllers installed in the server:

# lspci | grep -i sata
00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
02:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
03:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)

The onboard AMD/ATI controller uses the standard “ahci” Linux kernel driver and the Marvell controllers use the standard “sata_mv” Linux kernel driver. In both cases, the controllers are configured purely in JBOD mode, which exposes each drive as-is to the OS, with no hardware RAID or abstraction taking place.

You can see how the disks are allocated to the different PCI controllers using the sysfs filesystem on the server:

# ls -l /sys/block/sd*
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sda -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sdb -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host1/target1:0:0/1:0:0:0/block/sdb
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdc -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host2/target2:0:0/2:0:0:0/block/sdc
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sdd -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host3/target3:0:0/3:0:0:0/block/sdd
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sde -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host4/target4:0:0/4:0:0:0/block/sde
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdf -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host5/target5:0:0/5:0:0:0/block/sdf
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdg -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdg
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdh -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdh
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdi -> ../devices/pci0000:00/0000:00:11.0/host8/target8:0:0/8:0:0:0/block/sdi
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdj -> ../devices/pci0000:00/0000:00:11.0/host9/target9:0:0/9:0:0:0/block/sdj
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdk -> ../devices/pci0000:00/0000:00:11.0/host10/target10:0:0/10:0:0:0/block/sdk
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdl -> ../devices/pci0000:00/0000:00:11.0/host11/target11:0:0/11:0:0:0/block/sdl

There is a single high-wattage PSU driving the host and all the disks, in a well designed Lian-Li case with proper space and cooling for all these disks.

The Failure

In March my first disk died. In this case, /dev/sde attached to the host AMD controller failed and I decided to hold off replacing it for a couple of months as it’s internal to the case and I would be visiting home in May in just a month or so. Meanwhile the fact that RAID 6 has 2x parity disks would ensure that I had at least a RAID-5 level of protection until then.

However a few weeks later, a second disk in the same array also failed. This become much more worrying, since two bad disks in the array removed any redundancy that the RAID array offered me and would mean that any further disk failures would cause fatal data corruption of the array.

Naturally due to Murphy’s Law, a third disk then promptly failed a few hours later triggering a fatal collapse of my array, leaving it in a state like the following:

md2 : active raid6 sdl1[9] sdj1[8] sda1[4] sdb1[5] sdd1[0] sdc1[2] sdh1[3](F) sdg1[1] sdf1[7](F) sde1[6](F)
      7814070272 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/7] [UUU_U__UUU]

Whilst 90% of my array is regularly backed up, there was some data that I hadn’t been backing up due to excessive size that wasn’t fatal, but nice to have. I tried to recover the array by clearing the faulty flag on the last-failed disk (which should still have been consistent/in-line with the other disks) in order to bring it up so I could pull the latest data from it.

mdadm --assemble /dev/md2 --force --run /dev/sdl1 /dev/sdj1 /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sdc1 /dev/sdh1 /dev/sdg1
mdadm: forcing event count in /dev/sdh1(3) from 6613460 upto 6613616
mdadm: clearing FAULTY flag for device 6 in /dev/md2 for /dev/sdh1
mdadm: Marking array /dev/md2 as 'clean'

mdadm: failed to add /dev/sdh1 to /dev/md2: Invalid argument
mdadm: failed to RUN_ARRAY /dev/md2: Input/output error

Sadly this failed in a weird fashion, where the array state was cleared to OK, but /dev/sdh was still failed and missing from the array, leading to corrupted data and all attempts to read the array successfully being hopeless. At this stage the RAID array was toast and I had no other option than to leave it broken for a few weeks till I could fix it on a scheduled trip home. :-(

Debugging in the murk of the I/O layer

Having lost 3x disks, it was unclear as to what the root cause of the failure in the array was at this time. Considering I was living in another country I decided to take the precaution of ordering a few different spare parts for my trip, on the basis that spending a couple hundred dollars on parts I didn’t need would be more useful than not having the part I did need when I got back home to fix it.

At this time, I had lost sde, sdf and sdh – all three disks belonging to a single RAID controller, as shown by /sys/block/sd, and all of them Seagate Barracuda disks.

lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sde -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host4/target4:0:0/4:0:0:0/block/sde
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdf -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host5/target5:0:0/5:0:0:0/block/sdf
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdg -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdg
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdh -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdh

Hence I decided to order 3x 1TB replacement Western Digital Black edition disks to replace the failed drives, deliberately choosing another vendor/generation of disk to ensure that if it was some weird/common fault with the Seagate drives it wouldn’t occur with the new disks.

I also placed an order of 2x new SATA controllers, since I had suspicions about the fact that numerous disks attached to one particular controller had failed. The two add in PCI-e controllers were somewhat simplistic ST-Labs cards, providing 4x SATA II ports in a PCI-e 4x format, using a Marvell 88SX7042 chipset. I had used their SiI-based PCI-e 1x controllers successfully before and found them good budget-friendly JBOD controllers, but being a budget no-name vendor card, I didn’t hold high hopes for them.

To replace them, I ordered 2x Highpoint RocketRaid 640L cards, one of the few 4x SATA port options I could get in NZ at the time. The only problem with these cards was that they were also based on the Marvell 88SX7042 SATA chipset, but I figured it was unlikely for a chipset itself to be the cause of the issue.

I considered getting a new power supply as well, but the fact that I had not experienced issues with any disks attached to the server’s onboard AMD SATA controller and that all my graphing of the power supply showed the voltages solid, I didn’t have reason to doubt the PSU.

When back in Wellington in May, I removed all three disks that had previously exhibited problems and installed the three new replacements. I also replaced the 2x PCI-e ST-Labs controllers with the new Highpoint RocketRaid 640L controllers. Depressingly almost as soon as the hardware was changed and I started my RAID rebuild, I started experiencing problems with the system.

May  4 16:39:30 phobos kernel: md/raid:md2: raid level 5 active with 7 out of 8 devices, algorithm 2
May  4 16:39:30 phobos kernel: created bitmap (8 pages) for device md2
May  4 16:39:30 phobos kernel: md2: bitmap initialized from disk: read 1 pages, set 14903 of 14903 bits
May  4 16:39:30 phobos kernel: md2: detected capacity change from 0 to 7000493129728
May  4 16:39:30 phobos kernel: md2:
May  4 16:39:30 phobos kernel: md: recovery of RAID array md2
May  4 16:39:30 phobos kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
May  4 16:39:30 phobos kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May  4 16:39:30 phobos kernel: md: using 128k window, over a total of 976631296k.
May  4 16:39:30 phobos kernel: unknown partition table
...
May  4 16:50:52 phobos kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:50:52 phobos kernel: ata7.00: failed command: IDENTIFY DEVICE
May  4 16:50:52 phobos kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
May  4 16:50:52 phobos kernel:         res 40/00:24:d8:1b:c6/00:00:02:00:00/40 Emask 0x4 (timeout)
May  4 16:50:52 phobos kernel: ata7.00: status: { DRDY }
May  4 16:50:52 phobos kernel: ata7: hard resetting link
May  4 16:50:52 phobos kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:50:52 phobos kernel: ata7.00: configured for UDMA/133
May  4 16:50:52 phobos kernel: ata7: EH complete
May  4 16:51:13 phobos kernel: ata16.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:51:13 phobos kernel: ata16.00: failed command: IDENTIFY DEVICE
May  4 16:51:13 phobos kernel: ata16.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
May  4 16:51:13 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May  4 16:51:13 phobos kernel: ata16.00: status: { DRDY }
May  4 16:51:13 phobos kernel: ata16: hard resetting link
May  4 16:51:18 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:51:23 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:51:23 phobos kernel: ata16: hard resetting link
May  4 16:51:28 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:51:33 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:51:33 phobos kernel: ata16: hard resetting link
May  4 16:51:39 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:52:08 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:52:08 phobos kernel: ata16: limiting SATA link speed to 3.0 Gbps
May  4 16:52:08 phobos kernel: ata16: hard resetting link
May  4 16:52:13 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:52:13 phobos kernel: ata16: reset failed, giving up
May  4 16:52:13 phobos kernel: ata16.00: disabled
May  4 16:52:13 phobos kernel: ata16: EH complete
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 c8 00 00 01 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: Disk failure on sdk, disabling device.
May  4 16:52:13 phobos kernel: md/raid:md2: Operation continuing on 6 devices.
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396112 on sdk).
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396280 on sdk).
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 c9 00 00 04 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396288 on sdk)
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84397304 on sdk).
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 cd 00 00 03 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84397312 on sdk).
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84398072 on sdk).
May  4 16:53:14 phobos kernel: ata18.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata15.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata18.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata18.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata18.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata18: hard resetting link
May  4 16:53:14 phobos kernel: ata17.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata17.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata17.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata17.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata17: hard resetting link
May  4 16:53:14 phobos kernel: ata15.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata15.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata15.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata15: hard resetting link
May  4 16:53:15 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:15 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:53:15 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:20 phobos kernel: ata15.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata17.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata18.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata17.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata17: hard resetting link
May  4 16:53:20 phobos kernel: ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata15.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata15: hard resetting link
May  4 16:53:20 phobos kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata18.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata18: hard resetting link
May  4 16:53:21 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:21 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:53:21 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:31 phobos kernel: ata15.00: qc timeout (cmd 0xec)
May  4 16:53:31 phobos kernel: ata17.00: qc timeout (cmd 0xec)
May  4 16:53:31 phobos kernel: ata18.00: qc timeout (cmd 0xec)
May  4 16:53:32 phobos kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata15.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata15: limiting SATA link speed to 1.5 Gbps
May  4 16:53:32 phobos kernel: ata15: hard resetting link
May  4 16:53:32 phobos kernel: ata17.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata17: limiting SATA link speed to 3.0 Gbps
May  4 16:53:32 phobos kernel: ata17: hard resetting link
May  4 16:53:32 phobos kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata18.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata18: limiting SATA link speed to 1.5 Gbps
May  4 16:53:32 phobos kernel: ata18: hard resetting link
May  4 16:53:32 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
May  4 16:53:33 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
May  4 16:53:33 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
May  4 16:54:05 phobos kernel: ata15: EH complete
May  4 16:54:05 phobos kernel: ata17: EH complete
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] Unhandled error code
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdj, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 5 devices.
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] Unhandled error code
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdl, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 4 devices.
May  4 16:54:05 phobos kernel: ata18: EH complete
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] Unhandled error code
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdm, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 4 devices.
May  4 16:54:05 phobos kernel: md: md2: recovery done.

The above logs demonstrate my problem perfectly. After rebuilding the array, a disk would fail and lead to a cascading failure where other disks would throw up “failed to IDENTIFY” errors until they eventually all failed one-by-one until all the disks attached to that controller had vanished from the server.

At this stage I knew the issue wasn’t caused by all the hard disk themselves, but there was the annoying fact that at the same time, I couldn’t trust all the disks – there were certainly at least a couple bad hard drives in my server triggering a fault, but many of the disks marked as bad by the server were perfectly fine when tested in a different system.

Untrustworthy Disks

Untrustworthy Disks

By watching the kernel logs and experimenting with disk combinations and rebuilds, I was able to determine that the first disk that always failed was usually faulty and did need replacing. However the subsequent disk failures were false alarms and seemed to be part of a cascading triggered failure on the controller that the bad disk was attached to.

Using this information, it would be possible for me to eliminate all the bad disks by removing the first one to fail each time till I had no more failing disks. But this would be a false fix, if I was to rebuild the RAID array in this fashion, everything would be great until the very next disk failure occurs and leads to the whole array dying again.

Therefore I had to find the root cause and make a permanent fix for the solution. Looking at what I knew about the issue, I could identify the following possible causes:

  1. A bug in the Linux kernel was triggering cascading failures of RAID rebuilds (bug in either software RAID or in AHCI driver for my Marvell controller).
  2. The power supply on the server was glitching under the load/strain of a RAID rebuild and causing brownouts to the drives.
  3. The RAID controller had some low level hardware issue and/or an issue exhibited when one failed disk sent a bad signal to the controller.

Frustratingly I didn’t have any further time to work on it whilst I was in NZ in May, so I rebuilt the minimum array possible using the onboard AMD controller to restore my critical VMs and had to revisit the issue again in October when back in the country for another visit.

Debugging hardware issues and cursing the lower levels of the OSI layers.

Debugging disks and cursing hardware.

When I got back in October, I decided to eliminate all the possible issues, no matter how unlikely…. I was getting pretty tired of having to squeeze all my stuff into only a few TB, so wanted a fix no matter what.

Kernel – My server was running the stock CentOS 6 2.6.32 kernel which was unlikely to have any major storage subsystem bugs, however as a precaution I upgraded to the latest upstream kernel at the time of 3.11.4.

This failed to make any impact, so I decided that it would be unlikely I’m the first person to discover a bug in the Linux kernel AHCI/Software-RAID layer and marked the kernel as an unlikely source of the problem. The fact that when I added a bad disk to the onboard AMD server controller I didn’t see the same cascading failure, also helped eliminate a kernel RAID bug as the cause, leaving only the possibility of a bug in the “sata_mv” kernel driver that was specific to the controllers.

Power Supply – I replaced the few year old power supply in the server with a newer higher spec good quality model. Sadly this failed to resolve anything, but it proved that the issue wasn’t power related.

SATA Enclosure – 4x of my disks were in a 5.25″ to 3.5″ hotswap enclosure. It seemed unlikely that a low level physical device could be the cause, but I wanted to eliminate all common infrastructure possibility, and all the bad disks tended to be in this enclosure. By juggling the bad disks around, I was able to confirm that the issue wasn’t specific to this one enclosure.

At this time the only hardware in common left over was the two PCI-e RAID controllers – and of course the motherboard they were connected via. I really didn’t want to replace the motherboard and half the system with it, so I decided to consider the possibility that the fact the old ST-Lab and new RocketRaid controllers both used the same Marvell 88SX7042 chipset meant that I hadn’t properly eliminated the controllers as being the cause of the issue.

After doing a fair bit of research around the Marvell 88SX7042 online, I did find a number of references to similar issues with other Marvell chipsets, such as issues with the 9123 chipset series, issues with the 6145 chipset and even reports of issues with Marvell chipsets on FreeBSD.

The Solution

Having traced something dodgy back to the Marvell controllers I decided the only course of action left was to source some new controller cards. Annoyingly I couldn’t get anything that was 4x SATA ports in NZ that wasn’t either Marvell chipset based or extremely expensive hardware RAID controllers offering many more features than I required.

I ended up ordering 2x Syba PCI Express SATA II 4-Port RAID Controller Card SY-PEX40008 which are SiI 3124 chipset based controllers that provide 4x SATA II ports on a single PCIe-1x card. It’s worth noting that these controllers are actually PCI-X cards with an onboard PCI-X to PCIe-1x converter chip which is more than fast enough for hard disks, but could be a limitation for SSDs.

Marvell Chipset-based controller (left) and replacement SiI based controller (right).

Marvell Chipset-based controller (left) and replacement SiI based controller (right). The SiI controller is much larger thanks to the PCI-X to PCI-e converter chip on it’s PCB.

I selected these cards in particular, as they’re used by the team at Backblaze (a massive online back service provider) for their storage pods, which are built entirely around giant JBOD disk arrays. I figured with the experience and testing this team does, any kit they recommend is likely to be pretty solid (check out their design/decision blog post).

# lspci | grep ATA
00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
03:04.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
05:04.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)

These controllers use the “sata_sil24” kernel driver and are configured as simple JBOD controllers in the exact same fashion as the Marvell cards.

To properly test them, I rebuilt an array with a known bad disk in it. As expected, at some point during the rebuild, I got a failure of this disk:

Nov 13 10:15:19 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x3fe SErr 0x0 action 0x6
Nov 13 10:15:19 phobos kernel: ata14.00: irq_stat 0x00020002, device error via SDB FIS
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:08:00:c5:c7/01:00:02:00:00/40 tag 1 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 41/40:00:2d:c5:c7/00:01:02:00:00/00 Emask 0x409 (media error) <F>
Nov 13 10:15:19 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:19 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x3fe SErr 0x0 action 0x6
Nov 13 10:15:19 phobos kernel: ata14.00: irq_stat 0x00020002, device error via SDB FIS
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:08:00:c5:c7/01:00:02:00:00/40 tag 1 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 41/40:00:2d:c5:c7/00:01:02:00:00/00 Emask 0x409 (media error) <F>
Nov 13 10:15:19 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:19 phobos kernel: ata14.00: error: { UNC }
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:10:00:c6:c7/01:00:02:00:00/40 tag 2 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 9c/13:04:04:00:00/00:00:00:20:9c/00 Emask 0x2 (HSM violation)
Nov 13 10:15:19 phobos kernel: ata14.00: status: { Busy }
Nov 13 10:15:19 phobos kernel: ata14.00: error: { IDNF }
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/c0:18:00:c7:c7/00:00:02:00:00/40 tag 3 ncq 98304 in
Nov 13 10:15:19 phobos kernel:         res 9c/13:04:04:00:00/00:00:00:30:9c/00 Emask 0x2 (HSM violation)
...
Nov 13 10:15:41 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 13 10:15:41 phobos kernel: ata14.00: irq_stat 0x00060002, device error via D2H FIS
Nov 13 10:15:41 phobos kernel: ata14.00: failed command: READ DMA
Nov 13 10:15:41 phobos kernel: ata14.00: cmd c8/00:00:00:c5:c7/00:00:00:00:00/e2 tag 0 dma 131072 in
Nov 13 10:15:41 phobos kernel:         res 51/40:00:2d:c5:c7/00:00:02:00:00/02 Emask 0x9 (media error)
Nov 13 10:15:41 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:41 phobos kernel: ata14.00: error: { UNC }
Nov 13 10:15:41 phobos kernel: ata14.00: configured for UDMA/100
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Unhandled sense code
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Sense Key : Medium Error [current] [descriptor]
Nov 13 10:15:41 phobos kernel: Descriptor sense data with sense descriptors (in hex):
Nov 13 10:15:41 phobos kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov 13 10:15:41 phobos kernel:        02 c7 c5 2d 
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Add. Sense: Unrecovered read error - auto reallocate failed
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] CDB: Read(10): 28 00 02 c7 c5 00 00 01 00 00
Nov 13 10:15:41 phobos kernel: ata14: EH complete
Nov 13 10:15:41 phobos kernel: md/raid:md5: read error corrected (8 sectors at 46646648 on sdk)

The known faulty disk failed as expected, but rather than suddenly vanishing from the SATA controller, the disk remained attached and proceeded to spew out seek errors for the next few hours, slowing down the rebuild process.

Most importantly, no other disks failed at any stage following the move from the Marvell to the SiI controller. Having assured myself with the stability and reliability of the controller, I was able to add in other disks I weren’t sure had or had not failed and quickly eliminate the truly faulty disks.

My RAID array rebuild completed successfully and I was then able to restore the LVM volume and all my data and restore full services on my system.

With the data restored, time for a delicious IPA.

With the data restored successfully at last, time for a delicious local IPA to celebrate victory at long last.

Having gone through the pain and effort to recover and rebuild this system, I had to ask myself a few questions about my design decisions and what I could do to avoid a fault like this again in future.

Are the Marvell controllers shit?

Yes. Next question?

In seriousness, the fact that two completely different PCI-e card vendors exhibited the exact same problem with this controller shows that it’s not going to be the fault of the manufacturer of the controller card and must be something with the controller itself or the kernel driver operating it.

It is *possible* that the Linux kernel has a bug in the sata_mv controller, but I don’t believe that’s the likely cause due to the way the fault manifests. Whenever a disk failed, the disks would slowly disappear from the kernel entirely, with the kernel showing surprise at suddenly having lost a disk connection. Without digging into the Marvell driver code, it looks very much like the hardware is failing in some fashion and dropping the disks. The kernel just gets what the hardware reports, so when the disks vanish, it just tries to cope the best that it can.

I also consider a kernel bug to be a vendor fault anyway. If you’re a major storage hardware vendor, you have an obligation to thoroughly test the behaviour of your hardware with the world’s leading server operating system and find/fix any bugs, even if the driver has been written by the open source community, rather than your own in house developers.

Finally posts on FreeBSD mailing lists also show users reporting strange issues with failing disks on Marvell controllers, so that adds a cross-platform dimension to the problem. Further testing with the RAID vendor drivers rather than sata_mv would have been useful, but they only had binaries for old kernel versions that didn’t match the CentOS 6 kernel I was using.

Are the Seagate disks shit?

Maybe – it’s hard to be sure, since my server has been driven up in my car from Wellington to Auckland (~ 900km) and back which would not have been great for the disks (especially with my driving) and has had a couple of times that it’s gotten warmer than I would have liked thanks to the wonders of running a server in a suburban home.

I am distrustful of my Seagate drives, but I’ve had bad experienced with Western Digital disks in the past as well, which brings me to the simple conclusion that all spinning rust platters are crap and the price drops of SSDs to a level that kills hard disks can’t come soon enough.

And even if I plug in the worst possible, spec-violating SATA drive to a controller, that controller should be able to handle it properly and keep the disk isolated, even if it means disconnecting one bad disk. The Marvell 88SX7042 controller is advertised as being a 4-channel controller, I therefore do not expect issues on one channel to impact the activities on the other channels. In the same way that software developers code assuming malicious input, hardware vendors need to design for the worst possible signals from other devices.

What about Enterprise vs Consumer disks

I’ve looked into the differences between Enterprise and Consumer grade disks before.

Enterprise disks would certainly help the failures occur in a cleaner fashion. Whether it would result in correct behaviour of the Marvell controllers, I am unsure… the cleaner death *may* result in the controller glitching less, but I still wouldn’t trust it after what I’ve seen.

I also found rebuilding my RAID array with one of my bad sector disks would take around 12 hours when the disk was failing. After kicking out the disk from the array, that rebuild time dropped to 5 hours, which is a pretty compelling argument for using enterprise disks to have them die quicker and cleaner.

Lessons & Recommendations

Whilst slow, painful and frustrating, this particular experience has not been a total waste of time, there are certainly some useful lessons I’ve learnt from the exercise and things I’d do different in future. In particular:

  1. Anyone involved in IT ends up with a few bad disks after a while. Keep some of these bad disks around rather than destroying them. When you build a new storage array, install a known bad disk and run disk benchmarks to trigger a fault to ensure that all your new infrastructure handles the failure of one disk correctly. Doing this would have allowed me to identify the Marvell controllers as crap before the server went into production, as I would have seen that a bad disks triggers a cascading fault across all the good disks.
  2. Record the serial number of all your disks once you build the server. I had annoyances where bad disks would disappear from the controller entirely and I wouldn’t know what the serial number of the disk was, so had to get the serial numbers from all the good disks and remove the one disk that wasn’t on the list in a slow/annoying process of elimination.
  3. RAID is not a replacement for backup. This is a common recommendation, but I still know numerous people who treat RAID as an infallible storage solution and don’t take adequate precautions for failure of the storage array. The same applies to RAID-like filesystem technologies such as ZFS. I had backups for almost all my data, but there were a few things I hadn’t properly covered, so it was a good reminder not to succumb to hubris in one’s data redundancy setup.
  4. Don’t touch anything with a Marvell chipset in it. I’m pretty bitter about their chipsets following this whole exercise and will be staying well away from their products for the foreseeable future.
  5. Hardware can be time consuming and expensive. As a company, I’d stick to using hosted/cloud providers for almost anything these days. Sure the hourly cost can seem expensive, but not needing to deal with hardware failures is of immense value in itself. Imagine if this RAID issue had been with a colocated server, rather than a personal machine that wasn’t mission critical, the cost/time to the company would have been a large unwanted investment.

 

Restoring LVM volumes

LVM is a fantastic technology that allows block devices on Linux to be sliced into volumes(partitions) and allows for easy resizing and changing of the volumes whilst running, in a much easier fashion than traditional MSDOS or GPT partition tables. LVM is commonly used as the default at install of most popular Linux distributions due to the ease of management.

I use LVM for providing logical volumes for each virtual machine on my KVM server. This makes it easy to add, delete, resize and snapshot my virtual machines and also reduces I/O overhead compared to having virtual machines inside files ontop of a file system.

I recently had the misfortune of needing to rebuild this array after an underlying storage failure in my RAID array. Doing so manually would be time consuming and error prone, I would have to try and determine what the sizes of my different VMs used to be before the failure and then issue lvcreate commands for them all to restore the configuration.

 

LVM Configuration Backup & Restore

Thankfully I don’t need to restore configuration manually – LVM keeps a backup of all configured volumes/groups, which can be used to restore the original settings of the LVM volume group, including the physical volume settings, the volume group and all the logical volumes inside it.

There is a copy of the latest/current configuration in /etc/lvm/backup/, but also archived versions in /etc/lvm/archive/, which is useful if you’ve broken something badly and actually want an older version of the configuration.

Restoring is quite simple, assuming that firstly your server still has /etc/lvm/ in a good state, or you’ve backed it up to external media and secondly that the physical volume(s) that you are restoring to are at least as large as they were at the time of the configuration backup.

Firstly, you need to determine which configuration you want to restore. In my case, I want to restore my “vg_storage” volume group.

# ls -l /etc/lvm/archive/vg_storage_00*
-rw-------. 1 root root 13722 Oct 28 23:45 /etc/lvm/archive/vg_storage_00419-1760023262.vg
-rw-------. 1 root root 14571 Oct 28 23:52 /etc/lvm/archive/vg_storage_00420-94024216.vg
...
-rw-------. 1 root root 14749 Nov 23 15:11 /etc/lvm/archive/vg_storage_00676-394223172.vg
-rw-------. 1 root root 14733 Nov 23 15:29 /etc/lvm/archive/vg_storage_00677-187019982.vg
#

Once you’ve chosen which configuration to restore, you need to fetch the UUID of the old physical volume(s) from the LVM backup file:

# less /etc/lvm/archive/vg_storage_00677-187019982.vg
...
   physical_volumes {
      pv0 {
         id = "BgR0KJ-JClh-T2gS-k6yK-9RGn-B8Ls-LYPQP0"
...

Now we need to re-create the physical volume with pvcreate. By specifying both the UUID and the configuration file, pvcreate rebuilds the physical volume with all the original values, including extent size and other details to ensure that the volume group configuration applies correctly onto it.

# pvcreate --uuid "BgR0KJ-JClh-T2gS-k6yK-9RGn-B8Ls-LYPQP0" \
  --restorefile /etc/lvm/archive/vg_storage_00677-187019982.vg

If your volume group is made up of multiple physical volumes, you’ll need to do this for each physical volume – make sure you give the right volumes the right UUIDs if the sizes of the physical volumes differ.

Now that the physical volume(s) are created, we can restore the volume group with vgcfgrestore. Because we’ve created the physical volume(s) with the same UUIDs, it will correctly rebuild using the right volumes.

# vgcfgrestore -f /etc/lvm/archive/vg_storage_00677-187019982.vg vg_storage

At this time, you should be able to display the newly restored volume group:

# vgdisplay
  --- Volume group ---
  VG Name               vg_storage
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1044
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                30
  Open LV               11
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               4.55 TiB
  PE Size               4.00 MiB
  Total PE              1192176
  Alloc PE / Size       828160 / 3.16 TiB
  Free  PE / Size       364016 / 1.39 TiB
  VG UUID               OIs8bs-nIE1-bMTj-vSt2-ue80-jf1X-oZCYiO
#

Finally make sure you activate all logical volumes – sometimes LVM restores with volumes set to inactive so they can’t be modified/used – use vgchange -a y to activate all volumes for the volume group:

# vgchange -a y vg_storage

The array is now restored and ready to use. Run lvdisplay and you’ll see all the restored logical volumes. Of course, this is only the configuration – there’s no data in those volumes, so you still have to restore the volumes from backups (you DO have backups right?).

 

Data Backup & Restore

Once you have your volume configuration restored, you need restore your data. The way you do this is really going to depend on the nature of your data and the volume of it, however I use a combination of two of the most common approaches to backups:

  1. A regular data-only level backup to protect all the data I care about. However in the event of a storage failure, I’m left having to rebuild all my virtual machines OS images, before being able to load in the data-only backups into the created VMs.
  2. An infrequent block-level backup of all the LVM volumes, where images of each volume are dumped onto external media, which can be loaded back in once the LVM configuration has been restored.

I have a small Perl script I’ve written which does this snapshot & backup process for me. It makes use of LVM snapshots to ensure data consistency during the backup process and compresses with gzip to reduce the on disk backup size without too much CPU overhead.

#!/usr/bin/perl -w
#
# Run through a particular LVM volume group and perform a snapshot
# and compressed file backup of each volume.
#
# This script is intended for use with backing up complete system
# images of VMs, in addition to data level backups.
#

my $source_lvm_volgroup = 'vg_storage';
my $source_lvm_snapsize = '5G';
my @source_lvm_excludes = ('lv_unwanted', 'lv_tmpfiles');

my $dest_dir='/mnt/backup/snapshots';

foreach $volume (glob("/dev/$source_lvm_volgroup/*"))
{
    $volume            =~ /\/dev\/$source_lvm_volgroup\/(\S*)$/;
    my $volume_short    = $1;

    if ("$volume_short" ~~ @source_lvm_excludes)
    {
        # Excluded volume, we skip it
        print "[info] Skipping excluded volume $volume_short ($volume)\n";
        next;
    }

    print "[info] Processing volume $volume_short ($volume)\n";

    # Snapshot volume
    print "[info] Creating snapshot...\n";
    system("lvcreate -n ${volume_short}_snapshot --snapshot $volume -L $source_lvm_snapsize");

    # Write compressed backup file from snapshot, but only replace existing one once finished
    print "[info] Creating compressed snapshot file...\n";
    system("dd if=${volume}_snapshot | gzip --fast > $dest_dir/$volume_short.temp.gz");
    system("mv $dest_dir/$volume_short.temp.gz $dest_dir/$volume_short.gz");

    # Delete snapshot
    print "[info] Removing snapshot...\n";
    system("lvremove --force ${volume}_snapshot");

    print "[info] Volume $volume_short backup completed.\n";
}

print "[info] FINISHED! VolumeGroup backup completed\n";

exit 0;

Since I was doing these backups before my storage failure, by restoring the LVM volume group configuration, it was an easy process to restore all my LVM volume data with a simple matching restore script:

#!/usr/bin/perl -w
#
# Runs through LVM snapshots taken by backup_snapshot_lvm.pl and
# restores them to the LVM volume in question.
#

my $dest_lvm_volgroup = 'vg_storage';
my $source_dir = '/mnt/backup/snapshots';

print "[WARNING] Beginning restore process in 5 SECONDS!!\n";
sleep(5);

foreach $volume (glob("$source_dir/*"))
{
    $volume            =~ /$source_dir\/(\S*).gz$/;
    my $volume_short    = $1;

    print "[info] Processing volume $volume_short ($volume)\n";

    # Just need to decompress & write into LVM volume
    system("zcat $source_dir/$volume_short.gz > /dev/$dest_lvm_volgroup/$volume_short");

    print "[info] Volume $volume_short restore completed.\n";
}

print "[info] FINISHED! VolumeGroup restore completed\n";

exit 0;

Because the block-level backups are less frequent than the data ones, the way I do a restore is to restore these block level backups and then apply the latest data backup over the top of the VMs once they’re booted up again.

Using both these techniques, it can be very fast and hassles free to rebuild a large LVM volume group after a serious storage failure – I’ve used these exact scripts a few times and can restore a several TB volume with 30+ VMs in a matter of hours simply by kicking off a few commands and going out for a beer to wait for the data to transfer from the backup device.

Resize ext4 on RHEL/CentOS/EL

I use an ext4 filesystem for a couple large volumes on my virtual machines. I know that btrfs and ZFS are the new cool kids in town for filesystems, but when I’m already running RAID and LVM on the VM server underneath the guest, these newer filesystems with their fancy features don’t add a lot of value for me, so a good, simple, traditional, reliable filesystem is just what I want.

There’s actually a bit of confusion and misinformation online about using ext4 filesystems with RHEL/CentOS (EL) 5 & 6 systems, so just wanted to clarify some things for people.

ext4 uses the same tools as ext3 and ext2 before it – this means still using the e2fsprogs package which provides utilities such as e2fsck and resize2fs, which handle ext2, ext3 and ext4 filesystems with the same command. But there are a few catches with EL…

On an EL 5 system, the version of e2fsprogs is too old and doesn’t support ext4. Therefore there is an additional package called “e4fsprogs” which is actually just a newer version of e2fsprogs that provides renamed tools such as “resize4fs”. This tends to get confusing for users who then try to find these tools on EL 6 or other distributions, so it’s important to be aware that it’s a non-standard thing. Why they didn’t just upgrade the tools/backport the ext4 support is a bit weird to me, especially considering the fact these tools are extremely backwards compatible and should meet even Red Hat’s API compatibility requirements, but it-is-what-it-is sadly.

On an EL 6 system, the version of e2fsprogs is modern enough to support ext4, which means you need to use the normal “resize2fs” command to do an ext4 filesystem resize (as per the docs). Generally this works fine, but I think I may have found a bug with the stock EL 6 kernel and e2fsprog tools.

One of my filesystems almost reached 100% full, so I expanded the underlying block volume and then issued an offline e2fsck -f /dev/volume & resize2fs /dev/volume which completed as normal without error, however it left the filesystem size unchanged. Repeated checks and resize attempts made no difference. The block volume correctly reported it’s new size, so it wasn’t a case of the kernel not having seen the changed size.

However by mounting the ext4 filesystem and doing an online resize, the resize works correctly, although quite slowly, as the online resize seems to trickle-resize the disk, gradually increasing it’s size until finally complete.

This inability to resize offline is not something I’ve come across before, so may be a rare bug trigged by the full size of the filesystem that could well be fixed in newer kernels/e2fsprog packages, but figured I’d mention it for any other poor sysadmins scratching their heads over a similar issue.

Finally, be aware that you need to use either no partition table, or GPT if you want file systems over 2TB and also that the e2fsprogs package shipped with EL 6 has a 16TB limit, so you’ll have to upgrade the package manually if you need more than that.

I’m Jethro Carr, open source geek, and this is how I work

Lifehacker regularly features a segment where they interview famous people and ask them how they work. Rather than waiting for the e-mail that would never come has yet to come, my friend Jack Scott decided to answer this set of questions on his own last week,then tapped my other friend Chris Neugebauer who then tapped me to answer them after him.

Why hello there.

Hello Good Sir/Madam, care to converse?

Location: Sydney, Australia. But my heart is still home in Wellington, New Zealand.

Current Gig: Senior Systems Engineer at Fairfax Media Australia. I’m part of a small team that looks after all of their AU websites (SMH, TheAge, Domain, RSVP, etc), responsible for building and managing the server infrastructure from hypervisor to application.

Current Mobile Device: Google/Samsung Galaxy Nexus Android phone. I have a love-hate relationship with how it enables me to be hyper-connected all the time.

Current Computer: At work I use an under-speced Lenovo X1 Carbon with Ubuntu Linux. In personal life it’s more interesting, with an aging 2010-era Lenovo Thinkpad X201i upgraded to 512GB SSD that still has it thrashing current market offerings, a 1U IBM xSeries 306m server for production websites and a large custom-built full tower server for files and dev VMs. I run a mixture of Debian and CentOS GNU/Linux distributions mostly. I did a post a while back on my setup here.

One word that best describes how you work: Hard

What apps/software/tools can’t you live without? Apps come and go over time… fundamentally for me my killer app is a GNU/Linux operating system with all the wonderful open source tools that it packs in. I’ve tried using both Windows and MacOS over the years, but have found that I’m always more productive with a GNU/Linux machine with many thousands of apps and tools just an apt-get install or yum install away.

What’s your workplace like? My “work” workplace is a pretty standard office, Fairfax has kitted it out with somewhat decent chairs and 27″ LCDs to hotdesk in. My home workspace these days is sadly just a couch or an unstable dining table. It’s very suboptimal, but as I’m only spending a limited time in AU and didn’t want to buy lots of stuff, I make do. My ideal setup looks more like what I had back in Auckland.

What’s your best time-saving trick/life hack?  Make the effort to do things right the first time. Seriously, sloppy work, designs or documentation leads to huge time losses whilst working around issues or having to fix them at a later time.

What’s your favourite to-do list manager? For work life, it’s JIRA. We use it heavily at Fairfax and I assign all my priorities different tickets, it makes delegating tasks to other team members or splitting big bits of work into easily manageable chunks possible.

Besides your phone and computer, what gadget can’t you live without?  Besides my computer?? The only gadget I care about strongly is my laptop, everything else is just “stuff” that I could take or leave. I’m pretty minimalist these days, could list all my gadgets on one hand.

What everyday thing are you better at than anyone else? What’s your secret?
Solving complex issues. I can take any problem and quickly determine the issue, the possible fixes and test and implement these. It’s not specific to IT, I’m good in a crisis and can solve issues quickly IRL as well. It comes naturally to me and it’s one of the things that allows me to be very effective in my job.

What do you listen to while you work? I don’t tend to listen so much whilst at work, but I certainly do when relaxing on personal projects. I have a wide range of tastes, my often played includes bands such as Kraftwork, Genesis, Nightwish, Eluvite, Ensiferum, Adele, The Killers, Meat Loaf, Mumford and Sons, Marillion, Muse, Rammstein, Tangerine Dream and many others that are far apart in genre.

What are you currently reading? Fairfax: The Rise And Fall. Lisa picked it up recently and it’s interesting reading a bit of the background behind the company I currently work for.

Are you more of an introvert or extrovert? An extroverted introvert. :-) I’m certainly an introvert in that I prefer small groups of close friends and staying in over going out, but at the same time I get on well with people and can happily go out and make friends without too much trouble.

What’s your sleep routine like? What’s sleep? I tend to go to sleep around 00:00-01:00 and get up at 07:30-08:00. It’s not the best, but there’s always so much to do in one day!

Fill in the blank. I’d love to see _____ answer these same questions. I’ve got many people in mind, but they tend not to have blogs. :-( So I’m going with Hamzah Khan, he has a pretty cool blog and I’m envious of his network racks. But I invite anyone else reading this to join in as well, don’t wait for an invite. :-)

What’s the best advice you’ve ever received? Invest in good tools. Whether it’s software or a hammer, either way bad tools will cause you pain, lost time, wasted money and endless frustration.

Is there anything else you want to add for readers? If you don’t currently blog, please do! It’s a fulfilling activity and I love reading blogs by my friends rather than throwaway 140char one liners and so many of you whom don’t blog have such interesting content that you could put up.