Android, the leading propietary mobile operating system

The Linux kernel has had a long history in the mobile space, with the successes and benefits of the OS in the embedded world transferring across to the smart phone and tablet market once devices evolved to a level requiring (and supporting) powerful multitasking operating systems.

But whilst there had been other Linux-based mobiles before, it wasn’t until Android was first released to the world by Google that Linux began to obtain true mass-maket consumer acceptance.  With over 1 billion devices activated by late 2013, Android is certainly the single most successful mobile Linux distribution ever and possibly even the single largest mobile OS on basis of number of devices sold.

Whilst Open Source and Free Software [By Free Software I mean software that is Libre, ie Free as in Freedom, rather than Free as in Beer] had historically succeeded strongly in the server space, it always suffered limited mass market appeal in the desktop. With the sudden emergence and success of Android, proponents of both Open Source and Free Software camps could enjoy a moment of victory and success. Sure we may not have won the desktop wars, and sure it wasn’t GNU/Linux in the traditional sense, but damnit, we had a Linux kernel in every other consumer device, something worth celebrating!

 

Whilst Android still features the Linux kernel, it differs from a conventional GNU/Linux system, as it doesn’t feature the GNU user space and applications. When building Android, Google took the Open Source Linux kernel but threw out most of the existing user space, instead building a new Apache-licensed user space designed for consumers and interaction via touch interfaces.

For Google themselves, Android was a way to prevent vendors like Microsoft or Apple getting a new monopoly in the mobile world where they could then squeeze Google out and strangle their business in the new emerging market – a world where Microsoft or Apple could dictate what browser or search engine that a user could use would not be in Google’s best financial interests and it was vital to take steps to prevent that from being possible.

The proposition to device vendors is that Android was an answer to reducing their R&D costs to compete with incumbent market players, making their devices more attractive and allowing some collaboration with their peers via means of a common application platform which would attract developers and enable a strong ecosystem, that in turn would make Android phones more attractive for consumers.

For Google and device vendors, this was a win-win relationship and it quickly began to pay off.

 

Yet even as soon as we started consuming the delicious Android desert (with maybe a slightly dubious Google advertising crust we could leave on the side), we found the taste souring with every mouthful. For whilst Google and device vendors brought into the idea of Android the operating system, they never brought into the idea of the Free Software movement which had lead to the software and community that had made this success possible in the first place.

To begin with, unlike the GNU/Linux distributions pre-dating Android which generally fostered collaboration and joint effort around a shared philosophy of working together to make a better system, Android was developed in a closed-room model, with Google and select partners developing new features in private before throwing out completed releases to coincide with new devices. It’s an approach that’s perfectly compliant with Open Source licensing, but not necessarily conducive to building a strong community.

Even the open source nature of the OS was quickly tainted, with device vendors taking Android and instead of evolving the source code as part of a community effort, they added in their own proprietary front ends and variations, shipped devices with locked boot loaders preventing OS customisation and shoved binary drivers and firmware into their device kernels.

This wasn’t the activity of just a few bad vendors either. Even Google’s own popular “Google Nexus” series targeted at developers of both applications and operating system requires proprietary blobs to get hardware such as cellular radios, WiFi, Cameras and GPUs to function. [Depending whom you ask, this is a violation of the Linux kernel’s GPLv2 license, but there is disagreement amongst kernel developers and the fears that a ban on kernel proprietary drivers will just lead to vendors moving the proprietary blobs to user space, a legally valid but still ethically dubious approach.]

Google’s main maintainer for AOSP recently departed Google over frustrations getting Qualcomm to release drivers for the 2013 revision of the popular Nexus 7 tablet, which illustrates the hurdles that developers face when getting even just the binaries from vendors.

Despite all these road blocks thrown up, a strong developer community has still managed to form around hacking on the Android source code, with particular credit to Cyanogenmod a well polished and very popular enhanced distribution of Android, Replicant which seeks to build a purely free Android OS replacing binary blobs along the way, and FDroid a popular alternative to the “Google Play” application store offering only Free Software licensed applications for download.

It’s still not perfect and there’s a lot of work left to do – projects like Cyanogenmod and Replicant still tend to need many proprietary modules if you want to make full use of the features of your device. The community is working on fixing these short comings, but it’s always much more frustrating having to play catch up to vendors, rather than working collaboratively with them.

But whilst this community effort can resolve the issue of proprietary drivers and applications and lead us to a proper Free Software Android, there is a much more tricky issue coming up which could cause far greater headaches.
In order to resolve the issue of Android version fragmentation amongst vendors causing challenges for application developers, Google has been introducing new APIs inside a package called “Google Play Services”, which is a proprietary library distributed only via the Google Play application store.

Any application that is reliant on this new library (not to mention over existing proprietary components such as Google Cloud Messaging used for push notifications) will be unable to run on pure Free Software devices that are stripped of non-free components. And whilst at the immediate moment the features offered by this API are mostly around using specific Google cloud-based APIs and features which are non-free by their very nature, there’s nothing preventing more and more features being included in this API in future, reducing the scope of applications that will run on a Free Software Android.

If Google Play Services proves to be a successful way for Google to enforce consistency and conformity on their platform to tackle the fragmentation issues they face, it’s not inconceivable that they’ll push more and more library functions into proprietary layers distributed via the Play Store like this.

 

But if Google chooses to change Android in this way, I feel that it will be inappropriate to continuing calling Android an Open Source or Free Software operating system. Instead it will be better described as a proprietary operating system with an open core – in similar fashion to that of Apple’s MacOS.

Such an evolution could lead to two distinct forks of Android being created:

  1. Propietary/Android, the version identified by the public, offered by Google and their associated vendors, a polished experience but with increasingly reduced user and developer freedoms.
  2. Free/Android, the community variations with it’s own application ecosystem that diverges away from Propietary/Android as more and more applications refuse to run on it due to Free/Android lacking libraries like Google Play Services.

Some readers will ponder why having some proprietary components is such a concern – who really wants to hack around with drivers or application compatibility APIs? Generally they’re not the most exciting part of computers [subjectively speaking of course] and on some level I can understand this mindset.

But proprietary software chunks are more than just being an annoyance to developers who want to tinker. Proprietary software makes your device opaque, obscuring what the software is doing, how it works and how it can be (ab)used.

The Google Play application has the capability to install content on your phone, a feature often used by users to install applications to their device from their browser. But does the source code of the the Google Play application ensure that it can never happen without your awareness? There’s already due cause to distrust the close association between companies like Google and the NSA, without the ability to see inside the software’s source code, you can’t be sure of it’s capabilities.

Building applications around proprietary APIs like Google Play Services removes the freedom of a user to decided to replace calls to proprietary systems to free ones. It may be preferable to use a Free Software mapping API rather than Google’s privacy lacking Maps offering for example, but without the source code, it’s not possible to make this change.

Even something as innocent as a driver or firmware of hardware such as the GSM modem could be turned into a weapon by a powerful adversary, by taking advantage of backdoors in the firmware to deliver malware to spy on an individual – whether for the “right” reasons or not, depends on your moral views and whom is doing the spying at the time.

Admittedly a pessimistic view, but I’ve laid out my personal justifications for taking this approach before and believe we need to look at how this technology could potentially (hopefully never) be used against individuals for immoral reasons.

 

I think Android illustrates the differences between Open Source and Free Software extremely well. Whilst Android is licensed under an Open Source license, it doesn’t have the same philosophy of Free Software.

It’s source code is open because it provided Google with a commercial advantage, not because Google believe that user freedom is important. Google and their partners have no qualms about making future applications and/or features proprietary, even at the detriment to developers and users by restricting their freedoms to understand and modify the software in their device.

Richard M. Stallman (RMS), the founder of the Free Software movement, wrote about the differences in Free Software vs Open Source and tells how whilst these two different ideologies have overlapping goals, at times they also differ. In some ways the terminology Open Source can be dangerous, as it lets us lose sight of the real reasons why software needs to be Free for Freedom’s sake above all.

 

Interestingly, despite how strongly I feel about Free Software,  I’ve found it somewhat personally easy to ignore concerns of proprietary software on mobiles for a prolonged period of time. In many ways, I see my mobile as  just a tool and not a serious “real” computer like my GNU/Linux laptop where I conduct most of my digital activities.  It’s possibly a result of my historical experiences with the devices, starting off using mobiles when they were just phones only and having had them slowly gain more capabilities around me, but always being seen as “phones” rather than “pocket computers”.

I’m certainly a digital native, a child of the internet generation separated from my parent’s generation by being the first to really grow up with widely available internet connectivity and computers. But to me, computers are still laptops and servers, despite having a good understanding of the mobile space and using mobile devices every day to possibly excessive amounts.

Yet for the current and next generation growing up, mobile phones and tablets are *the* computer that will define their learning experiences and interaction with the world – they may very well end up never owning a conventional computer, for the old guard of Windows, Linux and PC are gone, replaced with iOS, Android and handhelds.

It’s clear that mobiles operating systems are the platform of the future, it’s time we consider them equals with our conventional operating systems and impose the same strict demands for privacy and freedom that we have grown to expect onto the mobile space. I know that personally I don’t trust my Android mobile even one tenth as much as I trust my GNU/Linux laptop and this is unacceptable when my phone already has access to my files, my emails, my inner most private communications with others and who knows what else.

 

So the question is, how do we get from the Kinda-Propietary/Android we have now, to the Free/Android that we need?

I know there are some who will take a purist approach of running only pure Free Software Android and ignoring any applications or features that don’t run on it as-is. Unfortunately taking this approach will inevitably lead to long term discrepancies between the mass market Android OS and the Free Software purists pulling the OS feature set in different directions.

A true purist risks becoming a second class citizen – we are already at the stage where not being able to run popular applications can seriously restrict your ability to take part in our world – consider the difficulties of not being able to load applications needed to use public transport, do banking (online or NFC banking) or to communicate with friends due to all these applications requiring a freedom impacting proprietary layer.

It will be difficult to encourage users and application developers to use a Free Software Android build if they discover their existing collection of applications that rely on various proprietary APIs and library features no longer work, so we need to be somewhat pragmatic and make it easier for them to take up Free Software and still run proprietary applications on top of a free base, until such time as free alternatives arise for their applications.

I think the solution is a collection of three different (but all vital) efforts:

  1. Firstly, to support development of community Android distributions, such as Cyanogenmod and Replicant, something which has been successful so far, it’s clear that Google isn’t interested in working as equals with the community, so having a strong independent community is important for grass-roots innovation.
  2. Secondly to support the replacement of binary blobs in the core Android OS, such as the work that the Replicant project has started with writing Free Software drivers for hardware.
  3. Thirdly (and not at all least) we need to make it easy to provide the same functionality in Free/Android as Proprietary/Android by re-implementing closed source applications and libraries such as Google Play application store, Google Cloud Messaging (Push notifications) and the Google Play Services library/API.

Whether we like it or not, Google’s version of Android will be the platform than the majority of developers target long term. It doesn’t suit all developers, but it has suited most Free (as in beer and/or Freedom) and paid application developers for Android well enough for a long period already that I don’t see it being easy to de-rail that momentum.

If we can re-implement Google’s proprietary layers to a level sufficient for maintaining compatibility with the majority of these applications, it opens up some interesting possibilities. A Free/Android mobile with a Free/PlayServices API layer developed using the documented API calls published by Google is entirely possible and would allow users to run a Free/Android mobile and still maintain support for the majority of public applications being released for the Android platform, even if they use more and more proprietary API features.

Such a compatibility layer will enable users to run applications on their own terms – a user might decide to only run Free as in Freedom software, or they could decide that running proprietary software is OK sometimes -and that’s an acceptable choice, but the user is the one that should be making it, not Google or their device vendor.

Potentially we could take this idea a step further and re-implement features like contact and setting synchronisation against a Free Software server that technically capable users can choose to setup on their own servers, giving them the benefits of cloud-type technologies without loss of freedoms and privacy that takes place if using the Google proprietary features.

 

I’m not alone in these concerns – neither RMS or the Free Software Foundation (FSF) have been idle on this issue – RMS has an excellent write up on the freedom of Android here, and on a more mainstream level, the FSF is running campaigns promoting freeing Android phones and encouraging efforts to keep the platform Free as in Freedom.

I’m currently taking steps to move my Android Mobile off various proprietary dependencies to Free Software alternatives – it’s going to be slow and gradual and it will take time to determine replacements for various applications and libraries.

I haven’t done much in the way of Android application development, but I’m not afraid to pick up some Java if that’s what it takes to fill in a few gaps to get there – and if it means reverse engineering some features like Google Play Services, I’ll go down that path if need be.

Because Free Software computing is vital for privacy, vital for security and vital for a free society itself. And if the cost is a few weekends hacking at code, it’s a price well worth paying.

O-Ring Mod

As much as I love my Das Ultimate Silent keyboard, the one thing that it fails to do is live up to it’s “silent” label. Whilst it’s certainly massively quieter than something like the mighty IBM Model M, it still makes a fair bit of noise due to the keycaps bottoming out when typing, making a plastic clacking noise.

With a new more squished up office layout at work my colleagues have been begging threatening bribing cursing complaining requesting that I consider the “O-Ring Mod”, where you remove all the keycaps and install little rubber rings underneath each key to reduce their noise.

The result is quite effective, about a 50% sound reduction IMHO, with little negative impact on the typing experience – just a slightly shorter travel distance and a bit more bounce in the keyboard. There’s a great Youtube video on the difference it makes with various Cherry MX switch types – my Das Keyboard uses the brown switches, which are the second type demonstrated.

There’s a number of online stores happy to sell you the rings – although in my case, I just ordered the raw thing from Amazon than a keyboard shop. I also decided against spending the $10 for a keycap remover which was a good move –  a couple paperclips were effective enough.

OK keyboard, I know it's not quite the same, but it's time to put a rubber on it...

OK keyboard, I know it’s not quite the same, but it’s time to put a rubber on it…

Generally the replacement was easy, the biggest issue was the spreader/stabiliser bar keys, such as the Enter, Backspace, Shift and Space keys – these ones have a little metal bar which you need to stretch apart to unhook the key from and to hook it back on once the rubber ring is installed.

The tricky keys - don't just pull them off, unless you want to break the white plastic loops.

The tricky keys – don’t just pull them off, unless you want to break the white plastic loops. You can see the little rubber ring I’ve just added to the keycap.

Next up is working to improve my typing accuracy – I can already thrash out some insanely fast stuff, but my accuracy rate can vary a lot (partially due to bad spelling), but sitting down and forcing myself to slow slightly for more accuracy would be a good trade off.

I’m also pondering learning a different layout like Dvorak which could be a good excuse to learn a new typing style and get some performance advantages.

Ubuntu, the Windows of the Linux world

Sometimes I do wonder about if Ubuntu is actually the Windows of the Linux world, some of their design decisions, like non-closable restart windows…

Nice desktop you have there. Be a good fellow and reboot now ok?

Nice desktop you have there. Be a good fellow and reboot now ok?

Thankfully xkill closes *all* windows and leaves no survivors. It’s a shame that the general desktop environment on Ubuntu has been so cut down and over simplified over the last few years, since the server Ubuntu LTS releases are actually pretty damn good.

Delicious Entropy

I run a large GNU/Linux server with KVM for running numerous virtual machine guests, including build hosts used to package and compile software for different GNU/Linux distributions and other operating systems.

I recently ran into an issue during a kernel compile where the kernel compile hung indefinitely whilst GPG (tried) to sign kernel modules as part of the build process, due to the virtual machine guest running out of available entropy and being unable to proceed until more random data was available.

Bro, I'm stalled as bro!

Bro, I’m stalled as bro!

On Linux there are two sources of random data  – /dev/random, which provides high quality random data and /dev/urandom which provides an unlimited amount of pseudo-random data based on a seed value taken from the random pool initially.

Linux generates this random data by collecting entropy from somewhat-random events, such as disk activity, network activity, keyboard, mouse and other sources. When the pool of entropy is exhausted, /dev/random will block (ie force processes to freeze) until more is available, whereas /dev/urandom will continue to serve continuous pseudo-random data, although the quality of the random data is not considered as secure as /dev/random.

On a workstation or single server this tends to be enough to generate sufficient random data for most applications (although if you’re doing certain tasks you may still have an issue). Virtual machines on the other hand, lack hardware sources of entropy such as disks or keyboards and it’s very easy to quickly exhaust the available entropy pool and have some applications block until more is available.

Applications like Apache (with mod_ssl) and OpenSSL use /dev/urandom so aren’t impacted by shortages of entropy, but some signing processes, such as GPG require /dev/random and can be impacted if the source of entropy is exhausted  – which is exactly what happened to my kernel signing process.

 

It’s pretty easy to use to test and see how quickly a Linux system re-fills the entropy pool by running a test to read data from /dev/random, forcing the pool to empty and be repopulated.

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
16+1 records out
8496 bytes (8.5 kB) copied, 149.849 s, 0.1 kB/s

The host doing this test has around 12 physical hard disks, 10 active KVM virtual machines spewing out packets, an unfiltered WAN link feeding random junk – all which is good for generating a decent amount of entropy. The numbers may look pretty bad, but when compared with the amount of entropy generated by my laptop…

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
16+1 records out
8409 bytes (8.4 kB) copied, 1389.95 s, 0.0 kB/s

The rate of entropy generation on my laptop is quite depressing – but at least my laptop has a keyboard, mouse and hardware environmental values to help add sometime to the entropy sources.

When I run the same test on a virtual machine guest, which lacks all these physical sources, it comes to  a grinding halt:

# dd if=/dev/random of=/dev/null count=10000
0+24 records in
0+0 records out
0 bytes (0 B) copied, 1865.68 s, 0.0 kB/s

I was forced to kill the above test due to it timing out indefinitely thanks to the host running out of any available entropy and being unable to generate any more to complete the test. :-(

Even when performing an intensive activity such as compiling a large software library, it still takes considerable time to complete this test on a VM:

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
15+1 records out
8018 bytes (8.0 kB) copied, 2560.36 s, 0.0 kB/s

It seems that the lack of the random data generated by active physical hardware is too much for the VM guest to be able to complete the test. And whilst some applications like an HTTPS website would continue to operate fine, others like a build host GPG-signing packages may fail and hang indefinitely, unable to obtain the required volume of random data to complete it’s key generation process.

 

For times when this lack of entropy becomes an issue for your applications, it is possible to obtain additional entropy from a hardware random number generator – this can be as simple as using a feed such as analog noise from the sound card or as sophisticated as a hardware random number generator or functionality built into certain CPUs which is designed to be extremely random and unpredictable.

A while ago I picked up a pair of Simtec Electronic’s Entropy Keys, a small USB device which generates truly random sources of data by a clever method of abusing semiconductors and connected one to my primary KVM servers.

The device ships with an open source daemon that takes random data from the key and injects it into the Linux entropy pool for use by all /dev/random using applications. It instantly makes a huge difference to the available volume by generating almost 3.9KB/s of random data.

Gain entropy with just 1 easy repayment!

Gain entropy with just 1 easy repayment! Call now!

After starting the daemon and re-running the test, the performance looks much better:

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
145+1 records out
74504 bytes (75 kB) copied, 21.8926 s, 3.4 kB/s

The numbers are still low, but the reality is you generally you only need a few bytes at a time, rather than massive volumes like this test demands – for general signing usage, 3.4kB/s is a huge volume to have.

So whilst this test doesn’t reflect the real way /dev/random is used, it does illustrates the difference in data volume a proper random number generator can make. And whilst this might not be a common problem thanks to the low volume of random data required for most applications to function, the increasing use of virtualisation makes this issue possibly one that people may bump into more in future.

Now that I have my host server getting a reliable and steady flow of random data, my next step is to share that data to the virtual machines running on the host – as I’m doing all my signing in guests, it’s vital that I get that random data through to them,

I’m in the process of investigating a few different options and will cover these in a follow up blog post, as it’s a somewhat sizeable topic in it’s own right.

WordPress & SSL Fixes

I’ve been using WordPress for this blog for a number of years now – at some point I realised that whilst writing my own code is fun, there’s no need to reinvent yet-another-fucking-blog-platform and ended up selecting WordPress to use for my content, on the basis of it’s strong and active development and community.

Generally it’s pretty good, but there are times it disappoints, such as WordPress expecting servers to have FTP for unpacking updates and plugins (it’s 2013 guys, SFTP at least!), excessively setting cookies which makes caching layers more complex and doing stupid stuff with storing full URLs inside the database for page links and image resources.

The latter has been impacting me in particular. Visitors to my site have had the option of using HTTP or HTTPS (SSL secured) access methods for some time, but annoyingly whenever I posted an article with images, WordPress includes all the images using http://. This mixed content type prevents browsers from showing the lock icon (best case) or throws up a nasty error (worst case) depending on the browser and it’s level of concern for user safety for mismatched content.

Dubious Firefox is dubious about this site.

Dubious Firefox is dubious about this site, no lock icon of security here!

Despite having accessed the site on https://, WordPress still uses http:// for my images.

Despite having accessed the site on https://, WordPress still uses http:// for my images.

I could work around this by setting the WordPress base URL for my site to be https://www.jethrocarr.com, but then images served at the unsecured http:// site would also be served via SSL, which is just adding pointless load to the server (not that SSL termination really adds much load these days, but damnit, I’m being a purist here!).

I was hoping that it was a misconfiguration of my WordPress setup, but reading online it seems that this is a known issue with WordPress and a whole bunch of modules, hacks and themes have sprung up to fix/workaround the issue…

Of course there’s an easier way – fix it at the webserver layer! Both Nginx and Apache have modules to do substitutions in page content on load, for Nginx there’s HttpSubModule and for Apache there is mod_substitute. In my case with stock Apache 2.2 on CentOS 5, I was able to fix the whole issue by adding the following to my SSL vhost configuration:

# Fix SSL URLs thanks to WordPress hardcoding http:// links to images :'(
<Location />
    AddOutputFilterByType SUBSTITUTE text/html
    Substitute "s|http://www.example.com|https://www.example.com|"
</Location>

Following this, things look much better:

The lock icon of browser approval!

The lock icon of browser approval!

All media files are now https://, not http://

All media files are now https://, not http://

Technically this substitution will have some level of performance impact, as it has to process the generated HTML content and check for strings to replace, but the impact is so low that I wasn’t able to measure it amongst the usual variation of page response times – and it’s not going to be anywhere as slow as mod_php and WordPress itself anyway. ;-)

Finally, if you haven’t already, you probably want to change the following in wp-config.php:

define('FORCE_SSL_ADMIN', true);

This forces all WordPress logins and wp-admin activities to take place under HTTPS which is a pretty good idea if you ever post to your blog from an unsecured network.

The Apache that wanted to be root

I’ve run into an issue a couple of times where some web applications on my server have broken following a restart of Apache when the application in question calls external programs..

What seems to happen is that when an administrator restarts Apache during general maintenance of that server, Apache picks up some of the unwanted environmental settings from the root user account, in particular the variable HOME ends up getting set to the home directory of the root user account (/root).

Generally it won’t be an issue for web applications, but if they call an external application (in my case, Git), that external application may use the HOME environment to try and read or write configuration files.

# tail -n1 error.log
fatal: unable to access '/root/.config/git/config': Permission denied

In my case, Git kept dying with a fatal error, which lead to a very confused sysadmin wondering why a process running as Apache is trying to read from the root user’s account…

By looking at the environmental settings for the Apache worker processes, we can see what’s happening. After a normal boot, the environmental variables look something like the below:

# ps aux | grep httpd
root     10173  0.0  1.6  27532  8496 ?        Ss   22:42   0:00 /usr/sbin/httpd
apache   10175  0.1  2.8  37560 14692 ?        S    22:42   0:01 /usr/sbin/httpd
apache   10176  0.1  2.8  37836 14952 ?        S    22:42   0:01 /usr/sbin/httpd
apache   10177  0.1  2.8  37332 14876 ?        S    22:42   0:01 /usr/sbin/httpd
apache   10178  0.1  2.8  37560 14692 ?        S    22:42   0:01 /usr/sbin/httpd

# cat /proc/10175/environ
TERM=dumbPATH=/sbin:/usr/sbin:/bin:/usr/binPWD=/LANG=CSHLVL=2_=/usr/sbin/httpd

Because Apache has been started by init, it has a nice clean environment. But after a restart by the root user, it’s clear that some cruft from the root user account has been pulled into the application environment variables:

# cat /proc/10175/environ

HOSTNAME=localhostSHELL=/bin/bashTERM=xtermHISTSIZE=1000USER=root:
MAIL=/var/spool/mail/rootPATH=/sbin:/usr/sbin:/bin:/usr/bin
INPUTRC=/etc/inputrcPWD=/rootLANG=CSHLVL=3HOME=/rootLOGNAME=root
LESSOPEN=|/usr/bin/lesspipe.sh %sG_BROKEN_FILENAMES=1_=/usr/sbin/httpd

Because of these settings, external programs relying on the value of HOME will try to read/write to a directory that they aren’t permitted to use.

Debian-based systems fix this issue by unsetting certain environmentals (including HOME) in the bootscript for Apache, based on the rules in /etc/apache2/envvars.

To fix the issue on a RHEL/CentOS host, you can instead just append a replacement HOME setting into /etc/sysconfig/httpd. This particular configuration file is read at server startup and isn’t overwritten when Apache gets upgraded.

cat >> /etc/sysconfig/httpd << "EOF"
# Correct Apache's home directory
HOME=/var/www
EOF

Following a restart, Apache should now show the correct HOME environmental variable and your application should function as expected.

Awstats 7.2 + extras RPMs

I’ve been a long term user of Awstats for reporting on visitor traffic to my websites. Whilst it’s a little dated, it’s simplicity and reliance only on the web server logs makes it ideal for any application, including general websites such as blog, but also more specialised sites such as my package repositories which can’t make use of more sophisticated client-side Javascript tracking methods as files are being downloaded by non-browser clients.

Simple web 1.0 goodness. No fancy AJAX graphs here son!

Simple web 1.0 goodness. No fancy AJAX graphs here son!

That repository server in particular (repos.jethrocarr.com) is now pushing 20-40GB of traffic per month to around 2500-3000 servers. Unfortunately Awstats doesn’t differentiate between general purpose file grabbers and the Yum downloader for RPM-based distributions, and it makes it difficult to see if downloads are from machines vs mirror scripts scanning and re-downloading files.

I also run dual-stack IPv4 and IPv6 – Awstats includes some useful GeoIP modules to lookup where user traffic comes from, but it doesn’t support mixed IPv4 and IPv6 by default and as my IPv6 traffic usage increases, this could be a problem as the “Unknown” country counter increases.

To fix this, I’ve written a patch for adding Yum user agent support and also merged in a patch by Sven Strickroth which adds a geoip6 module that does both IPv4 and IPv6 country lookups using the popular MaxMind GeoLite databases.

I’ve built packages for CentOS/RHEL/etc 5 & 6, which are available at my repositories at repos.jethrocarr.com. The awstats package I’ve built includes these two patches and also pulls in a current copy of MaxMind’s GeoIP database and required dependencies, so you’re all good to go immediately.

If you’re after the patches themselves, you can download them directly:

ELBs & Corporate Proxies

Following on from yesterday’s ELB post, it’s worth noting that there’s another common scenario where you can trigger issues when accessing ELBs – many corporates enforce the use of an HTTP proxy for all outgoing traffic, sometimes transparently, other times less so.

Having a multi-AZ ELB and accessing from your own data center isn’t too much of an issue, if each host does it’s own DNS lookup, your hosts should roughly end up with a 50/50 split across AZs, as each one resolves it’s own DNS record.

But when a proxy is added to the mix, it breaks this, since proxies tend to do their own DNS lookups and cache the results for use by other clients. Testing with Squid showed that the DNS caching for Squid would favour a particular AZ and send all traffic to that one AZ, before then flipping to the other AZ when the DNS cache expired and was refreshed – in my case, every 5mins when the TTL expired.

I'm so sick of these motherfucking ELBs in this motherfucking cloud!

Go home Amazon, you’re drunk

If you’re using Squid, there’s little you can do to work around this – whilst you can adjust the Squid DNS caching times and approaches, short of disabling DNS caching and taking the performance hit of a DNS lookup for every new outbound request, you will always end up with load jumping between both AZs and causing havoc..

There’s a few workaround options:

  • Have multiple Squid proxies for your outbound traffic and load balance between then on your network, if you load balance outgoing traffic across 4+ different outbound Squid servers, your load should end up going to different AZs a *bit* more evenly – but still not guaranteed.
  • Create an internal ELB and access via your VPC link, allowing you to bypass your company’s outbound network proxies (as traffic routes via VPN or Direct Connect) – but then you’re paying for 2x ELBs – one external for end users and one internal for your own systems.
  • Replace the ELB with something actually useful (eg a Varnish or HA-Proxy instance in Amazon).
  • Get rid of the outbound proxies please! I could write a business case for it based on the amount of money I’ve seen proxies waste at so many different companies (hint: engineers time debugging issues is much more expensive than a couple GB extra data usage).
  • Gin.

Russian roulette with ELBs and CDNs

In my day job, I look after a number of websites, all of which generally make heavy use of CDNs (Content Distribution Networks) to offload traffic to edge nodes near to an end user’s device. In our case we use Akamai, one of the largest and experienced providers in the world.

A large number of our clusters and applications now run on Amazon’s public cloud service here in Sydney, making use of EC2 instances and ELBs. Due to the important nature of our systems, we have almost all applications in active-active multi-AZ (Availability Zone) configurations. The intention of this design is that the ELB (Elastic Load Balancer) serves all incoming traffic by dividing it across each availability zone in equal proportions. If either Amazon AZ fails, the other will continue to serve requests like nothing is wrong.

It’s a nicer solution than the traditional data center approach of having an active-passive multi-site design, as with both AZs being constantly active serving requests, we know that production and “DR” are always in a functional working state, ready to handle traffic; plus your investment into DR isn’t going to waste like traditional servers sitting idle.

Unfortunately Amazon ELBs offer only the barest of no-frills features which makes them a bit stupid at times. In particular, Amazon’s multi-AZ ELBs actually consist two separate ELBs, once in each AZ. Incoming traffic selects an ELB by means of a DNS round robin and then is directed to a server in that particular AZ .

Thus, each availability zone has it’s own ELB, which adds it’s own IP address to the DNS round robin, and looks something like this:

www.example.com is an alias for www-example-com-elb.jws.elb.amazonaws.com.
www-example-com-elb.jws.elb.amazonaws.com. has address 172.16.32.1
www-example-com-elb.jws.elb.amazonaws.com. has address 192.168.0.1

The problem is that DNS round robin has no guarantee of balancing the load evenly across the two data centers. If a particular company’s proxy server caches one address, it may direct traffic for the whole company to AZ-A and deliver no traffic to AZ-B.

In reality, due to the large number of users getting assigned different IP addresses with round robin, users tend to be spread somewhat evenly across the different AZs, making the problem a somewhat moot point when you have sizeable visitor numbers.

But if you add Akamai to the mix, you can end up with interesting results – it turns out that Akamai Edge nodes in AU use a central source of DNS information, which can lead to them favouring a particular ELB IP address. And since *all* your traffic goes via the CDN, this in turn results in all your traffic going directly to a single AZ and ignoring the other one entirely.

In a real-world scenario of a 4 webserver cluster, we saw traffic jump between each AZ whenever Akamai’s edge servers updated DNS to a different IP address, as per the below graph:

Time to really test that your application is active-active!

Akamai decides to switch which ELB it’s using from A to B :-/

This swapping brings around some really nasty issues. In theory your active-active setup should be large enough to handle all your usual traffic load on just one AZ, but if that’s not the case, bad things will happen to your site performance and/or reliability.

The other nasty issue is when doing auto-scaling with Amazon, this swapping messes with your Cloud Watch metrics for autoscale policies/triggers – one AZ is complete idle, one AZ is maxed out, average stats show a half busy cluster, no need to autoscale upwards to handle the load.

And even if you’re clever and set your autoscaling to also trigger based on ELB latency/errors/throughput, you may still end up with issues, since the new host created during the autoscale may end up in the idle AZ, instead of the active AZ where you need it.

Using a smarter system for load balancing can negate the issue – for example using a pair of Varnish servers or HA-Proxy servers configured to do cross-AZ load balancing would workaround the issue, by spreading all the traffic coming into one AZ across all the servers in both AZs, but this does have increased costs (running EC2 instances, inter-AZ traffic). It also may have performance issues depending on the amount of traffic pouring into your instance.

Additionally, if you have a global audience, rather than a mostly single-country audience like us, you may not see the issue, since the different Akamai regions around the world will balance load somewhat equally across the two AZs.

To properly fix this behaviour with Akamai, you need to open a professional services request and have the SureRoute configuration adjusted so that Akamai forces the edge notes to lookup the origin IPs at the edge:

<!-- SR fix to handle multiple origin IP's -->
<forward:cache-parent.sureroute2.force-origin-ip-from-edge>on
</forward:cache-parent.sureroute2.force-origin-ip-from-edge>
<forward:cache-parent.sureroute2.round-robin.status>on
</forward:cache-parent.sureroute2.round-robin.status>

<!-- no host in sureroute stat-key -->
<forward:cache-parent.sureroute2.stat-key.host>off
</forward:cache-parent.sureroute2.stat-key.host>

With this fixed configuration, Akamai will correctly spread load evenly across our two AZs and our load graphs settled comfortably back into normality. I’m not entirely sure why this configuration isn’t default SureRoute behaviour, but like many things with Akamai, there are often mysterious adjustments that only professional services know about or can make.

Finally it’s worth noting that this issue isn’t unique to Amazon – you could get the same issue if you run active-active conventional data centers and use Akamai for offload. It may also be an issue with other CDNs by default, so double-check the behaviour of your particular vendor – it would be interesting to see if CloudFront (Amazon’s CDN) exhibits similar issues or not.

Credit to my colleague Andrew. for spotting this issue originally and having to deal with two different vendors support cases at once to get to the bottom of the root cause.

SSL Intermediate CA Bundles with Amazon

When configuring SSL services, generally you need to set a certificate, a private key and the CA bundle containing the intermediate certificate(s), which is often a bundle of several different certificates.

For example, https://www.jethrocarr.com‘s configuration looks like:

SSLEngine on
SSLCertificateFile jethrocarr.com.crt
SSLCertificateKeyFile jethrocarr.com.key
SSLCertificateChainFile startssl.intermediate.ca.crt

When your browser connects, it doesn’t trust jethrocarr.com.crt, but it checks it against the certificates that have signed it in startssl.intermediate.ca.crt – and those certificates are signed by the CA that your browser trusts.

This means that the CAs can protect their root certificates which are trusted by the browser much more securely and sign their certificates with intermediates than can be revoked and easily(ish) replaced should the need arise.

Generally this works fine from the end user perspective, although there are sometimes issues when an sysadmin forgets to add the intermediate CA bundle and doesn’t immediately notice as sometimes some browsers work fine whilst others fail depending whether or not they already trust the intermediates.

 

Today I ran into a different new issue where Amazon Web Services is fussy about the order of the certificates in that bundle when adding a certificate to an Elastic Load Balancer for SSL termination.

Any attempt to upload my certificate was met with “Invalid Public Key Certificate”, which didn’t make a lot of sense as I was certain that my certificates were OK. It was easy to verify and prove this, using OpenSSL:

$ openssl rsa -noout -modulus -in example.com.key | openssl md5
(stdin)= 30e1b6cb4168117b7923392ca536c701

$ openssl x509 -noout -modulus -in example.com.crt | openssl md5
(stdin)= 30e1b6cb4168117b7923392ca536c701

$ openssl verify -verbose -CAfile cabundle.crt example.com.crt 
example.com.crt: OK

This proved that my certificates were all correct so the fault was Amazon-side. A post on their forums helped me “fix” the issue, by adjusting the order of my CA bundle, which subsequently fixed the error.

So is this a bug with Amazon? It’s tricky to say – there are several posts online which state that the order is important for some systems, but not for all. Clearly anything based around OpenSSL doesn’t care, as it was able to verify my out-of-order CA bundle happily enough.

As one does with issues like this, I dug into RFC 3280 which details how the certificate path validation should occur. Section 6.1 (Basic Path Validation) details that the path validation process is actually outside the specification, but then goes on and defines how the validation could occur, with the order of the certificates being implied, but not stated outright.

The primary goal of path validation is to verify the binding between
a subject distinguished name or a subject alternative name and
subject public key, as represented in the end entity certificate,
based on the public key of the trust anchor.  This requires obtaining
a sequence of certificates that support that binding.  The procedure
performed to obtain this sequence of certificates is outside the
scope of this specification.

To meet this goal, the path validation process verifies, among other
things, that a prospective certification path (a sequence of n
certificates) satisfies the following conditions:

   (a)  for all x in {1, ..., n-1}, the subject of certificate x is
   the issuer of certificate x+1;

   (b)  certificate 1 is issued by the trust anchor;

   (c)  certificate n is the certificate to be validated; and

   (d)  for all x in {1, ..., n}, the certificate was valid at the
   time in question.

Following the above, the specification goes on into detail different ways the path can be validated, which also imply that the certificates should be read in and then sorted by software, but it doesn’t actually state exactly.

Sadly with the way that this specification is written it’s not clear, which means the only 100% certain way to ensure nothing is unhappy is to order the CA bundles file in the correct order, which is something I would expect the SSL provider to do when they provide you with the files.