Tag Archives: open source

All posts relating to Open Source software, mostly but not exclusively UNIX focused.

cifs, ipv6 and rhel 5

Unfortunately with my recent project enabling IPv6 across my entire personal server environment, I’ve bumped into a number of annoying issues – nothing that isn’t fixable, but things that are generally frustrating and which just shouldn’t be an issue.

Particular thanks goes to my many RHEL/CentOS 5 virtual machines, which lack some pretty key stuff such as:

  • IPv6 connection tracking preventing the ESTABLISHED,RELATED ip6tables rules from working.
  • Unexpected behavior of certain bootscript configuration options.
  • Lack of IPv6 support with CIFS (Samba/SMB) share mounting.
  • Some weirdness with Dovecot I still need to resolve.

(Personally, based on the number of headaches I’ve found with RHEL 5, my recommendation is accelerate any plans to upgrade to RHEL 6 – or some other distribution – before deploying IPv6 in production.)

At the moment, CIFS IPv6 support on RHEL 5 & 6 has been causing me the most pain. My internal file server is dual stacked and has both A and AAAA DNS records – it’s a stock-standard CentOS 6 box running distribution-shipped Samba packages and works perfectly from the server side and modern IPv6 hosts have no issue mounting the shares via IPv6.

Very typical dual stack configuration:

# host fileserver.example.com 
fileserver.example.com has address 192.168.0.10
fileserver.example.com has IPv6 address 2001:0DB8::10

However, when I run the following legitimate and syntactically correct command to mount the CIFS share provided by the Samba server on other RHEL 5 hosts, it breaks with a error message that is typical of incorrect syntax with the mount options:

# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody
mount: wrong fs type, bad option, bad superblock on //fileserver.example.com/tmp,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Taking a look a the kernel log, it shows a non-descriptive error explanation:

kernel:  CIFS VFS: cifs_mount failed w/return code = -22

This isn’t particularly helpful, made more infuriating by the fact that I know the command syntax is correct and should be working perfectly fine.

Seeing as a number of things broke after switching on IPv6 across the entire network, I’ve become even more of a cynical bastard and ran some tests using specifically stated IPv6 and IPv4 addresses in the mount command.

I found that by passing the IPv6 address instead of the DNS name, you can produce the additional error message which offers some additional insight:

kernel: CIFS: ip address too long

Huh. Looks like a text book IPv6 support bug to me. (Even I have made this mistake in some older generation web apps that didn’t foresee long 128-bit addresses).

In testing, I found that the following commands are all acceptable on a dual-stack network with a RHEL 5 host:

# mount -t cifs //192.168.0.10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=192.168.0.10

However all ways of specifying IPv6 will fail, as well as pure DNS resolution:

# mount -t cifs //2001:0DB8::10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=2001:0DB8::10
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody

No method of connecting via IPv6 would work, leaving stock RHEL 5 hosts only being able to work with CIFS shares via IPv4. :-(

Unfortunately this error is due to a known kernel bug in 2.6.18, which was fixed in 2.6.31, but sadly not backported to RHEL 5’s kernel (as of version 2.6.18-308.8.1.el5 anyway), leaving RHEL 5 users in a position where the stock OS is unable to mount CIFS shares on an IPv6 or dual-stacked network. :-(

The ideal solution would be to patch the kernel to resolve the issue – and in fact if you are running on a native IPv6-only (not dual stacked), it would be the only option to get a working solution.

However, typically if you’re using RHEL, custom kernels aren’t that popular due to the impact they make to supportability/guarantee of the platform by vendor and added headaches of security update tracking and application, so another approach is needed.

The following methods will all work for stock RHEL/Centos 5:

  • Use the ip=X mount option to overule DNS.
  • Add an entry to /etc/hosts.
  • Have a separate DNS entry that only has an A record for your file servers (ie //fileserverv4only.example.com/)
  • Disable IPv6 entirely (and suffer the scorn of your cooler IPv6 enabled friends).

These solutions all suck – having manually fixed IPs isn’t great for long term supportability, additional DNS records is an additional pain for management, and let’s not even begin to cover why disabling IPv6 entirely is wrong.

Of course RHEL 5 is a little outdated now, so I took a look at how RHEL 6 fared. On the plus side, it *can* mount IPv6 shares, all of the following mount commands are accepted without fault:

# mount -t cifs //192.168.0.10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //2001:0DB8::10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=192.168.0.10
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=2001:0DB8::10

However, any mount of a IPv6 server using the DNS name will still fail, just like how they did with RHEL 5:

# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody

The solution is that you need to install the “cifs-utils” package which provides the /sbin/mount.cifs binary offering smarter handling of shares – once installed, all mount command options will work OK on RHEL 6, including the standard DNS-based command we all know and love. :-D

I had always assumed that all Linux systems that could mount CIFS shares had the /sbin/mount.cifs binary installed, but it seems that’s not the case, rather the standard /bin/mount command can handle mounting CIFS using just the standard kernel mount() function

However when /bin/mount detects a /sbin/mount.FILESYSTEM binary, it will call that process instead of calling the kernel mount() directly, these binaries can offer additional logic and handling off the mount command before passing it through to the Linux kernel.

For example, the following strace from a RHEL 5 host shows that /sbin/mount checks for the existence of /sbin/mount.cifs, before then going on to call the Linux kernel mount() directly with the provided arguments:

stat64("/sbin/mount.cifs", 0xbfc9dd20)  = -1 ENOENT (No such file or directory)
...
mount("//fileserver.example.com/tmp", "/mnt", "cifs", MS_MGC_VAL, "user=nobody,password=nobody") = -1 EINVAL (Invalid argument)

But a RHEL 6 host with cifs-utils installed provides /sbin/mount.cifs, which appears to do it’s own name resolution, then establishes a connection to both the IPv4 and IPv6 sockets, before deciding which to use and instructs the kernel using the ip=X parameter.

stat64("/sbin/mount.cifs", {st_mode=S_IFREG|0755, st_size=29376, ...}) = 0
clone(Process 1666 attached
...
[pid  1666] mount("//fileserver.example.com/tmp/", ".", "cifs", 0, "ip=2001:0DB8::10",user=nobody,password=nobody) = 0

So I had an idea….. what if I could easily modify a version of cifs-utils to run on RHEL 5 dual-stack servers, yet only ever resolve DNS queries to IPv4 addresses to work around the kernel issue? :-D

Turns out you can – effectively I just made the nastiest hack ever by just tearing out the IPv6 name resolver. :-/

I’m going to hell for this, but damn, feels good man. ;-)

I wasn’t totally evil, I added an info level syslog notice about the IPv4 enforcement incase any poor admin is ever getting puzzled by someone’s customized RHEL 5 box refusing to connect to CIFS shares IPv6 – that would be a bit too cruel. ;-)

The hack is pretty crude, it actually just breaks the IPv6 socket connection attempt and so it then falls back to IPv4, so it throws up a couple errors in the logs, but doesn’t actually impact the mounting at all.

mount.cifs: Warning: Using specially patched cifs-utils to ignore IPv6 address resolution - enforcing IPv4 only!
kernel:  CIFS VFS: Error connecting to socket. Aborting operation
kernel:  CIFS VFS: cifs_mount failed w/return code = -111

But wait, there’s more! I have shiny cifs-util i386/x86_64/SRPM packages with this evil hack available for download from amberdms-os repository (or directly from the server here).

Naturally this is a bit of a kludge, don’t trust it for mission critical stuff, you ONLY need it for RHEL 5, not RHEL 6 and I can’t guarantee it won’t eat all your data and bring upon the end times, etc, etc.

I’ve tested it on my devel systems and it seems like the nicest fix – sure it won’t work for any hosts needing to run on native IPv6, but by the time I come to drop IPv4 addressing entirely I certainly will have moved on my last hosts from RHEL 5 to something a bit newer. :-)

Largefiles strike again!

With modern Linux systems – hell, even systems from 5+ years ago – there’s usually very little issue with handling large files (> 2GB), in fact files considered large a decade ago are now tiny in comparison.

However sometimes poor sysadmins like myself have to support much older machines, in my case, a legacy accounting platform which is tied to the RHEL 2.1 host it was installed on and you suddenly get to re-discover the headaches that plagued sysadmins before us.

In my case, the backup scripts for this application suddenly stopped working recently with the error of:

cpio: standard input is closed: Value too large for defined data type

Turns out that their data had finally crept over the 2GB limit, which left cpio able to write the backup, but unable to read it for verification or restore purposes.

Thankfully cpio does support largefiles, but it’s a case of adding -D_FILE_OFFSET_BITS=64 to the gcc options at build time, so I built which fixes the problem (or at least till we hit the 16GB filesystem limits) ;-)

The version of cpio on the server is ancient, dating back to 2001 (with RHEL 2.1 being first released in 2002), so it’s over a decade old now, and I found it quite difficult to obtain the source for the specific installed version of cpio on the server, Red Hat seemed to be missing the exact release (they have -23 and -28, but not -25) so I pulled the Red Hat 8 source which comes from around the same time period – one of the advantages of having RHN is being able to quickly pull old packages, both binary and source. :-)

If you have this exact issue with a legacy system using cpio, feel free to grab my binary or source package from my repos and save yourself some build time. :-)

mailx contains invalid character

Whilst my network is predominately  CentOS 5 hosts, I’ve started moving many of them to CentOS 6, mostly on a basis of doing so whenever a host needs a particularly newer version, since I don’t really want to spend an entire week rebuilding all 30-odd VMs.

One problem I encountered was a number of scripts failing when sending emails, throwing out messages to STDERR:

[example] contains invalid character '['
send-mail: invalid option -- 's'
send-mail: invalid option -- 's'
send-mail: fatal: usage: send-mail [options]

What I found is that on CentOS/RHEL 5, the following would work fine:

# mail root -s "[example] message"
test message content
Cc: 
#

But on CentOS/RHEL 6, it would ignore the subject field (as can be seen by it re-asking for it) and then fail with an annoying “invalid character” error:

# mail root -s "[example] message"
[example] contains invalid character '['
Subject: 
test message content
EOT
#
# send-mail: invalid option -- 's'
send-mail: invalid option -- 's'
send-mail: fatal: usage: send-mail [options]
#

Turns out that between mailx version 8.1.1 and mailx version 12.4, the mailx binary got a lot more fussy about the formatting of the command line options.

Viewing the help on both versions shows that options need to come before the destination user, however it seems that older versions of mailx were a bit slacker and accepted some flexibility of command line options.

Usage: mail -eiIUdEFntBDNHRV~ -T FILE -u USER -h hops -r address \
 -s SUBJECT -a FILE -q FILE -f FILE -A ACCOUNT -b USERS -c USERS \
 -S OPTION users

The correct solution, is to always have the target user as the final field, after the command line options, aka:

# mail -s "[example] message" root
test message content
Cc: 
#

This will work happily on all versions since it’s correct syntax of the command line options.

Hopefully everyone else is smart enough to do this the right way the first time, but figured I’d post this incase some other poor sysadmin is having the same confusion over the invalid character message. :-)

Munin Performance

Munin is a popular open source network resource monitoring tool which polls the hosts on your network for statistics for various services, resources and other attributes.

A typical deployment will see Munin being used to monitor CPU usage, memory usage, amount of traffic across network interface, I/O statistics and more – it’s very handy for seeing long term performance trends and for checking the impact that upgrades or adjustments to the environment have made.

Whilst having some overlap with Nagios, Munin isn’t really a replacement, more an addition – I use Nagios to do critical service and resource monitoring and use Munin to graph things in more detail – something that Nagios doesn’t natively do.

A typical Munin graph - Munin provides daily, weekly, monthly and yearly graphs (RRD powered)

Rather than running as a daemon, the Munin master runs a cronjob every 5minutes that calls a sequence of scripts to poll the configured servers and generate new graphs.

  1. munin-update to poll configured hosts for new statistics and store the information in RRD databases.
  2. munin-limits to highlight perceived issues in the web interface and optionally to a file for Nagios integration.
  3. munin-graph to generate all the graphs for all the services and hosts.
  4. munin-html to generate the html files for the web interface (which is purely static).

The problem with this model, is that it doesn’t scale particularly well – once you start getting a substantial number of servers, the step-by-step approach can start to run out of resources and time to complete within the 5minute cron period.

For example, the following are the results for the 3 key scripts that run on my (virtualised) Munin VM monitoring 18 hosts:

sh-3.2$ time /usr/share/munin/munin-update
real    3m22.187s
user    0m5.098s
sys     0m0.712s

sh-3.2$ time /usr/share/munin/munin-graph
real    2m5.349s
user    1m27.713s
sys     0m9.388s

sh-3.2$ time /usr/share/munin/munin-html
real    0m36.931s
user    0m11.541s
sys     0m0.679s

It’s a total of around 6 minutes time to run – long enough that the finishing job is going to start clashing with the currently running job.

So why so long?

Firstly, munin-update – munin-update’s time is mostly spent polling the munin-node daemon running on all the monitored systems and then a small amount of I/O time writing the new information to the on-disk RRD files.

The developers have appeared to realise the issue of scale with munin-update and have the ability to run it in a forked mode – however this broke horribly for me with a highly virtualised environment, since sending a poll to 12+ servers all running on the one physical host would cause a sudden load spike and lead to a service poll timeout, with no values being returned at all. :-(

This occurs because by default Munin allows a maximum of 5 seconds for each service query to complete across all hosts and queries all the hosts and services rapidly, ignoring any that fail to respond fast enough. And when querying a large number of servers on one physical host, the server would be too loaded to respond quickly enough.

I ended up boosting the timeouts on some servers to 60 seconds (particular the KVM hosts themselves, as there would sometimes be 60+ LVM volumes that Munin wanted statistics for), but it still wasn’t a good solution and the load spikes would continue.

There are some tweaks that can be used, such as adjusting the max number of forked processes, but it ended up being more reliable and easier to support to just run a single thread and make sure it completed as fast as possible – and taking 3 mins to poll all 18 servers and save to the RRD database is pretty reasonable, particular for a staggered polling session.

 

After getting munin-update to complete in a reasonable timeframe, I took a look into munin-html and munin-graph – both these processes involve reading the RRD databases off the disk and then writing HTML and RRDTool Graphs (PNG files) to disk for the web interface.

Both processes have the same issue – they chew a solid amount of CPU whilst processing data and then they would get stuck waiting for the disk I/O to catch up when writing the graphs.

The I/O on this server isn’t the fastest at the best of times, considering it’s an AES-256 encrypted RAID 6 volume and the time taken to write around 200MB of changed data each time was a bit too much to do efficiently.

Munin offers some options, including on-demand graph generation using CGIs, however I found this just made the web interface unbearably slow to use – although from chats with the developer, it sounds like version 2.0 will resolve many of these issues.

I needed to fix the performance with the current batch generation model. Just watching the processes in top quickly shows the issue with the scripts, particular with munin-graph which runs 4 concurrent processes, all of them waiting for I/O. (Linux process crash course: S is sleeping (idle), R is running, D is performing I/O operations – or waiting for them).

Clearly this isn’t ideal – I can’t do much about the underlying performance, other than considering putting the monitoring VM onto a different I/O device without encryption, however I then lose all the advantages of having everything on one big LVM pool.

I do however, have plenty of CPU and RAM (Quad Phenom, 16GB RAM) so I decided to boost the VM from 256MB to 1024MB RAM and setup a tmpfs filesystem, which is a in-memory filesystem.

Munin has two main data sources – the RRD databases and the HTML & graph outputs:

# du -hs /var/www/html/munin/
227M    /var/www/html/munin/

# du -hs /var/lib/munin/
427M    /var/lib/munin/

I decided that putting the RRD databases in /var/lib/munin/ into tmpfs would be a waste of RAM – remember that munin-update is running single-threaded and waiting for results from network polls, meaning that I/O writes are going to be spread out and not particularly intensive.

The other problem with putting the RRD databases into tmpfs, is that a server crash/power down would lose all the data and that then requires some regular processes to copy it to a safe place, etc, etc – not ideal.

However the HTML & graphs are generated fresh each time, so a loss of their data isn’t an issue. I setup a tmpfs filesystem for it in /etc/fstab with plenty of space:

tmpfs  /var/www/html/munin   tmpfs   rw,mode=755,uid=munin,gid=munin,size=300M   0 0

And ran some performance tests:

sh-3.2$ time /usr/share/munin/munin-graph 
real    1m37.054s
user    2m49.268s
sys     0m11.307s

sh-3.2$ time /usr/share/munin/munin-html 
real    0m11.843s
user    0m10.902s
sys     0m0.288s

That’s a decrease from 161 seconds (2.68mins) to 108 seconds (1.8 mins). It’s a reasonable increase, but the real difference is the massive reduction in load for the server.

For a start, we can see from watching the processes with top that the processor gets worked a bit more to complete the process, since there’s not as much waiting for I/O:

With the change, munin-graph spends almost all it’s time doing CPU processing, rather than creating I/O load – although there’s the occasional period of I/O as above, I suspect from the time spent reading the RRD databases off the slower disk.

Increased bursts of CPU activity is fine – it actually works out to less CPU load, since there’s no need for the CPU to be doing disk encryption and hammering 1 core for a short period of time is fine, there’s plenty of other cores and Linux handles scheduling for resources pretty well.

We can really see the difference with Munin’s own graphs for the monitoring VM after making the change:

In addition, the host server’s load average has dropped significantly, as well as the load time for the web interface on the server being insanely fast, no more waiting for my browser to finish pulling all the graphs down for a page, instead it loads in a flash. Munin itself gives you an idea of the difference:

If performance continues to be a problem, there are some other options such as moving RRD databases into memory, patching Munin to do virtualisation-friendly threading for munin-update or looking at better ways to fix CGI on-demand graphing – the tmpfs changes would help a bit to start with.

find-debuginfo.sh invalid predicate

I do a lot of packaging for RHEL/CentOS 5 hosts, often this packaging is backporting of newer software versions, typically I’ll pull Fedora’s latest package and make various adjustments to it for RHEL 5’s older environment – typically things like package name changes, downgrade from systemd to init and correcting any missing build dependencies.

Today I came across this rather unhelpful error message:

+ /usr/lib/rpm/find-debuginfo.sh /usr/src/redhat/BUILD/my-package-1.2.3
find: invalid predicate `'

This error is due to the newer Fedora spec files often not explicitly setting the value of BuildRoot which then leaves the package to install into the default location, which isn’t always defined on RHEL 5 hosts.

The correct fix is to define the build root in the spec file with:

BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)

This will set both %{buildroot} and $RPM_BUILD_ROOT, so no matter whether you’re using either syntax, the files will be installed into the right place.

However, this error is a symptom of a bigger issue – without defining BuildRoot, the package will still compile and complete make install, however instead of the installed files going into /var/tmp/packagename…etc, the files will be installed directly into the actual / filesystem, which is generally ReallyBad(tm)

Now if you were building the package as a non-privileged user, this would have failed at the install phase and you would not have gotten as far as the invalid predicate error.

But if you were naughty and building as the root user, the package would have installed into / without complaint and clobbered any existing files installed on the build host. And the first sign of something being wrong is the invalid predicate error when the find debug script gets provided with no files.

This is the key reason why you are highly recommended to build all packages as a non-privileged user, so that if the build incorrectly tries to install anything into /, the install will be denied and the user will quickly realize things aren’t installing into the right place.

Building as root can be even worse than just “whoops, I overwrote the installed version of mypackage whilst building a new one” or “blagh annoying invalid predicate error” – consider the following specfile line:

rm -rf $RPM_BUILD_ROOT/%{_includedir}

On a properly defined package, this would execute:

rm -rf /var/tmp/packagename/usr/include/

But on a package lacking a BuildRoot definition it becomes:

rm -rf /usr/include/

Yikes! Not exactly what you want – of course, running as a non-root user would save you, since that rm command would be refused and you’d quickly figure out the issue.

I will leave it as an exercise of the reader to determine why I commented about this specific example… ;-)

IMHO, rpmbuild should be patched to just outright refuse to compile packages as the root user so this mistake can’t happen, it seems silly to allow a bad packaging habit to be used when the damages are so severe.

acpid trickiness

Ran into an issue last night with one of my KVM VMs not registering a shutdown command from the host server.

This typically happens because the guest isn’t listening (or is configured to ignore) ACPI power “button” presses, so the guest doesn’t get told that it should shutdown.

In the case of my CentOS (RHEL) 5 VM, the acpid daemon wasn’t installed/running so the ACPI events were being ignored and the VM would just stay running. :-(

To install, start and configure to run at boot:

# yum install -y acpid
# /etc/init.d/acpid start
# chkconfig --level 345 acpid on

If acpid wasn’t originally running, it appears that HAL daemon can grab control of the /proc/acpi/event file and you may end up with the following error upon starting acpid:

Starting acpi daemon: acpid: can't open /proc/acpi/event: Device or resource bus

The reason can quickly be established with a ps aux:

[root@basestar ~]# ps aux | grep acpi
root        17  0.0  0.0      0     0 ?        S<   03:16   0:00 [kacpid]
68        2121  0.0  0.3   2108   812 ?        S    03:18   0:00 hald-addon-acpi: listening on acpi kernel interface /proc/acpi/event
root      3916  0.0  0.2   5136   704 pts/0    S+   03:24   0:00 grep acpi

Turns out HAL grabs the proc file for itself if acpid isn’t running, but if acpid is running, it will talk to acpid to get it’s information. This would self-correct on a reboot, but we can just do:

# /etc/init.d/haldaemon stop
# /etc/init.d/acpid start
# /etc/init.d/haldaemon start

And sorted:

[root@basestar ~]# ps aux | grep acpi
root        17  0.0  0.0      0     0 ?        S<   03:16   0:00 [kacpid]
root      3985  0.0  0.2   1760   544 ?        Ss   03:24   0:00 /usr/sbin/acpid
68        4014  0.0  0.3   2108   808 ?        S    03:24   0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root     16500  0.0  0.2   5136   704 pts/0    S+   13:24   0:00 grep acpi

 

A tale of two route controllers

Ever since I built a Linux 3.2.0 kernel for my Debian Stable laptop to take advantage of some of the newer kernel features, I have been experiencing occasional short periods of disconnect/reconnect on the Wi-Fi network.

This wasn’t happening heaps (maybe a couple times a day), but it was starting to get annoying, so I decided to sort it out properly and do a kernel driver and microcode update for my Intel Centrino Wireless-N 1000 card.

The firmware/microcode update was easy enough, simply a case of downloading the latest code from Intel and installing into /lib/firmware/ – the kernel driver does the rest, finding it and loading it into the Wi-Fi card at boot time.

Next step was building a new kernel for my machine, I went through and tuned the module selection very carefully tossing out all the hardware my laptop will never use, as I was getting sick of wasting lots of disk space on the billion+ device modules in Linux these days.

After finding that my initial kernel lacked support for my video card (turns out the Lenovo X201i laptops still use AGP-based i915 cards, I was assuming PCIe) I got a working kernel up and running.

Except that my Wi-Fi stability problem was worse than ever, instead of losing connectivity every few hours, it was now doing so ever few minutes. :-(

The logs weren’t particularly helpful – NetworkManager likes to give reason numbers but I couldn’t easily find a documented explanation of these (but maybe I’m looking in the wrong place).

19:44:36 NetworkManager[1650]: <info> (wlan0): device state change: 8 -> 9 (reason 5)
19:44:36 NetworkManager[1650]: <warn> Activation (wlan0) failed for access point (b201)
19:44:36 NetworkManager[1650]: <warn> Activation (wlan0) failed.
19:44:36 NetworkManager[1650]: <info> (wlan0): device state change: 9 -> 3 (reason 0)
19:44:36 NetworkManager[1650]: <info> (wlan0): deactivating device (reason: 0).
19:44:36 NetworkManager[1650]: <info> (wlan0): canceled DHCP transaction, DHCP client pid 3354
19:44:36 kernel: [  391.070772] wlan0: deauthenticating from 00:0c:42:67:8b:bc by local choice (reason=3)
19:44:36 kernel: [  391.185461] wlan0: moving STA 00:0c:42:67:8b:bc to state 2
19:44:36 kernel: [  391.185466] wlan0: moving STA 00:0c:42:67:8b:bc to state 1
19:44:36 kernel: [  391.185470] wlan0: moving STA 00:0c:42:67:8b:bc to state 0
19:44:36 wpa_supplicant[1682]: CTRL-EVENT-DISCONNECTED - Disconnect event - remove keys
19:44:36 NetworkManager[1650]: <error> [1337240676.376011] [nm-system.c:1229] check_one_route(): (wlan0): \
         error -34 returned from rtnl_route_del(): Netlink Error (errno = Numerical result out of range)
19:44:36 kernel: [  391.233344] cfg80211: Calling CRDA to update world regulatory domain
19:44:36 avahi-daemon[1633]: Withdrawing address record for 192.168.1.11 on wlan0.
19:44:36 avahi-daemon[1633]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 192.168.1.11.
19:44:36 avahi-daemon[1633]: Interface wlan0.IPv4 no longer relevant for mDNS.
19:44:36 avahi-daemon[1633]: Withdrawing address record for 2407:1000:1003:99:226:c7ff:fe66:b822 on wlan0.
19:44:36 avahi-daemon[1633]: Leaving mDNS multicast group on interface wlan0.IPv6 with address 2407:1000:1003:99:226:c7ff:fe66:b822.
19:44:36 NetworkManager[1650]: <info> (wlan0): writing resolv.conf to /sbin/resolvconf
19:44:36 avahi-daemon[1633]: Joining mDNS multicast group on interface wlan0.IPv6 with address fe80::226:c7ff:fe66:b822.
19:44:36 avahi-daemon[1633]: Registering new address record for fe80::226:c7ff:fe66:b822 on wlan0.*.

So I proceeded to debug:

  1. Cursed and wished my 300m spool of Cat6 ethernet wasn’t in Wellington.
  2. Rolled back the microcode update – my initial thought was that the new code was making the card unstable and the result was the card dropping the connection and NetworkManager doing the clean up.
  3. Did a full power down to make sure that the microcode wasn’t remaining active on the card across reboots (had this problem once with a dodgy GPU once).
  4. Verdict: Microcode upgrade was OK, must be something else.
  5. Upgraded NetworkManager from 0.8.1 to 0.8.4 from Debian Backports – 0.8.1 isn’t too recent, was tempted to try 0.9 series but would have required a lot more backporting work.
  6. Verdict: Appears not to be a NetworkManager issue in the 0.8 series – maybe something fixed in 0.9 or later?
  7. Upgraded wpasupplicant from 0.6.10 to 1.0 by manual backport from unstable – the activation error made me consider it might have been a bug with newer kernels & wpasupplicant’s AP negotiation.
  8. Verdict: No change to the issue.
  9. Built a Linux 3.3 kernel with the older less-crashy 3.2 iwlwifi driver to see if it was driver specific, or otherwise-kernel related.
  10. Verdict: Same issue continued to occur, rolling back driver version infact made no change – something about the 3.3 kernel itself was the problem.
  11. Got suspicious about NetworkManager – either it or the kernel had to be at fault, one possibility was some weird API breakage with the age gap between the software versions being used. The kernel is *usually* pretty solid and something like wifi drivers dropping every couple of minutes would be a pretty serious bug to get through, so I looked through the logs to see if I could get anything more useful with NetworkManager’s logs.
  12. Spotted a kernel error “ICMPv6 RA: ndisc_router_discovery() failed to add default route.“. This error tended to occur shortly before any WiFi disconnection occurred, but not immediately so.
  13. Found an entry in Red Hat’s bugzilla.
  14. And then the upstream bug fix from 19th April.

Turns out that the Linux 3.3 kernel and NetworkManager fight over which one is going to control the default route for each router advertised link – the kernel adds one, Network Manager removes and then the kernel gets upset and drops all router advertisements.

In hindsight, I should have spotted it sooner, but I had discarded the RA statement from being related initially as the disconnection often didn’t happen till a minute of two after the log entry occurred – eg:

19:51:40 kernel: [  814.274903] ICMPv6 RA: ndisc_router_discovery() failed to add default route.
19:52:47 NetworkManager[1650]: <info> (wlan0): device state change: 8 -> 9 (reason 5)

What’s interesting about this bug, is that at first reading it explains a loss of IPv6 connectivity perfectly – however it doesn’t explain why IPv4 or the Wi-Fi connection itself was impacted.

The reason this happened, is that NetworkManager was set to have IPv6 as a requirement for that connection to be established – in the event of IPv6 not working, NetworkManager would consider the interface to be down, even if IPv4 was up.

There is a good reason for this, that the developers detailed on their (excellently written) blog, explaining that by having NetworkManager check for IPv6, it allows applications to be written smarter to better understand their level of connectivity.

For users of the NetworkManager 0.9 series, there’s a patch already committed which you can grab here and I would expect the next NetworkManager update will have this fix.

If you’re on the NetworkManager 0.8 series, this patch won’t apply cleanly – I might make some time to go and backport it, but you can workaround it for now by using the Ignore method so that NetworkManager does nothing and leaves it up to the Linux kernel in the background to negotiate IPv6 addressing.

Breaking vs Working Network Manager Settings

Of course if you’re not connecting to any IPv6 capable networks, you don’t have anything to worry about (other than the fact you’re still stuck in the 20th century).

 

Initially I was a bit annoyed at NetworkManager for being so silly as to drop the whole interface when just one of the two networking stacks was broken, however after thinking about it for a bit, it does make some sense as to why it chose that behavior – often most interface issues can be fixed by reconnecting – maybe the AP got rebooted, maybe the laptop just moved between two of them, etc – a reconnect can solve many of these.

But a smarter approach, would be to determine whether network issues are layer 2 or layer 3 – if it’s just a layer 3 issue, then there’s little need to drop the Wi-Fi connection itself, instead attempt to re-establish IPv4 or IPv6 connectivity where appropriate, and if unable to do so, use the notifications to tell the user that “IPv6 connectivity is experiencing a problem, some hosts and services may be unreachable”.

It’s actually something that Windows does semi-OK – it figures out roughly how borked a user’s connection is and then does a balloon popup stating that there’s limited connectivity or IP conflict, or some other sometimes helpful message.

This may be better in newer versions of NetworkManager, I’ll have to have a play with a more recent release and see.

Fixing Blogging

I’m finding an increasing number of friends and people using services like Tumblr or Google Plus as blogging services, or at least as a place to make posts that are more detailed an indepth than typical micro-blogging (aka Twitter/Facebook).

The problem with both these services, is that they deny interaction from external users who aren’t registered with their service.

With traditional blogging platforms such as WordPress, Blogger, or other custom developed blogs, any visitor to the blog could read it and post comment – the interfaces would vary, the ease of posting would vary and the method of validation of posting would vary, but you could 99% of the time still be able to post comments and engage with the author.

This has not been the case with social networks to date – platforms like Twitter or Facebook require a user to be logged in, in order to communicate with others – however this tends to work OK, since they’re mostly used for person-to-person messages and broadcasting, rather than detailed posts you will sent to users outside of those networks (after all 140 char tweets aren’t exactly where you’ll debate things of key meaning).

The real issue starts with half-blog, half-microblog services such as Tumblr and Google Plus, which users have started to use for anything from cat pictures to detailed Linux kernel posts, turning these tools into de-facto blogging platforms, but without the freedom for outsiders to post comments and engage in conversation.

 

Tumblr is one of the worst networks, as it’s very much designed as a glorified replacement for chain email forwards – you post some text or some pictures and all your friends “reblog” your page if they like it and users all pat themselves on their back at how witty and original they all are.

But to make a comment, one must reblog the post, add a comment and have it end up in the pages long list of reblog and like statements at the bottom of the post. And if the original poster wants to comment on that, you’d have to re-blog their blog. :-/

Yo dawg, we heard you like to reblog your reblogging.

The issue is that more people are starting to use it for more than just funny cat pictures and treat it as a replacement to blogging, which makes for a terrible time engaging with anyone. I have friends who use the service to post updates about their lives, but I can’t engage back – makes me feel like some kind of outcast stalker peering through the windows at them.

And even if I was on Tumblr, I’d actually want to be able to comment on things without reblogging them – nobody else cares if Jane had a baby, but I’d like to say “Congrats Jane, you look a lot less fat now the fork()ed process is out” to let my friend know I care.

Considering most Tumblr users are going to use Facebook or Twitter as well, they might as well use the image and short statement posting features of those networks and instead use an actual blog for actual content. Really the fault is due to PEBKAC – users using a bad service in the wrong way.

 

Google Plus is a bit better than Tumblr, in the respect that it actually has expected functionality like posts you can comment on, however it lacks the ability for outsiders to post comments and engage with the author – Google has been pretty persistent with trying to get people to sign up for an account, so it’s to be expected somewhat.

I’ve seen a lot of uptake with Google Plus by developers and geeks, seemingly because they don’t want the commitment of actually using a blog for detailed posts, but want somewhere to post lengthy bits of test.

Linus Torvalds is one particular user whom I might want to follow on Google Plus, but there’s not even RSS if you wanted to get updates on new posts! (To get RSS, you’d have to use external thirdparty services).

Tumblr at least has RSS so I can still use it in my reader like everything else, even if I can’t reply to the author….

Follow Linus! Teenage fanboy Jethro squeee!

And of course with no ability for posting comments by outsiders, I can’t post Linus comments requesting his hand in marriage after merging a kernel bug fix for my laptop. :-(

 

So with all these issues, why are users adopting these services? After all, there are thousands of free blogging services, several well known and very good ones, all better technical options.

I think it’s a combination of issues:

  • Users got overwhelmed by RSS – we followed everything we loved, then got scared by the 10,000 unread posts in our readers – and resolved by simply not opening the reader in fear of the queue waiting for us. The social media style approaches used by Google+, Tumblr and of course Twitter and Facebook focus less on following every single post by users, but rather what’s happening here and now – users don’t feel bad if they miss reading 1,000 posts overnight, they just go on to the next.
  • Users love copying. The MPAA & RIAA love this fact about humans, we love to copy and share stuff with others. Blogging culture tends to frown on this, but Tumblr’s reblogging style of use makes it more acceptable and maintains a credit trail.
  • Less commitment – if I started posting pictures of funny cats or one paragraph posts on this blog, it wouldn’t be doing it justice or up to the level of quality readers expect. However on social network based services, this is OK, there’s no expectation of a certain level of presentation and effort into a post. A funny cat picture followed by the post about you raging about by GNU Hurd will always be better than BSD is acceptable – on a blog, you’d drop the funny cat and be expected to write a well detailed post explaining your reasoning. Another label would be that it’s “more casual”, than conventional blogs.
  • Easier interactions with your readers (at least with Google+) – there’s no standards with blogging for handling notifications to users about changes to your blog or replies to comments. Even WordPress, one of the most popular platforms, doesn’t provided native email notifications to comments.
  • I noticed a major improvement in the level of interaction between myself and my readers after adding Subscribe to Comments Reloaded plugin to this site, using email notifications to users about replies to my blog post. And considering how slack many people are with checking their email, I do wonder how much better it would be if I added support for notification to new posts and comment replies via Twitter or Facebook.
  • Conventional blogs tend to take a bit more effort to post comments, some go overboard with captcha input fields that take 10 attempts or painful comment validation. I’ve tried to keep mine simple with basic fields and dealing with spam using Akismet rather than captcha (which has worked very well for me).

In my opinion the biggest issue is the communication, notification and interaction issue as noted above. I don’t believe we can fix the cultural side of users such as the crap they post or the inability to actually make the effort to read their RSS but we can go someway towards improving the technology to reduce/eliminate some of the pain points, to encourage use of the services.

There have been some attempts to address these issues already:

  • Linkback techniques such as Pingback address the issue of finding out who’s linking to your blog (although I turned this off as I found it really spammy and I get that information out of awstats anyway).
  • RSS handles getting updates of new posts on a polling basis and smarter RSS readers offer better filtering/grouping/etc.
  • Email notifications for blog comments and updates.

But it’s not good enough yet – what I’d actually like to see would be:

  • Improvement of linkback techniques to spam pages less, potentially with the addition of some AI logic to determine whether the linkback was just “check out this cool post!” or some actual useful content that readers of your post would like to read (such as a rebuttal).
  • Smarter RSS readers that act more like social network feeds, to give users who want more of a “live stream” feel what they want.
  • Live commenting technology – not all users have push email, so email notifications kind of suck for many users. A better solution would be to use the existing XMPP standard to send notifications to the user’s XMPP server (anyone using Gmail already has an XMPP service with them and numerous geeks run their own – like me ;-), so the user gets a chat message pop up. If the message format was standardized, it would be possible to have the IM client recognize it was a blog comment reply and to hand off to the installed RSS reader to handle for better UX – or fall back to posting text with a link to the reply for support with any XMPP standard client.
  • (I did see that there is an outdated plugin for XMPP on WordPress  as well as some commercial live-commenting packages that hook into social networks, but I really want a proper open source solution that does everything in one plugin, so there’s a more seemless UX – rather than having 20 checkboxes for which method the user would like notifications via.)
  • Whilst mentioning XMPP, we could even consider replacing RSS with XMPP based push notifications – blog servers sending out a push message when they get an update, rather than readers polling services. Advantage is near-instant update of new posts and potentially less server load of not having thousands of wasted polls when there hasn’t been any update to fetch.
  • Comment reply via notification support. If you send someone an XMPP IM, email, tweet, virtual sheep or whatever to alert to a comment or blog post, they should be able to reply via that native medium and have the blog server interpret, validate and integrate that reply into the page.

My hope is that with these upgrades, blogging platforms will extend themselves to be better placed for holding up against social networking sites, making it easier to have detailed conversations and long running threads with readers and authors.

Moving to a new generation communication platform build around the existing blogging platforms would be as much of an improvement for real time social responsiveness as shifting from email to Twitter and hopefully, the uptake in real time communications will bring more users back to decentralised, open and varied platforms.

I’m tempted to give this a go by building a WordPress plugin to provide unified notifications using XMPP / Email / Social Media, but it’ll depend on time (lol who has that??) and I haven’t done much with WordPress’s codebase before. If you know of something existing, I would certainly be interested to read about it and I’ll be taking a look at options to build upon.

DAViCal 1.0.2 on RHEL 5 & 6

To follow up on my previous post about DAViCal, I’ve built and published RPMs for DAViCal itself and the php-awl dependency.

These are based off provided spec files from the project and tweaked somewhat to be more suitable for RHEL 5 & 6.

 

RHEL 5 & PostgreSQL 8.1 Note

Whilst DAViCal is intended (and for normal operation, does) work with PostgreSQL 8.1 or later, this version is too old for the LDAP authentication module to work, as it uses some PostgreSQL 8.4 version queries.

Fortunately RHEL & CentOS ship with both PostgreSQL 8.1 and PostgreSQL 8.4 now available, so you can fix the solution by installing with:

# yum install davical postgresql84-server

 

RHEL 5 & 6 Installation Instructions

These instructions assume you have confirmed the Amberdms RHEL 5 “amberdms-os” repository at minimum – or you can go and pull the specific RPM files you want – php-awl and davical and add them to your own repository.

Once the repositories are setup, simply install with:

# yum install davical

DAViCal uses PostgreSQL, if this is a new/first PostgreSQL installation, you will need to start and possibly initilise the DB:

# service postgresql start
 /var/lib/pgsql/data is missing. Use "service postgresql initdb" to
 initialize the cluster first.     [FAILED]
# service postgresql initdb
 Initializing database:    [  OK  ]
# service postgresql start
 Starting postgresql service:     [  OK  ]

We need to edit the PostgreSQL user authentication configuration to allow local-only password-less access for the DAViCal application. Optionally you can configure MD5, ident or other desired methods. Add the two lines below to the configuration file, above any existing lines.

# vi /var/lib/pgsql/data/pg_hba.conf

 # trust davical
 local   davical davical_app     trust
 local   davical davical_dba     trust

Restart PostgreSQL for the changes to take effect:

# service postgresql restart

Install the database:

# cd /tmp/
# su postgres -c /usr/share/davical/dba/create-database.sh

 Supported locales updated.
 Updated view: dav_principal.sql applied.
 CalDAV functions updated.
 RRULE functions updated.
 Database permissions updated.
 NOTE
 ====
 *  The password for the 'admin' user has been set to 'EXAMPLE' 

 Thanks for trying DAViCal!  Check in /usr/share/doc/davical/examples/ for
 some configuration examples.  For help, visit #davical on irc.oftc.net.

Adjust the access rules for Apache & restart it:

# vi /etc/httpd/conf.d/davical.conf
# service httpd restart

Test access at http://localhost/davical/or whatever your appropriate server URL is. Any 403 errors probably suggest fault with the /etc/httpd/conf.d/davical.conf IP ACL configuration.

 

RHEL 5 & 6 Upgrade Instructions

Using the packages I have provided, the DAViCal PostgreSQL DB will be updated on any new releases when installing newer RPMs.

This uses the /usr/share/davical/dba/update-davical-database script supplied with DAViCal and shouldn’t require any manual execution or options normally.

 

LDAP Authentication

To configure LDAP authentication, edit the configuration file and define the external authentication settings.

vi /etc/davical/config.php

See the notes in the file about LDAP configuration or consult the quite reliable source of documentation at the DAViCal wiki.

You will also need to have php-ldap installed – it’s not one of the default package dependencies – if it’s missing, you will get this clear message on the login screen:

"drivers_ldap : function ldap_connect not defined, check your php_ldap module"

To install, run:

# yum install php-ldap
# service httpd restart

If authentication still fails to work, try the following

  1. Check the version of PostgreSQL used – must be 8.4 or later, not 8.1, as per my note at the start of this document.
  2. Check Apache error logs (typically /var/log/httpd/error_log)
  3. Check the LDAP server logs

 

 

 

DAViCal, awkward name, great features

A reoccurring theme of this blog is that I love to be able to use open standards and open source for storing and accessing my information – biggest example is of course IMAP for email, but I also use tools such as Mozilla Sync Server for self-reliant synchronization and backup of client device information, without using external cloud providers.

I’ve been a user of Evolution for almost a decade now – sometimes criticized as the “Outlook of Linux”, Evolution provides mail, calendering, contacts and todo lists to the GNOME desktop, with a pretty large but sometimes slightly buggy feature set. For me personally, it’s always done a great job and it’s my key business productivity tool.

I moved all my mail onto an IMAP server years ago, which makes it easy to shift clients if I ever need to – in my case, pretty much just needing to access mail from both my laptop and smart phone.

However this hasn’t been the case for other key data such as calendering and contacts. A few years ago, the open source calendering solutions available weren’t that well developed, and many clients suffered limitations such as read-only functionality.

Thankfully this has been changing – most clients (*glares angrily at Microsoft Outlook*) now support CalDAV and CardDAV quite reliably, which gives us an open standard that works across different programs, platforms and device types.

  • CalDAV is an open standard for the exchange of calendering and task/todo/memo information between a client and a server.
  • CardDAV is an open standard for the exchange of contact/address book information between a client and a server.

These two standards have a number of implementations, both open source and proprietary, of note is Apple Calender Server, which is Apple’s open source implementation; and DAViCal, an open source LAMP based server solution that is becoming quite popular.

I’ve used both solutions – my employer runs an Apple Calender Server after getting fed up at not having free/busy between engineers. Whilst we ended up running a MacOS server, the Linux ports have improved and there are resources for setting it up on a Linux or even BSD host.

Apple Calender Server works reasonably well with Evolution, I never have any issue booking events, however Evolution appears unable to accept or deny meeting requests, forcing me to go to the calender server web-interface which is actually pretty horrific.

I decided I wouldn’t use it for my own personal calendering, even if I went to the effort for porting it onto my Linux servers, it wouldn’t really be the solution I ideally want as it lacks a lot of features and isn’t as easy to configure as other Linux services.

Instead I had a look at DAViCal. It’s a feature-packed calendering and contacts application developed primarily by Morphoss in Wellington NZ, started by Andrew McMillian of ex-Catalyst IT fame.

Despite having an annoyingly tricky name to type (you try typing it for the 100th time at 3am without typoing on the capitalization!!), the software itself appears reliable and worked across a number of devices when I ran tests.

It’s not perfect, I have some issues with the user interface design, which very functional and effective, it’s not that intuitive to a new user, exposing far too many options to them at the beginning, ideally have a simple/advanced option so a user who just wants to add user calenders and do basic stuff can do so, then dig into more detailed ACLs, tokens, shared calenders, etc as needed.

Naturally it’s open source, so I should stop complaining and hack up some code to demonstrate what I think might be better. Maybe if people would stop stealing my car I’d have time to get something done. :-/

Main Screen Turn On! (Maybe some more 1-2-3 clear setup flows here would be nice, the wall of text is kind of offputting for visual people like myself).

Options! All the options!

The web-based interface is only for administration, there isn’t a web-based calender app provided with DAViCal, instead choose any CalDAV client you wish to use with it, whether it’s web based or client-side.

I haven’t given DAViCal’s feature set a full work out yet, at this stage I’ve just setup my personal calendar, contacts and todo list on both Evolution and my Android ICS phone but haven’t touched meeting requests, shared calenders and free/busy information.

Partly my testing is a bit limited since I’m only running Evolution 2.30.3 (Debian Stable) which is a little outdated and it looks like there’s some functionality missing/broken that might not be an issue any more.

On the mobile side, I’m using “aCal”, an open source Android application written by the DAViCal developer, providing a CalDAV calender, todo list and read-only contact/address book synchronization.

This now means I can add, edit and delete calender and task entries on either my Android phone or Linux laptop via evolution and have it propagate to the other device – although unfortunately this is based on polling, rather than push (looks like push is possible theoretically via an extension to the standard, works with iCal).

Tasks & calendar entries in a bright sunny UI

I can also get read-only copies of all my contact information from Evolution synced through to my phone, but sadly there isn’t support for editing contacts on the Android phone just yet.

I did also consider using LDAP for my address book entries, but CardDAV looks like a better designed solution, it’s very rare that I don’t see “LDAP” and “headache” mentioned in the same sentence, and this comes from someone maintaining and supporting LDAP enterprise environments…

Essentially the main problem with LDAP, is that there isn’t an exact standard for address entries, so what works for one client, might not work 100% for another, along with limited selection of decent applications for actually managing LDAP address databases.

Also some clients treat LDAP assuming it’s going to be a million+ record store and expose different UI compared to that of smaller address books which harm user experience (*glares at Evolution*).

aCal & Android ICS address book integration - note the uneditable edit screen on the right, read-only for now :-(

The other main issue with aCal is that it doesn’t sync with the native OS calender program, but instead provides it’s own. Digging through the documentation and mailing list, this appears to be due to the native application lacking support for some of the functionality needed for a proper CalDAV implementation, so a sync solution would leave certain features missing, although I’d still like the option.

Of course these are limitations of aCal, not DAViCal or the standards themselves – there are some other CalDav & CardDAV sync programs available in the Android market under non-open licenses, which you have the option of trying.

The nice thing about using standards is that you can have multiple vendors competing to make the best product/tool for their customer’s needs, not simply using lock-in to maintain/force a customer base. :-)

Overall DAViCal seems really nice and in my testing has been quite reliable – I’m now moving on to more rigorous testing and am in the process of migrating my calender and contacts information into it, once I start using it daily in the real world the true testing begins.

Keen to take a look at what options I have around exposing some information publicly, eg sharing schedule free/busy with friends on different servers.