I recently shifted from having two huge server racks down to having a single speedy home server running KVM virtual machines, with the intent of packaging all my servers – experimental, development, staging, etc, into a single reliable system which will reduce power and maintenance costs.
As part of this change, I went from having dedicated DHCP & DNS servers to having everything located onto the KVM host.
The design I’ve used, has the host OS running with minimal services – the host just runs KVM, OpenVPN, DHCP and a DNS caching nameserver – all other services run as guest VMs, with a virtual network for the guests and host to communicate over.
Guests run as DHCP clients – this makes it easy to assign or adjust addressing if needed and get their information from the host OS.
However this does mean you can’t get away with hammering the host too badly – for example, running an I/O and network intensive backup can cause some interesting problems when you also need the host for services, such as DHCP.
Take a look at the following log messages from a mostly idle VM – these were taken whilst another VM on the server was running a bonnie++ process to test performance:
Mar 6 10:18:06 virtguest dhclient: 5 bad udp checksums in 5 packets Mar 6 10:18:27 virtguest dhclient: DHCPREQUEST on eth0 to 10.8.12.1 port 67 Mar 6 10:18:45 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 Mar 6 10:19:00 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 Mar 6 10:19:07 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 Mar 6 10:19:15 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 Mar 6 10:19:15 virtguest dhclient: 5 bad udp checksums in 5 packets
That’s some messed up stuff – what you’re seeing is that the guest VM is trying to renew the DHCP address with the host server – but the host is so sluggish with having to run the I/O intensive virtual machine that is actually corrupting or dropping the UDP packets, preventing the guest VM from renewing it’s address.
This of course raises the most important question: What happens if the guest can’t renew it’s IP address?
In this case, the Linux/CentOS 5 guest VM actually completely lost it’s IP address after a long period of DHCPREQUEST attempts, fell off the network entirely and caused my phone to go nuts with Nagios alerts.
Now of course in any sane production environment, nobody would be running a bonnie++ processes on a VM on an active server – however there’s some pretty key points still made here:
- The isolation is a lie: Guests are only *somewhat* isolated from one another – one guest can still mess with another and effectively denial-of-service attack the other VMs by utilising all the available resources.
- Guests can be jerks: Organisations running KVM (or some other systems) with untrusted guest VMs should carefully consider how they are going to monitor and protect the service from users running crazily resource intensive processes. (after all, there will be someone who wants to bonnie++ test their new VM simply for the lols).
- cgroups to the rescue? Linux cgroups does have an I/O controller (blkio-cgroup) although whilst this controls read/write flow, it won’t restrict seeks which can also badly impact spinning rust based servers.
- WTF DHCP? The approach of the guests simply dropping their DHCP address after losing contact with the DHCP server is a pretty bad design limitation – if the DHCP server is unreachable, it should keep the original address (of course if the “physical” ethernet connection dropped, that would be a different situation, and it should drop it’s address to match).
- Also: I wonder what OSes/distributions have the above behavior?
I’m currenting running a number of bonnie++ tests on my KVM server and will have a blog post in the near future detailing these findings in more detail, I’m also planning to look into cgroups and other resource control and limiting functions and will report back on how these fare when you have guest VMs running heavy processes.
Overall it made my weekend of geekery that bit more exciting. :-D
> WTF DHCP?
Yes, DHCP has its drawbacks. But this is what the three timers renew/rebind/expire are for ; look into dhclient leases file for current state. And set default-lease-time and max-lease-time options in DHCP server config accordingly huge/different.
Unfortunately, there’s no way to trigger a lease revocation from DHCP server side, so you either need to trigger it via some remote shell access or via “ethernet” disconnect/connect action, in case you want to renumber things.
Hmm, whilst I could boost up the lease times, it would still eventually have the same issue of not being able to check it’s lease if the server has a period of high load for a long time, just take longer for it to occur.
I’d have to dig into the DHCP standards around IP expiry, whether a client must drop it’s IP or whether it’s legitimate to retain that IP if a connection cannot be established to a DHCP server.
Hi Jethro,
What you are seeing is an interaction between KVM virtio not doing UDP checksums (and not zeroing the field) and ISC dhcp (imo correctly) discarding the packets with random values as a checksum. Switch the instance running your dhcp server from virtio nic to e10000 nic and the problem goes away. (And yes, this sucks…)
Regards,
Martijn Lievaart
Hi Jethro,
I will answer to this old entry because I just had a similar issue on a Debian 8 VM which requested nearly every second a new IP.
The first idea was to deactivate the UDP checksum check on the client, but this has to be done on all clients which will run into this problem.
At the end I was implementing a simple iptables rule on my Ubuntu isc-dhcp-server which has to be loaded while booting:
iptables -A POSTROUTING -t mangle -p udp –dport bootpc -j CHECKSUM –checksum-fill
Cheers, Ralf