Per-packet round-robin load balancing for UDP

The requirement was satisfied as follows:

I've installed a more recent version of ipvsadm (and its kernel modules), the one that supports the --ops flag (1.26). Since keepalived does not expose this flag in its configuration file, you have to apply it manually. Luckily, you can do that after the "virtual service" is created (in terms of plain ipvsadm, you can first ipvsam -A a virtual service without --ops, and then ipvsadm -E it to add one packet scheduling).

Since keepalived creates the the virtual service for you, all you have to do is to edit it after it is created, which happens when quorum is gained for this virtual server (basically, there is a sufficient number of working realservers). Here's how it looks in the keepalived.conf file:

virtual_server <VIP> <VPORT> {
    lb_algo rr
    lb_kind NAT
    protocol UDP
    ...

    # Enable one-packet scheduling when quorum is gained
    quorum_up "ipvsadm -E -u <VIP>:<VPORT> --ops -s rr"

    ... realserver definitions, etc ...
}

This works, but I've encountered a number of problems (kind of) with this setup:

  1. There is small time gap (less than a second, more like 1/10), between quorum going up and the script in quorum_up getting executed. Any datagrams that manage to go through the director during that time will create a connection entry in ipvsadm, and further datagrams from that source host / port will be stuck on the same realserver even after the --ops flag is added. You can minimize the chance of this happening by making sure that the virtual service is never deleted once it is created. You do that by specifying inhibit_on_failure flag in your realserver definitions so that they are not deleted when the corresponding realserver is down (when all realservers are deleted, the virtual service is also deleted), but instead their weight is set to zero (they stop receiving traffic then). As a result, the only time datagrams can slip by is during keepalived startup (assuming you have at least one realserver up at that time, so that quorum will be gained immediately).
  2. When --ops is active, the director does not rewrite the source host / port of the datagrams that the realservers sends to the clients, so the source host / port are those of the realserver that has sent this particular datagram. This might be a problem (it was for my clients). You can amend that by SNAT'ing those datagrams with iptables.
  3. I've noticed significant system CPU load when the director is under load. Turns out, CPU is hogged by ksoftirqd. It does not happen if you turn off --ops. Presumably, the problem is that the packet dispatching algorithm is fired on every datagram instead of just the first datagram in the "connection" (if that even applies to UDP..). I haven't actually found the way to "fix" that, but maybe I haven't tried hard enough. The system has some specific load requirements and under that load the processor usage does not max out; neither are there any lost datagrams, so this problem is not considered a show-stopper. It is still rather alarming though.

Summary: the setup definitely works (also under load), but the hoops one has to jump through and the problems I've encountered (especially №3.. maybe someone knows the solution?), mean that, given time, I would've used a userspace program (written in C, probably) for listening on a UDP socket and distributing the received datagrams between realservers, in conjunction with something that would check the health of realservers for me, SNAT in iptables to rewrite the source host / port and keepalived in VRRP mode for HA.