Things to consider when running public NTP servers

Solution 1:

Firstly, good for you; it's a helpful and public-spirited thing to do. That said, and given your clarification that you're planning on creating one or more DMZ VMs which will sync to and make publicly-available the time from your three Meinberg GPS-enabled stratum-1 (internal) servers:

  1. Edit: Virtualisation comes up for discussion on the pool list from time to time; a recent one was in July 2015, which can be followed starting from this email. Ask Bjørn Hansen, the project lead, did post to the thread, and did not speak out against virtualisation. Clearly a number of pool server operators are virtualising right now, so I don't think anyone will shoot you for it, and as one poster makes clear, if your server(s) are unreliable the pool monitoring system will simply remove them from the pool. KVM seems to be the preferred virtualisation technology; I didn't find anyone specifically using VMWare, so cannot comment on how "honest" a virtualisation that is. Perhaps the best summary on the subject said

    My pool servers are virtualized with KVM on my very own KVM hosts. Monitoring says, the server is pretty accurate and provides stable time for the last 2-3 years. But I wouldn't setup a pool server on a leased virtual server from another provider.

  2. This is the daily average number of distinct clients per second I see on my pool server (which is in the UK, European and global zones) over the past year:

    ntp client count

    This imposes nearly no detectable system load (ntpd seems to use between 1% and 2% of a CPU, most of the time). Note that, at some point during the year, load briefly peaked at nearly a thousand clients per second (Max: 849.27); I do monitor for excessive load, and the alarms didn't all go off, so I can only note that even that level of load didn't cause problems, albeit briefly.

  3. The project-recommended configurations are best-practice, and work for me. I also use iptables to rate-limit clients to two inbound packets in a rolling ten-second window (it's amazing how many rude clients there are out there, who think that they should be free to burst in order to set their own clocks quickly).

  4. Or remove any lines referring to server addresses starting with 127.127.

  5. The best practice guidelines also recommend more than three clocks, so you might want to pick a couple of other public servers, or specific pool servers, in addition to your three stratum-1 servers.

    I'd also note that if you're planning to put both these VMs on the same host hardware, you should probably just run the one, but double the bandwidth declared to the pool (ie, accept twice as many queries as you otherwise would).

Solution 2:

Firstly, congrats on an NTP question that is non-facepalm material. :-) I've included some graphs at the bottom of this post to give you a feel for things. The VM in question is set to 100 Mbps in the pool control panel, and is in the UK, Europe, and global pools.

  1. I think MadHatter covered this well - virtualisation should be fine. Like you say, if they're feeding from your GPS-connected stratum 1s, they should be reasonably solid. In my experience, VMs tend to be a little more jumpy than bare metal in terms of frequency (see graph below), but that's what you'd expect - they're dealing with a clock emulation layer (hopefully pretty efficient) and potentially noisy neighbours. If you'd rather not see that sort of jumpiness, maybe use older servers or unused desktops as your DMZ stratum 2s instead.

  2. This VM is 1 core, 2 GB RAM, running Ubuntu 16.04 LTS, virtualised in OpenStack (KVM hypervisor). As you can see, the RAM is a little over the top.

  3. The recommended settings - including not having the local driver configured - are the default in Ubuntu 16.04. I'm running very close to the stock configuration, other than the peer list.

  4. (see above)

  5. I'd probably start bandwidth on the low side and ramp up the bandwidth after you've monitored it for a bit. If your VMs are all nearby each other and near your stratum 1s in terms of network latency, I'd probably have all the VMs talking to all the stratum 1s, and probably peer them with each other and turn on orphan mode as well.

Here are the graphs - they all cover the same period of roughly 3 weeks, except for the network one, which had a couple of spikes due to backups. When the network spikes were there I couldn't even see the normal NTP traffic, so I zoomed in a little to show the usual background.

CPU CPU Memory Memory Network Network Frequency Frequency System Offset System Offset


Solution 3:

Some things to consider with NTP

There are already soom good answers here. I am just adding a few thoughts for completeness sake based on my own experiences.

I would suggest enabling NTP logging and graph clock skews and corrections on bare metal vs. VM as it pertains to that discussion if that is a concern. I don't belive this can be generalized easily as hardware and configuration vary between implementations. It might be best to get your own numbers on that one.

I have always suggested to folks to pick systems roles of servers or network devices that have fairly constant CPU time and that are not tickless kernels or that have power saving modes enabled. Especially avoid daemons line cpuspeed or speed govenors or advanced power saving on NTP servers, even if they are only stratum 2 in your farm. Some stability can be gained by never going deeper than C-State 1, but your power consumption will increase.

I also try to ensure that folks pick a handful of stratum 1 servers that are under 40ms away from the edge of their network, then divide them up across your edge NTP servers and ensure that no 2 servers behind the same SNAT in your network are talking to the same stratum 1 server. Along the same lines as burst, it is unwise to have multiple servers behind the same SNAT using the same upstream servers, as it will appear to them you have enabled burst even when you have not.

You should always honor the kod packet from the upstream server and have monitoring tools checking time offsets and reachability of the upstream servers.

You may want to consider having your own accurate time sources in a few of your datacenters to peer with or fall back on in the unlikely case that GPS SA is enabled by the military. There are cost effective appliances specifically for this. Even if you are in a "cage" environment and don't have your own datacenter, some hosting facilities may accomodate this.


Solution 4:

See the vmware timekeeping document at http://www.vmware.com/pdf/vmware_timekeeping.pdf

Running a NTP daemon in a VM is probably not a good idea, particularly if you need reliable time.

Tags:

Ntp

Ntpd