Time synchronization in an heterogeneous environment

[EDIT] A major rewrite with references as I just jotted down the old answer from memory.

Short answer: no. It is not possible to get near-millisecond accuracy from a run-of-the-mill operating system on a x86/x64 platform today.

DISCLAIMER This is a laymans answer as I am an ordinary sysadmin with an ordinary sysadmins view of computers. A professional level of knowledge of timekeeping is likely found among some kernel developers and hardware architects.

Long answer:

One has to start somewhere. I'll do this top down, starting with applications moving down towards the oscillator(s).

The first problem is not having timekeeping on one computer, but managing to get the environment as a whole to agree on whatever timekeeping you have. What timekeeping? It turns out there are a couple of ways to keep time in a computer of today. The one we see the most of is the system time (as displayed in one of the screens corners). Let's start by pretending it's that simple and complicate things a couple of paragraphs down.

We want the system time to be correct and we want it to be uniform across all of our computers. We need a way to communicate it from a trusted source at a level so granular so as to meet our requirements whichever they may be.

Let's make our requirement into a tolerance level of 1ms, that is, our time may deviate 1ms within our environment or we miss a critical goal. Let's get concrete and look at what Microsoft can do for us.

Excluding obsoletes such as NT, Windows native runs its timekeeping based on either simplified ntp (domain-joined computers beginning with XP/2003) or simplified sntp (non-domain-joined computers beginning with Win2k) - thanks to @Ryan for nitpicking this detail. Microsoft set two goals when making the timekeeping implementation, neither of which include our desired level of accuracy:

"We do not guarantee and we do not support the accuracy of the W32Time service between nodes on a network. The W32Time service is not a full-featured NTP solution that meets time-sensitive application needs. The W32Time service is primarily designed to do the following:

  • Make the Kerberos version 5 authentication protocol work.
  • Provide loose sync time for client computers.

The W32Time service cannot reliably maintain sync time to the range of one to two seconds. Such tolerances are outside the design specification of the W32Time service."

OK. Assuming we are running your service stack on more than one computer and have a timekeeping tolerance level approaching 1ms for event correlation, that is quite a letdown. If the service stack includes two computers, we actually can't use Windows native timekeeping at all. But while we're at it, let's underscore a key point or two about the Windows native timekeeping, and include some thorough documentation:

If you have an AD observe that the time in a given domain will be synchronized from the PDC Emulator role, whichever DC has it. Bringing correct time into the domain thus needs to be via the Domain Controller running the PDC Emulator role. If in a multidomain forest this translates to the PDC Emulator of the forest root domain. From there time is dispersed primarily to the PDC Emulators of subdomains and to each domain member in a fan out fashion (with some caveats). This process is documented here. Even more in depth information here

OK. What can we do?

To begin with, we need one or other more precise way to synchronise time throughout the environment. Assuming we can't run Linux ntpd or ntpd for Windows you could take a look at a shareware client called Tardis, but there are likely many more out there to try.

We ran Tardis on a Win2k3 server running as PDC Emulator which had a CMOS clock with a really large skew, for inexplicable historical reasons we had no choice but to synchronize the entire network from it. Now it has been replaced to great joy with a dedicated Linux ntpd bringing time in from atomic clocks on the outside, but Tardis saved us admirably then and there. I don't know however if it could help you in achieving precision greater than Windows native.

But let's assume from this point on, that we(us) have figured out how to implement a perfect substitute network time synchronisation. Through its inherent craftiness it has a capacity for tolerance levels below one millisecond. We have put it in place so as to enforce how our AD expects time to spread through the network.

Does this mean that we can get accurate diagnostics out of operating systems and microservices at a granularity approaching single milliseconds?

Let's look at how operating systems on the x86/x64 architecture schedule processor time.

They use interrupts, which are multifacetted beasts rich in archaeological substance. However, the operating system is not alone in its desire to interrupt. The hardware wishes to interrupt too, and it has the means to do it! (Hello keyboard) And operating systems play along.

This is where it gets complicated and I will solve this by oversimplifying. Questions? I duck, cover and point you to an absolutely excellent treatise on the subject. (If you're hunting milliseconds on a Windows platform you really should read it..) An updated version for Win8.1/Win2012r2 is reportedly in the works but no release date has yet surfaced.

OK, interrupts. Whenever something should happen in an OS, an interrupt triggers the action which follows. The action is a bunch of instructions fetched from the kernel, which can be executed in a whole lot of different manners. The bottom line is that despite the interrupt happening at a time which can be determined with more or less accuracy depending on hardware architecture and kernel interrupt handling, the exact time at which the subsequent parts of execution happen generally can not. A specific set of instructions may be executed early on after the interrupt or late on, it may be executed in a predictable sequence or not, it may be victim of buggy hardware or poorly written drivers affecting latencies hard to even recognize. Most of the time one simply doesn't know. The millisecond level timestamp that shows in the subsequent log file - it is very precise, but is it accurate as to when the event happened?

Lets stop briefly by the timekeeping interrupt. An interrupt comes with a priority level, the lowest level is where user applications (such as a standard service) get their processor time. The other (higher) levels are reserved for hardware and for kernel work. If an interrupt at a level above the lowest arrives, the system will pretend any lower priority interrupts also in queue don't exist (until higher prio interrupts have been cared for). The ordinary applications and services running will in this way be last in line for processor time. As a contrast, almost highest priority is given to the clock interrupt. The updating of time will just about always get done in a system. This is an almost criminal oversimplification of how it all works, but it servers the purpose of this answer.

Updating time actually consists of two tasks:

  • Updating the system time / AKA the wall clock / AKA what I say when someone asks me what time it is / AKA the thing ntp fiddles a bit back and forth relative to nearby systems.

  • Updating the tick count, used for instance when measuring durations in code execution.

But wether it is wall time or tick count, where does the system get the time from? It depends greatly on the hardware architecture. Somewhere in the hardware one or several oscillators are ticking, and that ticking is brought via one of several possible paths into an interface for contact with the kernel as it with greater or lesser precision and accuracy updates its wall time and tick count.

There are several design models for oscillator placement in a multicore system, the major differentiator seems to be synchronous vs asynchronous placement. These together with their respective challenges to accurate timekeeping are described here for instance.

In short, synchronous timekeeping has one reference clock per multicore, which gets its signal distributed to all cores. Asynchronous timekeeping has one oscillator per core. It is worth noting that the latest Intel multicore processors (Haswell) use some form of synchronous design using a serial bus called "QuickPath Interconnect" with "Forwarded Clocking", ref. datasheet. The Forwarded Clocking is described in terms such that a layman (me) can get a quick superficial grasp on it here.

OK, so with all that nerderism out of the way (which served to show that timekeeping is a complex practical task with a lot of living history about it), let's look even closer at interrupt handling.

Operating systems hadle interrupts using one of two distinct strategies: ticking or tickless. Your systems use one or the other, but what do the terms mean?

Ticking kernels send interrupts at fixed intervals. The OS cannot measure time at a finer resolution than the tick interval. Even then, the actual processing involved in performing one or several actions may well contain a delay greater than the tick interval. Consider for instance distributed systems (such as microservices) where delays inherent in inter-service calls could consume relatively a lot of time. Yet every set of instructions will be associated with one or several interrupts measured by the OS at a resolution no finer than the kernel ticking time. The tick time has a base value but can at least in Windows be decreased on demand by an individual application. This is an action associated not only with benefits but also with costs, and carries quite a bit of fine print with it.

So called tickless kernels (which have a very non descriptive name) are a relatively new invention. A tickless kernel sets the tick time at variable intervals (as long duration as possible into the future). The reason is for the OS to dynamically allow processor cores to go into various levels of sleep for as long as possible, with the simple purpose of conserving power. "Various levels" include processing instructions at full speed, processing at decreated rates (i.e. slower processor speed) or not processing at all. Different cores are allowed to operate at different rates and the tickless kernel tries to let processors be as inactive as possible, even in cases including queueing up instructions to fire them off in interrupt batches. In short, different cores in a multiprocessor system are allowed to drift in time relative to each other. This of course plays havoc with good time keeping, and is so far an unsolved problem with newer powersaving processor architectures and the tickless kernels which allow them to do efficient power saving. Compare this with a ticking kernel (static tick interval) which continually wakes all processor cores up, regardless of them receiving actual work or not, and where timekeeping carries a degree of inaccuracy but to a relatively dependable degree compared to tickless kernels.

The standard Windows tick time - that is the system resolution - is 15.6ms up until Windows 8/2012 where the default behaviour is tickless (but is revertable to ticking kernel). Linux default tick time I believe depends on the kernel compilation, but this niche is well outside my experience (and this one too) so you may wish to double check if you depend on it. Linux kernels I believe are compiled tickless from 2.6.21 and may be compiled with various flags optimizing the tickless behaviour (and of which I only recall a few variants of no_hz).

So much for bare metal systems. In virtual systems it gets worse, as VM and hypervisor contention in different ways make accurate timekeeping extremely difficult. Here is an overview for VMware and here is one for RHEL KVM. The same holds true for distributed systems. Cloud systems are even more difficult as we do not get even close to seeing actual hypervisors and hardware.

To conclude, getting accurate time out of a system is a multilayered problem. Going now bottom up from a high-level point of view, we have to solve: Internal time synchronization between the hardware and the kernel, interrupt processing and delays to the execution of the instructions we wish the time of, if in a virtual environment inaccuracies due to the encapsulation of a second OS layer, the synchronization of time between distributed systems.

Therefore at this point in the history of computing we will not get milisecond level accuracy out of an x86/x64 architecture, at least not using any of the run-of-the-mill operating systems.

But how close can we get? I don't know and it ought to vary greatly between different systems. Getting a grip on the inaccuracy in ones own specific systems is a daunting task. One need only look at how Intel suggests code benchmarking should be done to see that ordinary systems, such as the ones I happen to find myself administering, are very much out of control in this perspective.

I don't even comtemplate acheiving "All power optimization, Intel Hyper-Threading technology, frequency scaling and turbo mode functionalities were turned off" in critical systems, much less tinker with code wrappers in C and running long term tests to get subsequent answers. I just try to keep them alive and learn as much as I can about them without disturbing them too much. Thank you timestamp, I know I can't trust you fully but I do know you're not too many seconds off. When actual millisecond accuracy does get important, one measure is not enough, but a greater number of measurements are needed to verify the pattern. What else can we do?

Lastly, it is interesting to look at how the realtime OS-people think interrupt latency. There is also a very exciting time synch alternative in the works, where quite a bit of interesting statistics, methodology and whitepapers are made public. Add future hardware architecture and kernel developments to that and in a few years this timekeeping accuracy thing may no longer be such a problem. One may hope.