NTP in a virtualised world

Let me start this off by saying that I love NTP. The whole way the protocol has been designed is truly elegant and it is such an important protocol that is often neglected that I thought I'd put together a blog article on how I configure NTP and why.

Before that, it's important to go through a few things about how NTP works, if you're familiar with NTP, feel free to skip to the next bit.
Basics of NTP

The first thing to note is that NTP relays time in UTC format
If you think about it, NTP has to be ignorant of timezones. It's whole job is to keep accurate time and timezones will just upset that as there are so many of them. Better to just keep to something like UTC internally and have the OS deal with the timezone.
One question this always generates is "What happens if I point my UK server at a US time zone source?" - Because NTP doesn't care about timezones, those NTP servers in the US will have the exact same time as those based in the UK and across the rest of the world. It's up to the operating system to sort out the time zone so yes, pointing NTP at servers in another country is fine and it's not going to force all your machines into the time zone of the country where the NTP server is! 
Another thing to realise is that NTP is hierarchical. Each time server in the chain is said to be at a particular stratum level. Stratum 0 is an atomic clock. Stratum 1 and 2 would be NTP servers around the world that you can connect to. Your internal time source (if you use one) would be stratum 3 or more likely, stratum 4. There is a very good explanation of it all here https://ntpserver.wordpress.com/2008/09/10/ntp-server-stratum-levels-explained/
It's also key to note that NTP takes into account the latency involved in contacting an NTP server. This means that even connecting to NTP servers around the world you should find that your time is still within 500ms of reference time and even accurate down to less than 100ms.

It's also always a good idea to provide multiple external NTP servers, in testing, I've found that three are optimum as three allows for one to ruled invalid by NTP cross checking the three servers and it allows for NTP to use some clever math to offset both the latency of all three and to average out the time received from all three to ensure that the time you're getting is as close to reference time as it can be. (see, I said that NTP was elegant!). In testing, it was often possible to get time on the server to within 10ms around 90% of the time and within 100ms 100% of the time.
Now that I've gone through some of the aspects of how NTP works, another key question is "Does this apply in a virtualised world"? The answer is Yes, but with a few caveats to watch out for.

NTP issues to watch out for:

1. Circular time referencing.

The first issue is to watch for is around circular time. This occurs when a source of time is set up as a virtual machine and is pulling it's time from the host which in turn pulls it's time from the VM. At this point the whole hierarchy breaks time as the hypervisor is both a recipient and a giver of time. This is something that needs to be avoided as it'll cause no end of issues as there is no way to correct for clock drift as neither server is authoritative in terms of the NTP hierarchy.

This is why it is vitally important to turn OFF the ability to pull time from the host on ANY server that participates in any sort of NTP hierarchy. This sort of issue is most often seen when the VM is running as the active directory PDC emulator. In that case it's always best that BOTH the hypervisor and the DC's pull time from an external source such as the pool NTP servers.

2. Invalid time in ESXi

The screenshot above is a very important one, it shows that NTP is configured and running but that for some reason the time on the host is WRONG which is why it's highlighted in red. This can often be because it's not possible for the host to connect to the NTP server, this is commonly see when external NTP is used and it's blocked on the firewall.Why does accurate time matter on the host if VMWare tools/integration tools are turned off?

Even if such tools turned off, the VM has to ask the hypervisor for time under two special conditions, the first is when the VM is powered on. A VM doesn't have a CMOS, it has absolutely no idea what the time is when it's first powered on so the only place it can get it from is the host. If the time on the host is wrong then you've already got a problem and the VM isn't even started yet! This is key when you're dealing with Active Directory which, by default, needs time to be within 5 minutes of UTC across the domain (AD also ignores timezones internally), if isn't then you're going to have authentication issues.

The second case is during VMotion/Live migration tasks. For a fraction of a second, that VM does not exist on the old host or the new host. When the host your transferring it to takes on responsibility for the VM, the VM has to ask the host for the time. Again, if it's wrong then you have issues.

Those are the two major misconfigurations/issues I've seen the most in virtualised environments with the circular time setup being the most common. Getting NTP right even in a small network is key to avoiding strange authentication issues and other problems.

Author image
Maidstone
IT Person | Veeam Vanguard | VMware vExpert | Windows admin | Docker fan | Spiceworks moderator | keeper of 3 cats | Avid Tea fan