Geoff Huston, APNIC
The tabloid press is never lost for a good headline, but in July 2012 this one in particular caught my eye: "Global Chaos as Moment in Time Kills the Interwebs."  I am pretty sure that "global chaos" is somewhat "over the top," but a problem did happen on July 1 this year, and yes, it affected the Internet in various ways, as well as affecting many other enterprises that rely on IT systems. And yes, the problem had a lot to do with time and how we measure it. In this article I will examine the cause of this problem in a little more detail.
What Is a Second?
I would like to start with a rather innocent question: What exactly is a second? Obviously it is a unit of time, but what defines a second? Well, there are 60 seconds in a minute, 60 minutes in an hour, and 24 hours in a day. That information would infer that a "second" is 1/86,400 of a day, or 1/86,400 of the length of time it takes for the Earth to rotate about its own axis. Yes?
Almost, but this definition is still a little imprecise. What is the frame of reference that defines a unit of rotation of the Earth? As was established in the work a century ago in attempting to establish a frame of reference for the measurement of the speed of light, these frame-of-reference questions can be quite tricky!
What is the frame of reference to calibrate the Earth's rotation about its own axis? A set of distant stars? The Sun? These days we use the Sun, a choice that seems logical in the first instance. But cosmology is far from perfect, and far from being a stable measurement, the length of time it takes for the Earth to rotate once about its axis relative to the Sun varies month by month by up to some 30 seconds from its mean value. This variation in the Earth's rotational period is an outcome of both the Earth's elliptical orbit around the Sun and the Earth's axial tilt. These variations mean that by the time of the March equinox the Solar Day is some 18 seconds shorter than the mean, at the time of the June solstice it is some 13 seconds longer, at the September equinox it is some 21 seconds shorter, and in December it is some 29 seconds longer.
This variation in the rotational period of the Earth is unhelpful if you are looking for a stable way to measure time. To keep this unit of time at a constant value, then the definition of a second is based on an ideal version of the Earth's rotational period, and we have chosen to base the unit of measurement of time on Mean Solar Time. This mean solar time is the average time for the Earth to rotate about its own axis, relative to the Sun.
This value is relatively constant, because the variations in solar time work to cancel out each other in the course of a full year. So a second is defined as 1/86,400 of mean solar time, or in other words 1/86,400 of the average time it takes for the Earth to rotate on its axis. And how do we measure this mean solar time? Well, in our search for precision and accuracy the measurement of mean solar time is not, in fact, based on measurements of the sun, but instead is derived from baseline interferometry from numerous distant radio sources. However, the measurement still reflects the average duration of the Earth's rotation about its own axis relative to the Sun.
So now we have a second as a unit of the measurement of time, based on the Earth's rotation about its own axis, and this definition allows us not only to construct a uniform time system to measure intervals of time, but also to all agree on a uniform value of absolute time. From this analysis we can make calendars that are not only "stable," in that the calendar does not drift forward or backward in time from year to year, but also accurate in that we can agree on absolute time down to units of minute fractions of a second. Well, so one would have thought, but the imperfections of cosmology intrude once again.
The Earth has the Moon, and the Earth generates a tidal acceleration of the Moon, and, in turn the Moon decelerates the Earth's rotational speed. In addition to this long-term factor arising from the gravitational interaction between the Earth and the Moon, the Earth's rotational period is affected by climatic and geological events that occur on and within the Earth . Thus it is possible for the Earth's rotation to both slow down and speed up at times. So the two requirements of a second—namely that it is a constant unit of time and it is defined as 1/86,400 of the mean time taken for the Earth to rotate on its axis—cannot be maintained. Either one or the other has to go.
In 1955 we went down the route of a standard definition of a second, which was defined by the International Astronomical Union as 1/31,556,925.9747 of the 1900.0 Mean Tropical Year. This definition was also adopted in 1956 by the International Committee for Weights and Measures and in 1960 by the General Conference on Weights and Measures, becoming a part of the International System of Units (SI). This definition addressed the problem of the drift in the value of the mean solar year by specifying a particular year as the baseline for the definition.
However, by the mid-1960s this definition was also found to be inadequate for precise time measurements, so in 1967 the SI second was again redefined, this time in experimental terms as a repeatable measurement. The new definition of a second was 9,192,631,770 periods of the radiation emitted by a Caesium-133 atom in the transition between the two hyperfine levels of its ground state.
So we have the concept of a second as a fixed unit of time, but how does this relate to the astronomical measurement of time? For the past several centuries the length of the Mean Solar Day has been increasing by an average of some 1.7 milliseconds per century. Given that the solar day was fixed on the Mean Solar Day of the year 1900, by 1961 it was around a millisecond longer than 86,400 SI seconds. Therefore, absolute time standards that change the date after precisely 86,400 SI seconds, such as the International Atomic Time (TAI), get increasingly ahead of the time standards that are rigorously tied to the Mean Solar Day, such as Greenwich Mean Time (GMT).
When the Coordinated Universal Time (UTC) standard was instituted in 1961, based on atomic clocks, it was felt necessary that this time standard maintain agreement with the GMT time of day, which until then had been the reference for broadcast time services. Thus, from 1961 to 1971 the rate of broadcast time from the UTC atomic clock source had to be constantly slowed to remain synchronized with GMT. During that period, therefore, the "seconds" of broadcast services were actually slightly longer than the SI second and closer to the GMT seconds.
In 1972 the Leap Second system was introduced, so that the broadcast UTC seconds could be made exactly equal to the standard SI second, while still maintaining the UTC time of day and changes of UTC date synchronized with those of UT1 (the solar time standard that superseded GMT). Reassuringly, a second is now a SI second in both the UTC and TAI standards, and the precise time when time transitions from one second to the next is synchronized in both of these reference frameworks. But this fixing of the two time standards to a common unit of exactly 1 second means that for the standard second to also track the time of day it is necessary to periodically add or remove entire standard seconds from the UTC time-of-day clock. Hence the use of so-called leap seconds. By 1972 the UTC clock was already 10 seconds behind TAI, which had been synchronized with UT1 in 1958 but had been counting true SI seconds since then. After 1972, both clocks have been ticking in SI seconds, so the difference between their readouts at any time is 10 seconds plus the total number of leap seconds that have been applied to UTC.
Since January 1, 1988, the role of coordinating the insertion of these leap-second corrections to the UTC time of day has been the responsibility of the International Earth Rotation and Reference Systems Service (IERS). IERS usually decides to apply a leap second whenever the difference between UTC and UT1 approaches 0.6 second in order to keep the absolute difference between UTC and the mean solar UT1 broadcast time from exceeding 0.9 second.
The UTC standard allows leap seconds to be applied at the end of any UTC month, but since 1972 all of these leap seconds have been inserted either at the end of June 30 or December 31, making the final minute of the month in UTC, either 1 second longer or 1 second shorter when the leap second is applied. IERS publishes announcements in its Bulletin C every 6 months as to whether leap seconds are to occur or not. Such announcements are typically published well in advance of each possible leap-second date—usually in early January for a June 30 scheduled leap second and in early July for a December 31 leap second. Greater levels of advance notice are not possible because of the degree of uncertainty in predicting the precise value of the cumulative effect of fluctuations of the deviation of the Earth's rotational period from the value of the Mean Solar Day. Or, in other words, the Earth is unpredictably wobbly!
Between 1972 and 2012 some 25 leap seconds have been added to UTC. On average this number implies that a leap second has been inserted about every 19 months. However, the spacing of these leap seconds is quite irregular: there were no leap seconds in the 7-year interval between January 1, 1999, and December 31, 2005, but there were 9 leap seconds in the 13 years between 1985 and 1997, as shown in Figure 1. Since December 31, 1998, there have been only 3 leap seconds, on December 31, 2005, December 31, 2008, and June 30, 2012, each of which has added 1 second to that final minute of the month, at the UTC time of day.
Leaping Seconds and Computer Systems
The June 30, 2012 leap second did not pass without a hitch, as reported by the tabloid press. The side effect of this particular leap second appeared to include computer system outages and crashes—an outcome that was unexpected and surprising. This leap second managed to crash some servers used in the Amadeus airline management system, throwing the Qantas airline into a flurry of confusion on Sunday morning on July 1 in Australia. But not just the airlines were affected, because LinkedIn, Foursquare, Yelp, and Opera were among numerous online service operators that had their servers stumble in some fashion. This event managed to also affect some Internet Service Providers and data center operators. One Australian service provider has reported that a large number of its Ethernet switches seized up over a 2-hour period following the leap second.
It appears that one common element here was the use of the Linux operating system. But Linux is not exactly a new operating system, and the use of the Leap Second Option in the Network Time Protocol (NTP) [7, 8, 9, 10] is not exactly novel either. Why didn't we see the same problems in early 2009, following the leap second that occurred on December 31, 2008?
Ah, but there were problems then, but perhaps they were blotted out in the post new year celebratory hangover! Some folks noticed something wrong with their servers on January 1, 2009. Problems with the leap second were recorded with Red Hat Linux following the December 2008 leap second, where kernel versions of the system prior to Version 2.6.9 could encounter a deadlock condition in the kernel while processing the leap second. 
"[...] the leap second code is called from the timer interrupt handler, which holds xtime_lock. The leap second code does a printk to notify about the leap second. The printk code tries to wake up klogd (I assume to prioritize kernel messages), and (under some conditions), the scheduler attempts to get the current time, which tries to get xtime_lock => deadlock." 
The advice in January 2009 to sysadmins was to upgrade the systems to Version 2.6.9 or later, which contained a patch that avoided this kernel-level deadlock. This time it is a different problem, where the server CPU encountered a 100-percent usage level:
"The problem is caused by a bug in the kernel code for high resolution timers (hrtimers). Since they are configured using the CONFIG_HIGH_RES_TIMERS option and most systems manufactured in recent years include the High Precision Event Timers (HPET) supported by this code, these timers are active in the kernels in many recent distributions.
"The kernel bug means that the hrtimer code fails to set the system time when the leap second is added. The result is that the hrtimer representation of the time taken from the kernel is a second ahead of the system time. If an application then calls a kernel function with a timeout of less than a second, the kernel assumes that the timeout has elapsed immediately after setting the timer, and so returns to the program code immediately. In the event of a timeout, many programs simply repeat the requested operation and immediately set a new timer. This results in an endless loop, leading to 100% CPU utilisation." 
Following a close monitoring of its systems in the earlier 2005 leap second, Google engineers were aware of problems in their operating system when processing this leap second. They had noticed that some clustered systems stopped accepting work during the leap second of December 31, 2005, and they wanted to ensure that this situation did not recur in 2008. Their approach was subtly different to that used by the Linux kernel maintainers.
Rather than attempt to hunt for bugs in the time management code streams in the system kernel, they noted that the intentional side effect of NTP was to continually perform slight time adjustments in the systems that are synchronizing their time according to the NTP signal. If the quantum of an entire second in a single time update was a problem to their systems, then what about an approach that allowed the 1-second time adjustment to be smeared across numerous minutes or even many hours? That way the leap second would be represented as a larger number of very small time adjustments that, in NTP terms, was nothing exceptional. The result of these changes was that NTP itself would start slowing down the time-of-day clock on these systems some time in advance of the leap second by very slight amounts, so that at the time of the applied leap second, at 23:59:59 UTC, the adjusted NTP time would have already been wound back to 23:59:58. The leap second, which would normally be recorded as 23:59:60 was now a "normal" time of 23:59:59, and whatever bugs that remained in the leap second time code of the system were not exercised. 
The topic of leap seconds remains a contentious one. In 2005 the United States made a proposal to the ITU Radiocommunication Sector (ITU-R) Study Group 7's Working Party 7-A to eliminate leap seconds. It is not entirely clear whether these leap seconds would be replaced by a less frequent Leap Hour, or whether the entire concept of attempting to link UTC and the Mean Solar Day would be allowed to drift, and over time we would see UTC time shifting away from the UT1 concept of solar day time.
This proposal was most recently considered by the ITU-R in January 2012, and there was evidently no clear consensus on this topic. France, Italy, Japan, Mexico, and the United States were reported to be in favor of abandoning leap seconds, whereas Canada, China, Germany, and the United Kingdom were reportedly against these changes to UTC. At present a decision on this topic, or at the least a discussion on this topic, is scheduled for the 2015 World Radio Conference.
Although these computing problems with processing leap seconds are annoying and for some folks extremely frustrating and sometimes expensive, I am not sure this factor alone should affect the decision process about whether to drop leap seconds from the UTC time framework. With our increasing dependence on highly available systems, and the criticality of accurate time-of-day clocks as part of the basic mechanisms of system security and integrity, it would be good to think that we have managed to debug this processing of leap seconds.
It is often the case in systems maintenance that the more a bug is exercised the more likely it is that the bug will be isolated and corrected. However, with leap seconds, this task is a tough one because the occurrence of leap seconds is not easily predicted. The next time we have to leap a second in time, about the best we can do is hope that we are ready for it.
For Further Reading
The story of calendars, time, time of day, and time reference standards is a fascinating one. It includes ancient stellar observatories, the medieval quest to predict the date of Easter, the quest to construct an accurate clock that would allow the calculation of longitude, and the current constellations of time and location reference satellites. These days much of this material can be found on the Internet.