There's been a lot of blogging over the weekend about the 36-hour Skype outage that occurred starting last Thursday. From Skype's official explanation, it wasn't a security-related event -- in other words, Skype wasn't hacked. We have no reason to believe otherwise. However, security and availability are often discussed in the same breath, and lots of people will be speculating about the chain of events that led to this outage.
It's worth understanding a little bit about the Skype network. I remembered reading this paper a few years back, in which some Columbia University researchers analyzed Skype network traffic to derive information about the Skype infrastructure. Things may have changed slightly since then, but for now, let's assume it's accurate.
The primary takeaway from this analysis is that Skype relies heavily on hosts called Super Nodes to help direct traffic on the P2P network. Super Nodes are what allow Skype clients to communicate on the network from behind firewalls and NAT. Any Skype client can become a Super Node, and though the mechanism by which that happens is not well-documented, you're more likely to become a Super Node if you're on a publicly routable IP address and have adequate bandwidth.
Logging into the Skype network is a two-step process. First, the client has to establish a connection to a Super Node. If and only if this is successful, the client contacts Skype's central login server to authenticate to the network. Failure at either of these steps would prevent users from logging on, which is the scenario in Skype's blog:
The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.
The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.
So why hasn't this happened before? Both Skype and patch Tuesday have been around for a long time. It depends on whether the issue was primarily due to overloading the login server or a lack of available Super Nodes. It's a reasonable assumption that Skype closely monitors the load on its login servers and would have seen that load inching closer and closer to capacity over the past few patch Tuesdays. Unless the Skype user base grew exponentially over the past month, it's likely they would have been prepared for the spike in traffic. Also, I'm unclear on how the lack of Super Nodes and the load on the login servers combine to create an amplification effect, since they are serial dependencies -- that is, the login server isn't even contacted unless a Super Node is available.
Perhaps there has been a significant shift in the number of Super Nodes relative to the number of Skype clients. More people are sitting behind home firewalls or NAT routers these days. However, if this were the case, the network would probably be struggling to operate under normal conditions, not just on patch Tuesday.
Did Microsoft change the mechanism by which patches are rolled out? It doesn't seem like there should ever be a "massive restart" of computers across the globe. On my systems, I let the Windows Updater download updates automatically, and I'm usually not prompted that they're ready for installing until Wednesday. It takes a while for the patches to propagate to systems across the globe, and as they are applied, the systems reboot in a rolling fashion. What would have caused significantly more systems than normal to reboot at the same time? And why didn't it happen until Thursday?
It will be interesting to see if Skype releases any more technical details. Without a doubt, it is difficult to build and maintain a reliable network of this size, and I'm sure there are some good lessons learned from this event that anybody working on large-scale systems would appreciate hearing about.
[Update, 8/21/2007: MSRC weighs in on the role of Windows Update]