There's been a lot of blogging over the weekend about the 36-hour Skype outage that occurred starting last Thursday. From Skype's official explanation, it wasn't a security-related event -- in other words, Skype wasn't hacked. We have no reason to believe otherwise. However, security and availability are often discussed in the same breath, and lots of people will be speculating about the chain of events that led to this outage.

It's worth understanding a little bit about the Skype network. I remembered reading this paper a few years back, in which some Columbia University researchers analyzed Skype network traffic to derive information about the Skype infrastructure. Things may have changed slightly since then, but for now, let's assume it's accurate.

The primary takeaway from this analysis is that Skype relies heavily on hosts called Super Nodes to help direct traffic on the P2P network. Super Nodes are what allow Skype clients to communicate on the network from behind firewalls and NAT. Any Skype client can become a Super Node, and though the mechanism by which that happens is not well-documented, you're more likely to become a Super Node if you're on a publicly routable IP address and have adequate bandwidth.

Logging into the Skype network is a two-step process. First, the client has to establish a connection to a Super Node. If and only if this is successful, the client contacts Skype's central login server to authenticate to the network. Failure at either of these steps would prevent users from logging on, which is the scenario in Skype's blog:

The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.

So why hasn't this happened before? Both Skype and patch Tuesday have been around for a long time. It depends on whether the issue was primarily due to overloading the login server or a lack of available Super Nodes. It's a reasonable assumption that Skype closely monitors the load on its login servers and would have seen that load inching closer and closer to capacity over the past few patch Tuesdays. Unless the Skype user base grew exponentially over the past month, it's likely they would have been prepared for the spike in traffic. Also, I'm unclear on how the lack of Super Nodes and the load on the login servers combine to create an amplification effect, since they are serial dependencies -- that is, the login server isn't even contacted unless a Super Node is available.

Perhaps there has been a significant shift in the number of Super Nodes relative to the number of Skype clients. More people are sitting behind home firewalls or NAT routers these days. However, if this were the case, the network would probably be struggling to operate under normal conditions, not just on patch Tuesday.

Did Microsoft change the mechanism by which patches are rolled out? It doesn't seem like there should ever be a "massive restart" of computers across the globe. On my systems, I let the Windows Updater download updates automatically, and I'm usually not prompted that they're ready for installing until Wednesday. It takes a while for the patches to propagate to systems across the globe, and as they are applied, the systems reboot in a rolling fashion. What would have caused significantly more systems than normal to reboot at the same time? And why didn't it happen until Thursday?

It will be interesting to see if Skype releases any more technical details. Without a doubt, it is difficult to build and maintain a reliable network of this size, and I'm sure there are some good lessons learned from this event that anybody working on large-scale systems would appreciate hearing about.

[Update, 8/21/2007: MSRC weighs in on the role of Windows Update]

About Chris Eng

Chris Eng, vice president of research, is responsible for integrating security expertise into Veracode’s technology. In addition to helping define and prioritize the security feature set of the Veracode service, he consults frequently with customers to discuss and advance their application security initiatives. With over 15 years of experience in application security, Chris brings a wealth of practical expertise to Veracode.

Comments (1)

LonerVamp | August 20, 2007 4:55 pm

Even at default times, only x amount of people in that timezone could possibly be rebooting all at once. That certainly cannot account for millions.

And that certainly should not account for an entire infrastructure being affected, even if distributed. If that were true, are P2Ps buckling? Or better yet, should everyone be far more wary about their use of and dependence on such a brittle system that relies entirely on its users?

No, I think all of this is the lesser of two choices, namely to admit Skype has been hacked/DOSed or just had an availability spurt with Patch Tuesday conveniently nearby... Skype's security has long been their biggest issue, both with detractors and with backers.

Please Post Your Comments & Reviews

Your email address will not be published. Required fields are marked *

Love to learn about Application Security?

Get all the latest news, tips and articles delivered right to your inbox.