Despite months of planning and preparation for the October 1st rollout of online healthcare exchanges as part of the U.S. Affordable Care Act, government contractors apparently forgot one, much needed feature: a Fail Whale.
You know the Fail Whale? That mythical creature that graced many a computer screen back in the early years of Twitter? The Fail Whale was a simple thing – a drawing of a friendly-looking whale borne aloft by a flock of Twitter’s trademark birdies. The image – a creation of Australian artist Yiying Lu – didn’t do much. It just conveyed a message: “too many tweets!” In other words: Twitter’s straining infrastructure couldn’t, at that moment, handle the torrent of 140 character messages from its fast-growing user base.
The Fail Whale wasn’t a one-time kind of thing. In fact, it was a semi-permanent “feature” of Twitter from its earliest days. It appeared with particular frequency in 2008 and 2009 – years that saw Twitter’s user base grow at an almost exponential rate. Availability problems were particularly noticeable around major events – the FIFA World Cup, Michael Jackson’s death, and in conjunction with architectural or back-end changes.
In other words: Twitter’s talented engineers had a lot of trouble early-on getting the service to “scale” – that is: grow to meet the demands of its users, especially during “spikes” in their activity. And Twitter – let us remember – is pretty straightforward. Its users circulate snippets of text in 140 character increments. At no point was Twitter considered a “failure” because of these availability problems. At most, users (and the press) saw the site as a victim of its own spectacular success.
Should we be surprised, then, that Healthcare.gov, the U.S. federal government’s main health insurance storefront encountered similar problems on Tuesday when the doors swung open and more than 2 million eager customers poured in?
“No way.” That was the consensus of application testing and security experts that I asked about the reports that users of Healthcare.gov, the Federal government’s main exchange for 30 states, as well as some of the state-run exchanges were experiencing errors and long delays.
“I’d never rule out government incompetence,” Jeremiah Grossman of the security firm WhiteHat Security told me. “But I think if you ask any people who build scalable web systems if they would expect to stand up a system and have two million people use it on the first day without any problems, they’d say ‘no way.’”
Grossman pointed out now-famous brands like Google, Facebook and Twitter all had their share of availability problems. And Twitter’s struggles to manage 140 character text-based tweets pale in comparison to the complexity of healthcare.gov sessions, where users enter in personal information about themselves, compare and contrast the features of various health plans, and then register for the plan they want.
Twitter, Facebook and Google also had much more time to work out the kinks before their user base grew to millions, or tens of millions of users – the situation healthcare.gov now finds itself in. That gave the IT team time to adjust: spotting any one of the scores of bottlenecks that is likely to trip-up a new web-based application. Grossman rattled off a few of the likely culprits behind the health exchange problems.
“The incoming pipes are one. Do you have enough bandwidth to satisfy 2 million people? Can your web server handle 2 million simultaneous web server connections? Can your backend database handle all the application queries? Now that the DNS is resolving to your server, does that fall down?”
Inevitably, Grossman said, solving one problem just moves the problem further downstream, where further problems lurk. Script-based testing as part of a quality assurance process might be able to reproduce the behavior of thousands or even tens of thousands of concurrent users. But “how do you simulate a million or two million users? Unless you hire some botnet to surf your site, I don’t see how you can do it,” Grossman joked.
All this is to be expected – but that doesn’t mean that the problems experienced by healthcare.gov (designed by CGI Federal with a $71 million federal grant) were unavoidable. Grossman notes. The federal government (and states) would have been smart to roll the new online marketplace out slowly, over the course of a few months, taking it one state at a time, or limiting access to users by birth date. Many of the same problems would have almost certainly cropped up with a smaller rollout, but their impact would have been muted.
But the time for playing “what if” is passed for healthcare.gov and the various state-level exchanges. With the doors open, anxious citizens queued up to get on a health plan – any health plan – and Republicans pointing to the delays as evidence of the Affordable Care Act’s failings, the only thing to do now is to work through the problems and try to ask your irate users for patience. And that’s a lot to ask.