The March 2014 report is going to be a bit different than those in the past. This is primarily due to architectural changes that were made to get more precise data in less time. Additionally, a lot of work has been done to automate generation of these reports so they can be released more often. Our scan was run on March 5th 2014 using the latest input from the Alexa Top 1 Million.
Before going over what has changed, we must cover what was done in the past. Previously, scans were run using Python + gevent with Kyoto Cabinet as a data store. The architecture was a hack and not much thought was given to it as it was not much more than a toy project at the time. After a scan was completed only specific headers were extracted and put into a MySQL database for processing. Initially, these scans were done as one offs and MySQL was simply chosen because it was already running on the system used. The K/V store of Kyoto Cabinet gave the benefit of automatically reducing duplicate URLs as keys are unique, at least that was the thought. After changing the database to PostgreSQL a number of discrepancies were noticed. First, care was not taken to lowercase the URLs, this ended up with duplicates due to redirects upper-casing parts or the entire URI. However, since MySQL treats HTTP://VERACODE.COM and https://veracode.com the same when using the distinct modifier, our stats were for the most part accurate. Other issues appeared such as how MySQL treats white space in default collation. A simple query such as select * from headers where header_name='x-xss-protection'; matches not only 'x-xss-protection' but also 'x-xss-protection '. However, a header of: X-XSS-Protection : 1; mode=block; (note space between header and value) is technically invalid and at least in Internet Explorer, only the default XSS protections would be in place, not the defined blocking mode. Another issue is that grequests (gevent wrapper of the requests library) merges the values of header names that are the same. So if a server responds with duplicate headers such as below, the results would be stored as x-xss-protection: 1, 1; in our database.
X-XSS-Protection: 1; X-XSS-Protection: 1;
This is unfortunate as it was not possible to tell if the server responded with 1, 1 or the headers were merged by the requests library. Since these reports are being quoted and re-used in various forums it was felt that a rewrite was in order, with proper data integrity and more precise statistics. As such, a number of changes were made to meet this goal. The scanner was completely re-written in Go. Issuing four million requests concurrently is almost the perfect use case for a language like Go. All header data is now written directly to PostgreSQL using a uniqueness constraint on URLs and user-agents. These constraints stopped over 17,000 duplicate URLs being added from sites which redirect back to a URL that had already been processed. Additionally, all URLs and header names were lower-cased prior to insertion into the database. Another check that was added was if a requested URL redirects back to itself over a different protocol, the redirect would not be followed and instead the 301 response would have its headers inserted into the database. Previously, redirects were followed all the way to the final destination resource, potentially overwriting values if they already existed. Overall, the new architecture allows us to issue 4 million requests in under two hours; roughly 740 requests per second all while ensuring data integrity and giving us all header data to use in our analysis. For the curious, this is done using two m3.large AWS instances (with permission from Amazon). One hosting the 'Golexa' scanner, and the other our PostgreSQL 9.3 database. CPU consumption was around 60-70% of all three cores with less than a gigabyte of memory used. The scanner averaged around 80 MB/s for the requests. PostgreSQL hovered around 80% CPU utilization with around 200 concurrent connections. While unfortunate, the old format is too imprecise to give accurate results, as such, it has been decided to not show rate of change for previous scans. However, future scans will be compared using the new format. Finally, it was observed that responses using the different user-agents ended up in some cases producing very interesting differences in header values. All charts will now be displayed using the values specific to the browser's response.
Of the four million requests, we received 2,809,213 responses with 1,393,497 URLs matching in both Firefox and Chrome responses. Using Firefox 25's user-agent, there were 1,404,180 responses where Chrome 31 produced 1,405,033. Chrome had a total of 941,568 HTTP and 463,465 HTTPS responses, where Firefox had 940,899 HTTP and 463,281 HTTPS responses. In total there were 23,095,205 headers stored for analysis. The March 2014 report adds two new headers to the analysis; X-Content-Type-Options and Public-Key-Pins. For more information on these headers please see our previous post on Guidelines for Setting Security Headers.
Invalid Header Names
Thankfully, the number of invalid headers is quite low, with the majority being incorrect CORS headers such as "access-control-allow-origen" [sic] or "access-control-allow-method" (missing s on methods). Overall, around 50 header names were specified incorrectly.
cache-control: no-store, no-cache, must-revalidate cache-control: post-check=0, pre-check=0 expires: Wed, 05 Mar 2014 00:00:00 GMT last-modified: Wed, 05 Mar 2014 09:27:02 GMT x-content-type-options: nosniff x-xss-protection: 0; mode=block date: Wed, 05 Mar 2014 09:27:02 GMT content-type: text/html; charset=windows-1252 connection: keep-alive p3p: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" pragma: no-cache
The URLs themselves give away an interesting tell, almost all appear to be forums. After visiting a number of them, there were clear signs they are apart of www.forumotion.com a free forum hosting service. Visiting their own help forum at http://help.forumotion.com/forum exhibits the same issue, accessing with Firefox the 1; mode=block value is returned. When accessing with Chrome, 0; mode=block is returned. The 12,572 sites that compromise blocking with reporting are almost all youtube, with only two other sites using report-uri, etsy and xing.com.
This is a brand new header and as such has very little adoption. In fact only three sites responded with the Public-Key-Pins header, and one with the Public-Key-Pins-Report-Only header. Of the three sites, only one was configured correctly by encapsulating the hash values in quotes. This is concerning as anyone who is even aware of this header should be quite adept at setting it correctly. The fact that this is not the case, does not bode well for site operators who may be implementing this in the future. We look forward to watching the adoption rate of this header and hope that either the specifications are relaxed to allow unquoted hashes or it is made painfully clear that doing so is incorrect.
Set-Cookie: phpMyAdmin=a9dee5eb7a9d5ae4579ad44fb82c6f37b5278351; path=/; secure; HttpOnly
It turns out back in May of 2011, according to this changelog, phpMyAdmin added the X-Content-Security-Policy header. Even in the latest version, it is still defining X-Content-Security-Policy and not Content-Security-Policy.
Analyzing the results from this month has probably been the most interesting for me personally. Having more data and in a more structured layout has led me to be able to get greater insight and context into who is using these headers and how they are being used. The numbers themselves can only tell half the story, only by analyzing the context and surrounding information can we get the full picture. One thing that should be painfully obvious is the importance web frameworks and hosting companies play in the adoption of these security defenses. If you or your company falls into either of these categories, it is strongly recommended that you keep up to date with the latest specifications as they can change quite often. The proper or improper implementation of them can have far more impact on the state of the web than a single web site. Invalid settings continue to be a concern, to the point that I think either the specifications and implementations are too rigid or the documentation for them are not doing a good job in clearly explaining their constraints. I personally find ABNF rather cryptic and do wonder if a better format could be used. As for the future, I'm hoping with the new infrastructure we can run these more often. As always, comments and ideas for additional analysis is always welcome.
I'd like to finally give thanks to a number of people. As these reports are getting more in depth and more complicated I always appreciate people who point me in the right directions or answer my questions! So big thanks to Ian Melven from New Relic who is my insight into Mozilla's Firefox, Mike West from Google for pointing me to the right place for Chrome internals, our very own Erik Peterson for setting up the AWS infrastructure and Florent Daignière of Matta Consulting for introducing me to the Public-Key-Pins header. Thanks!