If there's one thing I've learned in my first year working in the software security field, it's that security is an inherently complex and difficult challenge to tackle. Threats are organic and ever evolving, attack surfaces are massive and varied, and measurment of metrics and success are both limited and objective. Every year a slew of companies put out their reports on the current state of software security, each reporting on a unique, finite set of data. Our SOSS reports are no exception, breaking them down and examining the results further is important in understanding the data and why it's telling the story it tells. Read on to join Betsy Nichols of PlexLogic as she examines trends in our Software Quality Scores reported over various SOSS versions.
The following post is a contribution by Elizabeth Nichols (aka Betsy) the Co-Founder and Chief Data Scientist of PlexLogic LLC. Over her career which includes positions as university faculty, civilian government, private enterprise, and several start-ups, she has applied mathematics and computer technologies to automate analytics for war gaming, space craft mission optimization, industrial process control, network and systems management, and most recently, analytics for mobile apps and IT security. She has co-founded three companies, the most recent being PlexLogic.
Application security analytics at Veracode is “living in interesting times.” With each passing month, the data set is growing dramatically in both size and variety. An increasingly diverse set of organizations are submitting their applications for review. New programs such as VAST create usage patterns of Veracode’s services that reflect an evolving security supply chain. On top of this, Veracode’s platform for finding, classifying, and reporting discovered flaws continues to expand to address new challenges such as mobile apps, new vulnerability classes, new scanning technologies and revised policies for defining acceptable application software security. The challenge with all this newness is striking the right balance between keeping the analysis the same to track trends over time and developing new analysis to convey some new findings.
The good news is that Veracode’s unique and rich dataset always has more to offer as we continue to poke and prod it for insights. This blog is a series designed to give interested readers a peek at some of our behind-the-scenes adventures in creating and evolving Veracode’s SOSS analytics.
In SOSS Volume 4 (representing 9,910 application builds), we presented analysis that investigated the interaction between the Veracode Software Quality Score (SQS) and “build” of an application, where build is a number that reflects how many times the same application has been submitted to Veracode for review. We hypothesized that SQS would increase as the build number increased. Here is what the SOSSv4 dataset had to say (un beautified, original analysis):
For folks unfamiliar with notched box and whisker plots, the figure shows that the median score attained across all applications submitted for the first time (first build) is 82.31 and across all applications submitted for the second time (second build) is 84.81. The notch high and low numbers indicate whether this result is statistically significant. The figure suggests statistically significant improvement from build 1 to build 2, no significant change from build 2 to 3, greatest improvement from build 3 to 4, then down, then up, then down a bit, then down more, then up again. Not exactly what we expected – but not strange enough to warrant further investigation.
Fast forward a year, we repeated the exact same analysis for the SOSSv5 dataset (representing 22,430 scanned application builds) and obtained the following (again un-beautified):
Now we see no suggestion of statistically significant improvement from build one to build two. This makes no sense, why would customers resubmit their applications without completing any remediation work? The answer is they wouldn’t, so clearly, there are more factors at work here than just flaw discovery and remediation from one build to the next. So our next step is to look for some factors that we can investigate further.
First, we postulated that subsequent builds were not just the previous build with flaws fixed but rather added new features, thereby introducing new code and associated new flaws. However, customers are not required to indicate whether a new application build contains new features or only security remediations, which means the raw data is not there for us to properly test this hypothesis. Does this suggest that we should stop the investigation and focus on finding a way to obtain such information? Not exactly. We’ll press on and look for additional factors that we can test. As far as collecting more raw data, we can chat with product management to see if it is feasible.
Moving on to look for additional factors, we started looking at two readily accessible factors related to homogeneity in our dataset. The first factor is language. A significant percentage of the applications reviewed by Veracode are comprised of multiple components which are not all written in the same language . The implication is that different development teams are working on components written in different languages – which began an internal debate on whether or not to treat these components as individual applications for SOSS analysis. There are pros and cons to both approaches, so again rather than halt our analysis entirely, we decided to move forward with a dataset that excluded these multi-language applications because it would afford us a more homogeneous set of observations.
A second factor has to do with application “age”. We restrict SOSS reports to a specific interval of time. For example SOSSv5 covers scans that occurred between 1 Jan 2011 and 30 Jun 2012 – 18 months. This has the effect of creating a sample set that includes applications with the earlier builds removed because they occurred before the cut off date of 1 Jan 2011. So there are applications in our pool that contributed to the later build statistics but not the earlier build statistics. Not good! We then removed all applications whose first scan did not occur within the SOSSv5 time interval.
Here is what the revised dataset tells us (without beautification by our publication designer):
This result suggests statistically significant improvement from build 1 to 2 and 2 to 3. Not exactly the same pattern as Volume 4, however, the revised dataset makes the comparison more meaningful.
The above describes a typical sequence that happens multiple times for each SOSS report. We perform exploratory analysis to discover lines of investigation that can potentially yield statistically significant results. We look for various relationships, interactions, and correlations. In some cases, as with the SQS results, we discover results are now being skewed by factors we had not considered before. We identify factors that we can investigate and ones that we cannot. We use these discoveries to refine and improve both the data that we collect and our logic for analyzing the data.
Our hope is that this effort, which does take a significant amount of time, will ultimately lead to some predictive models and increased understanding of factors that are the most positive (or negative) influencers of application software security. For each SOSS volume we have retained all of the data and analytical logic used so that we can reapply it to new data, as it becomes available. Additionally, we are constantly reviewing our assumptions, models and opportunities for refinement and improvement. Future blogs will describe our continuing efforts to learn all we can from this incredible data source.