After nearly two years working on the SRC:CLR product now we have come to think about open-source security by answering a set of questions.
What open-source do you have in your code base? It turns out that on the face of it doing component discovery is easy but under-the-hood it is hard to do right. If you are using build co-ordinates like Maven or NPM, you need to know how the build systems actually work, how they optimize things like local caches and how they really process and make choices about versioning. You have to know how transitive dependancies really work and build a complete dependency graph or all you are doing is GREP. We did some analysis last week of an open-source tool that uses CPE values and found by example that where we found 93 components they identified 24. When discovery is the used to map against vulnerabilities that in-accuracy propagates. When we did our venture funding, we used a popular commercial license checking tool and were shocked at what they missed as well (as bad as the open-source tool above). Today we use an algorithm that uses build coordinates and examines source and binaries (bytecode). Our system knows about versions so we can compute how far behind in versioning you are (patch latency) and you can do taint analysis on copyleft licensing which is a neat side-effect. You will be amazed (or not) at how many different versions of the same component are usually used across organizations.
[RANT : While I am on the topic it's about time people stopped calling this a BOM or Bill Of Materials. We aren't building ships or trains and waterfall development was so ten years ago....dinosaurs!]
Where did it come from? Despite some best efforts, the vast majority of open-source is effectively anonymous. Java has a jar signing specification where you can use a valid Class 3 signing key and cert to prove identity but sadly the popular distribution site uses PGP and knowing something came from [email protected] versus the Apache Software Foundation is a big deal. Most components are actually either not-signed or self-signed, even worse. Knowing where your components came from is critical as it enables you to make decisions based on trust.
Which components have vulnerabilities? We have written about tracking sources of data and mapping disclosed vulnerabilities to components and while it's important it turns out that those vulnerabilities are the tip of the ice-berg. Using modern data-science and machine learning we examine all the components and are able to determine exponentially more components that have issues than have previously been reported. We tell you where all the skeletons are buried!
Are you using the vulnerable parts? I am using a vulnerable component, so what? A fair question. Luckily people on our team have experience building inter-proceedural static analysis tools so we have built a way to tag the vulnerable methods (or way the exploit is instatiated) and determine if you code ever calls the component in the vulnerable way. If you followed Heartbleed last year you will know that real risk was limited to a few method calls. Vulnerable methods allows you to determine if you need to take action now or do spring-cleaning when it's convenient.
What do (or could) the components actually do inside your code base? Surprise surprise, not everything is like it says it is. You know that logging component that logs your events? Guess what, it calls home and sends your data with it! Did you know? That image manipulation library actually has full access to the entire local filesystem or that single-sign-on library that stores your usernames and passwords in a magic text file, that it sends back the the C2C host later? It turns out reusable code means reusable backdoors and the bad-guys figured this out a while back. It's a lot less effort to backdoor a component and have millions of developers install it than it is to hack a million web sites! We have seen desktop and mobile malware and I am here to tell you that open-source component malware will be popular and running freely among your businesses before long.
What secure coding practices do my components follow? One of our very early adopters took security really seriously. Among other things they banned MD5 and symmetric crypto that used CBC mode. Some government standards like FIPS-140 require this. Imagine their surprise when it turned out their components didn't get the messages. Many software components doing things like logging or storing timestamps or you-name-it do what they do using poor security practices. Being able to look inside the components and generate data about the crypto being used, dangerous API's being called, use of native interfaces like C and much more, is very powerful.
If you use Github.com, Github Enterprise or Atlassian Stash and Java, Ruby on Rails or Node.js you can sign up for the private beta now https://srcclr.com. More languages and integration points will follow very soon.