SGL: Mapping the open-source genome for fun and profit

For a long-time we have known that the current state-of-the-art of vulnerability research in open-source code does not scale. That current state-of-art involves individual security researchers looking at specific bits of code and then reporting potential issues found to a central vulnerability database in the form of textual descriptions. If accepted (after some basic validation) the report is re-published to the world as a CVE.

While the intent of discovering issues in an ad-hoc manner and maintaining a public database of vulnerabilities comes from a genuine and good place, it's simply fraught with fundamental problems when dealing with open-source code, including a lack of precision and accuracy of the findings, the inability to understand dynamic global dependency relationships and the inability to surface related vulnerabilities in other pieces of code. This has resulted in a woefully inaccurate and incomplete data-set on which the world now relies on when doing vulnerability assessments.

Lets take CVE-2017-1000034 as an example.

Akka versions <=2.4.16 and 2.5-M1 are vulnerable to a java deserialization attack in its Remoting component resulting in remote code execution in the context of the ActorSystem.

The CVE description of the vulnerability simply says that that the Akka framework is affected by a deserialization vulnerability. It provides no details about the underlying component that is affected or the attack vectors that can be used to exploit the issue. Because we have dedicated security researchers, we know that it is the akka-actor library that is affected and we will show later that using SGL we can actually see that the vulnerable library is used by over 357 other libraries.

After understanding the problem and spending time thinking about how best to advance the state-of-the-art, we came to theorize that it could be approached similarly to the way scientists have advanced life-sciences with gene engineering. For instance the Human Genome Project was an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, and of identifying and mapping all of the genes of the human genome from both a physical and a functional standpoint.

We theorized that we could map the worlds open-source code, creating representations of each bit of code and how they relate to other bits of code. We knew it was ambitious but in doing so we would essentially create a global dependency-graph and a global call-graph as well as a set of abstractions which would could be used to look for patterns, relationships and dependencies. A respected security researcher told me I was a mad-man but I like a challenge and knew we had a world class team of product R&D engineers who have built static analysis tools and call graph generators. In a nod to the famous security paper "Smashing the Stack for fun and profit" I saw an opportunity to have a impact and so "challenge accepted".

After a year of R&D lead by Dr. Asankhaya Sharma and his team we are pleased to announce that our theory has proven to be significant. We now have a graph database describing the worlds open-source Java code with over 63 million nodes and 420 million edges that we are able to query and find new vulnerabilities that were previously hidden, invisible or un-discovered. We are now in the process of adding support for other languages including JavaScript, Ruby, Python, Go, PHP and C/C++ as well as working to add additional data such as binary representations, code similarity signatures and commit logs.

This technology which includes a domain specific language for finding vulnerabilities called the Security Graph Language or SGL is now being embedded into the SourceClear platform so that we can accurately surface the right data at the right time to our customers providing a level of analysis thats not been seen before.

We also think the technology can benefit the wider open-source and security communities and with a desire to give back, we have decided to open-source the language specification and a reference architecture to encourage others to explore and embrace our work. In addition, in early 2018 we will be opening up a researcher program that will allow qualified security researchers to use the technology and our infrastructure to hunt for new security issues.

Designing the Security Graph Language

To aid large scale analysis of open-source code we needed to design a language that would enable us to capture different bug patterns in a concise manner. SGL is designed as a graph query language and builds on top of Apache Tinkerpop, a graph computing framework. SGL queries are compiled to Gremlin) and thus have support for both OLTP and OLAP processing.

The domain of SGL can be conceptually understood as a property graph, with vertices and edges representing entities and the relationships between them.

As you can see, SGL operates over a graph which has vertices about libraries, vulnerabilities, class names, and method names. The edges capture underlying relationships like dependencies and method calls that can be used to describe instances of vulnerabilities. This enables us to express sophisticated questions like "How many libraries are affected indirectly because they themselves use a known vulnerable library?" or "How many libraries have a call chain to a particular method?".

Examples

To give you a flavor of the language let's consider a few example queries written in SGL.

In order to check all the libraries in the graph we can just do the following:

library(language: 'java')
# returns the list of all libraries in the graph

Similarly for checking all the vulnerabilities in the graph we can run the following queries.

vulnerability(_)
# shows all the vulnerabilities in the graph

Now, to check what are all the vulnerable libraries:

library(_) where(has_vulnerability)
# returns only those libraries that have a known vulnerability

We can also query details about a particular library say spring-web

let spring_web = library(language: 'java', coord1: 'org.springframework',
 coord2: 'spring-web', version: '4.1.6.RELEASE') in spring_web
let spring_web_classes = spring_web has_class in spring_web_classes
# returns all the classes that are defined in the library

let spring_web_methods = spring_web has_method in spring_web_methods
spring_web_methods count
# returns all methods in the library

If you are curious as to how many libraries call Runtime.exec() :

method(class_name:'java/lang/Runtime', method_name:'exec')
called_by method_in_library

Surprisingly that actually returns 11k+ library versions that call exec and I bet a case of beer that at least one is malware.

Finally, if you want to check the state of open-source security you can query for all libraries that are directly or indirectly affected by any known vulnerability:

vulnerability(_) has_version_range has_library
union(identity, embedded_in*, dependent_on*)

That query alone really shows the power of SGL. Before SGL we knew that around 1,900 libraries that were affected by disclosed vulnerabilities. After SGL we know know of over 6,500 libraries that are actually affected by those same vulnerabilities.

We can also see if a particular vulnerable library is used by other libraries. Using the SGL query below we can look for libraries that themselves use the vulnerable version of akka-actor.

vulnerability(identity: "2017-1000034") has_version_range has_library union(dependent_on*, embedded_in*)
# 357 vulnerable libraries found

Sure enough there are 357 other libraries in the results set. We can of course now use SGL to chain the query to include actual calls to the vulnerable method for precise results.

Similarly, last year a significant vulnerability SID-1847 was published in Apache Commons Collections (ACC) library that potentially led to remote code execution. While there has been a lot of awareness of the dangers of using this interface, using SGL we can instantly see that there are 184 libraries still using the vulnerable method of the vulnerable version of ACC including a popular directory server that if exploited could lead to significant data exposures.

vulnerability(identity: "1847") has_version_range has_vulnerable_method
called_by* method_in_library where(union(embeds*, depends_on*) library_in_version_range
version_range_in_vulnerability vulnerability(identity: "1847"))

SGL goes beyond vulnerabilities and can help detect patterns of bugs at scale. E.g. consider the following Java code snippet that is used to prevent XXE attacks in an application.

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setFeature("http://xml.org/sax/features/external-general-entities", false);

Based on this, we can easily construct a SGL query looking for the occurrence of potential XXE issues in Java:

let xml_new = method(class_name:'javax/xml/parsers/DocumentBuilderFactory', method_name: 'newInstance') in
let xml_set_feature = method(class_name:'javax/xml/parsers/DocumentBuilderFactory', method_name:'setFeature') in
let results = xml_new called_by not(calls xml_set_feature) in
results method_in_library

There are several exciting things happening behind the scenes that will help us work with the community to clean the upstream distributions of open-source so that we can all continue to embrace the good stuff.

You can sign up to get notified about SGL at https://www.sourceclear.com/sgl or mail us at [email protected] if you want to talk.

By Mark Curphey

Mark Curphey, Vice President, Strategy
Mark Curphey is the Vice President of Strategy at Veracode. Mark is the founder and CEO of SourceClear, a software composition analysis solution designed for DevSecOps, which was acquired by CA Technologies in 2018. In 2001, he founded the Open Web Application Security Project (OWASP), a non-profit organization known for its Top 10 list of Most Critical Web Application Security Risks.
Mark moved to the U.S. in 2000 to join Internet Security Systems (acquired by IBM), and later held roles including director of information security at Charles Schwab, vice president of professional services at Foundstone (acquired by McAfee), and principal group program manager, developer division, at Microsoft.
Born in the UK, Mark received his B.Eng, Mechanical Engineering from the University of Brighton, and his Masters in Information Security from Royal Holloway, University of London. In his spare time, he enjoys traveling, and cycling.

SGL: Mapping the open-source genome for fun and profit

Designing the Security Graph Language

Examples

Related Posts

Speed vs Security: Striking the Right Balance in Software Development with AI

Veracode Advances Cloud-Native Application Security with Longbow Acquisition

Veracode Customers Shielded from NVD Disruptions