The "gem" file format is a self-contained standard way to package and distribute Ruby programs and libraries. It is used by RubyGems, the default package manager for Ruby. A Ruby gem is built from a given ".gemspec" file which contains the dependencies and version information of all the libraries used by the application. The source code of the application is also packaged into the gem along with some metadata. In this article, we will see how we can extract the ruby source code files from the packaged ".gem" file.

###Gem File Format

At first, we need to understand the contents of a typical Ruby gem. The ".gem" file is just a standard POSIX tar archive. We can confirm this by running the Unix file command as follows:

    Asankhayas-MacBook-Pro:Downloads asankhaya$ file thor-0.19.1.gem
    thor-0.19.1.gem: POSIX tar archive

Given that it is a standard tar archive, we can rename the file from ".gem" to ".tar" and open it by extracting the archive. It will show that each gem contains three gzip files inside - checksums.yaml.gz, data.tar.gz and metadata.gz. The checksums.yaml file contains the SHA1 and SHA512 hashes of the other two gzip files:

      metadata.gz: 3a362ea0b9b3cf1f41649c522ddc312925cf1e47
      data.tar.gz: 9267cf56eb7c014270c8077b1ed4b2c95fcaa7ea
      metadata.gz: c659d6a5020fa953ec51394089c5a49a4fae9afefc2190a088a8e481a6ac586fbe2c268a47204d66b19ca660468f1e3e87f4e5be9b509faf83fc3992b2b7eb42
      data.tar.gz: fac520f0a428f1cf3ba627b47285c04534e11633599d82c7b15596eb76c1e1762d66b32195e8519252c58a061ccb54e37e8bfa4ba98684c6b0efc8f208cd66f4

The metadata file contains details about the gem like its version, author, dependencies and list of files. The data gzip file is itself an archive which contains among other things, the source code of the gem. Unzipping and extracting the archive will show the following folder structure:

    Asankhayas-MacBook-Pro:data asankhaya$ ls    bin        spec    Thorfile    lib        thor.gemspec

In the given folder structure, the lib directory contains the Ruby source code files while the bin directory has the compiled binary. Thus, in order to extract the source files from the packaged gem we need to first extract the gem as a tar archive, then extract the data archive inside it and finally the source files can be found in the lib folder under data. Now that we have a basic understanding of the contents in the gem file let us see how we can process the ".gem" file and extract Ruby source code using a Java program.

###Extracting from the Gem File using Java

To process a '.tar' file in Java we can make use of the Apache Commons Compress library. The library provides TarArchiveInputStream which can be used to extract a tar and loop over the entires in the archive. Similarly, for the '.gz' file we can use GZIPInputStream from package to extract the archive. The following code snippet shows how to use them for extracting the Ruby source files.

File gemFile = new File("thor-0.19.1.gem");
InputStream gemStream = new FileInputStream(gemFile);
// Treat the gem file as a tar archive
TarArchiveInputStream tarGemStream = new TarArchiveInputStream(gemStream);
ArchiveEntry gemEntry;
while((gemEntry = tarGemStream.getNextEntry()) != null) {
    if(gemEntry.getName().equals("data.tar.gz")) {
        // data.tar.gz is a GZIp archive
        GZIPInputStream gzStream = new GZIPInputStream(tarGemStream);
        TarArchiveInputStream dataTarStream = new TarArchiveInputStream(gzStream);
        ArchiveEntry sourceEntry;
        while((sourceEntry = dataTarStream.getNextEntry()) != null) {
            // Look for .rb files in the lib directory
            if(sourceEntry.getName().startsWith("lib/") && sourceEntry.getName().endsWith(".rb")) {
                //This is a Ruby source file which can be parsed using the JRuby Parser.

The only tricky bit to note here is that while processing the data.tar.gz file, we need to first use GZIPInputstream and then pass that to the constructor of TarArchiveInputStream to open the .gz and then the subsequent .tar archive. Once we have access to the correct archive, while looping over the entries we need to check the lib folder for the Ruby source code files by comparing the extension with the file extension .rb. This will ensure that we process all the source files inside the gem. The Ruby source code itself may be parsed using the JRuby Parser.

Mark Curphey, Vice President, Strategy Mark Curphey is the Vice President of Strategy at CA Veracode. Mark is the founder and CEO of SourceClear, a software composition analysis solution designed for DevSecOps, which was acquired by CA Technologies in 2018. In 2001, he founded the Open Web Application Security Project (OWASP), a non-profit organization known for its Top 10 list of Most Critical Web Application Security Risks. Mark moved to the U.S. in 2000 to join Internet Security Systems (acquired by IBM), and later held roles including director of information security at Charles Schwab, vice president of professional services at Foundstone (acquired by McAfee), and principal group program manager, developer division, at Microsoft. Born in the UK, Mark received his B.Eng, Mechanical Engineering from the University of Brighton, and his Masters in Information Security from Royal Holloway, University of London. In his spare time, he enjoys traveling, and cycling.

Love to learn about Application Security?

Get all the latest news, tips and articles delivered right to your inbox.




contact menu