How to Extract Ruby Source Code from Gem Packages with Java

The "gem" file format is a self-contained standard way to package and distribute Ruby programs and libraries. It is used by RubyGems, the default package manager for Ruby. A Ruby gem is built from a given ".gemspec" file which contains the dependencies and version information of all the libraries used by the application. The source code of the application is also packaged into the gem along with some metadata. In this article, we will see how we can extract the ruby source code files from the packaged ".gem" file.

###Gem File Format

At first, we need to understand the contents of a typical Ruby gem. The ".gem" file is just a standard POSIX tar archive. We can confirm this by running the Unix file command as follows:

    Asankhayas-MacBook-Pro:Downloads asankhaya$ file thor-0.19.1.gem
    thor-0.19.1.gem: POSIX tar archive

Given that it is a standard tar archive, we can rename the file from ".gem" to ".tar" and open it by extracting the archive. It will show that each gem contains three gzip files inside - checksums.yaml.gz, data.tar.gz and metadata.gz. The checksums.yaml file contains the SHA1 and SHA512 hashes of the other two gzip files:

      metadata.gz: 3a362ea0b9b3cf1f41649c522ddc312925cf1e47
      data.tar.gz: 9267cf56eb7c014270c8077b1ed4b2c95fcaa7ea
      metadata.gz: c659d6a5020fa953ec51394089c5a49a4fae9afefc2190a088a8e481a6ac586fbe2c268a47204d66b19ca660468f1e3e87f4e5be9b509faf83fc3992b2b7eb42
      data.tar.gz: fac520f0a428f1cf3ba627b47285c04534e11633599d82c7b15596eb76c1e1762d66b32195e8519252c58a061ccb54e37e8bfa4ba98684c6b0efc8f208cd66f4

The metadata file contains details about the gem like its version, author, dependencies and list of files. The data gzip file is itself an archive which contains among other things, the source code of the gem. Unzipping and extracting the archive will show the following folder structure:

    Asankhayas-MacBook-Pro:data asankhaya$ ls    bin        spec    Thorfile    libthor.gemspec

In the given folder structure, the lib directory contains the Ruby source code files while the bin directory has the compiled binary. Thus, in order to extract the source files from the packaged gem we need to first extract the gem as a tar archive, then extract the data archive inside it and finally the source files can be found in the lib folder under data. Now that we have a basic understanding of the contents in the gem file let us see how we can process the ".gem" file and extract Ruby source code using a Java program.

###Extracting from the Gem File using Java

To process a '.tar' file in Java we can make use of the Apache Commons Compress library. The library provides TarArchiveInputStream which can be used to extract a tar and loop over the entires in the archive. Similarly, for the '.gz' file we can use GZIPInputStream from package to extract the archive. The following code snippet shows how to use them for extracting the Ruby source files.

File gemFile = newFile("thor-0.19.1.gem");
InputStream gemStream = newFileInputStream(gemFile);
// Treat the gem file as a tar archive
TarArchiveInputStream tarGemStream = newTarArchiveInputStream(gemStream);
ArchiveEntry gemEntry;
while((gemEntry = tarGemStream.getNextEntry()) != null) {
    if(gemEntry.getName().equals("data.tar.gz")) {
        // data.tar.gz is a GZIp archive
        GZIPInputStream gzStream = newGZIPInputStream(tarGemStream);
        TarArchiveInputStream dataTarStream = newTarArchiveInputStream(gzStream);
        ArchiveEntry sourceEntry;
        while((sourceEntry = dataTarStream.getNextEntry()) != null) {
            // Look for .rb files in the lib directoryif(sourceEntry.getName().startsWith("lib/") && sourceEntry.getName().endsWith(".rb")) {
                //This is a Ruby source file which can be parsed using the JRuby Parser.

The only tricky bit to note here is that while processing the data.tar.gz file, we need to first use GZIPInputstream and then pass that to the constructor of TarArchiveInputStream to open the .gz and then the subsequent .tar archive. Once we have access to the correct archive, while looping over the entries we need to check the lib folder for the Ruby source code files by comparing the extension with the file extension .rb. This will ensure that we process all the source files inside the gem. The Ruby source code itself may be parsed using the JRuby Parser.

Dr. Asankhaya Sharma is the Director of Software Engineering at Veracode. Asankhaya is a cyber security expert and technology leader with over a decade of experience in creating security products for industry, academia and open-source community. He is passionate about building high performing teams and taking innovative products to market. He is also an Adjunct Professor at the Singapore Institute of Technology.