Hacking RubyGems servers

While I was writing Serving ruby gems, the paranoid way, I gradually became interested in the rubygems API and especially in its implementation in both gem server and geminabox. This is how I started a new journey inside the internals of these two programs.

Rubygems API basics

How would the simplest server that complies to the Rubygems API look like? Short answer: just look at the gem server source code!

gem server is pretty small: it is about 1000 lines contained within two Ruby files. Nevertheless, it is a drop-in replacement for server software that runs the official rubygems.org repository. At the very beginning of the source file, a comment describes the routes understood by the API server:

# gem_server starts an HTTP server on the given port and serves the following:
# * "/" - Browsing of gem spec files for installed gems
# * "/specs.#{Gem.marshal_version}.gz" - specs name/version/platform index
# * "/latest_specs.#{Gem.marshal_version}.gz" - latest specs
#   name/version/platform index
# * "/quick/" - Individual gemspecs
# * "/gems" - Direct access to download the installable gems
# * "/rdoc?q=" - Search for installed rdoc documentation

The gem browser is targeted at the users. It renders HTML pages and is not part of the API. Same thing for the online rdoc documentation. So in the end, the proper API can fit in only 4 routes.

To go further, let’s assume that some defaults are hardcoded in our Rubygems client:

  • specs are serialized using marshal version 4.8
  • specs response is gzipped

And now, the API is reduced to this:

  • /specs.4.8.gz to get all the specs
  • /latest_spec.4.8.gz to get latest
  • /quick/Marshal.4.8 to get the specification of a given gem
  • /gems to fetch a given gem as a .gem package

The last 2 routes are generic since they apply to any gem. Let’s illustrate with the rack gem version 1.5.2:

  • /quick/Marshal.4.8/rack-1.5.2.gemspec.rz to get the specs
  • /gems/rack-1.5.2.gem to fetch the gem package

More on the .rz extension later on.

That was a brief introduction to the API and I’ve skipped some details, including the pre-release specs, the legacy API, and the not-so-new dependency API. But this is just a starting point.

The built-in gem server

The built-in gem server is implemented as normal Rubygems plugin, with two components:

The server is about 800 lines of code and heavily relies on Gem::Specification to operate. It’s based on the venerable webrick webserver, but we could easily convert its code to sinatra. Here is how it responds to the /specs.* route:

get '/specs' do
  specs = Gem::Specification.specs

  specs_array = specs.sort.map do |spec|
    platform = spec.original_platform || Gem::Platform::RUBY
    [spec.name, spec.version, platform]
  end

  Marshal.dump(specs_array)
end

Same thing goes for /latest_specs, except that the code delegates to Gem::Specification.latest_specs.

The /quick route resolves a spec from the URL path, and exports it using Gem.deflate:

get '/quick' do
  specs = Gem::Specification.find_all_by_name name, version
  platform = extract_platform_from_url
  specs.select! { |s| s.platform == platform }

  Gem.deflate(Marshal.dump(specs.first))
end

The marshaled spec is then compressed using Zlib.deflate, from the ubiquitous zlib library.

Notice that the server only returns the first matching spec. But before that, it makes sure that the search does not return multiple matches.

The /gems route exports all the .gem that are known to the server. It is implemented by scanning the cache directory in $GEM_HOME. On my ubuntu box, the /var/lib/gems/1.9.1/cache directory contains all the gems installed.

Geminabox

geminabox is much more complex than the gem server command as it can both serve existing gems and publish new ones. The core of geminabox is the sinatra application found in geminabox.rb.

As stated in the documentation, the geminabox server stores everything it knows about the managed gems in its data directory. After running the geminabox application and adding some gems using the gem inabox command, here is how the data directory looks like:

data
├── _cache
├── gems
│   ├── json-1.6.6.gem
│   ├── json-1.7.6.gem
│   ├── json-1.8.0.gem
│   └── MyGem-0.0.1.gem
├── latest_specs.4.8
├── latest_specs.4.8.gz
├── prerelease_specs.4.8
├── prerelease_specs.4.8.gz
├── quick
│   └── Marshal.4.8
│       ├── json-1.6.6.gemspec.rz
│       ├── json-1.7.6.gemspec.rz
│       ├── json-1.8.0.gemspec.rz
│       └── MyGem-0.0.1.gemspec.rz
├── specs.4.8
├── specs.4.8.gz
└── _temp

This tree structure closely mimics the Rubygems API. Looking at the source code, the tree structure has been generated using Gem::Indexer, a class provided by rubygems itself.

So what’s the added value of geminabox? Well, it’s not only about Rubygems compliance: geminabox just calls the indexer, and serves the index as static files for performance.

As you might expect, geminabox adds some custom routes in order to publish new gems and remove existing ones; it can figure out automatically when to update the index and keep its caches in sync.

The remaining code implements the bundler dependency API. At the very top level, it adds this route:

get '/api/v1/dependencies' do
  query_gems = params[:gems].to_s.split(',')
  deps = query_gems.inject([]){|memo, query_gem| memo + gem_dependencies(query_gem) }
  Marshal.dump(deps)
end

I will not drop down too much into details, but geminabox features this simple optimization: when resolving dependencies for a gem once, the dependencies are stored in a cache file.

Let’s install, configure and run a geminabox server:

# install
$ git clone https://github.com/geminabox/geminabox
$ cd geminabox
$ bundle install

# set data directory and run
$ vim config.ru
$ rackup -p 9292

Then we can push some gems on the client side:

$ gem inabox --host http://localhost:9292/ \
  rack-1.5.2.gem rack-protection-1.5.0.gem sinatra-1.4.3.gem

It is easy to query dependencies for the sinatra gem in the console:

require 'net/http'
uri = 'http://localhost:9292/api/v1/dependencies?gems=sinatra'
response = Net::HTTP.get_response(URI.parse(uri))
deps = Marshal.load(response.body)

require 'pp'
pp deps

And here are the results:

[{:name=>"sinatra",
  :number=>"1.4.3",
  :platform=>"ruby",
  :dependencies=>
   [["rack", "~> 1.4"],
    ["tilt", ">= 1.3.4, ~> 1.3"],
    ["rack-protection", "~> 1.4"]]}]

Going back to the server, we now find a new file in the data/_cache directory. Its content is exactly what the server just returned to us.

server$ ruby <<EOS
> content = File.read('data/_cache/0efe415c937f6858550a6378f4f3f374')
> deps = Marshal.load(content)
> require 'pp'
> pp deps
> EOS

[{:name=>"sinatra",
  :number=>"1.4.3",
  :platform=>"ruby",
  :dependencies=>
   [["rack", "~> 1.4"],
    ["tilt", ">= 1.3.4, ~> 1.3"],

We now know enough about geminabox.

Back to the indexer

So, geminabox makes a special use of the gem indexer, behind the scene. And indeed, geminabox is a nice tool if you wanted to build a gem server from scratch.

But the purpose of the gem indexer is already to help implementing a standalone static gem server: it comes with a gem generate index command that turns a GEM_HOME in a www directory that can be directly served nginx, apache or alike.

To illustrate, let’s create a www directory like this:

www
└── gems
    ├── rack-1.5.2.gem
    ├── rack-protection-1.5.0.gem
    ├── sinatra-1.4.3.gem
    └── tilt-1.4.1.gem

We already know how to run the indexer and skip the legacy cruft:

www$ gem generate_index -d . --no-legacy
Generating Marshal quick index gemspecs for 4 gems
....

From there, we find a full tree structure that matches the API:

www
├── gems
│   ├── rack-1.5.2.gem
│   ├── rack-protection-1.5.0.gem
│   ├── sinatra-1.4.3.gem
│   └── tilt-1.4.1.gem
├── latest_specs.4.8
├── latest_specs.4.8.gz
├── prerelease_specs.4.8
├── prerelease_specs.4.8.gz
├── quick
│   └── Marshal.4.8
│       ├── rack-1.5.2.gemspec.rz
│       ├── rack-protection-1.5.0.gemspec.rz
│       ├── sinatra-1.4.3.gemspec.rz
│       └── tilt-1.4.1.gemspec.rz
├── specs.4.8
└── specs.4.8.gz

We can now play with Zlib.deflate, Zlib::GzipReader and Marshal.load to explore the specs, exactly like the gem clients do.

Obviously, you need to rerun the indexer each time you modify the gem set, exactly like geminabox does. If we were feeling playful we could write a guard for it.

There is still no way to query for dependencies dynamically like this. Nevertheless, this static tree complies with the Rubygems API.

Going further

If you wonder how easy it is to craft a basic server for your gems, I suggest you have a look at trivial_gem_server. This sinatra-based application does not fully complies with the official API but it works well enough with both the rubygems and bundler clients. The web application has no external dependency except sinatra and rubygems itself.

I hope you have gained a better understanding of the Rubygems API. And maybe this is a good start if you want to contribute to a project like geminabox!