defunct tech blog

Automated infrastructure testing with ServerSpec

Joff

18 May 2015 • 5 min read

By using automated rspec testing of our real infrastructure systems, We reduced the uncertainty around its operating environments, and decreased the amount of manual correctness checking that engineers have to do when provisioning new systems.

So, what’s the problem?

We have some “legacy managed” infrastructure, built by hand over several years. As some of our critical software runs on this platform, we recognised the need for a Disaster Recovery environment to mirror that infrastructure, in the case of large failures in that environment.

This presented several issues:

That environment was built in several stages, and was inconsistent across many machines, including installed packages and configuration
New capacity in that environment was sometimes added by hand, sometimes by cloning existing machines.
We needed to verify the new disaster environment is configured the same way as the system it is mirroring.
- In the past this kind of thing had been manually done by an engineer, a costly, time-consuming, and error prone job.

Introducing Serverspec!

When you are purely dealing with software, the solution to these problems is obvious – automated testing! What if we could regularly run a set of tests, and have them tell us if an environment is “correct” or not? We can! Serverspec builds on top of rspec and allows you to test the state of running hardware by accessing those machines, and asserting the state of files, running processes, etc., etc. In addition to that, we also used nodespec (written by one of our own engineers), which adds some nice features around configuring connections to our servers.

Serverspec provides a nice DSL around testing common aspects of running systems, e.g. a test we use to check that statsd is set-up correctly:

shared_examples_for 'it is the statsd endpoint' do
  describe file('/shared/redbubble/statsdConfig.js') do
    it { should be_file }
    its(:content) { should match %r|port: 8126| }
    its(:content) { should match %r|flushInterval: 60000| }
  end
 
  describe port(8126) do
    it { should be_listening.with('udp') }
  end
end

How we implemented Serverspec:

Our environments

For this piece of work, we roughly had four different environments to deal with. There was staging and production, and for each of those, there was the managed environment, and its counterpart DR version on cloud. Each of these had differences from each other, in the number of servers, in the layout of the services installed on them, configuration values, and a few places where versions differed.

How did we organise our specs to cope with that?

It sounds like a lot of complexity is involved in dealing with all the environments described above! You would be right too. In order to deal with it, we structured our test code into several “layers”:

Spec-file per deployment environment (e.g. production-managed), composed of:
- Spec per-server, composed of:
  - Behavioural examples, composed of:
    - Individual rspec tests
    - (and more composed behaviours)

Here is a part of one of the top-level environment specs :

require 'spec_helper'
 
environment = EnvironmentFactory.ey_managed_production
 
environment.nodespecs_for_all_hosts.each do |nodespec_config|
  describe "Node #{nodespec_config['host']}", nodespec: nodespec_config do
    it_behaves_like 'a base server', environment
    it_behaves_like 'a server with the firewall', environment
  end
end
 
environment.app_servers.each do |index|
  describe "App Server #{index}", nodespec: environment.nodespec_for("prod-app#{index}") do
    it_behaves_like 'an app server', environment.merge(unicorn_worker_count: index < 15 ? 10 : 5)
    it_behaves_like 'a syslog-ng publisher', environment
  end
end
 
describe 'prod-rabbitmq', nodespec: environment.nodespec_for('prod-rabbitmq') do
  it_behaves_like 'rabbitmq', environment
  it_behaves_like 'it has the shared NFS mount', environment
  it_behaves_like 'a syslog-ng publisher', environment
end

Here, on lines 5, 12 and 19, we have blocks that describe individual servers in the environment. The first two describe groups of servers (i.e. all hosts, or application servers), without having to individually list each one. The last block describes our RabbitMQ server, by referring to it via its’ hostname.

Each of these blocks refers to an rspec shared example, which describes a behaviour we want a server to have. These can be very high-level (e.g. behave like “an app server”), or be more specific (“has the NFS shared mount”).

Let’s have a look at an example:

shared_examples 'an app server' do |environment|
  it_behaves_like 'a unicorn server', environment
  it_behaves_like 'it has the shared NFS mount', environment
  it_behaves_like 'it has ruby installed', '1.9.3p484'
  it_behaves_like 'it has the application deployed', environment
  it_behaves_like 'an nginx server', environment
  it_behaves_like 'uploading logs to S3', environment.merge(server_role: 'appserver')
  it_behaves_like 'has geoip capabilities'
end

Here, we can see that these behaviours are further composed of other behaviours, which themselves look like:

shared_examples_for 'it has the shared NFS mount' do |environment|
  describe file('/shared') do
    it { should be_mounted.with(rw: true, type: 'nfs') }
  end
 
  describe file('/shared/nfs') do
    it { should be_readable.by_user(environment[:user]) }
    it { should be_writable.by_user(environment[:user]) }
  end
 
  describe file('/etc/fstab') do
      its(:content) { should match /#{environment.nfs_hostname}:\/data\s*\/shared\s*nfs/ }
    end
  end
end

Managing configuration differences

Okay, so that’s cool. But you still haven’t told me how that helps with all the different environments!?

You may have noticed the environment = EnvironmentFactory.ey_managed_production line at the top of the first example. This is a convenient method we created to provide an object that has all the specifics of the environment under test. This includes essentially a hash of values, as well as convenience methods, which are used in the specs.

Let’s break it down a little:

class EnvironmentFactory
#...
  COMMON = {
      memcached_memusage: 1024,
      nfs_exported_dir: '/data'
  }
  STAGING = {
      environment: 'staging',
      rails_env: 'staging',
      database: 'staging',
  }
  STAGING_MANAGED = {
      unicorn_worker_count: 2,
      memcached_memusage: 64,
      app_server_external_ip: '99.99.99.99',
      nfs_exported_dir: '/shared',
      syslog_ng_endpoint: 'tm99-s00999',
      memcached_hostname: 'tm99-s00998',
      max_connections: 300
  }
#...
  def self.ey_managed_staging
    EyManagedEnvironment.new(COMMON.merge(EY_MANAGED).merge(STAGING).merge(STAGING_MANAGED), :staging)
  end
#...
end

The EnvironmentFactory.ey_managed_staging method pulls in the series of configuration hashes (which lets us set common parameters, and then override them for specific environments), and returns an Environment object:

class Environment
  def initialize(params)
    @params = params
  end
#... hash access ... 
 def [](key)
    @params[key]
  end
 
  def []=(key, value)
    @params[key] = value
  end
#... convenience methods ...
  def app_servers
    (1..@params[:app_servers])
  end
#... and the nodespec configuration for connecting to machines ...
  def nodespecs_for_all_app_servers
    as_nodespec(hosts_matching('app'))
  end
 
  def as_nodespec(hosts)
    hosts.map do |hostname|
      {
          'adapter' => 'ssh',
          'os' => 'Gentoo',
          'user' => @params[:user],
          'rails_env' => @params[:rails_env],
          'host' => hostname,
      }
    end
  end
end

class EyManagedEnvironment < Environment
# ... more env-specific convenience methods ...
  def app_master_host
    staging? ? 'staging-app1' : 'prod-app1'
  end
end

This way, by passing the environment around to the specs, we can refer to things like the amount of memory memcached is expected to use in a common way, but with all the environment-specific information kept in one easy-to-update place!

Running specs & the board

The specs for an environment are simply run by issuing:

$ rspec spec/staging_spec.rb -f d
 
Node staging-app1
  behaves like a base server
    behaves like a network service
      Package "ntp"
        should be installed
      Service "ntpd"
        should be running
    behaves like a server monitored by NewRelic
      Package "newrelic-sysmond"
        should be installed
      Service "newrelic-sysmond"
        should be enabled
        should be running
... etc ...

We then set this up on a Jenkins server, with a visible dashboard, so we can easily see when something has gone wrong!

The many benefits of infrastructure testing

These suites of tests took a lot of time and effort for our team to retroactively produce for all our systems – so was it worth it? The answer is absolutely, yes. We saw a number of immediate wins:

When adding server capacity in our manual, managed environment, we can rapidly verify that the servers are provisioned correctly, cutting many dozens of engineer-hours from the time taken.
We now had a good, version controlled place to record our infrastructure configuration, including the layout of servers, which services are running where, and when things are different.
It gave us a way of verifying our `chef` recipes were doing the right thing in our cloud-based environments
It gave us a high degree of confidence in our Disaster Recovery environment, and let us switch live traffic over to it with minimal issues.