Basic and Simple Disaster Recovery with sshkit

Photo credit: Unsplash: dominik_qn

It is easy to create a hugely complex disaster recovery plan with all kinds of redundancy and failover. It is also easy to spend a ton of money doing so. There are cases however, where a daily snapshot, and the ability to boot up that most recent snapshot, is plenty of protection.

Capistrano is a great tool for deployment, and still one of my preferred methods when not on Heroku or using Docker. The core of Capistrano is performing commands over SSH, and the gem for that is sshkit.

The approach is simple. The standby server has the same codebase as production, maintained to the most current version by deploying via Capistrano whenever a production deploy is made. The server processes are left disabled on the standby server, so that it is not accessible while the production system is still live.

You will need an ssh key on the production server for the user you run the script as, and the public key will need to be in the authorized_keys file for the user on the standby server you are connecting as. Also you may want to place the IP address and an alias of the standby server in the hosts file of the production server.

A cron job on the production server triggers the process of keeping the standby server up-to-date nightly. Something like:

20 0 * * * /bin/bash -l -c 'cd /apps/app/current && /usr/bin/env bundle exec ruby lib/refresh_server.rb >> /apps/app/current/log/cron.log 2>&1'

The refresh_server.rb script is where our sshkit magic will be. Let’s break it down at a high level with pseudocode.

# plain old ruby script, not dependent on rails or much of anything
require 'rubygems'
require 'bundler/setup'
require 'rollbar'
require 'sshkit'
require 'yaml'

begin
  # prepare Rollbar for error notifications, load other configuration
  configure_script
  # check the remote server for any preconditions (more on this later) 
  check_remote_server
  # Create a database snapshot, rsync to the remote server
  run_local_commands
  # Apply updates to remote server, update search indexes, warm caches, etc
  run_remote_commands
  # Notify Rollbar or anything else of success
  Rollbar.info('Refresh successful')
rescue StandardError => error
  # Notify Rollbar
  Rollbar.error(error)
end

Let’s get into the details of each of those steps

Configure Script

Here we just do some basic setup. In my case I just get the Rollbar token from our secrets.yml file.

secrets = YAML.load_file('config/secrets.yml')
Rollbar.configure do |config|
  config.access_token = secrets['production']['rollbar_server_token']
end

Check Remote Server

This step is my failsafe check. I don’t want to refresh the standby server if it is in use, because that means we cut over to it. Imagine a case where production goes down and we cut over, then production comes back up, and tries to refresh the standby server again. Since we’ve started using it, we don’t want production to overwrite the day’s work. There will be a manual step to switch back to production, and make sure the work we did on standby is recreated on production (manually), but hopefully the timeline here is favorable. Remember, this is a failsafe, it needs to work, but it might not be elegant.

The heart of the remote server check is something like this:

SSHKit::Coordinator.new('user@standby_server').each in: :sequence do
  within '/apps' do
    if test(:test, '-f app/shared/tmp/pids/puma.pid') ||
       test(:test, '-f app/shared/tmp/pids/sidekiq.pid')
      raise 'Puma or sidekiq is running on the standby server'
    end
  end
end

Run Local Commands

The local commands prepare the database backup and rsync everything over to the standby server. It is easy to run commands locally with the sshkit coordinator, using the :local symbol. Make sure any data folders are also synced over (think file uploads or report outputs or something like that). Also, make sure the user this script runs as has the permission to create the database dump file.

SSHKit::Coordinator.new(:local).each in: :sequence do
  within '/apps' do
    execute :mkdir, '-p', 'db_dumps'
    execute 'pg_dump', '-Fc', '--no-acl', '--no-owner',
              '-f db_dumps/production_backup_to_standby.dump',
              'app_production'
    execute :rsync, '-e ssh',
              'db_dumps/production_backup_to_standby.dump',
              'user@standby_server:production_backup_for_restoration.dump'
    execute :rsync, '-a', '-e ssh', '--delete', '/apps/app/shared/uploads',
              'user@standby_server:/apps/app/shared'
      execute :rsync, '-a', '-e ssh', '--delete', '/apps/app/shared/report_downloads',
              'user@standby_server:/apps/app/shared'
  end
end

Run Remote Commands

The remote commands replace the existing standby database with the recently created production dump. It also moves some files around due to rails environment changes between the two servers. We also keep the database dumps around so we have snapshots to restore to should we need it. Finally, if there are services that need to be updated for the new data, Elasticsearch for example, then those tasks can be run after the data is in place.

SSHKit::Coordinator.new('user@standby_server').each in: :sequence do
  within '/apps' do
    execute 'dropdb', 'standby_db'
    execute 'createdb', 'standby_db'
    execute :mkdir, '-p', 'db_backups'
    timestamp = Date.today.strftime('%F')
    execute :cp, 'production_backup_for_restoration.dump',
              "db_backups/production_backup_#{timestamp}.dump"
    execute 'pg_restore', '--verbose', '--no-acl', '--no-owner', '-j 8', '-d standby_db', 'production_backup_for_restoration.dump'
    # Since report_downloads are organized by Rails environment, rename the sub folder to the target server's environment name
    execute :mv, '/apps/app/shared/report_downloads/production', '/apps/app/shared/report_downloads/standby'
  end
  # With the updated database in place, rebuild the elasticsearch index
  within '/apps/app/current' do
    with rails_env: 'standby' do
      execute './bin/rake', 'chewy:reset'
    end
  end
end

Conclusion

If you are already using Ruby, then this approach is a great and easy to way to script up this kind of daily application backup. Capistrano’s sshkit makes it really easy to write, and the code is easy to understand and maintain.