Our deployment process is actually fairly uncomplicated, and includes a lot of custom tools that we wrote ourselves. There are a few philosophies we use to guide us when working on our deployment process:
- Shipping code to production should be as painless as possible.
- Our production code should be as safe as possible, but not at the expense of development speed.
- Machines are cheap; virtual machines doubly so.
To demonstrate how our deploy works, we can walk through how a normal web push works:
Our web code is stored in a central Git repository backed by GitHub, with our production code stored on the central repository on a deploy branch. When code is ready to be deployed, it is pushed from a developer’s local copy of the repository and marked ready to be checked and deployed to production.
Our cloud management tool and package deployment system is an in-house tool called Boxman. Boxman controls our internal package management system, similar to how Yum or Synaptic would work on a Linux system but with more frequent updates to packages – effectively once every time code is pushed. When a code revision passes tests, a web package is constructed via Boxman and set as the newest version. This version is distributed in a P2P fashion to the web machines, which are then restarted in a rolling fashion. A simple Tornado server displays stats on the progress of the deploy. When the rolling restarts complete, some consistency checks are run and the deploy is complete.
Boxman is also used to restart machines that are down or incorrectly versioned during or after deploys. Since it is sometimes cheaper to simply bring up a new copy of a machine rather than attempt to resuscitate a permadead machine, Boxman is also designed to start up and bring up-to-date fresh web boxes on command.
An improved version of Boxman is under development called Boxer. Boxer is modeled after existing deploy tools like Capistrano and Fabric, but is specifically geared towards managing EC2 instances. For example, common tasks that I use Boxer for include: starting and stopping Boxer controlled hosts, imaging and scaling running instances, attaching/detaching volumes, attaching meta-information to instances or volumes, or assigning control groups based off of imaged instances.
Monitoring availability and performance
- Amazon Web Services for basic metrics
- Custom tools that individually log to memory, disk and databases, which we then graph for visualization and/or use for alerting (email & pages)
- PDB with objgraph
- Google Perf Tools
Charlie Cheever (one of the Quora’s founders) also answered on a question related to the the service’ infrastructure:
We use Thrift to communicate between different backend systems.
Our main webserver is paste (the default for Pylons) with nginx and HAProxy in front of it. For our Comet server, we use Tornado.
We mostly use Amazon EC2 and S3 for hosting.
Our data store is mostly MySQL + memcached right now, plus two services written in C++.
We use git for version control.
Although there isn’t too much public information related on Quora’s approach for pushing their code to production, we can extract relevant data from this. The process is pretty common for and pretty much based on in-house solutions/custom tools. Although we don’t know about scalability and high-availability I think it’s safe to say that Quora scales fast enough and the deployment seems pretty reliable.
Also, I like the P2P approach, as we’ve seen it in Facebook’s method of pushing the code. It’s simple, it’s fast … and if you’re taking some precautions, pretty secure.
There no actual WOW factor for Quora, but kudos for building their custom delivery and deployment system!Follow @bytearrays