Shipping code methods – Netflix

“Move Fast, Fail Fast”

Netflix, the popular movie streaming site, deploys a hundred times per day, without the use of Chef or Puppet, without a quality assurance department and without release engineers. To do this, Netflix built an advanced in-house PaaS (Platform as a Service) that allows each team to deploy their own part of the infrastructure whenever they want, however many times they require. During QCon New York 2013Jeremy Edberg gave a talk about the infrastructure Netflix built to support this rapid pace of iteration on top of Amazon’s AWS.

Netflix uses a service-oriented architecture to implement their API, which handles most of the site’s requests (2 billion requests per day). Behind the scenes, the API is separated into many services, where each service is managed by a team, allowing teams to work relatively autonomously and decide themselves when and how often they want to deploy new software.

How does Netflix handle shipping the code so good?

Netflix is heavily invested in DevOps. Developers build, deploy and operate their own server clusters and are accountable when things go wrong. In case of failure, a session is organized where the root cause of the issue is investigated, and ways are discussed to prevent similar issues in the future – similar to the five whys.

Deployment at Netflix is completely automated. When a service needs to be deployed, the developer first pushes the code to a source code repository. The code push is picked up by Jenkins, which subsequently performs a build producing an application package. Then, a fresh VM image (AMI) is produced based on a base image (containing a Linux distribution) and software that all Netflix servers run, including a JVM and Tomcat, possibly further customized by the team. On top of this base install, the application package is installed. From this, an AMI is produced and registered with the system.

To deploy the VM images to its infrastructure, Netflix built Asgard. Via the Asgard web interface, VM images can be instantiated to create new EC2 clusters. Every cluster consists of at least 3 EC2 instances for redundancy, spread over multiple availability zones. When deploying a new version, the cluster running the previous version is kept running while the new version is instantiated. When the new version is booted and has registered itself with the Netflix services registry called Eureka, the load balancer flips a switch directing all traffic to the new cluster. The new cluster is monitored carefully and kept running overnight. If everything runs OK, the old cluster is destroyed. If something goes wrong, the load balancer is switched back to the old cluster.

Asgard – Web based Cloud Management and Deployment

Asgard is named for the home of the Norse god of thunder and lightning, because Asgard is where Netflix developers go to control the clouds. I’m happy to announce that Asgard has now been open sourced on github and is available for download and use by anyone. All you’ll need is an Amazon Web Services account. Like other open source Netflix projects, Asgard is released under the Apache License, Version 2.0. Please feel free to fork the project and make improvements to it.
Some of the information in this blog post is also published in the following presentations. Note that Asgard was originally named the Netflix Application Console, or NAC.

 

nav-icons-625
Below is a diagram of some of the Amazon objects required to run a single front-end application such as Netflix’s autocomplete service:
cloud-model-diagram-625

Failure happens continuously in the Netflix infrastructure. Software needs to be able to deal with failing hardware, failing network connectivity and many other types of failure. Even if failure doesn’t occur naturally, it is induced forcefully using The Simian Army. The Simian Army consists of a number of (software) “monkeys” that randomly introduce failure. For instance, the Chaos Monkey randomly brings servers down and the Latency Monkey randomly introduces latency in the network. Ensuring that failure happens constantly makes it impossible for the team to ignore the problem and creates a culture that has failure resilience as a top priority.

Moving Toward Continuous Delivery

Before moving on it’s useful to draw a quick distinction between continuous deployment and delivery. Per the book, if you’re practicing continuous deployment then you’re necessarily also practicing continuous delivery, but the reverse doesn’t hold true.
Continuous deployment extends continuous delivery and results in every build that passes automated test gates being deployed to production.  Continuous delivery requires an automated deployment infrastructure but the decision to deploy is made based on business need rather than simply deploying every commit to prod.  Continuous deployment may be pursued as an optimization to continuous delivery but the current focus is to enable the latter such that any release candidate can be deployed to prod quickly, safely, and in an automated way.

To meet demand for new features and to make a growing infrastructure easier to manage, Netflix is overhauling their dev, build, test, and deploy pipeline with an eye toward a continuous delivery.  Being able to deploy features as they’re developed gets them in front of Netflix subscribers as quickly as possible rather than having them “sit on the shelf.”  And deploying smaller sets of features more frequently reduces the number of changes per deployment, which is an inherent benefit of continuous delivery and helps mitigate risk by making it easier to identify and triage problems if things go south during a deployment.

The foundational concepts underlying the delivery system are simple:  automation and insight.  By applying these ideas to the deployment pipeline, an effective balance between velocity and stability can be easily achieved.

Automation – Any process requiring people to execute manual steps repetitively will get you into trouble on a long enough timeline.  Any manual step that can be done by a human can be automated by a computer; automation provides consistency and repeatability.  It’s easy for manual steps to creep into a process over time and so constant evaluation is required to make sure sufficient automation is in place.

Insight – You can’t support, understand, and improve what you can’t see.  Insight applies both to the tools we use to develop and deploy the API as well as the monitoring systems used to track the health of the running applications.  For example, being able to trace code as it flows from SCM systems through various environments (test, stage, prod, etc.) and quality gates (unit tests, regression tests, canary, etc.) on its way to production helps us distribute deployment and ops responsibilities across the team in a scalable way.  Tools that surface feedback about the state of our pipeline and running apps give us the confidence to move fast and help us quickly identify and fix issues when things (inevitably) break.

Development & Deployment Flow

The following diagram illustrates the logical flow of code from feature inception to global deployment to production clusters across all of the AWS regions.  Each phase in the flow provides feedback about the “goodness” of the code, with each successive step providing more insight into and confidence about feature correctness and system stability.

Logical Flow #3 (2)
Taking a closer look at our continuous integration and deploy flow, we have the diagram below, which pretty closely outlines the flow we follow today.  Most of the pipeline is automated, and tooling gives us insight into code as it moves from one state to another.

Basic Build Test Deploy Flow

 

Branches

Currently there are 3 long-lived branches maintained that serve different purposes and get deployed to different environments.  The pipeline is fully automated with the exception of weekly pushes from the release branch, which require an engineer to kick off the global prod deployment.

Test branch – used to develop features that may take several dev/deploy/test cycles and require integration testing and coordination of work across several teams for an extended period of time (e.g., more than a week).  The test branch gets auto deployed to a test environment, which varies in stability over time as new features undergo development and early stage integration testing.  When a developer has a feature that’s a candidate for prod they manually merge it to the release branch.

Release branch – serves as the basis for weekly releases.  Commits to the release branch get auto-deployed to an integration environment in our test infrastructure and a staging environment in the prod infrastructure.  The release branch is generally in a deployable state but sometimes goes through a short cycle of instability for a few days at a time while features and libraries go through integration testing.  Prod deployments from the release branch are kicked off by someone on our delivery team and are fully automated after the initial action to start the deployment.

Prod branch – when a global deployment of the release branch (see above) finishes it’s merged into the prod branch, which serves as the basis for patch/daily pushes.  If a developer has a feature that’s ready for prod and they don’t need it to go through the weekly flow then they can commit it directly to the prod branch, which is kept in a deployable state.  Commits to the prod branch are auto-merged back to release and are auto-deployed to a canary cluster taking a small portion of live traffic.  If the result of the canary analysis phase is a “go” then the code is auto deployed globally.

Confidence in the Canary

The basic idea of a canary is that you run new code on a small subset of your production infrastructure, for example, 1% of prod traffic, and you see how the new code (the canary) compares to the old code (the baseline).

Canary analysis used to be a manual process for us where someone on the team would look at graphs and logs on our baseline and canary servers to see how closely the metrics (HTTP status codes, response times, exception counts, load avg, etc.) matched.

Needless to say this approach doesn’t scale when you’re deploying several times a week to clusters in multiple AWS regions.  So there is an automated process developed that compares 1000+ metrics between the baseline and the canary code and generates a confidence score that gives a sense for how likely the canary is to be successful in production.  The canary analysis process also includes an automated squeeze test for each canary Amazon Machine Image (AMI) that determines the throughput “sweet spot” for that AMI in requests per second.  The throughput number, along with server start time (instance launch to taking traffic), is used to configure auto scaling policies.

The canary analyzer generates a report for each AMI that includes the score and displays the total metric space in a scannable grid.  For commits to the prod branch (described above), canaries that get a high-enough confidence score after 8 hours are automatically deployed globally across all AWS regions.

The screenshots below show excerpts from a canary report.  If the score is too low (< 95 generally means a “no go”, as is the case with the canary below), the report helps guide troubleshooting efforts by providing a starting point for deeper investigation. This is where the metrics grid, shown below, helps out. The grid puts more important metrics in the upper left and less important metrics in the lower right.  Green means the metric correlated between baseline and canary.  Blue means the canary has a lower value for a metric (“cold”) and red means the canary has a higher value than the baseline for a metric (“hot”).

canary-score

Keep the Team Informed

With all this deployment activity going on, it’s important to keep the team informed about what’s happening in production. The reason behind it is to make it easy for anyone on the team to know the state of the pipeline and what’s running in prod, but also not wanting to spam people with messages that are filtered out of sight.  To complement of their dashboard app, an XMPP bot is ran that sends a message to the team’s chatroom when new code is pushed to a production cluster.  The bot sends a message when a deployment starts and when it finishes.  The topic of the chatroom has info about the most recent push event and the bot maintains a history of pushes that can be accessed by talking to the bot.

 

Move Fast, Fail Fast (and Small)

The goal is to provide the ability for any engineer on the team to easily get new functionality running in production while keeping the larger team informed about what’s happening, and without adversely affecting system stability.  By developing comprehensive deployment automation and exposing feedback about the pipeline and code flowing through it we’ve been able to deploy more easily and address problems earlier in the deploy cycle.  Failing on a few canary machines is far superior to having a systemic failure across an entire fleet of servers.

Interesting stats

  • 50 million subscribers
  • Number of hours spent by Netflix members watching streamed video on hi-speed internet: 2 billion hours
  • Number of distribution centers: 58
  • Netflix’s front-end services are running on 500 to 1,000 Linux-based Tomcat JavaServer and NGINX web servers. These are empowered by hundreds of other Amazon Simple Storage Service (S3) and the NoSQL Cassandra database servers using the Memcached high-performance, distributed memory object caching system. All of this, and more besides, are distributed across three Amazon Web Services availability zones.

It’s safe to say that Netflix may be the biggest cloud app of all. Their also reinventing the idea of high-availability architecture and everything seems to be working in their favour.

Many parts of the Netflix infrastructure are open source already and available on Github. It is Netflix’ goal to eventually release all of its infrastructure for other companies to benefit from.