Shipping code methods – Facebook

Choosing the right deployment/shipping method when starting a new, big project is always going to be a tedious process. There are so many ways that you could see your code pushed until production. You start researching, take a look over the new tools, start making diagrams and flow charts, take in consideration scalability, high-availability, and so many factors that should be present in every architectural design. Also, in this time you may discover ten other reasons that could cripple the chosen deployment method. I’m not going to enter now in these details. There will be a special topic for this

Research is the most important part of well.. anything, just because you don’t want to reinvent the wheel and you’d like to benefit from other people success. There is a proverb in my country, which translated in english would sound something like this: You don’t learn the profession, you steal it. Well, leaving aside the fact that my country is also well known for gipsies which guess what, are universally known for stealing stuff, the applicability of this proverb in IT is quite interesting. Basically, it’s exactly the same thing as reinventing the wheel. Of course you need to learn the basics, of course you need to have that thing that makes your excited when you see a cool code snippet, or a stupendous tool.

As soon that you are good enough to recognise the quality in other people work, you may want to start from there. It’s the basics of our own DNA programming, to learn from our ancestors, and it’s how we got so far being able to do this wibbly-wobbly things. The alternative of going by yourself on this road is not bad, it’s actually better. But you need time, you may need better skills, you need to find the limits and you need to break them. If you can do that that’s awesome, but for the moment, let’s take a look at how are other handling the same kind of problems.

Facebook Deployment and Push Process

Based on Pushing Millions of Lines of Code Five Days a Week, we have a very good example of what can be done when you have to continuously push code in Production.

Facebook releases code once a week, but they push new code daily. While few companies operate on this scale there is a lot we can learn from their “push culture” and the tools they use to promote a bug free deploy process.

Culture

There was a large focus on push culture. While tools are great and can help, they don’t mean a thing if there isn’t a company culture to support a bug-free and painless deploy process.  Creating this culture early on is important, that includes ensuring that developers are on the hook for the changes they make. The further the developer is from the end product and release date, the less likely they are to be held accountable for their code.  Pushers at Facebook ensure that a developer is responsible for their code that day, that week and even if they move to another group within Facebook.

This is pretty standard and common sense in my opinion. I mean, every person is responsible for the code they’ve written, the code they’ve tested, the code they’re pushing in production.

Also, there is no big fat layer of QA, managers and adult supervision.

Tools

For the geekier viewer the tools part is always fun – we love gadgets! Facebook has built a suite of monitoring tools they use internally to ease the push process.  For the most part there are free, open-source and DIY alternatives to a lot of these tools.  Facebook is at a much larger scale than many of us, so in my Take Away section I am going to focus more on what you can do to implement these tools and the culture they promote.

  • IRC Bots are used because all internal communication is done through IRC, so it’s important to provide automated answers to some of the most commonly asked questions and avoid overwhelming the push engineers. For instance, if a developer wants to know, “will my revision be in this push?” they need only ask an IRC bot.
  • Automated Tests are critical because engineers are responsible for writing unit and selenium tests to support their code. These tests are automated and they are used to determine if a revision is ready for release. For UI tests are used Watir and Selenium.
  • Shadow Branches are pre-live versions of Facebook. Internally facebook.com points to latest.facebook.com – a version of facebook that is ready for release this week. These shadow branches ensure the “you are always testing” mentality and help ferret out bugs before the push begins and during the first phases of the deploy process.
  • Error Tracking is controlled via a data-centric internal dashboard that includes information on where the error happened and which developer is to blame for that line of code.
  • GateKeeper is a fancy tool that developers use to roll-out features to subsets of users.  Subsets can include percentages of the population, males or females only, people with or without a certain affiliation or group membership and more.
  • Perflab is a tool for tracking performance changes on code branches and revisions and the impact your change has on the site.
  • HipHop is a compiled version of PHP that Facebook has built internally. It decreases server load by about 50% and turns facebook.com into a 1GB binary file rather than a collection of PHP files. Facebook compiles in about 10 minutes. Generates highly optimized C++ and converts into giant 1 GB binary – which is Facebook in it’s entirety. Open-source!
  • BitTorrent has been modified to distribute the HipHop binary to all clusters and within clusters of machines.  It is used to roll out changes quickly and efficiently.  Facebook pushes it’s 1GB binary of compiled PHP to it’s 10,000s of servers this way and can roll Facebook.com in about 15 minutes on all machines. Incredible!
  • Push Karma is a tool that is used to privately track how reliable an individual developer is.  It was emphasized that this is not a public shaming tool, it is simply used to determine the chance that an individual developer’s changes might mess things up.  If someone is more prone to bugs then a last minute change is less likely to be accepted into the release. Great way for putting accountability on the engineers to make sure their changes make it live OK, and don’t cause the build engineers any pain.
  • SVN/Git are the SCMs of choice. The central trunk repository is SVN, developers use Git for daily work and feature development.

Take Away

The main take away is that if you want to focus on a bug-free push culture it’s important to get developers on the hook for their changes and bring them closer to the final product.  The more QA and administration hurdles a piece of code has to go over, the less likely this is to happen.  If a developer can see and use his or her changes they are more likely to feel responsible for them and they are more likely to catch bugs. The most important tools that promote this mentality are the shadow branch, error tracking and push karma. The other three tools that I believe are of great importance to a web-application company are TDD/BDD, GateKeeper and Performance Monitoring.

The Shadow-Branch

The easiest way to implement shadow-branch is to have a staging server.  If you don’t already do this, you should. Staging servers are a great way to ensure that the code you are releasing works in an environment that mimics production as closely as possible. This usually also means using a live or replicated version of a live dataset, an external url (even if it’s only internally available), and replicating things like content-delivery networks and user access patterns.

Error Tracking

An error tracking tool is also critical, and it’s important to dedicate someone on your team to track these errors.  If you can’t automate notifications so that individual developers are notified of a bug, it’s important to designate one person the task of monitoring errors on your site.  If you don’t know there are errors, you aren’t going to fix bugs.

Push Karma

Push Karma is not as critical and I wouldn’t even recommend building a tool to automate this process, but it’s important that someone is aware of who is introducing bugs and who is responsible for them.  Some adult person needs to be responsible for determining if a developer has a higher chance of creating bugs, and if they have gone through the appropriate code-review processes and test driven development processes.  This ensures that last minute changes and large releases are smoother. It’s important to not publicly or privately shame people. I think creating a stressful environment around bugs is not healthy. You want developers that are creative, happy and not stressed about introducing bugs. But it is important to know who needs a little extra code review and a little bit more time to release a feature.

Test/Behavior Driven Development

I think the test automation speaks for itself.  If you are developing a large web application you cannot manually test everything and you cannot determine all of the side-effects of your code.  Test/Behavior Driven Development means you are ensuring your code will work now and later down the road when someone else makes a change. It’s just common sense.

GateKeeper

I believe it’s good to build features with an off-switch, if something goes wrong you want to be able to turn things off.  Especially with the prevalence of companies hosting their applications in the cloud, it’s important to think about how you can keep your site running when code or an external service doesn’t work.  Every feature should have an off switch that doesn’t require bringing down your website and re-booting.

Performance Monitoring

Studies show that small (under one second) degrade in page performance can result in users walking away.  If a user is on your website for entertainment, in other words for something other than checking their bank account, you need to ensure that your site is performant. Make sure you either test performance or monitor it with tools like NewRelic.

Resourcing for projects

Resourcing for projects is purely voluntary.Engineers decide which ones sound interesting to work on.  A PM lobbies group of engineers, tries to get them excited about their ideas.  Engineer talks to their manager, says “I’d like to work on these 5 things this week.”  Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.

Code ownership

Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between.  If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on.  Same for Architect help.  But in general, expectation is that engineers will handle everything they need themselves.

As a developer you will shepherd your changes out from the time you check it into trunk, to the time you release it out to your mom.

Code review

All changes are reviewed by at least one person, and the system is easy for anyone else to look at and review your code even if you don’t invite them to. It would take intentionally malicious behaviour to get un-reviewed code in.

 

Interesting facts

  • Product manager to engineer ratio is roughly 1-to-7 or 1-to-10
  • All engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers.  Estimate 10% of each boot camp’s trainee class don’t make it and are counselled out of the organisation.
  • After boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offences”, e.g., sharing private user data)
  • Any engineer can modify any part of FB’s code base and check-in at-will
  • Arguments about whether or not a feature idea is worth doing or not generally get resolved by just spending a week implementing it and then testing it on a sample of users, e.g., 1% of Nevada users.
  • Engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is.  Can be hard to get engineers excited about working on front-end projects and user interfaces.  this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.”  At Facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimisations, etc. are the juicy projects that engineers want.
  • Every employee at an office or connected via VPN is using a version of the site that includes all the changes that are next in line to go out. This version is updated frequently and is usually 1-12 hours ahead of what the world sees. All employees are strongly encouraged to report any bugs they see and these are very quickly actioned upon.
  • Most engineers are capable of writing bug-free code.  It’s just that they don’t have an incentive to do so at most companies.  When there’s a QA department, it’s easy to just throw it over to them to find the errors. So Facebook has QAs assigned, but there’s no specific department. Although this is partially true, there is also automated testing including push-block tests scheduled in the Continuous Integration process.
  • By default all code commits get packaged into weekly releases (tuesdays)
  • With extra effort, changes can go out same day
  • Tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
  • Engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
  • Ops team runs code releases by gradually rolling code.  There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
  • Facebook has around 60,000 servers
  • During the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention.  Not responding to ops team results in public shaming.
  • People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).
  • Getting blamed will NOT get you fired. They are extremely forgiving in this respect, and most of the senior engineers have pushed at least one horrible thing.