Software engineering is one industry where number of buzzwords invented in a year are probably more than any other industry whatsoever. Some of them are legit while others are just fluff. But one school of thought which has crossed the chasm of hype cycle and has consolidated itself as a solid practice and discipline is Site Reliability Engineering or SRE.
In the early days of development, there used to be a development team whose main mandate use to ship as many features as fast as possible. They used to toss the code across their silo in an adjacent silo of operations team who had mandate to keep the code running in production. Operations team never had the visibility or expertise into the code being shipped. Then came DevOps which aimed to bring down the silos between the development and operations team and automate all the manual operations work.
So SRE is nothing but a particular case of DevOps where focus of a Site Reliability Engineer is to make sure that service is reliably up and running and doesn’t have downtime below a threshold. SRE came into existence at none other than Google in the year 2003 when Ben Treynor was given command of a team of 7 software engineers to run operation of a production environment. In his own terms Ben simply describes SRE as “what happens when a software engineer is tasked with what used to be called operations.”
The task for Ben’s team was to make sure that various large distributed systems at Google ran smoothly and efficiently while teams can introduce new features continuously. To solve this problem, they had to come up with new paradigms which involved a software view of solving the operations issues. Basically they are like f1 pits crew who has to change the tyres at 100 mph.
An SRE team is responsible for latency, performance, availability, capacity planning, change management, monitoring, efficiency and emergency response if the system fails by applying scalable software solutions.
So what are the core tenets of a Site Reliability Engineer. The right answer is a mix of both software engineering chops and system admin experience. 50–50% between those two skill sets is a good mix. An SRE should be adept at algorithms, languages, data structures to write software that makes operations work smoothly.
SREs use a concept called Error budgets which gives a permisiable limit of downtime to the development team and let them decide how they want to spend it. For eg if a service has a 99% uptime, then it has an error budget of 1%. If the service is up most of the time , then SRE gives lax to development team to launch whatever new features they want. But if the service is jittery and marred with unexpected downtimes, then SRE tells development team to first fix the bugs or errors before they can ship any new feature.
Site Reliability engineers is very much a sought after division within a company and almost all companies have Site Reliability Engineers. They are true behind the scene heroes who put out the system failure fires and keep things up and running.