The IT roles have evolved significantly over the past decade. With the mass adoption of DevOps practices and the explosion of public Cloud services, some previously unheard of job roles have been sprawling up. These include Cloud architects, DevOps engineers and some other variations. One particularly interesting role, which is also derived from a work methodology and even philosophy is titled a Site Reliability Engineer (or SRE for short). Like with many of the newer titles, this role varies from organization to organization and can swing more towards an operations engineer role to something that could be considered a systems developer. And that’s quite a wide scale.
The original Site Reliability Engineering term comes from Google’s internal engineering teams and their efforts to ensure the highest availability and reliability of the vast and complex systems that the web giant is running. But to actually understand the philosophy of the SRE approach, let’s first establish the traditional operational model for IT systems support. While depending on the industry the IT departments across organizations might be vastly different, and the SRE role is often attributed with organizations that create applications and run infrastructure themselves, a lot of the principles are applicable on a wider scope, especially since the ways of managing infrastructure have been evolving significantly over the years.
The typical support organization
Imagine your typical enterprise organization. It might be developing IT products themselves, or might just be using IT systems to fulfill their main business objectives. They are following ITIL and are highly siloed. On the infrastructure side they might have an L1 support team (we’re talking about the infrastructure support here, not the end user service desk), looking at monitoring dashboards and actively reacting to alerts they see in the monitoring system, patching the servers and occasionally getting some requests/tickets from the users who use the systems they support when something doesn’t work.
Their main goal is to keep the lights green, fix some of the minor issues and, when necessary, escalate to L2 support teams, which are typically more specialized. The L2 teams work on these more complex problems, using a skillset that’s deeper in a certain area and occasionally might escalate to an even more advanced tier of support personnel.
In the end, the whole support organization works on keeping the systems running and keeping their ticket queue clean. Larger changes are treated as projects and are usually performed by other teams, having distinct lines of responsibility.
This highly structured approach certainly has some appeal. But it also has some room for improvement, especially in areas of constant infrastructure improvement and employee growth. More so in these public cloud, container and IaC driven times.
SRE principles and ways of working
To dive deeper into the principles embraced by SRE, let’s look at the seven main principles, originally derived by the SRE working model at Google. That is embracing risk, service level objectives, toil elimination, monitoring, release engineering, automation and simplicity.
With the traditional approach of “keeping the lights on”, embracing risk is definitely not on the forefront of the engineer’s agenda. And to be clear, we are not talking about spontaneous unapproved changes in production because someone “felt like it was the right time and thing to do something”. We’re talking about a “good enough” service level, which allows you to provide the service to your internal or external customers. At some point, you will reach a level of diminishing returns and a risk that can be tolerated for the current circumstances.
As we looked at an example of a support model in an organization, often the motivation for the teams is to resolve tickets and have a better uptime. However, this might not often be the right set of metrics. Service level objectives (SLOs) are metrics that are bound to the customer’s satisfaction and based on service level indicators. Yes, uptime numbers are important. But it is just as important to focus on the right ones and have mitigations in place. The availability of a single server being down for an hour might be utterly unimportant, if only a single user was directed to it prior to the load balancer detecting it and redirecting the customer to a different backend. However, even when everything seems green, but the users have to wait 5 seconds each time they click a button, that’s going to hurt the user experience significantly more.
The next and very related principle is monitoring. And this is not something new: knowing what things are important and what actually needs monitoring can significantly reduce noise and both effectively react to incidents (ideally, even without the customer noticing) and plan for improvements. And these, of course, would have to be aligned with the SLOs, as metrics of large distributed systems (the ones that SREs are typically associated with) are often different than your small 3-tier LOB application, which has 50 users.
Eliminating toil one of the most important yet counterintuitive principles in the discussed traditional operational model. An SRE should not be buried under support tasks all of his or her time. The work should be organized in a way that at least half of the time could be spent doing actual engineering to improve on systems.
That instantly makes us think about the differences compare to the traditional support organization. First of all, the first level systems support personnel usually spend all of the time supporting these systems. In having the same people operating the environment and having the ability to improve it allows for quicker reaction and a reduced communication chain. Secondly, the role of an SRE instantly becomes more attractive to potential employees as that employee knows that actually interesting engineering tasks are on the table at least half of the time.
The abovementioned point is a good thing, but a couple of potentially less enthusiastic thoughts might also come to mind. First of all, as the engineers are not only capable of handling incidents, but actually developing these complex systems, the engineers have to be really competent (read: expensive). The other thought, that might come to mind, is how do you get these engineering talents, having these extensive infrastructure skills in many areas. The answer to both of these questions is our next principle – automation.
Times have come where the sysadmins simply must have some basic coding skills. Managing modern infrastructures manually is simply very inefficient and skills like Bash or Powershell scripting, some configuration management and IaC tools, being able to interact with APIs as well as Git should definitely be in the infrastructure engineer’s toolbelt. For the business it simply means that fewer more skilled people are able to do as much or even more work than more less skilled people. And in addition to that – more easily come up with potential improvements, like automatic alert remediation or more effective scaling methods.
As for the broad spectrum of competencies that such engineers must possess, that, of course, is true. However, we live in times where we can chose to set up our datacenter ourselves with physical networking gear, underlay and overlay networking, physical servers, virtualization, containerization, SANs, vSANs, identity management and all the other interesting things OR decide to use a cloud provider, where these services are abstracted to a level, where a single person actually has the ability to know these services well enough to deploy and maintain efficient infrastructure without even having to know what NVMe-oF means.
But automation alone is a little too broad. After all, writing bash scripts to automate things was the daily activity of a Unix engineer 30 years ago. To bring this to this century, we need release engineering. Or to make it brief – make infrastructure development a closer discipline to software development.
That includes using infrastructure-as-code (declarative scripting) where possible to deploy and configure infrastructure, store all code and documentation in Git repositories, write unit tests and build pipelines (sounds a lot like GitOps, right?). And what’s even more important, the changes and fixes to the infrastructure should also be made by making changes in the code and creating a new release, not going to an AWS console or logging to your Linux machine to make a configuration change.
The bottom line
From these principles we can see that there are a few things that the SRE way of working promotes. The first is tighter collaboration between teams. Meaningful SLIs and SLOs allow different teams to work closer together in achieving a goal, rather than having a tunnel vision. And that’s significantly easier to achieve in the current cloud era, where most companies don’t even have people dedicated SAN or virtualization engineers with very deep, but highly specialized knowledge.
Another thing is a mutual understanding between the management, engineering and development that the IT system is evolving and so should be the infrastructure for it. While overengineering things is wasteful, ensuring that the people building and running it can improve it as the requirements change, is important. And having the right way of working is just as important, but it’s just one of a few attributes.
One of the other problematic aspects of embracing SRE is the changes required from the teams. Growing skills and changing the known and convenient way of working is something that requires plan, structure and motivation. And even in ideal scenarios that is not simple to achieve.
Again, like with our previous article about DevOps, there are just too many non-technical traits that enable SRE practices and no, focusing solely on automation in your operational process is not going to make the team an SRE powerhouse. While at first it might show some improvements, after some time you’ll probably start seeing your unicorn for what it is – a horse with a carrot glued to its forehead.