WHAT IS SITE RELIABILITY ENGINEERING (SRE)?
Site reliability engineering is when a software engineer designs an operations function that means doing work that an operations team has historically done but using engineers who are predisposed to use automation to substitute human labour (Treynor and Murphy, n.d.). Site reliability engineering (SRE) is a set of principles and practices that incorporate software engineering aspects and apply them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations.
WHY USE SITE RELIABILITY ENGINEERING (SRE)?
Site Reliability Engineers are responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics. They split their time between functions/on-call duties and developing systems and software that help increase site reliability and performance.
- Automation or elimination of anything repetitive that’s also cost-effective to automate or eliminate.
- Avoidance to pursue much more reliability than what’s strictly necessary.
- Defining what’s necessary is a practice by itself (see list of practices below).
- Systems design with a bias toward reduction of risks to availability, latency, and efficiency.
- Observability, as in, the ability to be able to ask arbitrary questions about your system without having to know ahead of time what you wanted to ask.
- Focuses on site reliability engineering, including automation, system design, and improvements to system resilience.
HOW DOES SITE RELIABILITY ENGINEERING (SRE) WORK?
SREs attempt to maintain high reliability and availability for software applications and respond to incidents as they occur. To aid their efforts, an SRE tries to streamline and automate as many operations as possible to remove opportunities for human error (Doerrfeld, 2022). A hallmark SRE goal is to reduce “toil.” or “tedious actions that have no enduring value”(Doerrfeld, 2022). Site Reliability Engineers spend much of their time eliminating toil by coding automation and configuring internal tools to better interact with software infrastructure.
Doerrfeld, B. (2022). Day in the Life of a Site Reliability Engineer (SRE). DevOps. devops.com/day-in-the-life-of-a-site-reliability-engineer-sre/
Treynor, B. and Murphy, N. (n.d.). In Conversation. Google Site Reliability Engineering. http://www.sre.google/in-conversation/