SRE Netflix at SRECon

190 Countries and 5 CORE SREs by Jonah Horowitz

How does Netflix scale SRE? How do we manage over 70 million customers around the world without a 24/7 operations center? With tens of thousands of Linux instances in a distributed system architecture, and thousands of daily production changes, it’s an environment that’s both challenging and exciting. Netflix had to change how our teams run applications in production and adopt a true DevOps culture. We also learned how to give teams the tools they need to be successful. In this talk you’ll hear from one of Netflix’s CORE SREs about the challenges we’ve learned from and tools we use to keep everything running. Throughout the talk we’ll discuss how Netflix views the role of the SRE and how it differs from the traditional Systems Administrator role. It also explains why freedom and responsibility are key, trust is required, and chaos is your friend.

Performance Checklists for SREs by Brendan Gregg

There’s limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there’s little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.

Principles of Chaos Engineering by Casey Rosenthal

Distributed systems create threats to resilience that are not addressed by classical approaches to development and testing. We’ve passed the point where individual humans can reasonably navigate these systems at scale. As we embrace a world that emphasizes automation and engineering over architecting, we left gaps open in our understanding of complex systems. Chaos Engineering is a new discipline within Software Engineering, building confidence in the behavior of distributed systems at scale. SREs and dedicated practitioners adopt Chaos Engineering as a practical tool for improving resiliency. An explicit, empirical approach provides a formal framework for adopting, implementing, and measuring the success of a Chaos Engineering program. Additional best practices define an ideal implementation, establishing the gold standard for this nascent discipline. Chaos Engineering isn’t the process of creating chaos, but rather surfacing chaos that is inherent in the behavior of these systems at scale. By focusing on high level business metric, we side step understanding *how* a particular model works in order to identify *whether* it work under realistic, turbulent conditions in production. This fills a gap that arms SREs with a better, holistic understanding of the system’s behavior.