{"id":525,"date":"2021-12-28T15:33:28","date_gmt":"2021-12-28T15:33:28","guid":{"rendered":"https:\/\/fde.cat\/?p=525"},"modified":"2021-12-28T15:33:28","modified_gmt":"2021-12-28T15:33:28","slug":"sre-netflix-at-srecon","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/12\/28\/sre-netflix-at-srecon\/","title":{"rendered":"SRE Netflix at SRECon"},"content":{"rendered":"\n<p><a rel=\"noreferrer noopener\" href=\"https:\/\/www.youtube.com\/watch?v=UdCEfUG6dBI\" data-type=\"URL\" data-id=\"https:\/\/www.youtube.com\/watch?v=UdCEfUG6dBI\" target=\"_blank\">190 Countries and 5 CORE SREs<\/a> by Jonah Horowitz<\/p>\n\n\n\n<p>How does Netflix scale SRE? How do we manage over 70 million customers around the world without a 24\/7 operations center? With tens of thousands of Linux instances in a distributed system architecture, and thousands of daily production changes, it&#8217;s an environment that&#8217;s both challenging and exciting. Netflix had to change how our teams run applications in production and adopt a true DevOps culture. We also learned how to give teams the tools they need to be successful. In this talk you&#8217;ll hear from one of Netflix&#8217;s CORE SREs about the challenges we&#8217;ve learned from and tools we use to keep everything running. Throughout the talk we&#8217;ll discuss how Netflix views the role of the SRE and how it differs from the traditional Systems Administrator role. It also explains why freedom and responsibility are key, trust is required, and chaos is your friend.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.youtube.com\/watch?v=zxCWXNigDpA\" data-type=\"URL\" data-id=\"https:\/\/www.youtube.com\/watch?v=zxCWXNigDpA\" target=\"_blank\" rel=\"noreferrer noopener\">Performance Checklists for SREs<\/a> by Brendan Gregg<\/p>\n\n\n\n<p>There&#8217;s limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there&#8217;s little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.youtube.com\/watch?v=yi0Hxy9VRCc\" data-type=\"URL\" data-id=\"https:\/\/www.youtube.com\/watch?v=yi0Hxy9VRCc\" target=\"_blank\" rel=\"noreferrer noopener\">Principles of Chaos Engineering<\/a> by Casey Rosenthal<\/p>\n\n\n\n<p>Distributed systems create threats to resilience that are not addressed by classical approaches to development and testing. We\u2019ve passed the point where individual humans can reasonably navigate these systems at scale. As we embrace a world that emphasizes automation and engineering over architecting, we left gaps open in our understanding of complex systems. Chaos Engineering is a new discipline within Software Engineering, building confidence in the behavior of distributed systems at scale. SREs and dedicated practitioners adopt Chaos Engineering as a practical tool for improving resiliency. An explicit, empirical approach provides a formal framework for adopting, implementing, and measuring the success of a Chaos Engineering program. Additional best practices define an ideal implementation, establishing the gold standard for this nascent discipline. Chaos Engineering isn\u2019t the process of creating chaos, but rather surfacing chaos that is inherent in the behavior of these systems at scale. By focusing on high level business metric, we side step understanding *how* a particular model works in order to identify *whether* it work under realistic, turbulent conditions in production. This fills a gap that arms SREs with a better, holistic understanding of the system\u2019s behavior.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>190 Countries and 5 CORE SREs by Jonah Horowitz How does Netflix scale SRE? How do we manage over 70 million customers around the world without a 24\/7 operations center? With tens of thousands of Linux instances in a distributed system architecture, and thousands of daily production changes, it&#8217;s an environment that&#8217;s both challenging and&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/12\/28\/sre-netflix-at-srecon\/\">Continue reading <span class=\"screen-reader-text\">SRE Netflix at SRECon<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"video","meta":{"spay_email":"","footnotes":""},"categories":[1,8],"tags":[16,18,17],"class_list":["post-525","post","type-post","status-publish","format-video","hentry","category-external","category-sre","tag-netflix","tag-sre-teams","tag-srecon","post_format-post-format-video","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":574,"url":"https:\/\/fde.cat\/index.php\/2022\/05\/09\/sre-weekly-issue-321\/","url_meta":{"origin":525,"position":0},"title":"SRE Weekly Issue #321","date":"May 9, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https:\/\/rootly.com\/demo\/\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":635,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/25\/sre-weekly-issue-340\/","url_meta":{"origin":525,"position":1},"title":"SRE Weekly Issue #340","date":"September 25, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly\u00a0\ud83d\ude92. Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":477,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/27\/sre-weekly-issue-289\/","url_meta":{"origin":525,"position":2},"title":"SRE Weekly Issue #289","date":"September 27, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Semgrep and StackHawk are showing you what\u2019s new with automated security testing on September 30. Grab your spot: https:\/\/sthwk.com\/whats-new-webinar Articles How SREs are unique in their approach to work Here are some things that make SREs a unique breed in software\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":717,"url":"https:\/\/fde.cat\/index.php\/2023\/05\/22\/sre-weekly-issue-373\/","url_meta":{"origin":525,"position":3},"title":"SRE Weekly Issue #373","date":"May 22, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":543,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/21\/sre-weekly-issue-310\/","url_meta":{"origin":525,"position":4},"title":"SRE Weekly Issue #310","date":"February 21, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":855,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/15\/sre-weekly-issue-420\/","url_meta":{"origin":525,"position":5},"title":"SRE Weekly Issue #420","date":"April 15, 2024","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, FireHydrant: FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https:\/\/firehydrant.com\/blog\/ai-for-incident-management-is-here\/ 1.0 Launch Retrospective The game Last Epoch launched in February, and they had a rocky start. This huge retrospective\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/525","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=525"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/525\/revisions"}],"predecessor-version":[{"id":526,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/525\/revisions\/526"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=525"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=525"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=525"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}