{"id":624,"date":"2022-08-29T16:32:07","date_gmt":"2022-08-29T16:32:07","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/08\/29\/improving-metas-slo-workflows-with-data-annotations\/"},"modified":"2022-08-29T16:32:07","modified_gmt":"2022-08-29T16:32:07","slug":"improving-metas-slo-workflows-with-data-annotations","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/08\/29\/improving-metas-slo-workflows-with-data-annotations\/","title":{"rendered":"Improving Meta\u2019s SLO workflows with data annotations"},"content":{"rendered":"<p><span>When we focus on minimizing errors and downtime here at Meta, we place a lot of attention on service-level indicators (SLIs) and service-level objectives (SLOs). Consider Instagram, for example. There, SLIs represent metrics from different product surfaces, like the volume of error response codes to certain endpoints, or the number of successful media uploads. Based on those indicators, teams establish SLOs, such as \u201cachieving a certain percentage of successful media uploads over a seven-day period.\u201d If SLOs are violated, it triggers an alert where respective on-call teams are notified to address the issue.<\/span><\/p>\n<p><span>In a<\/span><a href=\"https:\/\/engineering.fb.com\/2021\/12\/13\/production-engineering\/slick\/\" target=\"_blank\" rel=\"noopener\"> <span>previous article<\/span><\/a><span>, we covered<\/span><a href=\"https:\/\/engineering.fb.com\/2021\/12\/13\/production-engineering\/slick\/\" target=\"_blank\" rel=\"noopener\"> <span>SLICK<\/span><\/a><span>, our SLO management platform that\u2019s currently used by many services at Meta. Introducing SLICK allowed us to eliminate the discrepancies in how different teams tracked SLIs\/SLOs. Through SLICK, we now possess a single source of SLO information and provide various integrations with existing Meta tooling.<\/span><\/p>\n<p><span>Now, by leveraging historical data on SLO violations using SLICK, we\u2019ve made it even easier for engineers to prioritize and address the most pressing reliability issues.<\/span><\/p>\n<p>SLICK (Example, for illustration purposes only)<\/p>\n<h2><span>The difficulty identifying failure patterns\u00a0<\/span><\/h2>\n<p><span>While we had big success introducing SLICK, after a time, it became evident that just tracking SLOs would not suffice. After discussions with other Meta engineers, the SLICK team discovered that service owners face difficulties in identifying the issues they need to address.<\/span><\/p>\n<p><span>We wanted to make it easier for service owners to follow up on SLO violations and identify failure patterns and areas for improvement. That\u2019s why SLICK should have a way to provide some actual guidance on how to improve the reliability of services. The key to creating these recommendations lies in analyzing past events that led to SLO violations. To better draw conclusions from these events, they should contain meaningful, structured information. Otherwise, people tend to remember the most recent or most interesting outages rather than the most common issues. Without a reliable source of information, teams might prioritize fixing the wrong problems.<\/span><\/p>\n<h2><span>Collaborative data annotations<\/span><\/h2>\n<p><span>Data tools at Meta, including SLICK, support the collaborative data annotations framework. This allows engineers to annotate datasets by attaching metadata (title, content, start time, end time, string key-value pairs, etc.) to them, and to visualize it across all other such tools.<\/span><\/p>\n<p>Data annotation. (Example for illustration purposes only.)<\/p>\n<p><span>Naturally, some teams started to use this capability to annotate events that led to their SLO violations. However, there was no established way of looking at annotation data. Furthermore, service owners entered freeform data that wasn\u2019t easily analyzed or categorized. Some people tried to use conventions, like putting a cause for the violation in square brackets in the title and building their own insights on top of this data. But these very team-specific solutions could not be applied globally.<\/span><\/p>\n<h2><span>Annotations in Instagram<\/span><\/h2>\n<p><span>Instagram stood out as one of the teams that felt the need for proper annotation workflows for their SLO violations.\u00a0<\/span><\/p>\n<p><span>Like many teams at Meta, Instagram has a weekly handoff meeting for syncing up on noteworthy events and communicating context for the incoming on-calls. During this meeting, the team will address major events that affected reliability.<\/span><\/p>\n<p>A mockup of an on-call summary from the Instagram team.<\/p>\n<p><span>On-call engineers often navigate through busy on-call weeks. It\u2019s not unusual to forget what actually happened during specific events over the course of such weeks by the time a weekly sync meeting occurs. That\u2019s why the team started requiring on-calls to annotate any event that caused an SLO violation shortly after the event, by encoding it into their tooling and workflows. Then, as a part of the weekly on-call handoff checklist, the team ensured that all violations were appropriately tagged.\u00a0<\/span><\/p>\n<p><span>Over time, people started looking back at this data to identify common themes among past violations. In doing so, the team struggled with the lack of explicit structure. So they resorted to various string processing approaches to attempt to identify common words or phrases. Eventually, this led them to add a few additional fields in the annotation step to empower richer data analysis.<\/span><\/p>\n<p>An mockup of a root-cause analysis from the annotations by the Instagram team.<\/p>\n<p><span>Using these richer annotations, they could generate more useful digests of historical SLO violations to better understand why they\u2019re happening and to focus on key areas. For example, in the past, the Instagram team identified that they were experiencing occasional short-lived blips when talking to downstream databases. Since the blips lasted only a few seconds to a few minutes, they\u2019d often disappeared by the time an on-call received an alert and started investigating.<\/span><\/p>\n<p><span>Investigation rarely led to meaningful root-cause analysis, and on-call fatigue ensued. The team found themselves spending less effort trying to investigate and simply annotated the blips as issues with downstream services. Later on, they were able to use these annotations to identify that these short blips, in fact, acted as the biggest contributor to Instagram\u2019s overall reliability issues. The team then prioritized a larger project to better understand them. In doing so, the team could identify cases where the underlying infra experienced locality-specific issues. They also identified cases where product teams incorrectly used these downstream services.<\/span><\/p>\n<p><span>After practicing annotation usage for several years, the team identified a few elements that were to the success of this annotation workflow:<\/span><\/p>\n<p><span>The on-call already has a lot on their plate and doesn\u2019t need more process. An easy way to create annotations should exist. The number of annotations directly relates to the value you can get. However, if creating annotations is too difficult, people just won\u2019t create them.<\/span><br \/>\n<span>You must balance the level of depth in annotations with the amount of effort for an on-call. Ask for too much information and on-calls will quickly burn out.<\/span><br \/>\n<span>Team culture must reinforce the value of annotations. Furthermore, you have to actually use your annotations to build value! If you ask people to create annotations but don\u2019t prioritize projects based on them, people won\u2019t see the value in the whole process. As a result, they\u2019ll put less and less effort into creating annotations.<\/span><\/p>\n<h2><span>Introducing schema for annotations<\/span><\/h2>\n<p><span>Naturally, as the Instagram team adopted SLICK, they sought to extend the learnings they\u2019ve made in collecting annotations to the rest of Meta. Instagram and SLICK worked together to settle on a flexible data structure that allowed other teams to customize their workflow to meet their specific needs. This collaboration also provided common elements to make the annotation process a unified experience.<\/span><\/p>\n<p><span>To achieve this, the team introduced an additional field in the SLI configuration: <\/span><span>annotation_config<\/span><span>. It allows engineers to specify a list of matchers<\/span> <span>(key-value pairs associated with the annotation) that need to be filled in when an annotation is created. Each matcher can have extra matchers that will need to be filled in, depending on the value of the parent matcher. This structure allows for defining complex hierarchical relations.<\/span><\/p>\n<p>Annotation config data structure.<br \/>\nExample of annotation_config.<\/p>\n<p>\u00a0<\/p>\n<h2><span>Ways to create schematized annotations<\/span><\/h2>\n<p><span>Once the schema was ready, we needed a way to enter data.<\/span><\/p>\n<h3><span>Manual annotations via SLICK CLI<\/span><\/h3>\n<p><span>SLICK\u2019s tool offering contains a CLI, so it was natural to have this capability there. This was the very first way to create annotation metadata according to the schema. The CLI provides a nice interactive GUI for people who prefer the terminal rather than a web interface.<\/span><\/p>\n<p>Annotation creation via interactive mode of SLICK CLI.<\/p>\n<h3><span>Manual annotations via SLICK UI<\/span><\/h3>\n<p><span>Many users prefer to use the UI to create annotations because it provides a good visual representation of what they are dealing with. The default annotations UI in SLICK didn\u2019t allow for adding additional metadata to the created annotations, so we had to extend the existing dialog implementation. We also had to implement the schema support and make sure we\u2019re dynamically displaying some of the fields, depending on user selection.<\/span><\/p>\n<p>Annotation creation dialog in web UI.<\/p>\n<h3><span>Manual annotations via Workplace bot<\/span><\/h3>\n<p><span>Many of SLICK\u2019s users use a Workplace bot to receive notifications about events that led to SLO violations as a post in their Workplace groups. It was possible to annotate these events right from Workplace before. For many teams, this became the preferred way of interacting with alerts that led to SLO violations. We\u2019ve extended this feature with the ability to add extra metadata according to the schema.<\/span><\/p>\n<p>Annotation creation via Workplace bot.<\/p>\n<h3><span>Automated annotations via Dr. Patternson\u2013SLICK integration<\/span><\/h3>\n<p><span>Meta has an automated debugging runbooks tool called<\/span><a href=\"https:\/\/youtu.be\/nQg1jJNpAi4?t=2060\" target=\"_blank\" rel=\"noopener\"> <span>Dr. Patternson<\/span><\/a><span>. It allows service owners to automatically run investigation scripts in response to an alert. SLICK has integration with this system \u2014 if the analysis has conclusive results and is capable of figuring out the root cause of an event that led to SLO violation, we annotate the alert with the determined root cause and any additional data that the analysis script provided.<\/span><\/p>\n<p><span>Of course, not all problems can be successfully analyzed automatically, but there are classes of issues where Dr. Patternson performs very well. Using automated analysis greatly simplifies the annotation process and significantly reduces the on-call load.<\/span><\/p>\n<p>Annotation created via Dr. Patternson. (Example for illustration purposes only.)<\/p>\n<h2><span>Annotations insights UI<\/span><\/h2>\n<p><span>After having various workflows for people to fill in their information, we could build an insights UI to display the aggregated information.<\/span><\/p>\n<p><span>We\u2019ve built a new section in the SLICK UI to display annotations grouped by root cause. By looking at this chart for a specific time range, service owners can easily see the distribution of root causes for the annotations. This clearly signals that some particular issue needs to be addressed. We\u2019re also showing the distribution of the additional metadata. This way, SLICK users can, for example, learn that a particular code change occurred that caused multiple alerts.\u00a0<\/span><\/p>\n<p><span>We\u2019re also showing the list of all annotations that happened in the specified time period. This allows engineers to easily see the details of individual annotations and edit or delete them.<\/span><\/p>\n<p>Annotations Insights in SLICK. (Example, for illustration purposes only).<\/p>\n<h2><span>Takeaways and next steps<\/span><\/h2>\n<p><span>These new features are currently being tested by several teams. Feedback we\u2019ve received already indicates that the annotation workflow resulted in a big improvement for people working with SLOs. We plan to capitalize on this by onboarding all SLICK users and building even more features, including:<\/span><\/p>\n<p><span>Switching from just displaying the results to a more recommendation-style approach, like: \u201c<\/span><span>Dependency on service \u2018example\/service\u2019 was the root cause for the 30 percent of alerts that led to SLO violations for SLI \u201cAvailability\u201d. Fixing this dependency would allow you to raise your SLI results from 99.96% to 99.98%.\u201d\u00a0<\/span><\/p>\n<p><span>Adding the ability to exclude some particular annotated time periods from SLO calculation (e.g., a planned downtime).\u00a0<\/span><br \/>\n<span>Analyzing annotations\u2019 root causes across all SLIs (currently we support analysis on the individual SLI level).<\/span><\/p>\n<p><span>The work that we\u2019ve done so far will form the basis for a formal SLO review process that the team will introduce in the future. We hope to shift teams\u2019 focus from just reacting to SLO violations on an ad-hoc basis to a more planned approach. We believe that annotating events that led to SLO violations and regular periodical SLO violation reviews may become a standard practice at Meta.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2022\/08\/29\/developer-tools\/improving-metas-slo-workflows-with-data-annotations\/\">Improving Meta\u2019s SLO workflows with data annotations<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>When we focus on minimizing errors and downtime here at Meta, we place a lot of attention on service-level indicators (SLIs) and service-level objectives (SLOs). Consider Instagram, for example. There, SLIs represent metrics from different product surfaces, like the volume of error response codes to certain endpoints, or the number of successful media uploads. Based&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/08\/29\/improving-metas-slo-workflows-with-data-annotations\/\">Continue reading <span class=\"screen-reader-text\">Improving Meta\u2019s SLO workflows with data annotations<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-624","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":515,"url":"https:\/\/fde.cat\/index.php\/2021\/12\/13\/slick-adopting-slos-for-improved-reliability\/","url_meta":{"origin":624,"position":0},"title":"SLICK: Adopting SLOs for improved reliability","date":"December 13, 2021","format":false,"excerpt":"To support the people and communities who use our apps and products, we need to stay in constant contact with them. We want to provide the experiences we offer reliably. We also need to establish trust with the larger community we support. This can be especially challenging in a large-scale,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":591,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform-2\/","url_meta":{"origin":624,"position":1},"title":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform","date":"April 5, 2022","format":false,"excerpt":"At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, & Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":561,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform\/","url_meta":{"origin":624,"position":2},"title":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform","date":"April 5, 2022","format":false,"excerpt":"At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, & Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":563,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/12\/onboarding-slos-for-salesforce-services\/","url_meta":{"origin":624,"position":3},"title":"Onboarding SLOs for Salesforce services","date":"April 12, 2022","format":false,"excerpt":"At Salesforce, we operate thousands of services of various sizes: monolith and micro-services, both customer-facing and internal, across multiple substrates, i.e. first party and public cloud infrastructure. In our earlier blog \u201cREADS: Service Health Metrics,\u201d we talked about the Service Level Objective (SLO) framework called READS that we developed at\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":592,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/12\/onboarding-slos-for-salesforce-services-2\/","url_meta":{"origin":624,"position":4},"title":"Onboarding SLOs for Salesforce services","date":"April 12, 2022","format":false,"excerpt":"At Salesforce, we operate thousands of services of various sizes: monolith and micro-services, both customer-facing and internal, across multiple substrates, i.e. first party and public cloud infrastructure. In our earlier blog \u201cREADS: Service Health Metrics,\u201d we talked about the Service Level Objective (SLO) framework called READS that we developed Salesforce\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":519,"url":"https:\/\/fde.cat\/index.php\/2021\/12\/20\/sre-weekly-issue-301\/","url_meta":{"origin":624,"position":5},"title":"SRE Weekly Issue #301","date":"December 20, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo: https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles BadgerDAO Exploit Technical Post Mortem This\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/624","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=624"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/624\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=624"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=624"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=624"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}