{"id":591,"date":"2022-04-05T18:02:00","date_gmt":"2022-04-05T18:02:00","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform-2\/"},"modified":"2022-04-05T18:02:00","modified_gmt":"2022-04-05T18:02:00","slug":"transforming-service-reliability-through-an-slos-driven-culture-platform-2","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform-2\/","title":{"rendered":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform"},"content":{"rendered":"<p>At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, &amp; Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale of the Salesforce infrastructure, its range of tech stacks, and the many products that those tech stacks support. Because of that challenge \u2014 and because TMP must gauge reliability at both that high level (across products) and from a zoomed-in view (for individual services supporting those products) \u2014 agreeing on what \u201chighly reliable\u201d means and how to measure it is absolutely critical. So just as Salesforce employees refer to standardized branding guidelines to speak the same product language, we also need standardized service ownership guidelines to ensure that we\u2019re speaking the same reliability language. This blog post is about the Salesforce journey to framing reliability in terms of\u00a0<a href=\"https:\/\/sre.google\/sre-book\/service-level-objectives\/\" target=\"_blank\" rel=\"noopener\">service-level indicators (SLIs) and objectives (SLOs)<\/a>, which are often used in the enterprise software business to represent the true customer experience in a clear, quantitative, and actionable way.<\/p>\n<h2 class=\"has-large-font-size\">Understanding the \u201cBefore\u201d Picture of Service Ownership<\/h2>\n<p>In the past, teams stored SLOs in custom dashboards, documents, and a mix of other resources, all of which had to be manually updated by data analysts. Those analysts spent hours upon hours updating health metrics across the Salesforce product lines, and finding health metrics for a given team was also difficult. That search meant reaching out to colleagues and scanning through multiple repos and documents to get historical data. And when you found those numbers for one team, product, or service, you couldn\u2019t always compare them with the numbers from\u00a0<em>other\u00a0<\/em>teams, products, or services because how the numbers were calculated often varied. With this distribution of data, varied calculations for it, and no central view for comparing it, analyzing an SLO over longer periods of time \u2014 especially across properties \u2014 wasn\u2019t always possible. As a result, we didn\u2019t have as clear and direct of a read on customers\u2019 Salesforce experience as we wanted.<\/p>\n<h2>Starting the Journey to Improved Service Ownership<\/h2>\n<p>The piecemeal reporting on reliability wasn\u2019t working for us, and we knew that we needed a standardized view for the actual health of products, features, and services \u2014 and into how that actual health compared to the health that internal and external customers expect. With that view, we could more easily identify customer-impacting incidents, diagnose their root causes, and analyze dependencies between systems.<\/p>\n<p><strong>Standardized\u00a0<em>measurements<\/em>\u00a0of product and service health<\/strong>: We standardized on READS as the minimum set of SLIs required for services, with any SLIs that didn\u2019t apply to a service being exempted for it. The previously published\u00a0<a target=\"_blank\" href=\"https:\/\/engineering.salesforce.com\/reads-service-health-metrics-1bfa99033adc\" rel=\"noopener\"><em>READS: Service Health Metrics\u00a0<\/em>blog post<\/a>\u00a0outlines the SLO framework established at Salesforce.<strong>Standardized\u00a0<em>tooling\u00a0<\/em>to support those measurements<\/strong>: We built a dedicated SLO platform for hosting SLI definitions; SLO definitions; and service definitions, which include service ownership information, health thresholds, alert configurations, and more. Because this rich metadata for services was all published in the same data store, finding and learning about services became easy. Service owners could even integrate their SLOs with Salesforce-internal operational workflows to ensure that those SLOs become an integral part of their day-to-day work.<strong>Standardized\u00a0<em>visualization\u00a0<\/em>tied into that tooling<\/strong>: After a service owner instruments their service to emit health metrics, they get a standardized, out-of-the-box view into those metrics. That view includes the standard READS SLIs that apply to their service and any custom SLIs that they created to represent their service\u2019s unique capabilities, reflect their customers\u2019 reliability expectations, and capture a fuller picture of their service\u2019s health.<\/p>\n<h2>Providing Standardized Service Ownership Capabilities at the Org Level<\/h2>\n<p>Combined, the standardized measurements, tooling, and visualizations have have the ability to allow teams to:<\/p>\n<p>Have confidence in SLOs being calculated in a standardized way.Gain insights from visualized SLI\/SLO metrics.View those metrics with daily, weekly, and monthly granularity during operational reviews.Use SLO targets to judge whether a service is meeting user expectations.Set up alerts on SLI\/SLO metrics.Correlate SLO breaches with active incidents.Identify service dependencies, which can be especially useful during incident analysis.Continuously improve the Salesforce experience based on a data-driven approach.<\/p>\n<h2>Onboarding a Given Service to Service Ownership<\/h2>\n<p>For a service owner to take advantage of the previous capabilities for their\u00a0<em>own<\/em>\u00a0service, they must onboard that service to service ownership, which involves adding it to a service directory on the previously described SLO platform. They create a service definition file that codifies the name of their service and a configuration file that includes the queries for calculating the service\u2019s health data, the SLOs defining expectations for that health data, and other service-specific information. For the queries, they use a templated framework that minimizes configuration redundancies and idiosyncrasies.<\/p>\n<p>After the service owner runs the configuration pipeline and commits their files in Git, they get:<\/p>\n<p><strong>A Grafana dashboard specific to their service for realtime monitoring<\/strong>, which is automatically generated, gets populated with data collected and aggregated by a data pipeline, and can be used for real-time debugging and validation.<strong>Their service added to the service analytics dashboard<\/strong>, which is regularly reviewed during operational reviews and drives conversations about service health and availability \u2014 without making any additional demands of the service owner\u2019s time.<strong>Long-term storage and retention of their service\u2019s health data<\/strong>, giving them visibility into historical health trends.<strong>The ability to generate automated alerts<\/strong>\u00a0<strong>for their service\u00a0<\/strong>using configs that are set up using rich metadata.<strong>The service\u2019s topology information included in health and availability aggregates<\/strong>, which can help to pinpoint the instance(s) of the service that breach the service\u2019s SLOs.<\/p>\n<h2>Getting to Know the SLO Platform\u2019s Architecture<\/h2>\n<p>Now that we\u2019ve covered the benefits that the SLO platform provides to service owners, let\u2019s explore how it\u2019s structured.<\/p>\n<p>Architecture diagram showing how the SLO platform is built<\/p>\n<p><strong>Service Registry<\/strong>\u00a0\u2014 The store for service ownership information, service statuses, and service-specific configurations.<strong>Service configuration store<\/strong>\u00a0\u2014 The store for SLI and SLO information, triggers, and warning and critical thresholds required for monitoring and alerting.<strong>Topology information<\/strong>\u00a0\u2014 The service\u2019s substrate(s) and deployed instances.<strong>Change and release information<\/strong>\u00a0\u2014 In the future, information that allows us to correlate change and release artifacts with SLO breaches to quickly diagnose their causes.<strong>Ownership information<\/strong>\u00a0\u2014 Service Ownership information that can be integrated seamlessly with alert-related metadata and correlate with actionable notifications.<strong>Metrics time-series monitoring platform<\/strong>\u00a0\u2014 After a service owner instruments their service to emit operational metrics, the store for those metrics, which then feeds those metrics (some aggregated, some not) into the metrics aggregation pipeline.<strong>Metrics aggregation pipeline<\/strong>\u00a0\u2014 Running across various substrates to provide aggregated service health data for operational analysis, this pipeline finds queries for SLIs from the monitoring configuration store, executes those queries to get raw time-series data in 1-minute intervals, and then rolls up that data on a daily and weekly basis.<\/p>\n<h2>Fostering Service Ownership Success on the SLO Platform<\/h2>\n<p>After the SLO platform was built, we saw massive adoption of it, with ~1,200 services onboarded to it in just its first year. Through onboarding, cataloging, instrumenting, and analyzing services \u2014 and setting up carefully crafted SLIs and SLOs for them in sandbox and development environments \u2014 the platform has enabled service owners to extract deep, actionable insights into the health of their services and how to improve or maintain it.<\/p>\n<p>And by driving operations reviews from the unified service health dashboard, which the following screenshot was taken from, we\u2019ve witnessed several availability success stories. Service owners have been able to catch dips in SLIs, highlight dependencies that weren\u2019t meeting their defined SLOs, review availability trends, and better understand and communicate customers\u2019 experience with their services. At times, analysis of that data from the SLO platform has even prompted the rearchitecting of certain critical services and driven conversations about strategic investments and tactical improvements.<\/p>\n<p>Screenshot of a portion of the unified service health dashboard<\/p>\n<p>In the future, we want to provide a more comprehensive view of the layers of dependencies required for any service to maintain its health, which would allow us to pinpoint exactly where a failure occurs and to minimize recovery times. Then, we could set threshold expectations across the entire stack, not just service by service, and better enable critical services to meet their SLOs. We plan to make the SLO platform an API-first platform, which would allow us to set up integrations with change and release management data, and to scale up the data architecture to meet the platform\u2019s massive computational needs. We\u2019re excited to see the platform continue to grow and look forward to sharing more success stories in the future.<\/p>\n<p>If you would like to learn more about how you can build observability into your services from the ground up, check out this previously published post about READS:\u00a0<a href=\"https:\/\/engineering.salesforce.com\/reads-service-health-metrics-1bfa99033adc\" target=\"_blank\" rel=\"noopener\">https:\/\/engineering.salesforce.com\/reads-service-health-metrics-1bfa99033adc<\/a><\/p>\n<h3>Acknowledgments<\/h3>\n<p><em>Thanks to\u00a0<\/em><a href=\"https:\/\/medium.com\/@adimitropoulos\"><em>Alex Dimitropoulos<\/em><\/a>\u00a0<em>for additional contributions to this post and to the entire development team, who created the SLO Platform and continue to evolve it!<\/em><\/p>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/transforming-service-reliability-through-an-slos-driven-culture-platform\/\">Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/transforming-service-reliability-through-an-slos-driven-culture-platform\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, &amp; Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale of the Salesforce infrastructure, its&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform-2\/\">Continue reading <span class=\"screen-reader-text\">Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-591","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":561,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform\/","url_meta":{"origin":591,"position":0},"title":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform","date":"April 5, 2022","format":false,"excerpt":"At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, & Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":590,"url":"https:\/\/fde.cat\/index.php\/2022\/06\/02\/meet-the-team-of-problem-solvers-pushing-boundaries-to-see-how-massively-we-can-scale\/","url_meta":{"origin":591,"position":1},"title":"Meet the team of problem solvers pushing boundaries to see how massively we can scale.","date":"June 2, 2022","format":false,"excerpt":"Welcome to the new hub for all things Salesforce Engineering! This site is where you can get a behind-the-scenes look at how we build business-critical software at scale, take a peek at how we contribute to the open source community, meet some of our technical employees, and learn more about\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":696,"url":"https:\/\/fde.cat\/index.php\/2023\/04\/04\/video-meet-5-salesforce-engineers-who-are-innovating-the-future\/","url_meta":{"origin":591,"position":2},"title":"[video] Meet 5 Salesforce Engineers Who Are Innovating the Future","date":"April 4, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. This special edition connects five of the best and brightest minds within Salesforce Engineering and across the world including India, Argentina, and the U.S., chronicling their quest to innovate the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":544,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/22\/the-unified-infrastructure-platform-behind-salesforce-hyperforce\/","url_meta":{"origin":591,"position":3},"title":"The Unified Infrastructure Platform Behind Salesforce Hyperforce","date":"February 22, 2022","format":false,"excerpt":"If you\u2019re paying attention to Salesforce technology at all, you\u2019ve no doubt heard about Hyperforce, our new approach to deploying Salesforce on public cloud providers. As with any big announcement, it can be a little hard to cut through the hyperbolic language and understand what\u2019s going\u00a0on. In this blog series,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":585,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/22\/the-unified-infrastructure-platform-behind-salesforce-hyperforce-2\/","url_meta":{"origin":591,"position":4},"title":"The Unified Infrastructure Platform Behind Salesforce Hyperforce","date":"February 22, 2022","format":false,"excerpt":"If you\u2019re paying attention to Salesforce technology at all, you\u2019ve no doubt heard about\u00a0Hyperforce, our new approach to deploying Salesforce on public cloud providers. As with any big announcement, it can be a little hard to cut through the\u00a0hyperbolic language and understand what\u2019s going on. In this blog series, we\u2019ll\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":229,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","url_meta":{"origin":591,"position":5},"title":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning","date":"February 2, 2021","format":false,"excerpt":"Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/591","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=591"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/591\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=591"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=591"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=591"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}