{"id":528,"date":"2022-01-06T16:22:40","date_gmt":"2022-01-06T16:22:40","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/01\/06\/managing-availability-in-service-based-deployments-with-continuous-testing\/"},"modified":"2022-01-06T16:22:40","modified_gmt":"2022-01-06T16:22:40","slug":"managing-availability-in-service-based-deployments-with-continuous-testing","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/01\/06\/managing-availability-in-service-based-deployments-with-continuous-testing\/","title":{"rendered":"Managing Availability in Service Based Deployments with Continuous Testing"},"content":{"rendered":"<h3>The Problem<\/h3>\n<p>At Salesforce, trust is our number one value. What this equates to is that our customers need to trust us; trust us to safeguard their data, trust that we will keep our services up and running, and trust that we will be there for them when they need\u00a0us.<\/p>\n<p>In the world of Software as a Service (SaaS), trust and availability have become synonymous. Availability represents that percentage of time and\/or requests successfully handled.<\/p>\n<p>Availability can be calculated as the number of successful requests divided by the number of total requests.<\/p>\n<p>As a result, a few things become prevalent.<\/p>\n<p>In order have high availability, you must have low mean time to recovery\u00a0(MTTR).In order to have low MTTR, you must have low mean time to detection (MTTD).How can we distinguish between server errors and client errors? Is our availability penalized for client\u00a0errors?How do we calculate the availability of our service in regards to dependent services? Should our availability metrics show when a dependent service is\u00a0down?<\/p>\n<h3>The Solution<\/h3>\n<p>In order to comprehensively tackle this issue, we implemented a three-pronged strategy for the Salesforce Commerce APIs: <strong>monitoring<\/strong>, <strong>continuous<\/strong> <strong>testing<\/strong>, and <strong>alerting<\/strong>. Additionally, availability is broken down into multiple categories to allow for pinpointing and tackling problematic areas.<\/p>\n<h3>Monitoring<\/h3>\n<p>At the core of this problem is the ability to observe what is happening in the system. For a given API call, we need to\u00a0know:<\/p>\n<p>The overall\u00a0latencyThe response\u00a0codeThe latency and response codes for calls to any dependent services<\/p>\n<p>For many service-based systems under load, the raw metrics exceed the capability of any realtime metrics database and the data has to be aggregated and downsampled. In our experience, aggregating and downsampling to 1 minute intervals and publishing the<strong> p95 <\/strong>and <strong>p99<\/strong> is sufficient for catching most issues and keeping MTTD to a few\u00a0minutes.<\/p>\n<p>Additionally, the data needs to be tagged so that it can be queried and aggregated at a granular level. Since our services are multi-tenant, we often associate the following tags with the\u00a0metric.<\/p>\n<p>* Service name represents the type of service that will process the\u00a0request.<\/p>\n<p>This can be used to differentiate systems from their dependent systems.<\/p>\n<p>With fine grained metrics in place, we now have the data on how the system and its interactions with external entities are behaving.<\/p>\n<p>If one of our dependencies is throwing errors, we have the data to find and prove it. If one of our systems has a high latency, we have the data to find and prove it. If one of our tenants is having an issue, we have the data to find and prove\u00a0it.<\/p>\n<p>For example, let\u2019s take a hypothetical service that we are building around searching for ecommerce products. For scalability and control, we have separated it into two isolated web applications: ingestion and search. This service uses a managed elastic search service as its data store, which we make requests through its REST API. Below are example queries used to measure the overall and internal latency and\u00a0counts.<\/p>\n<h3>Continuous Tests<\/h3>\n<p>While we have the data to show what our customers are experiencing, we do not have the ability to infer what they are doing or even ensure they are doing it the way we\u00a0want.<\/p>\n<p>A customer may be repeatedly<\/p>\n<p>Calling us with an invalid\u00a0tokenAsking for an invalid\u00a0resourceSpecifying invalid\u00a0data<\/p>\n<p>This can make interpreting metrics difficult and make it difficult to track issues that arise around these types of errors. While we can protect ourselves using techniques such as <a href=\"https:\/\/engineering.salesforce.com\/coordinated-rate-limiting-in-microservices-bb8861126beb\">rate limiting<\/a> and circuit breaking, this doesn\u2019t help us identify all errors that may be impacting customers (e.g. an expired SSL cert\u00a0error).<\/p>\n<p>One mechanism to aid in mitigating this is continuous testing. Continuous testing refers to running a set of tests, on a routine interval, against a test tenant, that exercise common flows through the\u00a0system.<\/p>\n<p>Ability to write to a data\u00a0storeAbility to get \/ write to an external\u00a0serviceIn our ecommerce world, the ability to find a product, add it to a cart, and check\u00a0out.<\/p>\n<p>This is more than just a health check. It is ensuring the system is exercised and functioning as expected.<\/p>\n<p>This data can be used as a source of truth. One of our problems, described above, is how to distinguish if client errors are a problem and if they should be included in our availability. In the case of continuous testing, we control the client. If we get a client error (4xx) back that we aren\u2019t expecting, then the system is not functioning as expected, and we need to investigate and fix the\u00a0problem.<\/p>\n<p><em>The above diagram depicts how one may set up multi-region deployments, exercise them with a continuous test harness, track the metrics, and compute availability and\u00a0alerts.<\/em><\/p>\n<h3>Alerting<\/h3>\n<p>With both fine-grained monitoring and a source of truth in place, the next step is to enable alerting based on this data. This alerting is the core of what we use to drive our MTTD\u00a0down.<\/p>\n<p>The following alerts are used to to back this\u00a0process.<\/p>\n<p>Excessive server errors (5xx) across all\u00a0tenantsExcessive p95 latency across all\u00a0tenantsValidation that continuous tests are running (metrics exist for the test tenant at least once per continuous test interval).Excessive server (5xx) and client errors (4xx) for our continuous test\u00a0tenant<em>By incorporating our source of truth in our alerting, we are able to quickly detect errors from multiple paths within the system. Our MTTD becomes at most our continuous test interval.<\/em><\/p>\n<h3>Categories of Availability<\/h3>\n<p>With our monitoring and source of truth in place, we can now compute our availability numbers in a few different ways to get a better understanding of how our service is performing.<\/p>\n<p>By breaking down availability into these four categories\u200a\u2014\u200ageneral availability, service availability, synthetic availability, and synthetic service availability\u200a\u2014\u200awe are able to pinpoint any weaknesses in the system and identify where to invest to make it more\u00a0robust.<\/p>\n<h3>Qualification and Production Deployment Validation<\/h3>\n<p>A secondary use case of our continuous testing is to aid in our pre-production qualification and post-production validation.<\/p>\n<p>We employ a policy where all changes must soak in our pre-production environment for at least 24 hours before being deployed. During that time period we automatically run a daily load, to establish and validate performance trends, and run our continuous tests on a regular interval.<\/p>\n<p>When it comes time to deploy, if our performance is within the acceptable bounds of our baseline, and our continuous tests have been stable since the application was deployed, we consider the application qualified and ready to be deployed to production<\/p>\n<p><em>Please note this is only a subset of the pre-production qualification process and does not spell out our practices for unit testing, integration testing, and static analysis, which should always be performed prior to deploying to a staging\/pre-production environment.<\/em><\/p>\n<p>Once the application is deployed to production, we immediately run our continuous test suite against it. If it fails or if one of the subsequent runs fail, we immediately know that we have caused a problem, and a rollback is warranted.<\/p>\n<p>Additionally, we roll out to each of our product environments in a sequential manor, using the one with the least usage as the canary as a final risk reduction technique.<\/p>\n<p>Approaches like this enable a quick MTTD with a rollback that will result in a quick MTTR. The faster we recover, the higher our availability is.<\/p>\n<h3>Conclusion<\/h3>\n<p>As service owners look towards increasing their availability and decreasing their MTTD and MTTR, fine-grained metrics, alerting, and continuous testing can play a major role. Having a source of truth and continuously exercising the service provide a robust way to ensure it is functioning properly and a method to quickly debug issues that may\u00a0arise.<\/p>\n<p>This approach also provides an easily extensible model to aid in managing multi-app, multi-region, multi-partition deployments.<\/p>\n<p><em>The above graph is an example dashboard showing data at a high level across a series of applications. It includes the total calls, active tenants, overall availability including dependent systems, the availability excluding dependent systems, and the synthetic availability across just our continuous tests.<\/em><\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/managing-availability-in-service-based-deployments-with-continuous-testing-61be968da4a\">Managing Availability in Service Based Deployments with Continuous Testing<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/managing-availability-in-service-based-deployments-with-continuous-testing-61be968da4a?source=rss----cfe1120185d3---4\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>The Problem At Salesforce, trust is our number one value. What this equates to is that our customers need to trust us; trust us to safeguard their data, trust that we will keep our services up and running, and trust that we will be there for them when they need\u00a0us. In the world of Software&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/01\/06\/managing-availability-in-service-based-deployments-with-continuous-testing\/\">Continue reading <span class=\"screen-reader-text\">Managing Availability in Service Based Deployments with Continuous Testing<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-528","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":539,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/02\/reads-service-health-metrics\/","url_meta":{"origin":528,"position":0},"title":"READS: Service Health Metrics","date":"February 2, 2022","format":false,"excerpt":"As you scale your company\u2019s software footprint, with Service-Oriented Architecture (SOA) or microservices architecture featuring 100s or 1000s of services or more, how do you keep track of the performance of every service in every region? How will you know whether you are tracking every service? Of course every service\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":589,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/02\/reads-service-health-metrics-2\/","url_meta":{"origin":528,"position":1},"title":"READS: Service Health Metrics","date":"February 2, 2022","format":false,"excerpt":"As you scale your company\u2019s software footprint, with Service-Oriented Architecture (SOA) or microservices architecture featuring 100s or 1000s of services or more, how do you keep track of the performance of every service in every region? How will you know whether you are tracking every service? Of course every service\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":619,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/architectural-principles-for-high-availability-on-hyperforce\/","url_meta":{"origin":528,"position":2},"title":"Architectural Principles for High Availability on Hyperforce","date":"August 10, 2022","format":false,"excerpt":"Infrastructure and software failures will happen. We idolize four 9s (99.99%) availability. We know we need to optimize and improve Recovery-Time-Objective (RTO, the time it takes to restore service after a service disruption) and Recovery-Point-Objective (RPO, the acceptable data loss measured in time). But how can we actually deliver high\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":640,"url":"https:\/\/fde.cat\/index.php\/2022\/10\/17\/sre-weekly-issue-343\/","url_meta":{"origin":528,"position":3},"title":"SRE Weekly Issue #343","date":"October 17, 2022","format":false,"excerpt":"View on sreweekly.com Bit of a short one this week as I recover from my third bout of COVID. Fortunately, this is another relatively mild one (thank you, vaccine!). Good luck everyone, and get your boosters. A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly\u00a0\ud83d\ude92. Rootly\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":538,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/01\/behind-the-scenes-of-hyperforce-salesforces-infrastructure-for-the-public-cloud\/","url_meta":{"origin":528,"position":4},"title":"Behind the Scenes of Hyperforce: Salesforce\u2019s Infrastructure for the Public Cloud","date":"February 1, 2022","format":false,"excerpt":"Salesforce has been running cloud infrastructure for over two decades, bringing companies and their customers together. When Salesforce first started out in 1999, the world was very different; back then, the only practical way to provide our brand of Software-As-A-Service was to run everything yourself\u200a\u2014\u200anot just the software, but the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":544,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/22\/the-unified-infrastructure-platform-behind-salesforce-hyperforce\/","url_meta":{"origin":528,"position":5},"title":"The Unified Infrastructure Platform Behind Salesforce Hyperforce","date":"February 22, 2022","format":false,"excerpt":"If you\u2019re paying attention to Salesforce technology at all, you\u2019ve no doubt heard about Hyperforce, our new approach to deploying Salesforce on public cloud providers. As with any big announcement, it can be a little hard to cut through the hyperbolic language and understand what\u2019s going\u00a0on. In this blog series,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/528","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=528"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/528\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=528"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=528"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=528"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}