{"id":619,"date":"2022-08-10T20:17:13","date_gmt":"2022-08-10T20:17:13","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/architectural-principles-for-high-availability-on-hyperforce\/"},"modified":"2022-08-10T20:17:13","modified_gmt":"2022-08-10T20:17:13","slug":"architectural-principles-for-high-availability-on-hyperforce","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/architectural-principles-for-high-availability-on-hyperforce\/","title":{"rendered":"Architectural Principles for High Availability on Hyperforce"},"content":{"rendered":"<p>Infrastructure and software failures will happen. We idolize four 9s (99.99%) availability. We know we need to optimize and improve Recovery-Time-Objective (RTO, the time it takes to restore service after a service disruption) and Recovery-Point-Objective (RPO, the acceptable data loss measured in time). But how can we actually deliver high availability for our customers?<\/p>\n<p>One of the missions of Salesforce\u2019s engineering team is to prevent and minimize service disruptions. To achieve high availability for our customers, we design cloud-native architectural solutions that enable resilience for infrastructure failures and faster resolutions for unplanned incidents. We adopt safe deployment for non-disruptive software updates and releases.<\/p>\n<p>This post will share our architectural principles for high availability that we\u2019ve learned over the years and are applying to the Salesforce <a href=\"https:\/\/www.salesforce.com\/products\/platform\/hyperforce\/\">Hyperforce<\/a><em> <\/em>platform.<\/p>\n<h3>High Availability Architectural Principles<\/h3>\n<p><a href=\"https:\/\/salesforce.quip.com\/R7WPAkjSbCB7#temp:C:GKY6e9d92f1af9947a5b516436b1\">1. Build your services on infrastructure across multiple fault domains<\/a><br \/><a href=\"https:\/\/salesforce.quip.com\/R7WPAkjSbCB7#temp:C:GKYbb3d28a8daa642448380d6cba\">2. Adopt a safe deployment strategy<\/a><br \/><a href=\"https:\/\/salesforce.quip.com\/R7WPAkjSbCB7#temp:C:GKYf107ae45b99747639c004fbfc\">3. Understand your \u201cBlast Radius\u201d and minimize it<\/a><br \/><a href=\"https:\/\/salesforce.quip.com\/R7WPAkjSbCB7#temp:C:GKY0963c77fcb924efb8bdcd48e8\">4. Take advantage of the elastic capacity<\/a><br \/><a href=\"https:\/\/salesforce.quip.com\/R7WPAkjSbCB7#temp:C:GKY6c12738a51bf480fa33b55eb5\">5. Design to withstand dependency failures<\/a><br \/><a href=\"https:\/\/salesforce.quip.com\/R7WPAkjSbCB7#temp:C:GKY5a0fac118965490698f51a7d2\">6. Measure, learn, and improve continuously<\/a><\/p>\n<h4>1. Build your services on infrastructure across multiple fault domains<\/h4>\n<p>Salesforce Hyperforce is built natively on public cloud infrastructure. The public cloud providers operate multiple data centers in a single region. Data centers are grouped into \u201cavailability zones\u201d (AZ) in distinct physical locations with independent power and networks to prevent correlated infrastructure failures. The availability zones are the <strong>fault domains<\/strong> within a region. They are connected through redundant, low-latency, and high-bandwidth networks.<\/p>\n<p>Salesforce Hyperforce services are deployed across multiple availability zones (usually three) within a region. As the availability zones are isolated fault domains, Hyperforce services can continue to operate with minimal disruptions if an availability zone fails. Most of the Hyperforce services are serving actively from multiple availability zones. Some transactional systems that have a single primary database may automatically and transparently failover the primary database to the standby database in a healthy availability zone with a brief failover time. The disruption is minimal \u2013 some in-flight requests or transactions will not succeed and will be re-tried if an availability zone fails.<\/p>\n<p>In addition, we adopt the static stability deployment strategy for Hyperforce services with the objective of withstanding an entire availability zone failure and one more individual component failure in another availability zone (AZ+1). We do not need to compete for public cloud resources to scale up at runtime in the event of failures.<\/p>\n<h4><strong>2. Adopt a safe deployment strategy<\/strong><\/h4>\n<p>Over the years, we have learned that changes we made to the production systems, such as releasing a new version of software or automation or changing a configuration, is a primary contributor to service disruption due to software bugs or faulty configurations.<\/p>\n<p>While the best thing is to prevent it from happening (with rigorous testing, for example), we know in reality that some issues may not surface during the test due to the workload, dependent services, or use cases. We need a safe deployment strategy to minimize the risk of new software bugs and also can rapidly resolve disruptions caused by faulty software or configurations.<\/p>\n<p>At Salesforce, we adopt several safe deployment patterns, including rolling upgrades, canary and staggering, and Blue-Green deployment. All of them require no downtime for customers. We also adopt the immutable deployment principle to deploy services at the instance image level to avoid any configuration drift.<\/p>\n<p>Rolling upgrades are commonly used in a clustered service. We will first indicate one or a few service instances in the cluster to be in maintenance mode. The load balancer will direct traffic to other active service instances. We deploy or upgrade the software or configurations for these service instances in maintenance by recreating the instances with a new image. We then enable them for traffic and repeat the process for other service instances in batch.<\/p>\n<p>For canary, we first roll out changes to a small set of service instances to validate the changes. Given enough time for validation, we then roll out to the remaining set, usually using a staggered process through different logical groups of service instances at different times to minimize any unexpected service disruptions caused by the changes.<\/p>\n<p>For Blue-Green deployment, we deploy a new version (Green) of the service instances alongside the current version (Blue) with the same capacity. After testing and validation, the load balancer directs the traffic to the new version service instances and decommission the service instances with the previous version.<\/p>\n<p>We have seen in some rare scenarios that a change can go through rolling upgrades or some staggering without issues until some time later. Another requirement for safe deployment is to support rapid deployment mode when we need to quickly roll back a change by deploying the last known good image or roll forward a fix by deploying a new image.<\/p>\n<p>We will dive into more details on how we use the infrastructure redundancy and isolation and safe deployment for high availability in the following principle, the \u201cBlast Radius.\u201d<\/p>\n<h4>3. Understand your \u201cBlast Radius\u201d and minimize it<\/h4>\n<p>Try to answer these questions \u2013 how many customers are impacted if a software or infrastructure failure happens to my service? How can I reduce the number of customers affected if a failure happens?<\/p>\n<p>The fundamental concept is to \u201cpartition\u201d (or \u201cshard\u201d) your customers onto isolated and independent infrastructures and different staggered deployments.<\/p>\n<p>For the Salesforce core services, in addition to adopting the availability zones, we develop a logical construct, a \u201ccell,\u201d to manage and reduce the blast radius for the customers. A cell is a collection of services serving a group of our customers based on <a href=\"https:\/\/engineering.salesforce.com\/l33t-m10y-f04f38127b82\">Salesforce Multi-tenant Architecture<\/a>. It is deployed across multiple availability zones for redundancy within the cell and is isolated from other cells. If a cell is unhealthy, it will only impact the customers on the cell, not customers on other cells.<\/p>\n<p>To reduce the blast radius for deploying software releases or changes to our core services, we use a combination of the deployment patterns discussed in the second principle. For example, we may adopt a rolling-upgrade or Blue-Green deploy a new release to the database cluster instances within a cell. We may canary on a small set of cells, and then stagger the deployment across other cells.<\/p>\n<h4><strong>4. Take advantage of the elastic capacity<\/strong><\/h4>\n<p>Capacity is also one major contributing factor to service disruptions. We talked about adopting static stability as our deployment and capacity principle at the macro level in the first principle. We also mentioned the Salesforce multi-tenant architecture and how customers are sharing the compute and other infrastructure capacities within a cell.<\/p>\n<p>Now, let\u2019s go further into how we manage capacity at the cell level and how we tackle challenges such as \u201cnoisy neighbors.\u201d<\/p>\n<p>For each cell, we deploy intelligent protection that can detect abnormal or excess usages, such as denial of service attacks. We can block or throttle at the level of one or more customers, requests, or resources. In addition, for organic production workload increases, we take advantage of the public cloud elasticity to automatically scale the application computing within a cell as needed. At the transactional database tier, we can automatically scale the storage throughput and space. We can scale the read requests by deploying read-only standby instances.<\/p>\n<p>Another example of using the public cloud elastic capacity is the Blue-Green deployment. We deploy full capacity for the new version at the run time.<\/p>\n<h4>5. Design to withstand dependency failures<\/h4>\n<p>Ask yourself this question \u2013 what happens to my service if one of the dependent services fails?<\/p>\n<p>One design goal for our services is to handle dependency failures gracefully. First of all, your service should always have an error or response code that clearly indicates the failure is caused by a specific dependent service. Be aware of the depth of the error call stacks as it may hinder the ability to pinpoint the dependent service failures.<\/p>\n<p>The service should have programmatic logic, such as retry and timeout, for handling errors and exceptions when a dependent service is unavailable or is behaving unexpectedly. For example, a dependent service is returning intermittent failures or unavailable, service call hanging, etc.<\/p>\n<p>The service may be able to provide reduced functionalities if a dependent service fails. For example, when an unplanned primary database incident happens in a cell, the Salesforce core application can continue to serve read requests from the cache or the read replica databases, until the failover completes to promote one of the standby databases to become the new primary database.<\/p>\n<p>Another key learning is to avoid circular dependencies of services. These can be developed unintentionally over time and can be buried behind many layers of service dependencies. Create a clear dependency graph of the architecture and avoid any circular dependency.<\/p>\n<h4>6. Measure, learn, and improve continuously<\/h4>\n<p>What do we want to measure and improve?<\/p>\n<p>For Salesforce Hyperforce services, we define metrics for measuring request rate, error rate, response time, availability, and saturation. For every service disruption, we measure and report the time-to-detect (TTD) and time-to-resolve (TTR). We conduct retrospective reviews for every incidents. The goal of the review is to learn the root cause of the incident, understand if we have any technical or procedural gaps, and take action to address the gaps to avoid repeating them in the future.<\/p>\n<h3>Conclusion<\/h3>\n<p>By sharing these principles and the lessons we learned, we hope to help you understand how Salesforce provides high availability for our customers on the Hyperforce platform.<\/p>\n<p>Take a look at these principles and think about how they apply to your service and architecture, and join us on the journey to ever-improving service availability and trust for our customers!<\/p>\n<p>Follow along with the entire series:<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/behind-the-scenes-of-hyperforce-salesforces-infrastructure-for-the-public-cloud-429309542d8e?source=friends_link&amp;sk=0a22b253fffe0c7265ed602bd7e4e7fb\" target=\"_blank\" rel=\"noopener\">Behind the Scenes of Hyperforce: Salesforce\u2019s Infrastructure for the Public Cloud<\/a><a href=\"https:\/\/engineering.salesforce.com\/the-unified-infrastructure-platform-behind-salesforce-hyperforce-ad8f4c2cf789?source=friends_link&amp;sk=54cb6328080b7991a028d3ff086b7ec1\" target=\"_blank\" rel=\"noopener\">The Unified Infrastructure Platform Behind Salesforce Hyperforce<\/a><\/p>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/architectural-principles-for-high-availability-on-hyperforce\/\">Architectural Principles for High Availability on Hyperforce<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/architectural-principles-for-high-availability-on-hyperforce\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Infrastructure and software failures will happen. We idolize four 9s (99.99%) availability. We know we need to optimize and improve Recovery-Time-Objective (RTO, the time it takes to restore service after a service disruption) and Recovery-Point-Objective (RPO, the acceptable data loss measured in time). But how can we actually deliver high availability for our customers? One&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/08\/10\/architectural-principles-for-high-availability-on-hyperforce\/\">Continue reading <span class=\"screen-reader-text\">Architectural Principles for High Availability on Hyperforce<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-619","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":538,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/01\/behind-the-scenes-of-hyperforce-salesforces-infrastructure-for-the-public-cloud\/","url_meta":{"origin":619,"position":0},"title":"Behind the Scenes of Hyperforce: Salesforce\u2019s Infrastructure for the Public Cloud","date":"February 1, 2022","format":false,"excerpt":"Salesforce has been running cloud infrastructure for over two decades, bringing companies and their customers together. When Salesforce first started out in 1999, the world was very different; back then, the only practical way to provide our brand of Software-As-A-Service was to run everything yourself\u200a\u2014\u200anot just the software, but the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":544,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/22\/the-unified-infrastructure-platform-behind-salesforce-hyperforce\/","url_meta":{"origin":619,"position":1},"title":"The Unified Infrastructure Platform Behind Salesforce Hyperforce","date":"February 22, 2022","format":false,"excerpt":"If you\u2019re paying attention to Salesforce technology at all, you\u2019ve no doubt heard about Hyperforce, our new approach to deploying Salesforce on public cloud providers. As with any big announcement, it can be a little hard to cut through the hyperbolic language and understand what\u2019s going\u00a0on. In this blog series,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":585,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/22\/the-unified-infrastructure-platform-behind-salesforce-hyperforce-2\/","url_meta":{"origin":619,"position":2},"title":"The Unified Infrastructure Platform Behind Salesforce Hyperforce","date":"February 22, 2022","format":false,"excerpt":"If you\u2019re paying attention to Salesforce technology at all, you\u2019ve no doubt heard about\u00a0Hyperforce, our new approach to deploying Salesforce on public cloud providers. As with any big announcement, it can be a little hard to cut through the\u00a0hyperbolic language and understand what\u2019s going on. In this blog series, we\u2019ll\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":539,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/02\/reads-service-health-metrics\/","url_meta":{"origin":619,"position":3},"title":"READS: Service Health Metrics","date":"February 2, 2022","format":false,"excerpt":"As you scale your company\u2019s software footprint, with Service-Oriented Architecture (SOA) or microservices architecture featuring 100s or 1000s of services or more, how do you keep track of the performance of every service in every region? How will you know whether you are tracking every service? Of course every service\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":589,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/02\/reads-service-health-metrics-2\/","url_meta":{"origin":619,"position":4},"title":"READS: Service Health Metrics","date":"February 2, 2022","format":false,"excerpt":"As you scale your company\u2019s software footprint, with Service-Oriented Architecture (SOA) or microservices architecture featuring 100s or 1000s of services or more, how do you keep track of the performance of every service in every region? How will you know whether you are tracking every service? Of course every service\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":528,"url":"https:\/\/fde.cat\/index.php\/2022\/01\/06\/managing-availability-in-service-based-deployments-with-continuous-testing\/","url_meta":{"origin":619,"position":5},"title":"Managing Availability in Service Based Deployments with Continuous Testing","date":"January 6, 2022","format":false,"excerpt":"The Problem At Salesforce, trust is our number one value. What this equates to is that our customers need to trust us; trust us to safeguard their data, trust that we will keep our services up and running, and trust that we will be there for them when they need\u00a0us.\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/619","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=619"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/619\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=619"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=619"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=619"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}