{"id":573,"date":"2022-05-05T19:00:52","date_gmt":"2022-05-05T19:00:52","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/05\/05\/belljar-a-new-framework-for-testing-system-recoverability-at-scale\/"},"modified":"2022-05-05T19:00:52","modified_gmt":"2022-05-05T19:00:52","slug":"belljar-a-new-framework-for-testing-system-recoverability-at-scale","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/05\/05\/belljar-a-new-framework-for-testing-system-recoverability-at-scale\/","title":{"rendered":"BellJar: A new framework for testing system recoverability at scale"},"content":{"rendered":"<p><span>Building infrastructure that can easily recover from outages, particularly outages involving adjacent infrastructure, too often becomes a murky exploration of nuanced fate-sharing between systems. Untangling dependencies and uncovering side effects of unavailability has historically been time-consuming work.<\/span><\/p>\n<p><span>A lack of great tooling built for this, and the rarity of infrastructure outages, makes reasoning about them difficult. How far will the outage extend? And which available mitigations can get things back online? Bootstrapping blockers and circular dependencies present real concerns we cannot put to bed with theory or design documents alone. The lower in the stack we go, the broader the impact of such outages could be. Purpose-built tooling is important for safely experimenting with relationships between systems low in the stack.<\/span><\/p>\n<p><span>BellJar is a new framework we\u2019ve developed at Meta to strip away the mystery and subtlety that has plagued efforts to build recoverable infrastructure. It gives us a simple way to examine how layers in our stack recover from the worst outages imaginable. BellJar has become a powerful, flexible tool we can use to prove that our infrastructure code works the way we expect it to. None of us wants to exercise the recovery strategies, worst-case constrained behaviors, or emergency tooling that BellJar helps us curate. At the same time, careful preparation like this has helped us make our systems more resilient.<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<h2><span>Pragmatic coupling between systems: curation over isolation<\/span><\/h2>\n<p><span>With an infrastructure at Meta\u2019s scale, where software may span over dozens of data centers and millions of machines, tasks that would be trivial on a small number of servers (such as changing a configuration value) simply cannot be performed manually. The layers of infrastructure we build abstract away many of the difficult parts of running software at scale so that application developers can focus on their system\u2019s unique offerings. And reusing this common infrastructure solves these problems once, universally, for all our service owners.\u00a0<\/span><\/p>\n<p><span>However, the reach of these foundational systems also means that outages could potentially impact a wide swath of other software. Herein lies the trade-off between dependency and easy disaster recovery that infra engineers have to consider. The thinking goes that if you build code that doesn\u2019t need any external systems, then recovery from all sorts of failures should be quick.<\/span><\/p>\n<p><span>But with time and growth, the practicalities of large-scale production environments bear down on this simplistic model. All this infrastructure amounts to regular software. <\/span><span>Lik<\/span><span>e the services it powers, it needs to run somewhere, receive upgrades, scale up or down, expose configuration knobs, store durable state, and communicate with other services. Rebuilding each of these capabilities to ensure perfect isolation for each infrastructure system gets impractical. In many cases, leaning on shared infrastructure to solve these problems rather than reinventing the wheel makes sense. This is especially true at scale, where edge cases can be difficult to get right. We don\u2019t want every infrastructure team to build their own entire support stack from scratch.<\/span><\/p>\n<p><span>Yet without a solid understanding of how our systems connect, we can find ourselves in one of two traps: building duplicative support systems in pursuit of overkill decoupling, or allowing our systems to become intertwined in ways that jeopardize their resiliency. Without good visibility and rigor, systems can experience a mix of both as engineers address the low-criticality dependencies they know about. This can create undetected, and therefore unaddressed, circular dependencies.<\/span><\/p>\n<p>Even the lowest-dependency systems, like our Apache ZooKeeper deployments, lean on a wide array of supporting code.<\/p>\n<p><span>Without structured, rigorous controls in place, a mishmash of uncurated relationships between systems tends to grow unchecked. This then creates coupling between systems and complexity that can prove difficult to reverse.\u00a0<\/span><\/p>\n<h2><span>Recoverability as the design constraint<\/span><\/h2>\n<p><span>So how do we avoid senseless duplication while limiting the coupling between systems that can present real operational problems? By focusing on a clear constraint \u2014 recoverability, the requirement that whenever outages strike our infrastructure systems, we have strong confidence that our infrastructure can return to a healthy state in a short amount of time.<\/span><\/p>\n<p><span>But to analyze cross-system relationships through the lens of recoverability, we need tooling that exposes, tests, and catalogs the relationships buried in our millions of lines of code. BellJar is a framework for exercising infrastructure code under worst-case conditions to uncover the relationships that matter during recovery.<\/span><\/p>\n<p><span>Outages can manifest in many forms. For our purposes, we like to focus on a broad, prototypical outage that presents a categorical failure of common infrastructure systems. Using recoverability as the objective allows us to bring tractability to the problem of curated coupling in several ways:<\/span><\/p>\n<p>Practicality:<span> Recoverability provides an alternative to the impractical, absolutist position that requires complete isolation between all infra systems. If some coupling between two systems does not interfere with recovery capabilities of our infrastructure, such coupling may be acceptable.<\/span><br \/>\nCharacterization:<span> Recoverability lets us focus on a specific type of intersystem relationship \u2014 the relationships required to get a system up to minimally good health. Doing so shrinks the universe of cross-system connections that we need to reason about.<\/span><br \/>\nScope:<span> Rather than attempt to optimize the dependency graph for our entire infrastructure at once, we can zoom in on individual systems. By considering only their adjacent supporting software, these become the building blocks of recovery.<\/span><br \/>\nAdditive:<span> A common approach to decoupling systems asks, \u201cWhat happens to system <\/span><span>Foo<\/span><span> if we take away access to system <\/span><span>Bar<\/span><span>?\u201d However, when your production environment contains thousands of possible dependencies, even enumerating all such test permutations can become impossible. By contrast, a recovery analysis asks the inverse question: \u201cWhat is everything <\/span><span>Foo<\/span><span> needs to reach a minimally healthy state?\u201d<\/span><br \/>\nTestability:<span> Recovery scenarios offer real, verifiable conditions we can assert programmatically. While outages manifest in many shapes and sizes, in each case, recovery is a Boolean result \u2014 either recovery is successful or it isn\u2019t.<\/span><\/p>\n<h2><span>Examining system recoveries with a total outage in a vacuum<\/span><\/h2>\n<p><span>BellJar presents infra teams with an experimentation environment that is indistinguishable from production, with one big difference. It exists vacuum-sealed away from all production systems by default. Such an environment renders the remote systems and local processes that satisfy even basic functionality (configuration, service discovery, request routing, database access, key\/value storage systems) unavailable. In this way, the environment presents hardware in a state we\u2019d find during a total outage of Meta infrastructure. Everything but basic operating system functionality is broken.<\/span><\/p>\n<p><span>To produce this environment, we lean heavily on our virtualization infrastructure and fault injection systems. Each instance of a BellJar environment is ephemeral, composed of a set of virtual machines (VMs) that constantly rotate through a lease-wipe-update cycle. Before we hand the machines to a BellJar user, we inject an enormously disruptive blanket of faults that blackhole traffic over all network interfaces, including <\/span><span>loopback<\/span><span>. Our fault injection engine affords us allowlist-style disruptions we can stack against VMs. They disrupt our systems at the IP layer, at the application\u2019s network layer, or on a per-process or container basis.<\/span><\/p>\n<p>BellJar allows us to build tightly tailored test environments composed only of select production systems.<\/p>\n<p><span>BellJar\u2019s API gives operators the ability to customize which production capabilities they want to leak back into this vacuum through an additive allowlisting scheme. Need DNS? We\u2019ll allow it. Need access to a specific storage cluster? Just tell BellJar the name of the cluster. Want access to the read-only pathway for a certain binary package backend? You got it.<\/span><\/p>\n<p><span>Through granular, checkbox-style allowlisting, the operator can customize the limited set of remote endpoints, local daemons, and libraries they want to earmark as \u201chealthy.\u201d The resulting environment presents a severe outage scenario per the operator\u2019s choosing. But within the environment, the operator can describe and test their own system\u2019s minimally viable recovery requirements and startup procedure, from empty hardware to healthy recovery.<\/span><\/p>\n<h2><span>Making each recovery from a unique recipe<\/span><\/h2>\n<p><span>This approach reveals several unique ingredients in each BellJar test. Often, teams will define a set of different recovery scenarios for a single piece of infrastructure. Each describes its own combination of the following:<\/span><\/p>\n<p>Service under test:<span> This might be a container scheduler service, a fleetwide config distribution system, a low-dependency coordination back end, a certificate authority, or any number of other infrastructure components we wish to inspect. Most, but not all, of these services run in containers.<\/span><br \/>\nHardware: <span>Different systems run on different hardware setups. We compose the test environment accordingly.<\/span><br \/>\nRecovery strategy:<span> Each system has a recovery runbook that its operators expect to follow to bring up the system from scratch on fresh hardware.<\/span><br \/>\nValidation criteria:<span> The criteria that indicates whether a recovery strategy has produced a healthy system.<\/span><br \/>\nTooling:<span> The steps of those recovery runbooks likely involve system-specific tooling that operators will reach for during outage mitigation.<\/span><br \/>\nRecovery conditions:<span> Most systems require something within our infrastructure to be healthy before they can recover. An allowlist captures this set of assumptions, enumerating each production capability the recovery environment must provide before we can expect it to succeed.\u00a0<\/span><\/p>\n<p><span>All these ingredients can be assembled to define a single BellJar test, which the service owner writes with a few lines of Python.<\/span><\/p>\n<h2><span>Code conquers lore\u00a0<\/span><\/h2>\n<p><span>Since we can define each of the above inputs programmatically, we can code a fully repeatable test that distills all the assertions a service\u2019s operators want to prove about its ability to recover from failure.<\/span><\/p>\n<p><span>Prior to BellJar, we\u2019ve had to scour wiki documentation, interview a team\u2019s seasoned engineers, and spelunk through commit histories \u2014 all to understand whether <\/span><span>System A<\/span><span> or <\/span><span>System B<\/span><span> needed to be recovered first if both were brought offline. Exposing recovery-blocking cyclic dependencies often relied on mining through a team\u2019s tribal knowledge, outdated design documents, or code in multiple programming languages.<\/span><\/p>\n<p><span>Now BellJar tests the operators\u2019 assumptions, with a Boolean result. <\/span><span>True<\/span><span> means the service can be recovered with the runbook and tooling you defined, under any outage in which at least the test\u2019s recovery conditions are met. And we can deploy a service\u2019s BellJar test in its<\/span><a href=\"https:\/\/engineering.fb.com\/2021\/10\/20\/developer-tools\/autonomous-testing\/\"> <span>CI\/CD pipeline<\/span><\/a><span> to assert recoverability with every new release candidate before it reaches production.<\/span><\/p>\n<p>Users describe the capabilities afforded to their test environment alongside the recovery runbook steps they wish to exercise.<\/p>\n<p><span>For the individual service owner, the ability to generate a codified, repeatable recovery test opens doors to two exciting prospects:<\/span><\/p>\n<p>Move faster more safely<span>: If you\u2019ve ever been reluctant to change an arcane setting, retry policy, or failover mechanism, it may have been because you\u2019re unsure how it impacts your disaster recovery posture. Unsure whether you can safely use a library\u2019s new feature without jeopardizing your service\u2019s \u201clow dependency\u201d status? BellJar makes changes like these a no-brainer. Just make the change and post the diff. The automated BellJar test will confirm whether your recovery assertions still hold.<\/span><br \/>\nActively reduce coupling:<span> Many service owners discover surprising dependencies they didn\u2019t know existed when they onboard to BellJar, so we\u2019ve learned to onboard new systems gradually. First, the service owner defines their recovery strategy in an environment with all production capabilities available. Then we gradually pare down the recovery conditions allowlist to the bare minimum set of dependencies required for successful results. Through an iterative approach like this, teams uncover cross-system dependency liabilities. They then work to iteratively eliminate those that pose circular bootstrapping risks.<\/span><\/p>\n<h2><span>Multisystem knock-on effects<\/span><\/h2>\n<p><span>Wide coverage reveals cross-system details that were previously hard to understand. As a result, the benefits of the BellJar system multiply when we have many infrastructure service owners using it to explore their recovery requirements.<\/span><\/p>\n<h3><span>The recovery graph<\/span><\/h3>\n<p><span>BellJar allows teams to focus on the first-hop requirements their services demand for a successful recovery. In doing so, we intentionally do not expect each service owner to understand the full depth of Meta\u2019s entire dependency graph. However, limiting service owners to first-hop dependencies means BellJar alone does not provide them with the full picture needed to find transitive dependency cycles in our stack.<\/span><\/p>\n<p><span>Now that we\u2019re onboarding a broad collection of services, we can begin to assemble a large, multi-system recovery graph from the individual puzzle pieces provided by each service owner\u2019s efforts. We provide the same prompt to each service: \u201cWhat does it take to bring your service online on an empty set of hardware?\u201d By doing so, BellJar helps us generate the composable building blocks we need to tie together recovery requirements of services separated by organizational or time zone barriers. Security, networking, database, storage, and other foundational services express their portions of our recovery graph as code that lends itself to programmatic analysis. We\u2019re finding that this recovery-specific graph does not match the runtime relationships we uncover through distributed tracing alone.\u00a0<\/span><\/p>\n<h3><span>Tests as documentation<\/span><\/h3>\n<p><span>Service owners describe their recovery strategy (the runbook used to perform the service\u2019s recovery) as the set of steps a human would need to invoke during disaster recovery. This means a test includes the shell commands an operator will invoke to bootstrap things across otherwise empty machines. Typically, these commands reach for powerful toolkits or automation controls.<\/span><\/p>\n<p><span>Codifying a human-oriented recovery runbook inside the BellJar test framework may seem strange at first. But human intervention is typically the first step in addressing the disasters we\u2019re planning for. By capturing this runbook for each system, whether it\u2019s a single command or an exceptionally complex series of steps, we can <\/span><span>automatically<\/span><span> generate the human-friendly documentation operators typically have had to manually maintain on wikis or offline reference manuals.<\/span><\/p>\n<p><span>Every time a service\u2019s BellJar test passes, the corresponding recovery documentation gets regenerated in human-friendly HTML. This eliminates the staleness and toil we tend to associate with traditional runbooks.<\/span><\/p>\n<h3><span>Federated recovery and healthy ownership<\/span><\/h3>\n<p><span>Consider a foundation composed of scores of infrastructure systems. It is not scalable to assume we can task a single dedicated team, even a large one, with broadly solving all our worst-case recovery needs. At Meta, we do have a number of teams working on cross-cutting reliability efforts, protecting us against RPC overload, motivating<\/span><a href=\"https:\/\/engineering.fb.com\/2020\/09\/08\/data-center-engineering\/fault-tolerance-through-optimal-workload-placement\/\"> <span>safe deployment<\/span><\/a><span> in light of unexpected capacity loss, and equipping teams with powerful fault injection tools. We also have domain experts focused on multisystem recoveries of portions of our stack, such as intra-data center networking or container management.<\/span><\/p>\n<p><span>However, we\u2019d best avoid a model in which service owners feel they can offload the complexities of disaster recovery to an external team as \u201csomeone else\u2019s problem.\u201d The projects mentioned above achieve success not because they absolve service owners of this responsibility, but because they help service owners better solve specific aspects of disaster readiness.<\/span><\/p>\n<p><span>BellJar takes this model a step further by federating service recovery validation out to service owners\u2019 development pipelines. Rather than outsource to some separate team, the service owner designs their recovery strategy. They then test this strategy. And they receive the failing test signal when their strategy gets invalidated.<\/span><\/p>\n<p><span>By moving recovery into the service owner\u2019s release cycle, BellJar motivates developers to consider how their system can recover simply and reliably.<\/span><\/p>\n<h3><span>Common tooling<\/span><\/h3>\n<p><span>Until recently, service owners who have been especially interested in improving their disaster readiness have found themselves building a lot of bespoke recovery tooling. Without a standard approach that\u2019s shared by many teams, the typical service owner discovers and patches gaps in recovery tooling on an ad-hoc basis. Need a way to spin up a container in a pinch? A way to distribute failsafe routing information? Tools to retrieve binaries from safe storage? We\u2019ve found teams reinventing similar tools in silos because of the relative difficulty in discovering common recovery needs across team boundaries.<\/span><\/p>\n<p><span>By exposing recovery runbooks in a universal format that can be collected in a single place, BellJar is helping us easily identify the common barriers that teams need standard solutions for. The team building BellJar has consulted with dozens of service owners. As a result, we\u2019ve helped produce standard solutions like:<\/span><\/p>\n<p><span>A common toolkit for constructing low-level containers in the absence of a healthy container orchestrator.<\/span><br \/>\n<span>Fleetwide support for emergency binary distribution that bypasses most of our common packaging infrastructure.<\/span><br \/>\n<span>Secure access control over emergency tooling that cooperates with service owners\u2019 existing ACLs.<\/span><br \/>\n<span>Centralized collection and monitoring of the inputs (metadata and container specifications) needed for recovery tooling.<\/span><\/p>\n<p><span>By building well-tested, shared solutions for these common problems, we can ensure that those tools get first class treatment under a dedicated team\u2019s ownership.<\/span><\/p>\n<h2><span>Common patterns across systems<\/span><\/h2>\n<p><span>Thanks to BellJar\u2019s uniquely broad view of our infrastructure, we\u2019ve also been able to distill some pretty interesting patterns that teams have had to grapple with when designing for recovery.<\/span><\/p>\n<h3><span>The Public Key Infrastructure foundation<\/span><\/h3>\n<p><span>Access controls and TLS define every action that both human operators and automated systems can and cannot take. Security becomes important during disaster recovery, just as in a steady state. This means that systems like certificate authorities are important to any recovery strategy. In an environment where every container requires an x.509 certificate from a trusted issuer, removing dependency cycles for containerized security services becomes particularly hard. Security systems demand early investment to unlock robust recovery strategies elsewhere in the stack.<\/span><\/p>\n<h3><span>DNS seeps into everything<\/span><\/h3>\n<p><span>Even when you use a custom service discovery system, DNS permeates tooling everywhere. Due to its ubiquity, engineers often forget DNS is a network-accessed system. Because it is so robust, they treat it more like a local resource akin to the filesystem. Learning to view it as the fallible, remote system it is requires rigor and constant recalibration. Strict testing frameworks really help with this, especially in an environment of <\/span><a href=\"https:\/\/engineering.fb.com\/2017\/01\/17\/production-engineering\/legacy-support-on-ipv6-only-infra\/\"><span>IPv6<\/span><\/a><span>, where 128-bit addresses aren\u2019t easy to remember or write on a sticky note.<\/span><\/p>\n<p><span>For select systems that power discovery, we\u2019ve moved to relying on routing protocols to stabilize well-known IP addresses. Across our vast collection of tooling, we\u2019re retrofitting much of it with the ability to accept IP addresses in a pinch.<\/span><\/p>\n<h3><span>Caching obscures RPCs<\/span><span>\u00a0<\/span><\/h3>\n<p><span>Years of work have made our systems very difficult to fully break, hardening them with caching and fallbacks that balance trade-offs in freshness vs. availability. While this is great for reliability, this gentle failover behavior makes it challenging to predict how worst-case failure modes will eventually manifest. The first time that service owners see what production would be like without any configuration or service discovery caching, for instance, often occurs in a BellJar environment. For this reason, the team that builds BellJar spends a lot of time ensuring that we\u2019re actually breaking things inside our test environments in the worst way possible.<\/span><\/p>\n<h3><span>Even configuration needs configuration<\/span><\/h3>\n<p><span>At our scale, everything is configuration-as-code, including the feature knobs and emergency killswitches we deploy for every layer in the stack. We deploy RPC protocol library features and control security ACLs via configuration. Hardware package installations get defined as configuration. Even our configuration distribution systems behave according to configuration settings. The same applies to emergency CLI tooling and fleetwide daemons. When all this configuration is authored and distributed via a common system, it\u2019s worth your time to design robust mechanisms for emergency delivery, automatic fallback to prepackaged values. Additionally, you need to design a diverse set of contingency plans for whenever the configuration that powers your configuration system is broken.<\/span><\/p>\n<h3><span>Libraries demand allowlist-style validation<\/span><\/h3>\n<p><span>At the pace we move, even low-level services lean on libraries that constantly change in ways no single engineering team can scrutinize line by line. Those libraries can be a vector for unexpected dependency graph creep. With enormous codebases and thousands of services, relying on traditional blocklist-style fault injection to understand how your system interacts with production can be a game of whack-a-mole. In other words, you don\u2019t know what to test if you can\u2019t even enumerate all the services in production.<\/span><\/p>\n<p><span>We\u2019ve found unwanted dependencies inserted into our lowest-level systems thanks to simple drop-in Python2 -&gt; Python3 upgrades, which have made recovery tooling dependent on esoteric services. Authorization checks have introduced hard dependencies between emergency utilities and our web front end. Ubiquitous logging and tracing libraries have found themselves unexpectedly promoted to SIGABRT-ing blockers to our container tooling. Thanks to allowlist-style tests like those provided by BellJar, we can automatically detect new dependencies on systems we didn\u2019t even know existed.<\/span><\/p>\n<h2><span>Asking new questions<\/span><\/h2>\n<p><span>We\u2019re still being surprised by what our services can learn through BellJar\u2019s tightly constrained testing environment. And we\u2019re expanding the set of use cases for BellJar.<\/span><\/p>\n<p><span>Now, beyond its initial function for verifying the recovery strategies for our services, we also use BellJar to zoom in on other important pieces of code that don\u2019t look like containers or network services at all.<\/span><\/p>\n<p><span>Common libraries like our internal<\/span><a href=\"https:\/\/engineering.fb.com\/2013\/11\/21\/core-data\/under-the-hood-building-and-open-sourcing-rocksdb\/\"> <span>RocksDB<\/span><\/a><span> build and<\/span><a href=\"https:\/\/engineering.fb.com\/2019\/08\/02\/core-data\/systems-scale\/\"> <span>distributed tracing<\/span><\/a><span> framework now use BellJar to assert that they can gracefully initialize and satisfy requests in environments where none of Meta\u2019s upstream systems are online. Tests like these protect against dependency creep that could find these libraries blocking recoveries for scores of infra systems.<\/span><\/p>\n<p><span>We\u2019ve also expanded BellJar to vet the sidecar binaries we run on millions of machines to support basic data distribution, host management, debugging, and traffic shaping. With extremely restricted allowlists, we continually evaluate new daemon builds. Each version must demonstrate it can auto-recover from failure, even when the upstream control planes they typically listen to go dark.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2022\/05\/05\/developer-tools\/belljar\/\">BellJar: A new framework for testing system recoverability at scale<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Building infrastructure that can easily recover from outages, particularly outages involving adjacent infrastructure, too often becomes a murky exploration of nuanced fate-sharing between systems. Untangling dependencies and uncovering side effects of unavailability has historically been time-consuming work. A lack of great tooling built for this, and the rarity of infrastructure outages, makes reasoning about them&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/05\/05\/belljar-a-new-framework-for-testing-system-recoverability-at-scale\/\">Continue reading <span class=\"screen-reader-text\">BellJar: A new framework for testing system recoverability at scale<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-573","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":467,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/20\/sre-weekly-issue-288\/","url_meta":{"origin":573,"position":0},"title":"SRE Weekly Issue #288","date":"September 20, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Want to see what\u2019s new with automated security tooling? Tune in on September 30 to see how StackHawk and Semgrep are making it possible to embed security testing in CI\/CD. https:\/\/sthwk.com\/whats-new-webinar Articles Tammy Bryant Butow on SRE Apprentices Faced with a\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":717,"url":"https:\/\/fde.cat\/index.php\/2023\/05\/22\/sre-weekly-issue-373\/","url_meta":{"origin":573,"position":1},"title":"SRE Weekly Issue #373","date":"May 22, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":625,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/30\/hyperpacks-using-buildpacks-to-build-hyperforce\/","url_meta":{"origin":573,"position":2},"title":"Hyperpacks: Using Buildpacks to Build Hyperforce","date":"August 30, 2022","format":false,"excerpt":"At Salesforce we regularly use our products and services to scale our own business. One example is Buildpacks, which we created nearly a decade ago and is now a part of Hyperforce. Hyperpacks are an innovative new way of using Cloud Native Buildpacks (CNB) to manage our public cloud infrastructure.\u00a0\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":591,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform-2\/","url_meta":{"origin":573,"position":3},"title":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform","date":"April 5, 2022","format":false,"excerpt":"At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, & Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":818,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/29\/improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging\/","url_meta":{"origin":573,"position":4},"title":"Improving machine learning iteration speed with faster application build and packaging","date":"January 29, 2024","format":false,"excerpt":"Slow build times and inefficiencies in packaging and distributing execution files were costing our ML\/AI engineers a significant amount of time while working on our training stack. By addressing these issues head-on, we were able to reduce this overhead by double-digit percentages.\u00a0 In the fast-paced world of AI\/ML development, it\u2019s\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":561,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform\/","url_meta":{"origin":573,"position":5},"title":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform","date":"April 5, 2022","format":false,"excerpt":"At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, & Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/573","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=573"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/573\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=573"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=573"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=573"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}