The Unified Infrastructure Platform Behind Salesforce Hyperforce

If you’re paying attention to Salesforce technology at all, you’ve no doubt heard about Hyperforce, our new approach to deploying Salesforce on public cloud providers. As with any big announcement, it can be a little hard to cut through the hyperbolic language and understand what’s going on.

In this blog series, we’ll demystify what Hyperforce actually is from a technical perspective: how it works, and why it’s so revolutionary for our customers and our engineering teams.

Start with a look at Hyperforce’s architecture.

Back In My Day …

Salesforce has been around for over two decades. Back in 1999, when the company was founded, if you wanted to run a public internet software service (Software as a Service, or SaaS), the first thing you had to do was to get some servers and hook them up to the internet. In fact, in the very early days of Salesforce, this literally involved running ethernet cables above Marc Benioff’s bedroom door.

Servers like it’s 1999 (because it was!)

Flash forward to about 2017, and Salesforce doesn’t just have a few hundred, or even a few thousand servers — we have hundreds of thousands of servers, in data centers all around the world, sharded into discrete instances of the product. But even at this scale, the basic premise is still the same: we procure and control the servers on which our software runs, because … well, because we’ve always done it that way.

Of course, around this same time, a massive wave was sweeping the industry. Instead of every company running its own infrastructure, a handful of exceptionally high-scale providers emerged who could do the same thing entirely as a service. And not only that, but because of their scale, they could offer much higher levels of elasticity than anyone could achieve on their own, through having large buffer pools of available capacity and automated mechanisms to provision and set up those resources.

This model — infrastructure as a service, or IaaS — is so effective that it’s essentially wholly supplanted the practice of running one’s own infrastructure. These days, to run software on the internet is, more or less, to run it in public cloud.

So what does this mean for Salesforce?

In the early to mid 2010s, when we first broached this question with a few of our bigger customers, the answer was pretty clear: “no way!” After all, Salesforce’s priority of trust (security and availability in particular) was a big reason to prefer our private infrastructure over the less well-known and newer public cloud alternatives. It seemed like an IaaS vendor certainly wouldn’t take as good care of our customers’ precious data as we would, right?

The Tides Turn

Around 2017, we started to notice a change. Not only had digital transformation accelerated (for a lot of reasons), but most companies had gotten a lot more comfortable with the concept of public cloud infrastructure. In fact, most of them were now using it directly, for their systems! Hosting workloads and data in public cloud was no longer scary; it was just a fact of business, and people came to realize that the big public cloud providers offered high levels of compliance, security, and quality. It no longer struck people as an undue risk.

With this in mind, we started asking our customers once again: what would you think if Salesforce ran in public cloud? And this time, the answer was much different: “Not only would I be OK with it, but I’d actually prefer you to run in public cloud, because then my Salesforce data will be colocated with all of my other data!”

At the same time, another trend was contributing to our investigation of public cloud. Across the world, a growing number of data residency policies were going into effect. These policies state that for a subset of companies (for example, financial institutions in Australia or Canada), their data must be stored within the boundaries of their home country. Following this trend to its natural conclusion, we realized that it would mean running our own independent data centers in dozens or hundreds of countries, a scenario that’s economically infeasible for us and for our customers.

So with these questions in mind, we started afresh and asked: what would it mean for Salesforce to run in public cloud?

Free-For-All, Or Unified Approach?

Salesforce has a big engineering organization. It’s composed of thousands of independent teams that each look after a subset of our products.

So, one approach to moving to public cloud might have been for us to double down on that team-level independence and say, “OK, everybody whip out your credit cards and sign up for an AWS account to run your service!” This would have been a fast way to start, and the fact that public cloud infrastructure works as a hands-off service really promotes that kind of rapid approach.

As you can probably guess, however, there are a number of problems Salesforce would run into with that mindset. Our teams are all independent, but our products are deeply integrated. From our customers’ perspective, there’s just one Salesforce, and it needs to be a unified product that they can trust. It needs to run with high availability, high security, low latency, and predictable behavior. If every one of Salesforce’s thousands of independent software services was running with its own setup, its own unique accounts, and its own security practices, there would be no practical way to ensure high standards. And, more importantly, we wouldn’t be able to give our customers any assurances about the quality or compliance of this infrastructure. In that case, the original concerns we heard from customers about moving to public cloud would actually be true!

So with Hyperforce, we took a very different approach: unified from the beginning as a singular platform, with trust as its #1 value. In practice, this means we have a single method of, for example, creating and maintaining provider accounts, spinning up new resources, accounting for costs, and delivering software securely.

Lift-And-Shift, Or Shake-And-Bake?

Above, I painted the shift from first party infrastructure to public cloud infrastructure as being somewhat transparent: it’s just a change in who’s leasing the servers, right?

Well, it turns out this couldn’t be further from the truth. In particular, the architectural realities of infrastructure as a service is that hardware-level reliability is not only not guaranteed, it’s almost anti-guaranteed. You simply can’t run millions of servers and treat each one as a “pet” that requires special care and feeding. Servers come and go, often without warning, and the only safe approach is to build distributed systems that expect this instability and handle it without blinking.

What this means, then, is that if you’ve built a system that’s predicated on the hardware being reliable, you’ll have a hard time moving into public cloud. And, for better or worse, that’s the position that Salesforce had evolved into. As a very data-centric application, our relational databases are paramount, and they need to be rock-solid systems-of-record. Historically, we achieved that by making them big, highly customized, and optimized to the hilt. But that approach was clearly not going to fly in public cloud.

Moreover, in the two decades between when Salesforce started and now, there have been a number of systemic, industry-wide evolutions in infrastructure strategy. These new principles and mental models really do provide better ways of working, but they can be hard to “evolve” into from older practices.

As one example, the standard practice in 1999 was that if you needed to change something in your infrastructure — say, changing a configuration on a server’s network interface to achieve higher throughput — you’d connect to each server in your infrastructure and make that change. If you wanted to get fancy, you’d use something like Puppet to automate this process and make it more repeatable and faster, but the basic approach was the same. The problem with this, of course, is that without a great deal of discipline (and monitoring), what you end up with is a lot of drift. 99% of the servers might have your change, but maybe a couple were in a reboot cycle when you tried to apply it, and you didn’t go back to fix it.

The more modern approach (which we’ll go deeper on in a future post) is to say that, instead, all infrastructure is immutable, and when you want to make a change, you first put up a new version (of the server, or VM, or container) and then take down the old one. In this way, you can know exactly what’s happening in production, because what’s happening in production is derived 100% from source controlled artifacts (such as Terraform manifests).

This is just one example, of course. There are a whole host of other related principles that you can derive from the basic forces at play (unreliable hardware, elastic provisioning, API-driven interaction). Many of these were spelled out early on by the Heroku team in the 12-factor app manifesto, and many more have come to light since as the industry has evolved.

So in our transition to Hyperforce, we’ve also made the commitment to not just “lift and shift,” but instead to go through a “step function” in how we manage our infrastructure resources, jumping from 1999 practices to 2021 practices in one fell swoop. We’ll go deeper into all of these principles in future posts, but as a brief snapshot:

Infrastructure-as-code. We use artifacts under source control to dictate 100% of the setup and management of infrastructure resources.3-Availability-Zone design for High Availability. Instead of pairs of remote sites replicating data in preparation for large catastrophes, we rely on close-but-independent trios of availability zones.Measure service health, not server health. We elevate the perspective of our monitoring, away from concrete signals of the health of individual components, and towards aggregate measures from a client perspective on whether any service is healthy.Zero Trust. We assume breach, and treat every service’s communication with every other service as something that needs strong identity, encryption in transit, and close monitoring.

These are just a few of the principles that we think about in Hyperforce. Over the course of this blog series, we’ll get into much more detail about each of these, and many more.

The Upside

Of course, a transition as massive as Hyperforce is never done for purely architectural principles. The fact is, running in public cloud makes deep sense for Salesforce, and even more so for our demanding customers.

The most obvious benefit is elasticity. If you suddenly need more resources — say, because a marketing campaign went viral — then public cloud gives you the simple ability to request more resources, and then release them as soon as you’re done with them.

This comes up in surprising ways. Prior to public cloud, think about the way capacity planning happened. For many businesses, there are natural peaks to activity in the world of Sales: the quarter-end, when all the salespeople are trying to meet their numbers, and all their prospects are trying to make purchasing decisions. This is a time when the activity in our services is (and has always been) heightened. But, when you run your own infrastructure, this peak is exactly what you have to plan for. And, you have to keep all of that hardware running all of the time (unless you go through the effort to decommission it and return it, only to get it again next quarter, which is clearly not any more efficient). But because of the elasticity of public cloud, and the fact that these vendors serve compute resources at a scale for the whole world, our local peak cycles are easily absorbed into their existing buffer. In other words, they spread this uncertainty across their entire client base in a way that no individual tenant can (even a big one like Salesforce).

Elasticity also has an important effect on innovation. If you run your own data centers, and you’ve got a new idea for a service, your options are limited: you can try to run it on the same servers that are already handling traffic for your existing products (risky), or you can possibly scrounge up some leftover servers that weren’t being used for anything. But barring either of those, your choice is basically to go through the procurement process of getting and installing new servers. And if you’ve ever worked in that space, you know that it’s a process measured in months (at best), certainly not hours. But conversely, when that’s handled as a service, it’s a question that simply ceases to be important. Got an idea for something new? Just get a few nodes and try it. Want to scale it up as a beta? Go for it, and as long as the cost of the resources is merited, you’re done.

(Note, this isn’t to say that capacity management disappears in the world of public cloud; quite the opposite. It becomes potentially even more important, but it works in a really different way. This is something we’ll cover in a future post!)

As we stated above, there are a couple of very clear direct benefits for our customers. One of them is proximity: if our customer is already running workloads or storing data in public cloud, and they want to combine that with Salesforce data, or use our processing capabilities like Salesforce Functions to crunch that data, or bridge between all the pieces of their application network with Mulesoft, then it’s a distinct advantage for the Salesforce side of that equation to be running in the same public cloud region, because latencies and costs are much lower.

Another is data residency policy, which — as I mentioned above — increasingly requires companies to store their data in a particular geographic location. Indeed, with Salesforce as a partner and intermediary to public cloud vendors, companies have even more leverage about these aspects, as was the case in our recently announced expansion with AWS into Switzerland. Public cloud enables new topologies of compute throughout the world that no single provider company can match.

Perhaps most importantly, though, the real benefit of moving to public cloud is focus. Salesforce is a CRM company; our job is to help companies connect with their customers. Did you see anything in there about being amazing at running data centers? I didn’t either. We came into being at a time when running servers was synonymous with running software on the internet, but these days, it simply isn’t any more. There are so many compelling reasons to move to Hyperforce, both for us and our customers. We’re excited to do it in the way that only Salesforce would — with trust, availability and security at the forefront from day one.

Conclusion

What we’ve talked about in this post is just the tip of the iceberg. There’s so much happening inside of Hyperforce that we’re dying to tell you about because of how it changes the game with respect to what Salesforce can build for our customers.

Be sure to keep an eye on this space for future posts where we’ll go deeper into the key architectural principles that make Hyperforce behind the scenes. We’ll discuss how we organize resources in Hyperforce, what we mean by immutable infrastructure, how that 3-Availability-Zone design we mentioned above enables high availability, why the heck zero trust actually means the opposite of what it sounds like, and more. And, we’ll eventually share some of the implementation details that build on this architectural foundation to actually deliver Hyperforce, like our service mesh, CI/CD pipelines, developer experience, and service ownership.

Thanks to Ian Varley for additional contributions to this post!

Follow along with the full Hyperforce series:

Behind the Scenes of Hyperforce: Salesforce’s Infrastructure for the Public Cloud

The post The Unified Infrastructure Platform Behind Salesforce Hyperforce appeared first on salesforce-engineering.go-vip.net.