How Salesforce Built a Cloud-Native Task Execution Service

If you’re paying attention to Salesforce technology, you’ve no doubt heard about Hyperforce, our new approach to deploying Salesforce on public cloud providers. Start with a look at Hyperforce’s architecture. There are many compelling reasons to move to Hyperforce, both for us and our customers. We’re excited to do it in the way that only Salesforce would — with trust, availability, and security at the forefront from day one. Building a unified infrastructure platform for Hyperforce meant relooking at our automation tools to scale our operations with a fresh lens.

Salesforce has been around for over two decades. In 1999, when the company was founded, if you wanted to run a public internet software service (Software as a Service, or SaaS), you first had to get some servers and hook them up to the internet. So we built a few tools to perform our releases and database maintenance operations using SSH. Fast forward to 2015, when Salesforce took a very early bet on Kubernetes (K8s) to help manage an extensive suite of microservices. We’re proudly using it today across product lines and business units. And with our transformation to Hyperforce, building and using cloud-native tools, security and process made the most sense.

To leverage the scale and agility of the world’s leading public cloud platforms, our Technology and Products team has worked together over the past few years to build a cloud-native task execution system to execute remote operational tasks at scale. Because we believe you may need to walk down this path, too, we’d like to share some challenges we faced and the solutions we identified.

Transitioning away from SSH

By default, many companies take a “lift and shift” approach to running in the public cloud; they make the minimum set of changes needed to their software so that it’ll be possible to run it in public cloud infrastructure. As Salesforce has grown over the past two decades, the volume of secure shell (SSH) keys and their use has grown exponentially. As a result, SSH-based attacks are becoming a popular choice for attackers targeting business networks. Over the past few years, Interplanetary Storm and crypto-miner campaigns like Golang and Lemon_Duck have been used by attackers for backdoor creation. These incidents exploit SSH access vulnerabilities to use SSH keys in several ways for network access and exploitation. So Hyperforce was our chance to completely re-envision those practices in a cloud-native way, with uncompromising security and availability as part of our approach.

Build-Your-Own vs. Open-Source

Our prior experience was with static infrastructure and using Puppet to roll out automation scripts across our fleet of servers. However, as we started our research and development on operating a remote task execution service, we were clear with our fundamental design principles:

Secure-by-default – Security was baked in from the start into the Hyperforce architecture through its universal authentication architecture – principles, pathways, and processes that create security by default. Our task execution service had to meet the high bar set by our security team. Simple and Easy to Use – We want the architecture to be simple so that the cost to maintain and operate is not high. We are solving a single problem: having an automated workflow execution system that can help automate commonly run operational tasks. The service must be easy to use to have an excellent Developer Experience (DX).Immutability – With Hyperforce, we adopted a modern approach to infrastructure, i.e., all infrastructure is immutable. This approach required us to rethink how we colocated our operations automation scripts with our applications. Multi-substrate – We wanted a flexible solution to support operating Salesforce on top of any substrate, aka cloud provider infrastructure, i.e., Amazon, Google, Microsoft, etc.

As we were in the transition phase of transforming our services to adopt microservices design patterns and operate as containers using Kubernetes, we required a hybrid solution by supporting task execution on Servers, Virtual Machines (VMs), and Kubernetes-based deployments. Unfortunately, this ruled out container-native workflow engine solutions.

We also evaluated several other open-source workflow orchestration engines, and we ultimately decided that to stay close to our design principles, we march on the journey to develop this task execution service in-house.

Decoupling automation scripts from application source code

Going with our immutability design goal in Hyperforce, we needed to decouple automation scripts from our application deployments to reduce the operational cost of performing new releases via our CI/CD platform.

To promote a standard model for our task execution, we devised a Task Recipe Execution Framework. A recipe file is a declarative interface for an operator to define the main business logic for task execution. We quickly iterated through the framework and adopted Object-Oriented principles. These principles helped us to provide boilerplate code for new task declaration through a Base Recipe class. The task execution workers pass a recipe context containing input parameters and environment metadata that the recipe can leverage.

We created a mono repository in our source control system and centralized the delivery of these recipe files via our CI/CD pipeline to regional storage buckets (such as Amazon S3, Google Cloud Storage, Azure Storage Services, etc.).

Architecture

The task execution control plane consists of key components of an API server, coordinator, and status reporter. In this case, workers were deployed as RPM packages on Servers and Virtual Machines (baked as part of the image). For Kubernetes workloads, we built a mutating webhook (using our open-source tool). Below is a detailed description of each of the components.

During the design phase, deciding between service mesh vs. message queues for communication between the control plane and the workers was a critical choice. Given the sporadic nature of task requests, picking a message queue pattern made the most sense. Decoupling the control plane and workers helped eliminate many complexities, such as endpoint discovery, health checks, routing, and load balancing. Workers could execute tasks at their own pace using a work stealing pattern. And queues were created based on the infrastructure topology and fault partitions, and the coordinator routed the task by publishing messages on the proper queue based on input parameters.

API Server – The API Server is an always-on RESTful interface to receive requests from operators and other trusted services. After completing the AuthN/AuthZ check, the API server delegates the request processing to the coordinator. Coordinator – Coordinator is a stateless daemon deployed in each Hyperforce region and stripped across multiple availability zones. The coordinator ensures to subscribe to messages from the API server and routes the message to the right workers based on the request criteria. Status Reporter – Workers communicate their heart-beat and task execution progress to status reporters. Status reporter helps to centralize the updates to our backend storage and helps to eliminate each worker from having a persistent connection to our storage system. Workers – Workers are stateless daemons running either as a sidecar for Kubernetes applications or a system daemon on servers and virtual machines. The workers at runtime pull the latest copy of the recipe file, perform file integrity checks and then complete the execution of the task.

Conclusion

The content of this article is just scratching the surface. Our next challenge is to make the task execution system feature-rich with near real-time monitoring. We plan to do this by adding the ability to support cron and schedule-based task executions and integrate with substrate-specific queuing technologies, resulting in a truly multi-cloud compatible service. We are setting the security bar high to uphold our company commitment to Trust. Finally, based on our experiments and testing, we’ve documented some best practices for utilizing queues and asynchronous processing that we’ll publish soon, so stay tuned!

The post How Salesforce Built a Cloud-Native Task Execution Service appeared first on Salesforce Engineering Blog.

Transitioning away from SSH

Build-Your-Own vs. Open-Source

Decoupling automation scripts from application source code

Architecture

Conclusion

Related