In our “Engineering Energizers” Q&A series, we delve into the journeys of distinguished engineering leaders. Today, we feature Archana Kumari, Director of Software Engineering at Salesforce.
Archana leads our India-based Data Cloud Compute Layer team, which played a pivotal role in a recent transition from Amazon EC2 to Kubernetes for Trino workloads. This shift not only enhanced performance and scalability but also reduced operational overhead, improved cost-efficiency, and accelerated the time to market for Data Cloud.
Discover the strategies Archana’s team used to tackle complex challenges, seamlessly integrate Hyperforce, and develop proprietary innovations that facilitated the transition of workloads to Kubernetes.
Meet Archana (left) and her Data Cloud Compute Layer team.
What is your team’s mission?
Our team’s mission is to build and sustain a world-class compute fabric that empowers Data Cloud teams and other Salesforce big data teams to unleash their creativity, drive innovation, and deliver exceptional solutions to our customers.
We developed a platform — KRC (Kubernetes Resource Controller) — from the ground up that fully utilizes Kubernetes’ capabilities and integrates natively with the Salesforce ecosystem.
Our primary goal was to transition the Trino architecture to Kubernetes to enhance flexibility and efficiency for dedicated compute workloads. This shift enables quicker and easier deployment of new Trino instances, allowing for the use of the latest or any required version of Trino without restrictions. It also incorporates auto-scaling with Horizontal Pod Autoscaler and Cluster Autoscaler, which simplifies the design of dedicated compute features and reduces scale-out time to mere seconds.
By adopting Kubernetes, we have created a robust, scalable, and agile infrastructure that supports innovation, optimal performance, and dynamic adaptation to varying traffic loads.
What are significant technical problem statements/challenges your team faced during the migration process?
One of the major hurdles we encountered was dealing with the complexities of Hyperforce, Salesforce’s proprietary layer on top of AWS. The architecture of Hyperforce is intricate, demanding a thorough understanding from our team to facilitate a seamless transition. The integration process was particularly challenging because Hyperforce modifies the standard AWS environment, rendering typical solutions and configurations inadequate.
To tackle this issue, we engaged in extensive collaboration with multiple teams within Salesforce, including Data Cloud’s Service Delivery Team, the Query Team, and Hyperforce teams, among others. We organized dedicated sessions that brought together all the necessary stakeholders to engage in effective problem-solving. Our strategy was both technical and process-oriented. We immersed ourselves in the Hyperforce documentation and interacted with its various components for the first time. This iterative process of identifying and addressing gaps played a crucial role in achieving a successful migration.
How did your team ensure data security and integrity during the migration from EC2 to Kubernetes?
During the transition from EC2 to Kubernetes for our Trino setup, ensuring the security and integrity of data was paramount. To achieve this, we implemented the Istio service mesh across Trino clusters to secure pod-to-pod communication. Additionally, we enforced role-based access controls to limit access to sensitive data and resources strictly to authorized personnel only.
To further strengthen data security, we utilized industry-standard encryption protocols. This ensured that data was encrypted both during transit and while at rest. Moreover, to maintain a robust security posture, we regularly conducted security audits and vulnerability assessments. These proactive measures were instrumental in identifying and addressing potential security threats, thus ensuring the protection of data throughout the migration process.
What were the initial challenges you encountered when moving workloads from EC2 to Kubernetes?
Transitioning workloads from EC2 to Kubernetes posed several challenges, primarily the need to master the new Kubernetes environment with limited prior experience. To address this, we initiated an extensive upskilling program that included training sessions and workshops to enhance our team’s proficiency. Adapting to the complexities of the Hyperforce infrastructure was also crucial, requiring significant adjustments and thorough testing to ensure workload compatibility.
Debugging emerged as a particular challenge due to our initial unfamiliarity with Hyperforce. We responded by developing detailed documentation to define team responsibilities and streamline the debugging process. Regular weekly sessions for debugging and learning helped us address issues proactively, with persistent problems escalated to higher management for quick resolution. This structured approach was key to successfully managing the migration complexities and achieving a smooth transition to Kubernetes.
Diagram representing Trino on Kubernetes.
Which specific in-house innovations were developed to tackle the difficulties of the migration?
Our team developed numerous key in-house innovations:
Kubernetes Resource Controller (KRC) Framework: This framework was created to automate the dynamic deployment of resources based on demand, significantly improving the ability to manage workloads and optimize resource utilization.
CI/CD Pipelines: Continuous integration and continuous deployment pipelines were implemented to automate the deployment process, reducing manual intervention and minimizing errors, which streamlined workflows and enhanced operational efficiency.
Enhanced Observability: Metrics were improved to achieve better observability across systems.
Patching Logic: To meet security compliance, an in-house patching design was developed that facilitates seamless transitions of applications to the latest OS-based EC2 instances.
SSD Disk Integration: SSD disks attached to nodes were made accessible to pods, enhancing storage capabilities.
Resolution of PTT and PTE Issues: Issues related to pod terminations and connection pool refreshes were addressed by tuning connection threads and timeouts at the mesh level, and configuring pods to shut down gracefully while maintaining availability for ongoing operations.
Auto-scaling Mechanisms: Auto-scaling was introduced to allow the infrastructure to automatically adjust to traffic loads, ensuring optimal performance and cost efficiency.
These innovations significantly cut compute costs by 54% and accelerated the migration timeline, completing the migration in six months for canary environments and nine months for all environments.
CTS dashboards showcasing weekly wise cost reduction after migration.
CTS dashboards showcasing monthly wise cost reduction after migration.
What strategies did your team use to optimize resource utilization in the new Kubernetes environment, and how did these strategies improve overall system performance?
The team implemented several strategies:
Horizontal Pod Autoscaling (HPA): This feature automatically adjusted the number of pod replicas based on CPU utilization, allowing the team to efficiently manage varying loads without over-provisioning resources.
Instance Type Adjustment: By carefully selecting instance types, the team ensured an optimal balance between CPU and memory usage according to workload requirements, maximizing performance and minimizing costs.
Resource Requests and Limits: Setting up resource requests and limits for each pod enabled Kubernetes to make informed decisions about scheduling and resource allocation, improving operational efficiency.
Kubernetes Cluster Autoscaler: This tool dynamically adjusted the size of the clusters based on demand, enabling the infrastructure to scale seamlessly with workload changes.
These strategies collectively enhanced the system’s performance and cost efficiency by ensuring optimal resource use and adaptable infrastructure to meet changing demands.
How did your team manage resources effectively to ensure the migration project stayed on track?
The team employed several effective methods:
Careful Prioritization and Strategic Task Allocation: By continuously assessing and prioritizing critical tasks in collaboration with stakeholders, the team focused efforts on high-impact areas, enhancing overall efficiency.
Collaborative Team Approach: A unified mission was fostered across the Compute, Query, Service Delivery, and Hyperforce teams. This collaboration ensured alignment with common goals and facilitated seamless teamwork.
Enhanced Global Communication: Regular communication channels, including virtual meetings and collaboration tools, were utilized to improve collaboration across different geographical locations. This global teamwork was crucial for maintaining momentum and addressing challenges promptly.
Support from Data Cloud Teams: Substantial support from various Data Cloud teams was instrumental in the project’s success. Their expertise and assistance were invaluable in overcoming obstacles and ensuring access to necessary resources and knowledge.
Regular Check-ins and Feedback Sessions: These sessions were crucial for monitoring progress and addressing issues quickly, ensuring that potential setbacks were managed swiftly.
Learn More
Read this blog to learn how Data Cloud’s team are scaling massive data volumes and slashing performance bottlenecks.
Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.
The post Data Cloud’s Lightning-Fast Migration: From Amazon EC2 to Kubernetes in 6 Months appeared first on Salesforce Engineering Blog.