{"id":900,"date":"2024-07-22T22:19:01","date_gmt":"2024-07-22T22:19:01","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/07\/22\/data-clouds-lightning-fast-migration-from-amazon-ec2-to-kubernetes-in-6-months\/"},"modified":"2024-07-22T22:19:01","modified_gmt":"2024-07-22T22:19:01","slug":"data-clouds-lightning-fast-migration-from-amazon-ec2-to-kubernetes-in-6-months","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/07\/22\/data-clouds-lightning-fast-migration-from-amazon-ec2-to-kubernetes-in-6-months\/","title":{"rendered":"Data Cloud\u2019s Lightning-Fast Migration: From Amazon EC2 to Kubernetes in 6 Months"},"content":{"rendered":"<p>In our \u201cEngineering Energizers\u201d Q&amp;A series, we delve into the journeys of distinguished engineering leaders. Today, we feature Archana Kumari, Director of Software Engineering at Salesforce.<\/p>\n<p>Archana leads our India-based <a href=\"https:\/\/www.salesforce.com\/data\/\">Data Cloud <\/a>Compute Layer team, which played a pivotal role in a recent transition from <a href=\"https:\/\/aws.amazon.com\/pm\/ec2\/?gclid=Cj0KCQjw1qO0BhDwARIsANfnkv83huFe-VL8_g__B4O6ZdYJUGkhlHAYXuj0Z5QZUz7c2eHWE7-MSTUaAhT6EALw_wcB&amp;trk=36c6da98-7b20-48fa-8225-4784bced9843&amp;sc_channel=ps&amp;ef_id=Cj0KCQjw1qO0BhDwARIsANfnkv83huFe-VL8_g__B4O6ZdYJUGkhlHAYXuj0Z5QZUz7c2eHWE7-MSTUaAhT6EALw_wcB:G:s&amp;s_kwcid=AL!4422!3!467723097970!e!!g!!amazon%20ec2!11198711716!118263955828\">Amazon EC2<\/a> to <a href=\"https:\/\/kubernetes.io\/\">Kubernetes<\/a> for <a href=\"https:\/\/trino.io\/\">Trino<\/a> workloads. This shift not only enhanced performance and scalability but also reduced operational overhead, improved cost-efficiency, and accelerated the time to market for Data Cloud.<\/p>\n<p>Discover the strategies Archana\u2019s team used to tackle complex challenges, seamlessly integrate <a href=\"https:\/\/www.salesforce.com\/platform\/public-cloud-infrastructure\/\">Hyperforce<\/a>, and develop proprietary innovations that facilitated the transition of workloads to Kubernetes.<\/p>\n<p><em>Meet Archana (left) and her Data Cloud Compute Layer team.<\/em><\/p>\n<p><strong><\/strong><br \/><strong>What is your team\u2019s mission?<\/strong><\/p>\n<p>Our team\u2019s mission is to build and sustain a world-class compute fabric that empowers Data Cloud teams and other Salesforce big data teams to unleash their creativity, drive innovation, and deliver exceptional solutions to our customers.<\/p>\n<p>We developed a platform \u2014 KRC (Kubernetes Resource Controller) \u2014 from the ground up that fully utilizes Kubernetes\u2019 capabilities and integrates natively with the Salesforce ecosystem.<\/p>\n<p>Our primary goal was to transition the Trino architecture to Kubernetes to enhance flexibility and efficiency for dedicated compute workloads. This shift enables quicker and easier deployment of new Trino instances, allowing for the use of the latest or any required version of Trino without restrictions. It also incorporates auto-scaling with Horizontal Pod Autoscaler and Cluster Autoscaler, which simplifies the design of dedicated compute features and reduces scale-out time to mere seconds.<\/p>\n<p>By adopting Kubernetes, we have created a robust, scalable, and agile infrastructure that supports innovation, optimal performance, and dynamic adaptation to varying traffic loads.<\/p>\n<p><strong>What are significant technical problem statements\/challenges your team faced during the migration process?<\/strong><\/p>\n<p>One of the major hurdles we encountered was dealing with the complexities of <a href=\"https:\/\/engineering.salesforce.com\/hyperforces-framework-for-enhancing-developer-workflow-inside-the-7-pillars-of-agile-development\/\">Hyperforce<\/a>, Salesforce\u2019s proprietary layer on top of AWS. The architecture of Hyperforce is intricate, demanding a thorough understanding from our team to facilitate a seamless transition. The integration process was particularly challenging because Hyperforce modifies the standard AWS environment, rendering typical solutions and configurations inadequate.<\/p>\n<p>To tackle this issue, we engaged in extensive collaboration with multiple teams within Salesforce, including Data Cloud\u2019s Service Delivery Team, the Query Team, and <a href=\"https:\/\/engineering.salesforce.com\/hyperforce-behind-the-scenes-ushering-in-a-new-age-of-ai-driven-cloud-scalability\/\">Hyperforce<\/a> teams, among others. We organized dedicated sessions that brought together all the necessary stakeholders to engage in effective problem-solving. Our strategy was both technical and process-oriented. We immersed ourselves in the Hyperforce documentation and interacted with its various components for the first time. This iterative process of identifying and addressing gaps played a crucial role in achieving a successful migration.<\/p>\n<p><strong>How did your team ensure data security and integrity during the migration from EC2 to Kubernetes?<\/strong><\/p>\n<p>During the transition from EC2 to Kubernetes for our Trino setup, ensuring the security and integrity of data was paramount. To achieve this, we implemented the Istio service mesh across Trino clusters to secure pod-to-pod communication. Additionally, we enforced role-based access controls to limit access to sensitive data and resources strictly to authorized personnel only.<\/p>\n<p>To further strengthen data security, we utilized industry-standard encryption protocols. This ensured that data was encrypted both during transit and while at rest. Moreover, to maintain a robust security posture, we regularly conducted security audits and vulnerability assessments. These proactive measures were instrumental in identifying and addressing potential security threats, thus ensuring the protection of data throughout the migration process.<\/p>\n<p><strong>What were the initial challenges you encountered when moving workloads from EC2 to Kubernetes?<\/strong><\/p>\n<p>Transitioning workloads from EC2 to Kubernetes posed several challenges, primarily the need to master the new Kubernetes environment with limited prior experience. To address this, we initiated an extensive upskilling program that included training sessions and workshops to enhance our team\u2019s proficiency. Adapting to the complexities of the Hyperforce infrastructure was also crucial, requiring significant adjustments and thorough testing to ensure workload compatibility.<\/p>\n<p>Debugging emerged as a particular challenge due to our initial unfamiliarity with <a href=\"https:\/\/engineering.salesforce.com\/unlocking-hyperforce-migration-innovative-solutions-for-a-smooth-transition-to-the-cloud\/\">Hyperforce<\/a>. We responded by developing detailed documentation to define team responsibilities and streamline the debugging process. Regular weekly sessions for debugging and learning helped us address issues proactively, with persistent problems escalated to higher management for quick resolution. This structured approach was key to successfully managing the migration complexities and achieving a smooth transition to Kubernetes.<\/p>\n<p><em>Diagram representing Trino on Kubernetes.<\/em><\/p>\n<p><strong>Which specific in-house innovations were developed to tackle the difficulties of the migration?<\/strong><\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-core-group-is-layout-1 wp-block-group-is-layout-constrained\">\n<p>Our team developed numerous key in-house innovations:<\/p>\n<p><strong>Kubernetes Resource Controller (KRC) Framework<\/strong>: This framework was created to automate the dynamic deployment of resources based on demand, significantly improving the ability to manage workloads and optimize resource utilization.<\/p>\n<p><strong>CI\/CD Pipelines<\/strong>: Continuous integration and continuous deployment pipelines were implemented to automate the deployment process, reducing manual intervention and minimizing errors, which streamlined workflows and enhanced operational efficiency.<\/p>\n<p><strong>Enhanced Observability<\/strong>: Metrics were improved to achieve better observability across systems.<\/p>\n<p><strong>Patching Logic:<\/strong> To meet security compliance, an in-house patching design was developed that facilitates seamless transitions of applications to the latest OS-based EC2 instances.<\/p>\n<p><strong>SSD Disk Integration<\/strong>: SSD disks attached to nodes were made accessible to pods, enhancing storage capabilities.<\/p>\n<p><strong>Resolution of PTT and PTE Issues<\/strong>: Issues related to pod terminations and connection pool refreshes were addressed by tuning connection threads and timeouts at the mesh level, and configuring pods to shut down gracefully while maintaining availability for ongoing operations.<\/p>\n<p><strong>Auto-scaling Mechanisms<\/strong>: Auto-scaling was introduced to allow the infrastructure to automatically adjust to traffic loads, ensuring optimal performance and cost efficiency.<\/p>\n<\/div>\n<p><strong>These innovations significantly cut compute costs by 54% and accelerated the migration timeline, completing the migration in six months for canary environments and nine months for all environments.<\/strong><\/p>\n<p><em>CTS dashboards showcasing weekly wise cost reduction after migration.<\/em><\/p>\n<p><em>CTS dashboards showcasing monthly wise cost reduction after migration.<\/em><\/p>\n<p><strong>What strategies did your team use to optimize resource utilization in the new Kubernetes environment, and how did these strategies improve overall system performance?<\/strong><\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-core-group-is-layout-2 wp-block-group-is-layout-constrained\">\n<p>The team implemented several strategies:<\/p>\n<p><strong>Horizontal Pod Autoscaling (HPA):<\/strong> This feature automatically adjusted the number of pod replicas based on CPU utilization, allowing the team to efficiently manage varying loads without over-provisioning resources.<\/p>\n<p><strong>Instance Type Adjustment<\/strong>: By carefully selecting instance types, the team ensured an optimal balance between CPU and memory usage according to workload requirements, maximizing performance and minimizing costs.<\/p>\n<p><strong>Resource Requests and Limits<\/strong>: Setting up resource requests and limits for each pod enabled Kubernetes to make informed decisions about scheduling and resource allocation, improving operational efficiency.<\/p>\n<p><strong>Kubernetes Cluster Autoscaler<\/strong>: This tool dynamically adjusted the size of the clusters based on demand, enabling the infrastructure to scale seamlessly with workload changes.<\/p>\n<\/div>\n<p>These strategies collectively enhanced the system\u2019s performance and cost efficiency by ensuring optimal resource use and adaptable infrastructure to meet changing demands.<\/p>\n<p><strong>How did your team manage resources effectively to ensure the migration project stayed on track?<\/strong><\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-core-group-is-layout-3 wp-block-group-is-layout-constrained\">\n<p>The team employed several effective methods:<\/p>\n<p><strong>Careful Prioritization and Strategic Task Allocation:<\/strong> By continuously assessing and prioritizing critical tasks in collaboration with stakeholders, the team focused efforts on high-impact areas, enhancing overall efficiency.<\/p>\n<p><strong>Collaborative Team Approach<\/strong>: A unified mission was fostered across the Compute, Query, Service Delivery, and Hyperforce teams. This collaboration ensured alignment with common goals and facilitated seamless teamwork.<\/p>\n<p><strong>Enhanced Global Communication<\/strong>: Regular communication channels, including virtual meetings and collaboration tools, were utilized to improve collaboration across different geographical locations. This global teamwork was crucial for maintaining momentum and addressing challenges promptly.<\/p>\n<p><strong>Support from Data Cloud Teams<\/strong>: Substantial support from various Data Cloud teams was instrumental in the project\u2019s success. Their expertise and assistance were invaluable in overcoming obstacles and ensuring access to necessary resources and knowledge.<\/p>\n<p><strong>Regular Check-ins and Feedback Sessions<\/strong>: These sessions were crucial for monitoring progress and addressing issues quickly, ensuring that potential setbacks were managed swiftly.<\/p>\n<\/div>\n<div class=\"wp-block-group is-layout-constrained wp-container-core-group-is-layout-4 wp-block-group-is-layout-constrained\">\n<p><strong>Learn More<\/strong><\/p>\n<p>Read <a href=\"https:\/\/engineering.salesforce.com\/unlocking-data-clouds-secret-for-scaling-massive-data-volumes-and-slashing-processing-bottlenecks\/\">this blog <\/a>to learn how Data Cloud\u2019s team are scaling massive data volumes and slashing performance bottlenecks.<\/p>\n<p>Stay connected \u2014 join our <a href=\"https:\/\/flows.beamery.com\/salesforce\/eng-social-2023\">Talent Community<\/a>!<\/p>\n<p>Check out our <a href=\"https:\/\/www.salesforce.com\/company\/careers\/teams\/tech-and-product\/?d=cta-tms-tp-2\">Technology and Product<\/a> teams to learn how you can get involved.<\/p>\n<\/div>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/data-clouds-lightning-fast-migration-from-amazon-ec2-to-kubernetes-in-6-months\/\">Data Cloud\u2019s Lightning-Fast Migration: From Amazon EC2 to Kubernetes in 6 Months<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/data-clouds-lightning-fast-migration-from-amazon-ec2-to-kubernetes-in-6-months\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>In our \u201cEngineering Energizers\u201d Q&amp;A series, we delve into the journeys of distinguished engineering leaders. Today, we feature Archana Kumari, Director of Software Engineering at Salesforce. Archana leads our India-based Data Cloud Compute Layer team, which played a pivotal role in a recent transition from Amazon EC2 to Kubernetes for Trino workloads. This shift not&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/07\/22\/data-clouds-lightning-fast-migration-from-amazon-ec2-to-kubernetes-in-6-months\/\">Continue reading <span class=\"screen-reader-text\">Data Cloud\u2019s Lightning-Fast Migration: From Amazon EC2 to Kubernetes in 6 Months<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-900","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":268,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/zero-downtime-node-patching-in-a-kubernetes-cluster\/","url_meta":{"origin":900,"position":0},"title":"Zero Downtime Node Patching in a Kubernetes Cluster","date":"August 31, 2021","format":false,"excerpt":"Authors: Vaishnavi Galgali, Arpeet Kale, Robert\u00a0XueIntroductionThe Salesforce Einstein Vision and Language services are deployed in an AWS Elastic Kubernetes Service (EKS) cluster. One of the primary security and compliance requirements is operating system patching. The cluster nodes that the services are deployed on need to have regular operating system updates.\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":287,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/optimizing-eks-networking-for-scale\/","url_meta":{"origin":900,"position":1},"title":"Optimizing EKS networking for scale","date":"August 31, 2021","format":false,"excerpt":"Authors: Savithru Lokanath, Arpeet Kale, VaishnavigalgaliElastic Kubernetes Service (EKS) is a service under the Amazon Web Services (AWS) umbrella that provides managed Kubernetes service. It significantly reduces the time to deploy, manage, and scale the infrastructure required to run production-scale Kubernetes clusters. AWS has simplified EKS networking significantly with its\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":692,"url":"https:\/\/fde.cat\/index.php\/2023\/03\/22\/how-is-indias-brilliant-big-data-processing-team-engineering-salesforce-data-cloud\/","url_meta":{"origin":900,"position":2},"title":"How is India\u2019s Brilliant Big Data Processing Team Engineering Salesforce Data Cloud?","date":"March 22, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. Meet Archana Kumari, one of Salesforce\u2019s first India-based woman engineering leaders. In her role, Archana leads Salesforce India\u2019s Data Cloud big data processing compute layer team \u2014 charged with providing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":662,"url":"https:\/\/fde.cat\/index.php\/2022\/12\/14\/how-salesforce-uses-immutable-infrastructure-in-hyperforce\/","url_meta":{"origin":900,"position":3},"title":"How Salesforce uses Immutable Infrastructure in Hyperforce","date":"December 14, 2022","format":false,"excerpt":"Credits go to: Armin Bahramshahry, Software Engineering Principal Architect @ Salesforce\u00a0&\u00a0Shan Appajodu, VP, Software Engineering for Developer Productivity Experiences @ Salesforce. To leverage the scale and agility of the world\u2019s leading public cloud platforms, our Technology and Products team at Salesforce has worked together over the past few years to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":284,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/hadoop-hbase-on-kubernetes-and-public-cloud-part-i\/","url_meta":{"origin":900,"position":4},"title":"Hadoop\/HBase on Kubernetes and Public Cloud (Part I)","date":"August 31, 2021","format":false,"excerpt":"Authors: Dhiraj Hegde, Ashutosh Parekh, and Prashant\u00a0MurthyAt Salesforce, we run a large number of HBase and HDFS clusters in our own data centers. More recently, we have started deploying our clusters on Public Cloud infrastructure to take advantage of the on-demand scalability available there. As part of this foray onto\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":454,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/looking-at-the-kubernetes-control-plane-for-multi-tenancy\/","url_meta":{"origin":900,"position":5},"title":"Looking at the Kubernetes Control Plane for Multi-Tenancy","date":"August 31, 2021","format":false,"excerpt":"The Salesforce Platform-as-a-Service Security Assurance team is constantly assessing modern compute platforms for security level and features. We use the insights from these research efforts to provide fast and comprehensive support to engineering teams who explore platform options that adequately support their security requirements. Unsurprisingly, Kubernetes is one of the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/900","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=900"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/900\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=900"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=900"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=900"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}