{"id":694,"date":"2023-03-23T14:25:28","date_gmt":"2023-03-23T14:25:28","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/03\/23\/big-data-processing-driving-data-migration-for-salesforce-data-cloud\/"},"modified":"2023-03-23T14:25:28","modified_gmt":"2023-03-23T14:25:28","slug":"big-data-processing-driving-data-migration-for-salesforce-data-cloud","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/03\/23\/big-data-processing-driving-data-migration-for-salesforce-data-cloud\/","title":{"rendered":"Big Data Processing: Driving Data Migration  for Salesforce Data Cloud"},"content":{"rendered":"<p>The tsunami of data \u2014 set to <a href=\"https:\/\/www.statista.com\/statistics\/871513\/worldwide-data-created\/\">exceed <\/a><a href=\"https:\/\/www.statista.com\/statistics\/871513\/worldwide-data-created\/\" target=\"_blank\" rel=\"noopener\">180 zettabyte<\/a><a href=\"https:\/\/www.statista.com\/statistics\/871513\/worldwide-data-created\/\">s by 2025<\/a> \u2014 places significant pressure on companies. Simply having access to customer information is not enough \u2014 companies must also analyze and refine the data to find actionable pieces that power new business.<\/p>\n<p>As businesses collect these volumes of customer data, they rely on <a href=\"https:\/\/www.salesforce.com\/in\/blog\/2022\/10\/what-is-salesforce-genie.html\" target=\"_blank\" rel=\"noopener\">Salesforce Data Cloud<\/a> \u2014 a single source of truth for harmonizing, storing, unifying, and nurturing this information as it dynamically evolves over time. Consequently, businesses understand their customers\u2019 needs better than ever \u2014 powering enhanced customer experiences.<\/p>\n<p>Looking under the hood of Data Cloud reveals a couple key layers. The platform\u2019s bottom-most layer, the storage layer, consists of a data lake that ingests petabytes of customer data. Above that lies a compute plane layer, which processes, massages, and interprets the big data \u2014 enabling the information to be segmented and activated.<\/p>\n<p>Essentially the engine behind Data Cloud, this tier-zero fabric layer\u2019s journey began with a complex data migration orchestration to Data Cloud\u2019s data lake from Dataroma \u2014 Salesforce\u2019s cloud-based marketing intelligence platform \u2014 leading to the creation of Data Cloud\u2019s big data processing compute layer team in India.<\/p>\n<p>Executing this task was massive as it involved multiple teams \u2014 who managed the migration orchestration effort? Say hello to Data Cloud\u2019s compute layer team. The migration initially challenged them because they had no experience in big data processing. How did they successfully orchestrate the migration in mere months and then immediately launch their tier-zero layer to process petabytes of customer data?<\/p>\n<p><em>Salesforce Data Cloud\u2019s India-based compute layer team.<\/em><\/p>\n<p>Read on to learn how the team overcame the odds\u2026<\/p>\n<p><strong>How did the team migrate terabytes of data from Salesforce Dataroma to Data Cloud\u2019s data lake?<\/strong><\/p>\n<p>The Data Cloud data lake project kicked off in 2020, when the Data Cloud storage team built Data Cloud\u2019s data lake on top of <a href=\"https:\/\/iceberg.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Iceberg<\/a>. Shortly thereafter, the 10-person, India-based compute layer team faced its biggest test: Develop a workflow orchestration to migrate existing customers from Datorama to the data lake. However, the compute layer team was new. How new? Its members had just joined Salesforce, had no big data background, and had no time for onboarding. So, shortly after joining Salesforce, engineering management directed the team to solve this problem, which involved 20+ teams altogether \u2014 and a steep learning curve.<\/p>\n<p>The compute layer team\u2019s mission proved daunting: Develop the migration orchestration in three months, test it in production, and execute the final migration just three months later.<\/p>\n<p>To migrate the data from Dataroma to the Data Cloud\u2019s data lake, the team used <a href=\"https:\/\/airflow.apache.org\/\" target=\"_blank\" rel=\"noopener\">Airflow<\/a>, an open-source tool for creating, scheduling, and tracking batch-oriented big data processing workflows. However, given this gargantuan migration task, existing Airflow service could not provide a correct SLA, delivered behavioral inconsistencies, and failed to scale.<\/p>\n<p>Diving deeper, Airflow\u2019s \u201cScheduler\u201d component was used to schedule when tasks were run. However, tasks began to slip as the scheduler tool occasionally experienced delays up to several minutes which, in turn, extended the migration to several hours.<\/p>\n<p>Pondering its dilemma, the team examined two options: They could optimize Airflow, however, its performance number shared by the Airflow community did not reflect the team\u2019s initial observations about the platform. Alternatively, the team could leverage a workflow orchestration from AWS or Azure, however, onboarding a new offering to Salesforce\u2019s <a href=\"https:\/\/www.salesforce.com\/products\/platform\/hyperforce\/?_gl=1*sficq8*_ga*MTY1MjE3MzQ3MS4xNjY3MjQzMDYz*_ga_HPRCE01J19*MTY3OTA2NDc1Ny45NS4xLjE2NzkwNjQ3NjIuMC4wLjA.\" target=\"_blank\" rel=\"noopener\">Hyperforce<\/a> \u2014 a next-gen infrastructure platform that uses public cloud to rapidly deliver Salesforce software to global customers \u2014 might have required months. Ultimately, the migration deadline was firm, which meant the team had no time to explore an alternative migration tool and focused on optimizing Airflow to accomplish their goal.<\/p>\n<p>How did the team perform the optimization? First, they created a panel of sub-teams, tasked with determining performance gaps within Airflow. Second, they tweaked its configuration \u2014 effectively solving the scalability and consistency issues while bringing the SLA within acceptable limits.<\/p>\n<p><em>Airflow Scheduler CPU availability and memory utilization before performance tuning.<\/em><\/p>\n<p><em>Airflow Scheduler CPU availability and memory utilization after performance tuning.<\/em><\/p>\n<p><em>Airflow Scheduler latency improvements before (red) and after (white) performance tuning.<\/em><\/p>\n<p>Third, teams began writing directed acyclic graphs (DAGs) in Airflow in parallel. Following an iterative approach enabled them to discover, copy, validate, roll out, and mark the migration status. This led the team to scale their efforts, where they executed 50+ tasks within a single DAG, enabling them to migrate a tenant to Data Cloud\u2019s data lake.<\/p>\n<p><em>The step-by-step approach for migrating Dataroma tenants to Data Cloud\u2019s data lake.<\/em><\/p>\n<p>Finally, 20+ globally-dispersed scrum teams focused on deciphering the data, interpreting the schema, and moving the data from Dataroma to Data Cloud\u2019s data lake. Through this collaboration, the compute layer team ensured an optimized workflow to deliver a smooth and seamless migration.<\/p>\n<p><strong>How does the compute layer team perform big data processing for Data Cloud customers?<\/strong><\/p>\n<p>By successfully migrating customer data from Dataroma to Data Cloud\u2019s data lake, the compute layer team effectively set the table for launching its tier-zero layer \u2014 paving the path to process customer queries.<\/p>\n<p>Where does the processing begin? Data Cloud\u2019s \u201cingestion layer\u201d continuously collects petabytes of customer data at tremendous speeds and deposits it into Data Cloud\u2019s data lake.<\/p>\n<p>How does this vast amount of data get processed? This is where <a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener\">Spark<\/a> steps in, a big data processing tool that massages and harmonizes the data to create something meaningful. To achieve this, it removes extraneous characteristics of the data and enriches it \u2014 adding more properties as needed.<\/p>\n<p>After the data is unified, customers can run interactive queries, where the compute layer team performs data segmentation with <a href=\"https:\/\/trino.io\/\" target=\"_blank\" rel=\"noopener\">Trino<\/a>, a big data tool designed to efficiently query volumes of data by leveraging distributed execution. Trino harnesses the power of split logic, which splinters queries into multiple chunks and collates the data \u2014 providing results within a fraction of a second.<\/p>\n<p>What does this look like? When a multi-brand company plans to launch a new product, they can define their target market by determining which of their existing customers within a certain age group have purchased items within a similar price range. This is how segmentation comes in play. Customers will submit a query request to <a href=\"https:\/\/www.salesforce.com\/eu\/products\/genie\/overview\/\">Data Cloud<\/a>. That triggers the compute team to use Trino and provision runtime clusters to ensure scaling, resiliency, availability \u2014 leading to lightning-fast data activation and customer results.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Learn more<\/strong><\/h3>\n<p>Stay connected \u2013 join our<a href=\"https:\/\/careers.mail.salesforce.com\/w2?cid=7017y00000CRDS7AAP\"> <\/a><a href=\"https:\/\/careers.mail.salesforce.com\/w2?cid=7017y00000CRDS7AAP\" target=\"_blank\" rel=\"noopener\">Talent Community<\/a>!<\/p>\n<p><a href=\"https:\/\/www.salesforce.com\/company\/careers\/teams\/tech-and-product\/?d=cta-tms-tp-2\" target=\"_blank\" rel=\"noopener\">Check out our Technology and Product teams<\/a> to learn how you can get involved.<\/p>\n<p>For a closer look at Salesforce\u2019s big data processing team, check out this <a href=\"https:\/\/engineering.salesforce.com\/how-is-indias-brilliant-big-data-processing-team-engineering-salesforce-data-cloud\/\" target=\"_blank\" rel=\"noopener\">blog<\/a>.<\/p>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/big-data-processing-driving-data-migration-for-salesforce-data-cloud\/\">Big Data Processing: Driving Data Migration  for Salesforce Data Cloud<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/big-data-processing-driving-data-migration-for-salesforce-data-cloud\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>The tsunami of data \u2014 set to exceed 180 zettabytes by 2025 \u2014 places significant pressure on companies. Simply having access to customer information is not enough \u2014 companies must also analyze and refine the data to find actionable pieces that power new business. As businesses collect these volumes of customer data, they rely on&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/03\/23\/big-data-processing-driving-data-migration-for-salesforce-data-cloud\/\">Continue reading <span class=\"screen-reader-text\">Big Data Processing: Driving Data Migration  for Salesforce Data Cloud<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-694","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":705,"url":"https:\/\/fde.cat\/index.php\/2023\/04\/18\/ai-based-identity-resolution-the-key-for-linking-diverse-customer-data\/","url_meta":{"origin":694,"position":0},"title":"AI-based Identity Resolution: The Key for Linking Diverse Customer Data","date":"April 18, 2023","format":false,"excerpt":"Companies want a comprehensive view of their customers, enabling them to solve business and marketing challenges, such as personalization, segmentation, and targeting \u2014 but they face an uphill battle as they are drowning in data. For example, many companies cannot match the identity of a customer who visits their website\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":692,"url":"https:\/\/fde.cat\/index.php\/2023\/03\/22\/how-is-indias-brilliant-big-data-processing-team-engineering-salesforce-data-cloud\/","url_meta":{"origin":694,"position":1},"title":"How is India\u2019s Brilliant Big Data Processing Team Engineering Salesforce Data Cloud?","date":"March 22, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. Meet Archana Kumari, one of Salesforce\u2019s first India-based woman engineering leaders. In her role, Archana leads Salesforce India\u2019s Data Cloud big data processing compute layer team \u2014 charged with providing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":828,"url":"https:\/\/fde.cat\/index.php\/2024\/02\/20\/unlocking-hyperforce-migration-innovative-solutions-for-a-smooth-transition-to-the-cloud\/","url_meta":{"origin":694,"position":2},"title":"Unlocking Hyperforce Migration: Innovative Solutions for a Smooth Transition to the Cloud","date":"February 20, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we delve into the experiences and expertise of Salesforce Engineering leaders. Today, we\u2019re meeting Mahamadou Sylla, a Senior Member of the Technical Staff at Salesforce Engineering. Mahamadou is a key member of our Hyperforce\u2019s Bill of Materials (BOM) team, which assists internal teams in\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":724,"url":"https:\/\/fde.cat\/index.php\/2023\/06\/12\/ai-data-crm-chief-engineering-officers-innovative-insights-for-a-competitive-edge\/","url_meta":{"origin":694,"position":3},"title":"AI + Data + CRM: Chief Engineering Officer\u2019s Innovative Insights for a Competitive Edge","date":"June 12, 2023","format":false,"excerpt":"Written by Srini Tallapragada and Dylan DeSimone In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. In this special edition, we meet Srini Tallapragada, President and Chief Engineering Officer for Salesforce. In his role, Srini leads the global engineering team to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":896,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/16\/the-unstructured-data-dilemma-how-data-cloud-handles-250-trillion-transactions-weekly\/","url_meta":{"origin":694,"position":4},"title":"The Unstructured Data Dilemma: How Data Cloud Handles 250 Trillion Transactions Weekly","date":"July 16, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we delve into the journeys of engineering leaders who have made notable strides in their areas of expertise. This edition features Adithya Vishwanath, Vice President of Software Engineering at Salesforce. He leads the Data Cloud team, a pivotal platform that integrates diverse data sources,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":283,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/ai-research-to-production-with-einstein-reply-recommendations\/","url_meta":{"origin":694,"position":5},"title":"AI Research to Production with Einstein Reply Recommendations","date":"August 31, 2021","format":false,"excerpt":"We all know that AI is here and it\u2019s quickly changing our lives. However, the impacts of AI are unevenly distributed and it favors those with \u201cmore data,\u201d leaving those with \u201cfew data\u201d behind. This runs counter to our Salesforce core values of Customer Success and Equality, so we set\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/694","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=694"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/694\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=694"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=694"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=694"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}