{"id":229,"date":"2021-02-02T20:02:02","date_gmt":"2021-02-02T20:02:02","guid":{"rendered":"https:\/\/fde.cat\/?p=229"},"modified":"2021-02-02T20:02:03","modified_gmt":"2021-02-02T20:02:03","slug":"ml-lake-building-salesforces-data-platform-for-machine-learning","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","title":{"rendered":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning"},"content":{"rendered":"<p>Salesforce uses machine learning to improve every aspect of its product suite. With the help of <a href=\"https:\/\/www.salesforce.com\/products\/einstein\/overview\/\">Salesforce Einstein<\/a>, companies are improving productivity and accelerating key decision-making.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1000\/1*Vbrgxe6dRSZlcv0mSOewDQ.png?w=750&#038;ssl=1\" alt=\"\" data-recalc-dims=\"1\"><\/figure>\n<p>Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges Salesforce has in the realm of data management and how ML Lake addresses these challenges to enable internal teams to build predictive capabilities into all Salesforce products, making every feature in Salesforce smarter and easier to use.<\/p>\n<p>ML Lake enables Salesforce application developers and data scientists to easily build machine learning capabilities on customer and non-customer data. It is a <strong>shared service<\/strong> that provides the <strong>right data<\/strong>, optimizes the <strong>right access patterns<\/strong>, and alleviates the machine learning application developer from having to manage <strong>data pipelines, storage, security <\/strong>and<strong> compliance<\/strong>.<\/p>\n<h3>Multitenancy, Integration, Metadata<\/h3>\n<p>Before diving into ML Lake\u2019s internals and architecture, it is important to introduce the functional and non-functional requirements that inspired us to build it. Salesforce is a cloud enterprise company that offers vertical solutions in areas such as <a href=\"https:\/\/www.salesforce.com\/products\/sales-cloud\/overview\/\">Sales<\/a>, <a href=\"https:\/\/www.salesforce.com\/products\/service-cloud\/overview\/\">Service<\/a> and <a href=\"https:\/\/www.salesforce.com\/products\/marketing-cloud\/overview\/\">Marketing<\/a>, as well as general-purpose <a href=\"https:\/\/www.salesforce.com\/products\/salesforce-platform\/\">low-code\/no-code platform<\/a> capabilities. There are thousands of different features of Salesforce that our customers leverage and <a href=\"https:\/\/help.salesforce.com\/articleView?id=customize_overview.htm&#038;type=5\">customize<\/a> to the needs of their unique business challenges.<\/p>\n<p>Multitenancy has been described as the \u201c<a href=\"https:\/\/engineering.salesforce.com\/the-magic-of-multitenancy-2daf71d99735\">keystone of the Salesforce architecture<\/a>.\u201d Data at Salesforce is owned by our customers and is segregated to ensure that data from one customer does not mix with another. As in most of Salesforce systems, machine learning applications requires granular access controls to ensure tenant-level data isolation. Additionally, Salesforce has grown via both acquisitions and organically, resulting in an architecture comprised on varied stacks and databases.<\/p>\n<p>A key reason for building ML Lake is to make it easy for applications to get the data they need, while centralizing the security controls required for <a href=\"https:\/\/trust.salesforce.com\/en\/\">maintaining trust<\/a>. Teams building applications typically underestimate the level of effort required for ETL and integration. The total cost of copying and storing data includes many hidden costs in the area of compliance, synchronization and security.<\/p>\n<p>Salesforce customers have embraced the extensibility of the platform and have tailored their Salesforce systems to suit their unique business requirements. Every customer\u2019s object model is different, and understanding this uniqueness by machine learning applications is <a href=\"https:\/\/www.wired.com\/story\/inside-salesforces-quest-to-bring-artificial-intelligence-to-everyone\/\">critical for Einstein<\/a>. In order to leverage this metadata and use it in conjunction with highly structured and customized data to build high-quality ML models tailored to each customer, Salesforce has developed and open-sourced <a href=\"https:\/\/engineering.salesforce.com\/open-sourcing-transmogrifai-4e5d0e098da2\">TransmogrifAI<\/a>, an AutoML library specifically tailored to the needs of the enterprise.<\/p>\n<p>ML Lake must not only scale in total data size, but also in the number and variety of datasets it houses. A common industry practice is to carefully maintain and curate a number of key datasets for ML or analytics use cases. At Salesforce\u2019s scale of data and customization this is impossible. Everything has to be tracked and automated, enabled by maintaining extensive metadata in ML Lake. This metadata is vital for model training and serving, as well as compliance operations.<\/p>\n<h3>Architecture<\/h3>\n<img decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*1p0jx9s4FqZaa_StFqWk7A.png?w=750&#038;ssl=1\" alt=\"\" data-recalc-dims=\"1\">\n<p>ML Lake is deployed in multiple AWS regions as a shared service for use by internal Salesforce teams and applications running in a variety of stacks in both public cloud providers and Salesforce\u2019s own data centers. It exposes a set of OpenAPI-based interfaces running in a <a href=\"https:\/\/spring.io\/projects\/spring-boot\">Spring Boot<\/a>-based Java microservice. It uses Postgres to store application state and metadata. Data for machine learning is stored in S3 in buckets managed and secured by ML&nbsp;Lake.<\/p>\n<h3>Data Lake<\/h3>\n<p>AWS S3 was chosen as the backing store for ML Lake\u2019s data lake for its resiliency, cost-effectiveness, and ease of integration with data processing engines. It houses datasets containing customer data from different parts of Salesforce, non-customer data, such as public datasets containing word embeddings, as well as data generated and uploaded by internal machine learning applications.<\/p>\n<p>A typical flow for a machine learning application interacting with ML Lake is to request metadata for a particular dataset, which contains a pointer to an S3 path housing the data, request a granularly-scoped data access token, and interact with the actual data using S3 API or S3\u2019s integration into common data tooling like <a href=\"https:\/\/spark.apache.org\/\">Apache Spark<\/a> or <a href=\"https:\/\/pandas.pydata.org\/\">Pandas<\/a>.<\/p>\n<p>The majority of Salesforce data is highly structured and with schemas often customized by clients. It is important for ML Lake to support structured datasets well, allow partitioning and filtering of large datasets, and support consistent schema changes and data updates. The current de-facto standard for table formats is the <a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/Design#Design-MetastoreArchitecture\">Hive Metastore<\/a> but it addresses only a subset of ML Lake\u2019s needs. There are a number of exciting open source projects in this space that the ML Lake team has evaluated, eventually choosing <a href=\"https:\/\/iceberg.apache.org\/\">Apache Iceberg<\/a>. Iceberg is the table format for all ML Lake\u2019s structured datasets.<\/p>\n<h3>Pipelines<\/h3>\n<p>ML Lake uses the term \u201cpipelines\u201d for jobs bringing data in and out of it. The pipeline capabilities provided by ML Lake are controlled via a set of APIs exposed to internal applications. The APIs control bi-directional data movement of raw feature data into ML Lake, as well predictions and related data back to customer-facing systems in Salesforce, while hiding the complexity behind a simple facade.<\/p>\n<p>The pipelines service centralizes the management of data movement jobs and handles common concerns like retries, error handling and reporting. Pipeline jobs are implemented in <a href=\"https:\/\/www.scala-lang.org\/\">Scala<\/a> and <a href=\"https:\/\/spark.apache.org\/\">Spark<\/a> running on EMR clusters and utilize custom connectors to various parts of Salesforce, coupled with an intra-Salesforce integration auth mechanism that complies with the strict rules of granular and explicit authorization mandated by the Product Security organization.<\/p>\n<p>ML Lake automatically provides GDPR compliance for data stored inside it. Compliance-related pipeline jobs continuously ingest GDPR signals, such as record deletions and <a href=\"https:\/\/help.salesforce.com\/articleView?id=einstein_sales_data_policy_excluded_from_predictions.htm&#038;type=5\">do-not-profile flags<\/a>, and periodically remove that data from ML Lake. Jobs that ingest these signals, as well as jobs that process data deletions, are also implemented in Scala\/Spark.<\/p>\n<h3>Data Catalog<\/h3>\n<p>Data lake is a common industry approach to storing large volumes of disparate data, and allowing varied types of access to it without over-optimizing for specific access patterns up&nbsp;front.<\/p>\n<blockquote><p><em>Read about our <a href=\"https:\/\/engineering.salesforce.com\/engagement-activity-delta-lake-2e9b074a94af\">engagement activities data lake<\/a> built with Delta&nbsp;Lake.<\/em><\/p>\n<\/blockquote>\n<p>While data lakes allow flexibility and freedom, an ungoverned data lake often becomes a \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_lake#Criticism\">data graveyard<\/a>.\u201d For ML Lake, it is important to track and catalog the data stored inside for the following reasons:<\/p>\n<ul>\n<li><strong>Compliance<\/strong>\u2014\u200aML Lake stores sensitive customer data. Each dataset is annotated with information on what customer it belongs to, the date it was ingested, its lineage, specific metadata to capture info for automatic GDPR processing, TTL, and many more attributes that are key to keeping data organized and compliant.<\/li>\n<li><strong>Model quality and explainability<\/strong>\u2014\u200aA flexible metadata model at both the dataset and field level enables datasets to be annotated with lineage information not present in the data or data schema itself. Examples of this metadata include information about which Salesforce object a dataset represents and whether a string field originated as a Text or Email field in that object. Knowing that a field is an email address and not a simple text field carries additional information useful to improving model quality. This information is also used to provide record insights to customers.<\/li>\n<\/ul>\n<p>ML Lake has built a custom data catalog service to store and surface this granular metadata to both ML applications using ML Lake and to internal pipelines. Catalog capabilities are integrated with both pipelines and data lake operations into a single set of APIs for machine learning applications to leverage.<\/p>\n<h3>Next Steps<\/h3>\n<p>What are we working on now? We are hardly done on the journey of providing the best-of-breed data platform for machine learning. Here are some highlights of what we are working on&nbsp;now:<\/p>\n<ul>\n<li><strong>Feature Store<\/strong>\u2014\u200aExciting innovation is happening in the industry in the area of <a href=\"https:\/\/www.featurestore.org\/\">feature stores<\/a>. Feature stores enable the sharing and discoverability of machine learning features across different applications. Online feature stores are designed to support low-latency access to features for time-sensitive inference operations. While this is an emerging area in the industry, there are already both <a href=\"https:\/\/www.hopsworks.ai\/\">commercial<\/a> and <a href=\"https:\/\/feast.dev\/\">open-source<\/a> offerings at various stages of maturity.<\/li>\n<li><strong>Transformation Service<\/strong>\u2014\u200aThe goal of ML Lake is to simplify the data needs of machine learning applications at Salesforce. While for many applications having the full power of a data engine like Spark is required, for others, a more streamlined, declarative approach would suffice. ML Lake is adding declarative transformation capabilities to its suite of APIs to simplify ML applications further, accelerating their time to&nbsp;market.<\/li>\n<li><strong>Streaming<\/strong>\u2014\u200aLeveraging Salesforce\u2019s advancements in <a href=\"https:\/\/engineering.salesforce.com\/how-apache-kafka-inspired-our-platform-events-architecture-2f351fe4cf63\">asynchronous integration capabilities<\/a>, ML Lake is using streaming for data movement to and from ML Lake. While lowering latencies was the original driver for this shift, some internal studies have shown that switching to streaming reduces overall compute costs by &gt;70%. ML Lake is working on switching its pipelines to use streaming where possible.<\/li>\n<\/ul>\n<h3>Success Stories<\/h3>\n<p>ML Lake has been serving production traffic for over a year. Some applications that rely on ML Lake are highlighted below:<\/p>\n<ul>\n<li><a href=\"https:\/\/help.salesforce.com\/articleView?id=einstein_article_recommendations_introduction.htm&#038;type=5\">Einstein Article Recommendations<\/a>\u2014\u200aautomatically recommends knowledge articles to customers, saving service agents\u2019 time and increasing customer satisfaction using data from the customer\u2019s past&nbsp;cases.<\/li>\n<li><a href=\"https:\/\/releasenotes.docs.salesforce.com\/en-us\/summer20\/release-notes\/rn_einstein_service_reply_recommendations.htm\">Einstein Reply Recommendations<\/a>\u2014\u200aintegrates with Salesforce\u2019s chatbot product to automate agent responses streamlining user experience and saving agents\u2019&nbsp;time.<\/li>\n<li><a href=\"https:\/\/help.salesforce.com\/articleView?id=release-notes.rn_einstein_service_case_wrapup.htm&#038;type=5&#038;release=226\">Einstein Case Wrap-Up<\/a>\u2014\u200ahelps support agents wrap up cases faster with on-demand recommendations that are based on chat data and closed case field&nbsp;values.<\/li>\n<li><a href=\"https:\/\/help.salesforce.com\/articleView?id=custom_ai_prediction_builder.htm&#038;type=5\">Einstein Prediction Builder<\/a>\u2014\u200aallows admins to build predictive models on their data without having to write a line of code. Predictions can be defined on any Salesforce object.<\/li>\n<li>And many&nbsp;more!<\/li>\n<\/ul>\n<p>By leveraging the core tenets of Salesforce architecture of <a href=\"https:\/\/engineering.salesforce.com\/metadata-software-the-way-you-want-it-2367b179558d\">metadata<\/a>, <a href=\"https:\/\/engineering.salesforce.com\/how-apache-kafka-inspired-our-platform-events-architecture-2f351fe4cf63\">integration<\/a> and <a href=\"https:\/\/engineering.salesforce.com\/the-magic-of-multitenancy-2daf71d99735\">multitenancy<\/a>, and combining them with modern data tooling, we have built a data platform to serve the needs of machine learning at Salesforce, while addressing Salesforce\u2019s unique scale and trust concerns. We hope you found the post interesting, and we will be talking more about unique challenges and solutions for machine learning at Salesforce in the future. Stay&nbsp;tuned.<\/p>\n<h3>Acknowledgements<\/h3>\n<ul>\n<li>ML Lake team is: Alex Araujo, Eric Shahkarami, Felipe Olivera, Hanifi Gunes, Indranil Dey, Joshua Sauter, Mamallan Uthaman, Ralf Schundelmeir, Sonia Wu, Thomas Gerber, Tom Chan, Xuehai&nbsp;Bian.<\/li>\n<li>Many thanks to Leah McGuire and Jan Fernando for early feedback.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&#038;referrerSource=full_rss&#038;postId=228c30e21f16\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/engineering.salesforce.com\/ml-lake-building-salesforces-data-platform-for-machine-learning-228c30e21f16\">ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges Salesforce has in the realm&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/\">Continue reading <span class=\"screen-reader-text\">ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-229","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":692,"url":"https:\/\/fde.cat\/index.php\/2023\/03\/22\/how-is-indias-brilliant-big-data-processing-team-engineering-salesforce-data-cloud\/","url_meta":{"origin":229,"position":0},"title":"How is India\u2019s Brilliant Big Data Processing Team Engineering Salesforce Data Cloud?","date":"March 22, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. Meet Archana Kumari, one of Salesforce\u2019s first India-based woman engineering leaders. In her role, Archana leads Salesforce India\u2019s Data Cloud big data processing compute layer team \u2014 charged with providing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":317,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/a-deep-dive-on-text-classification-at-salesforce\/","url_meta":{"origin":229,"position":1},"title":"A Deep Dive on Text Classification at Salesforce","date":"August 31, 2021","format":false,"excerpt":"published on Towards Data\u00a0SciencePutting from a Sand Trap (Image by\u00a0Author)We\u2019re excited to announce that Noah Burbank, a Principal Data Scientist in Sales Cloud, has recently published a deep dive into text classification at Salesforce on Towards Data Science. The article, How to choose the right model for text classification in\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":700,"url":"https:\/\/fde.cat\/index.php\/2023\/04\/11\/3-ways-salesforce-takes-ai-research-to-the-next-level\/","url_meta":{"origin":229,"position":2},"title":"3 Ways Salesforce Takes AI Research to the Next Level","date":"April 11, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. Meet Shelby Heinecke, a research manager for the Salesforce AI team. Shelby leads her diverse team on a variety of projects, ranging from identity resolution to recommendation systems to conversational\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":770,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/10\/revealing-the-newest-data-science-tool-speeding-ai-development-and-securing-customer-data\/","url_meta":{"origin":229,"position":3},"title":"Revealing the Newest Data Science Tool: Speeding AI Development and Securing Customer Data","date":"October 10, 2023","format":false,"excerpt":"by Chi Wang and Scott Nyberg In today\u2019s data-powered world, leveraging customer data to improve AI capabilities remains key for providing highly personalized consumer experiences. In fact, 43% of customers believe AI has improved their lives, with 54% willing to provide their anonymized data to improve AI-related products. However, more\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":848,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/01\/unveiling-the-cutting-edge-features-of-ml-console-for-ai-model-lifecycle-management\/","url_meta":{"origin":229,"position":4},"title":"Unveiling the Cutting-Edge Features of ML Console for AI Model Lifecycle Management","date":"April 1, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the journeys of engineering leaders who have made remarkable contributions in their fields. Today, we meet Venkat Krishnamani, a Lead Member of the Technical Staff for Salesforce Engineering and the lead engineer for Salesforce Einstein\u2019s Machine Learning (ML) Console. This vital tool\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":694,"url":"https:\/\/fde.cat\/index.php\/2023\/03\/23\/big-data-processing-driving-data-migration-for-salesforce-data-cloud\/","url_meta":{"origin":229,"position":5},"title":"Big Data Processing: Driving Data Migration  for Salesforce Data Cloud","date":"March 23, 2023","format":false,"excerpt":"The tsunami of data \u2014 set to exceed 180 zettabytes by 2025 \u2014 places significant pressure on companies. Simply having access to customer information is not enough \u2014 companies must also analyze and refine the data to find actionable pieces that power new business. As businesses collect these volumes of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/229","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=229"}],"version-history":[{"count":2,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/229\/revisions"}],"predecessor-version":[{"id":237,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/229\/revisions\/237"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=229"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=229"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=229"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}