{"id":583,"date":"2022-03-15T08:10:00","date_gmt":"2022-03-15T08:10:00","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/03\/15\/eflow-racing-towards-millions-of-ml-flows-2\/"},"modified":"2022-03-15T08:10:00","modified_gmt":"2022-03-15T08:10:00","slug":"eflow-racing-towards-millions-of-ml-flows-2","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/03\/15\/eflow-racing-towards-millions-of-ml-flows-2\/","title":{"rendered":"EFlow\u200a\u2014\u200aRacing towards millions of ML flows"},"content":{"rendered":"<p><a href=\"https:\/\/www.salesforce.com\/products\/einstein\/overview\/\" target=\"_blank\" rel=\"noopener\">Salesforce Einstein<\/a>\u00a0operates many machine learning applications that cater to a variety of use cases inside Salesforce \u2014 vision and language, but also classic machine learning (ML) approaches using tabular data to classify customer cases, score leads, and more. We also offer many of these ML solutions and applications to our customers. Each one of our ML applications needs to work reliably for potentially any of our business customers, so we\u2019re faced with the challenging problem of scaling to tens or in some cases hundreds of thousands of independent machine learning model lifecycles, with at least one model per customer. Furthermore, we have strong isolation requirements for our customer\u2019s data, meaning the lifecycle of both the data and the models depending on that data, needs to be independent to meet compliance and legal requirements. We call this problem the\u00a0<a href=\"https:\/\/engineering.salesforce.com\/l33t-m10y-f04f38127b82?source=friends_link&amp;sk=785c13a7675907d6f0fa41a0a0d48f7c\" target=\"_blank\" rel=\"noopener\"><em>multi-tenancy<\/em>\u00a0scaling problem<\/a>. It defines the fundamental scale challenge we face across our stack, first and foremost in our ML operations and infrastructure, for both offline and online environments and services, but also in the lower level model architecture design and choice of machine learning approaches.<\/p>\n<h4>Multi-tenancy and EFlow<\/h4>\n<p>The multi-tenancy problem and the need for independent models, at least one per customer, translates operationally directly to managing hundreds of thousands of ML flows reliably with minimal disruption and overhead. Over several years we managed, operated and evaluated several obvious options and tools prevalent in workflow management, including Airflow, Argo, and Azkaban. While operating on them, we realized the hard way, that these historically popular solutions, while very rich in terms of ecosystem\/integrations, and easy to adopt also have critical gaps in reliability, high availability, and scale, along with core implicit assumptions which made them non ideal for us, given our multi-tenancy problem and scale numbers. In addition, we have a philosophy of operating a central managed service for all ML flows of all our applications, large or small, rather than having ML application teams provision and maintain self-serve clusters for ML flow management. As a result, we built and have been operating EFlow, our internal flow management system, for several years. In this post, we will share some of its fundamental principles that lend it the scaling properties required for supporting our numbers.<\/p>\n<h4>EFlow Core Principles and Primitives<\/h4>\n<p>Our system enables composing and executing complex workflows\/<em>DAG<\/em>s with several types of\u00a0<em>steps.\u00a0<\/em>One common type of step allows\u00a0<em>invoking<\/em>\u00a0<em>remote<\/em>\u00a0large scale computation systems for generation of datasets, models, or evaluations. Another typical step type performs\u00a0<em>registration<\/em>\u00a0of computation results in dedicated result storage\u00a0<em>metadata<\/em>\u00a0systems (i.e.\u00a0<em>Model Store<\/em>\u00a0for models,\u00a0<em>Data<\/em>\u00a0<em>Lake<\/em>\u00a0metadata for datasets and so on), and another common step type enables human-driven processes of approvals or evaluations, be that from application teams or from end business customers. In general, our flows represent the\u00a0<strong>lifecycle<\/strong>\u00a0of key ML entities (i.e. lifecycle of models or datasets, associated with an application\/use case), and the underlying assumption is that these lifecycle stages can take a long time. EFlow is designed not to consume resources during or in between these lifecycle stages. The two core principles underpinning EFlow are:<\/p>\n<p><strong>Separation of Flow State vs Compute \u2014 Steps in our flows don\u2019t perform computation\u00a0<\/strong>\u2014 they always\u00a0<em>invoke\/delegate<\/em>\u00a0to remote computation systems (Spark, K8 Tensorflow clusters, etc.) and listen to events from those systems for results (such as computation completed\/failed). Those events contain references of computation results.<strong>Steps handle metadata, not data\u00a0<\/strong>\u2014 they always work with\u00a0<strong>metadata\/references of inputs or results<\/strong>, typically representing\u00a0<em>Datasets<\/em>,\u00a0<em>Transformations,\u00a0<\/em><a href=\"https:\/\/engineering.salesforce.com\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-4ec2f5504421\" target=\"_blank\" rel=\"noopener\"><em>Evaluations<\/em><\/a>, or\u00a0<em>MetricSets<\/em>. In essence, they always handle pointers to large datasets\/models, etc.<\/p>\n<p>step execution lifecycle in Eflow (invoking a remote system)<\/p>\n<p>The two fundamental principles above ensure horizontal scalability where:<\/p>\n<p>very large numbers of\u00a0<strong>concurrent flow executions\u00a0<\/strong>(in the hundreds of thousands) can be handled very well by our system, a core requirement for our unique\u00a0<em>multi-tenant<\/em>\u00a0use cases at Salesforce.flow executions with\u00a0<strong>very long durations\u00a0<\/strong>can also be handled without issues<strong>\u00a0<\/strong>(they can be active for\u00a0<em>months<\/em>\u00a0or, if necessary, for\u00a0<em>years<\/em>). Technically, we don\u2019t have any underlying limitation on execution duration. Our flows lend themselves to be used for very long ML<em>\u00a0lifecycle transitions (things like complex live experiment progressions can take months).<\/em><\/p>\n<p>While our system is comparable in functionality with other popular solutions (Airflow, Argo, Azkaban, etc.) we differ greatly in\u00a0<em>scaling philosophy, availability,<\/em>\u00a0and\u00a0<em>implementation<\/em>.<\/p>\n<h4>Steps: What vs. How<\/h4>\n<p>Steps represent our standard units of work for flows. They involve invoking external systems for either immediate or asynchronous processes. Yet, in the ML lifecycle, we only perform a few well-defined types of operations regardless of underlying implementation or which exact external systems (e.g. computation engine) are being used. For example, for transforming and outputting datasets, we have to feed source data(sets), and a transformation definition, with the result\/output representing the transformed datasets. Similarly, for generating models, we input source datasets and a transformation definition, with model(s) as output; for evaluations we would have the same, but with model or data evaluation results as the output, and so on. The input\/output types represent the\u00a0<strong>what<\/strong>, the functional contract of the step. In our step design, we differentiate between this contract, representing what the step is doing vs the actual implementation, representing\u00a0<strong>how<\/strong>\u00a0the step is doing it. One benefit of designing steps this way is to abstract, for example, the concrete computation engine (Spark, vs Tensorflow clusters, vs. \u2026) and even whether computation is being performed on a batch system vs a (near-)realtime system. This allows, among other things, for highly configurable flows (they are defined as JSON) leveraging few well defined steps with clear semantics allowing for strongly typed validation. Splitting the what from the how, along with a consistent separation between compute and flow state, also opens up having flows that represent ML lifecycles that span the offline\/online divide and allows tracking models that may be changed\/enriched by very fast online learning\/processing.<\/p>\n<h4>EFlow Architecture<\/h4>\n<p>EFlow follows an\u00a0<em>API-first<\/em>\u00a0philosophy very consistently. All our functionality, from\u00a0<em>authoring<\/em>\u00a0flows to instantiating, managing, and tracking flow executions, is surfaced via APIs that have guarantees around backward compatibility and are associated with strong SLAs. We have three major components, each composed of multiple internal microservices. Each single component is setup with high availability and resiliency in mind:<\/p>\n<p><em>EFlow Engine<\/em>: manages flow state\/lifecycle; what is the flow state at any point in time and transitions from one state to the next?<em>EFlow Scheduler<\/em>: manages distributed scheduling\/cron capabilities, essentially determining what flow execution should start next; a fancy distributed cron job.<em>EFlow History<\/em>: tracks executions for surfacing to end users and enables observability with a UI<\/p>\n<h4>EFlow Engine<\/h4>\n<p>EFlow Engine is responsible for managing flow state transitions. It is composed at a high level of an API layer, a worker layer, a database, and an event bus. The flow engine is an event driven system; there is no part of the system that is waiting idle and blocking resources while, for example, computation is happening. The workers are the components performing\u00a0<em>step executions<\/em>, which in our case are only\u00a0<em>lightweight<\/em>\u00a0calls to external services. They also advance flow state via the events. The full flow state is kept in a database record. On each event, the associated flow state is retrieved by the worker and initialized, and, once the event is applied, the state of the flow is advanced and stored back. We scale by increasing the\u00a0<em>number of shards<\/em>\u00a0in the event bus and the associated workers processing events from the bus. The event processing for flow transitions ensures that the only real limitation for our system is the flow state\u00a0<em>size<\/em>, which primarily affects the speed of event processing.<\/p>\n<p>Flow state size is almost entirely dependent on the number of steps in a flow and the size of step inputs and output objects. We have various ways to cope with this limitation. On one hand, we enforce that input and output objects represent metadata, not data, as described in the core principles above. In addition, when there is a need for a flow to have a large number of branches- in the hundreds, or a few thousands-with each branch being similar to a flow of its own (e.g. segmentation or large hyper-parameter tuning flows), we express these branches as subflows independent from the parent flow. Their state is stored in separate records and their processing is independent, hence ensuring the overall size of the parent flow is still very manageable.<\/p>\n<div class=\"wp-container-5 wp-block-group\">\n<div class=\"wp-container-4 wp-block-group\">\n<div class=\"wp-container-3 wp-block-columns\">\n<div class=\"wp-container-1 wp-block-column\">\n<\/div>\n<div class=\"wp-container-2 wp-block-column\">\n<\/div>\n<\/div>\n<\/div>\n<p class=\"has-text-align-center\">Eflow engine conceptual view (left), event processing lifecycle (right)<\/p>\n<\/div>\n<h4>EFlow Scheduler<\/h4>\n<p>Our second major component is the scheduler. It is the component that provides the\u00a0<em>heartbeat<\/em>\u00a0of the entire ML platform. Its primary role is to manage flow executions running at a particular cadence, essentially determining when a flow execution should be\u00a0<em>triggered<\/em>, as well as to handle step level retries, backoffs, and timeouts. We think of our scheduler as a simple yet highly scalable and distributed cron. The key approach enabling the scalability properties that we require to support our numbers is\u00a0<em>partitioning<\/em>\u00a0the database tables that manage our schedules, with each partition being processed independently by what we call\u00a0<em>loaders<\/em>. For each scheduling record, we store when it should run next (<em>next execution time<\/em>), as well as its partition name. Each independent loader continuously processes the associated partition with records filtered and sorted by nextExecutionTime. The partitioning strategy allows us to scale to very large number of schedules simply via adding partitions and loaders.<\/p>\n<div class=\"wp-container-10 wp-block-group\">\n<div class=\"wp-container-9 wp-block-group\">\n<div class=\"wp-container-8 wp-block-columns\">\n<div class=\"wp-container-6 wp-block-column\">\n<\/div>\n<div class=\"wp-container-7 wp-block-column\">\n<\/div>\n<\/div>\n<\/div>\n<p class=\"has-text-align-center\">EFlow scheduler architecture (left), next execution time and partitioning (right)<\/p>\n<\/div>\n<h4>EFlow History<\/h4>\n<p>The last piece of the puzzle is our flow history component, which tracks historic execution and flow events and enables a UI to visualize and query state for end users. We follow a consistent\u00a0<em>separation<\/em>\u00a0of our history component from the flow engine, since the history component is used by our users and enables ad-hoc querying for potentially exhaustive searches for months\u2019 worth of data. It is connected via a bus to receive events from the flow engine, and it consumes the bus events to reflect and track all executions. End users see our history component when they interact with our UI.<\/p>\n<h4>Towards hundreds of millions<\/h4>\n<p>After a few years operating EFlow, we believe that separating\u00a0<em>lifecycle<\/em>\u00a0from\u00a0<em>computation<\/em>, in addition to separating the\u00a0<em>what<\/em>\u00a0vs the\u00a0<em>how<\/em>\u00a0(function vs implementation), are critical patterns that can enable scaling and managing millions or hundreds of millions of models. That is because flows are represented simply as database records and bus events; hence, there is no limit to scaling concurrently running flows, especially if those flows represent models that are enriched by cheap and fast moving user-specific online or edge computation processes. We believe managing this extreme number of flows, each representing data\/model lifecycles, similar to what we have to handle at Salesforce, is going to become an increasingly common pattern including at large consumer companies and not just an outlier mostly encountered in the enterprise space. The drive towards larger numbers of models is becoming more apparent as intelligent solutions need to be more\u00a0<em>personalized<\/em>\u00a0and provide an almost exact mirror of the users they are serving<em>.<\/em><\/p>\n<p><em>From Models to Agents<br \/><\/em>A larger number of personalized models has induced us to think of ML solutions in terms of agents,\u00a0<em>agent<\/em>\u00a0lifecycle, and\u00a0<em>agent populations<\/em>, and less in terms of low level\u00a0<em>models<\/em>. This shift in abstraction from\u00a0<em>models<\/em>\u00a0to\u00a0<em>agents<\/em>\u00a0is required also as holistic intelligent solutions are composed very often from more than just one model, each with a different lifecycle, that work together to achieve certain behavioral patterns. Managing and following agents from their creation to their post-deployment stages is beneficial, especially as we are going towards a future where users are represented by dedicated and isolated\/independent intelligent learning processes happening online or at the edge. For example, in modern model architectures, we commonly find patterns where many models\/agents have a\u00a0<em>common trunk<\/em>\u00a0trained on large scale global data representing some general model\/understanding about the larger world or environment (i.e the general english language in NLP models), and a user-specific\u00a0<em>head<\/em>\u00a0trained on user-specific data representing task\/user specific understanding and derived representations (common trunk and user-specific higher layers or heads are typical in transfer learning regimes and deep learning architectures). Another pattern is for an agent\u2019s models to always be trained from scratch with just a single user\u2019s data. Yet, in both of these cases we still end up with a need to manage an agent or model per user, meaning potentially millions of models for some domains.<\/p>\n<p>Our architecture lends itself very well to handle this shift in philosophy and to be able to scale to millions or hundreds of millions of independent agent lifecycles, defining and tracking not just how they are created\/trained, but also guiding their entire life progression until retirement. That means spanning the divide of the offline\/online environments and allowing for extremely personalized model lifecycles. In addition, thinking about intelligent solutions in terms of agent and agent\u00a0<em>populations<\/em>\u00a0enables us to provide a cross-generational view of lifecycles, enabling multi-generation optimization of entire\u00a0<em>agent\/model populations.\u00a0<\/em>This multi-generational view caters in principle to evolutionary or bayesian optimization processes that are very useful for large-scale autoML to span to very long time ranges \u2014 on the order of weeks, months, or even years for populations of millions of models. Ultimately, key to this shift towards more personalization of models is the ability for ML platforms to handle independent model and agent lifecycles at massive scale, something that lies at the foundations behind the EFlow structure.<\/p>\n<p><em>Acknowledgments<br \/><\/em>Special thanks for their contributions and leadership over the years for shaping and influencing this unique ML component: Radhika Pasari, Hormoz Taravern, Sonika Arora, Chandrashekhar Vijayarenu, Swaminathan Sundaramurthy, Rama Raman, Paul Kassianik, Alexander Nikitin.<\/p>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/eflow-racing-towards-millions-of-ml-flows-a0c61589e70a\/\">EFlow\u200a\u2014\u200aRacing towards millions of ML flows<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">salesforce-engineering.go-vip.net<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/eflow-racing-towards-millions-of-ml-flows-a0c61589e70a\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Salesforce Einstein\u00a0operates many machine learning applications that cater to a variety of use cases inside Salesforce \u2014 vision and language, but also classic machine learning (ML) approaches using tabular data to classify customer cases, score leads, and more. We also offer many of these ML solutions and applications to our customers. Each one of our&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/03\/15\/eflow-racing-towards-millions-of-ml-flows-2\/\">Continue reading <span class=\"screen-reader-text\">EFlow\u200a\u2014\u200aRacing towards millions of ML flows<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-583","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":229,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","url_meta":{"origin":583,"position":0},"title":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning","date":"February 2, 2021","format":false,"excerpt":"Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":553,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/15\/eflow-racing-towards-millions-of-ml-flows\/","url_meta":{"origin":583,"position":1},"title":"EFlow\u200a\u2014\u200aRacing towards millions of ML flows","date":"March 15, 2022","format":false,"excerpt":"EFlow\u200a\u2014\u200aRacing towards millions of ML\u00a0flows Salesforce Einstein operates many machine learning applications that cater to a variety of use cases inside Salesforce\u200a\u2014\u200avision and language, but also classic machine learning (ML) approaches using tabular data to classify customer cases, score leads, and more. We also offer many of these ML solutions\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":802,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/12\/developing-the-new-xgen-salesforces-foundational-large-language-models\/","url_meta":{"origin":583,"position":2},"title":"Developing the New XGen: Salesforce\u2019s Foundational Large Language Models","date":"December 12, 2023","format":false,"excerpt":"By Shafiq Rayhan Joty and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional journeys that have shaped Salesforce Engineering leaders. Meet Shafiq Rayhan Joty, a Director at Salesforce AI Research. Shafiq co-leads the development of XGen, a series of groundbreaking large language models (LLMs) of different\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":848,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/01\/unveiling-the-cutting-edge-features-of-ml-console-for-ai-model-lifecycle-management\/","url_meta":{"origin":583,"position":3},"title":"Unveiling the Cutting-Edge Features of ML Console for AI Model Lifecycle Management","date":"April 1, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the journeys of engineering leaders who have made remarkable contributions in their fields. Today, we meet Venkat Krishnamani, a Lead Member of the Technical Staff for Salesforce Engineering and the lead engineer for Salesforce Einstein\u2019s Machine Learning (ML) Console. This vital tool\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":550,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality\/","url_meta":{"origin":583,"position":4},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI Quality","date":"March 10, 2022","format":false,"excerpt":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI\u00a0Quality An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":584,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-2\/","url_meta":{"origin":583,"position":5},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML \/ AI Quality","date":"March 10, 2022","format":false,"excerpt":"An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift in skillsets that teams need to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/583","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=583"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/583\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=583"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=583"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=583"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}