{"id":225,"date":"2021-02-02T20:01:16","date_gmt":"2021-02-02T20:01:16","guid":{"rendered":"https:\/\/fde.cat\/?p=225"},"modified":"2021-02-02T20:01:17","modified_gmt":"2021-02-02T20:01:17","slug":"flow-scheduling-for-the-einstein-ml-platform","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/flow-scheduling-for-the-einstein-ml-platform\/","title":{"rendered":"Flow Scheduling for the Einstein ML Platform"},"content":{"rendered":"<p>At Salesforce, we have thousands of customers using a variety of products. Some of our products are enhanced with machine learning (ML) capabilities. With just a few clicks, customers can get insights about their data. Behind the scenes, it\u2019s the Einstein Platform that builds hundreds of thousands of models, unique for each customer and product, and serves those predictions back to the customer.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1000\/1*JNrG0SEIrpJRWyQDZ2wJ8Q.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>Each customer comes with their unique data structure and distribution. The sheer scale of the number of customers and their unique data configurations sets new requirements for our processes. For example, it\u2019s simply not practical to have a data scientist look at every model that gets trained and verify the model quality. The established process in the ML industry is to first start by manually \u201cconstructing\u201d a model, tweaking the parameters by hand, and then training and deploying the model. In contrast, we needed a process to automate the \u201cconstruction\u201d phase in addition to model training, validation, and deployment. We wanted to automate the end-to-end process of creating a model. To solve this particular problem, we use our own (Auto)ML library on top of the open source <a href=\"https:\/\/github.com\/salesforce\/TransmogrifAI\">TransmogrifAI<\/a> library.<\/p>\n<p>While many ML solutions deal with processing a handful of very large datasets (a big data problem), our main use case is a small data problem. Namely, to process many thousands of small-to medium sized datasets and generate models for each one of them. <\/p>\n<p>The various steps involved in producing a model need to be codified in a flow. The flow automates the entire process: pulling data from our sources, performing data cleaning and preparation, running data transformations, and finally training and validating the model. There are already well-established orchestration systems to solve authoring ML and big data flows. However, they\u2019re designed with the assumption that there are only a few (up to a few hundred) flow instances that need to run at any point in time. That\u2019s not the case for Salesforce. We have hundreds of thousands of flow instances\u200a\u2014\u200aseveral for each of our customers\u200a\u2014\u200athat must run reliably and in isolation. <\/p>\n<p>So we built, from the ground up, a workflow orchestration system designed to deal with our scale requirements. A key requirement of a scalable orchestration solution is to have a scalable scheduler\u200a\u2014\u200anamely, the ability to trigger flow executions reliably at a particular configured section in time. In fact, the extensive tests we performed on other existing orchestration solutions revealed key problems on their scheduling approach that wouldn\u2019t work for our use cases. For example, we ran into difficulties running more than 1,000 flows concurrently. <\/p>\n<p>The most straightforward solution is a cron, a reliable method for scheduling processes in Unix-like systems. However, this solution isn\u2019t scalable, and if the node fails, all of our schedules are gone. Similar issues occur when using single-process schedulers. <\/p>\n<p>What about a distributed cron? This solution is fault-tolerant and scalable, and node failures won\u2019t cause the schedules to be lost. This is the scheduling solution that other enterprise companies use. However, this approach puts a significant load on our system. Consider this graph of request counts for 100 flows that need to be executed every\u00a0minute:<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/602\/1*FssOtvAJM-BWBJxtYxDErg.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>This means that, at scale, our server will be hit with a lot of requests on the minute, but will be idle the rest of the time. We want to smooth out the execution of flows to something like\u00a0this:<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/602\/1*VF6XkvAg6tGAZoozRwrbsQ.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>This way we get a smaller but more continuous load on our servers for the same performance!<\/p>\n<p>So back to the drawing board we\u00a0go.<\/p>\n<h3>A Timely\u00a0Solution<\/h3>\n<p>This graphic shows our scheduler solution.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/782\/1*w59ZKi9-AUWF5kCA0-FjjA.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>Many of these components are common to most services. We have API nodes that respond to HTTP requests. We have a database cluster that handles persistence and keeps track of schedules. The key components of this scheduler are the two lambdas connected with a\u00a0queue.<\/p>\n<h3>What\u2019s in a Schedule?<\/h3>\n<p>Before we can begin building a schedule architecture, we need to first define what a \u201cschedule\u201d is in more detail. In our system, a schedule is <strong>an action (HTTP call) that must performed on a period within a margin of\u00a0error.<\/strong><\/p>\n<p>Let\u2019s break it down one point at a\u00a0time:<\/p>\n<ul>\n<li><strong>Action<\/strong>\u200a\u2014\u200aA schedule is an instruction to perform some <strong>action<\/strong>. For performance reasons, in our use case it\u2019s mostly HTTP\u00a0calls.<\/li>\n<li><strong>Must be performed on a period<\/strong>\u200a\u2014\u200aThe action usually doesn\u2019t happen just once. We normally want to do something once an hour, once a day, once a month, or a different time segment. We call the time between the actions a\u00a0<strong>period<\/strong>.<\/li>\n<li><strong>Within a margin of error<\/strong>\u200a\u2014\u200aThe margin of error on an action specifies <strong>how late an action can be executed<\/strong>. Say our service was down for an hour. We would have a backlog of schedules. Some scheduled actions (for example, that run once an hour) are no longer relevant. Other scheduled actions (for example, that run once a month) are critical.<\/li>\n<\/ul>\n<p>These three parameters combined define a schedule. <\/p>\n<p>We also define a <strong>trigger<\/strong> as an instance of a schedule. One of the innovations that we made in the Scheduler Service is that instead of dealing with schedules themselves, our applications operate on triggers. This allows us to keep better track of actions that we execute and easier error recovery.<\/p>\n<h3>Queue Up!<\/h3>\n<p>Now to answer the big question: how do we smooth out the volume of outbound calls? The mechanism that allows us to to do this is the queue. By queuing up triggers, we get to consume them at any pace we want without losing or blocking on any triggers. We just pick them up as they appear, and process them with our Lambdas. If any trigger fails with a recoverable error, we can just put it back into the queue with a timeout. <\/p>\n<p>The queue is the distinguishing factor between our scheduler and the scheduler for other popular workflow management platforms. The queue gives us incredible scale that enables our scheduler to run over 900,000 triggers per hour (we didn\u2019t count beyond\u00a0that).<\/p>\n<h3>The Lambdas<\/h3>\n<p>We decided to use <a href=\"https:\/\/aws.amazon.com\/lambda\/\">AWS Lambdas<\/a> for our architecture for a couple of reasons. First, they are easily scalable. You can spin up another 1,000 lambdas with a click of a button. This allows us to easily scale up if we ever need to handle more schedules. Second, the lambdas have built-in connectors with the Amazon SQS queues, which cuts down our development time and puts the responsibility of queue consumption on AWS.<\/p>\n<p>The Loader Lambda is the lambda that\u2019s responsible for loading events into the queue. It\u2019s run once a minute, and every time it sends a request to the database for triggers that should be executed. Then it sends the triggers into the queue. The Worker Lambdas simply consume the triggers from the queue and process them according to the data that comes with the\u00a0event.<\/p>\n<h3>Possible Hiccups<\/h3>\n<p>We just described the happy path for the execution of a scheduler. But we can\u2019t rely on our apps to always take the happy path. What can go wrong? Here are a few common scenarios.<\/p>\n<h4>Invalid Actions<\/h4>\n<p>Our actions are mostly formatted HTTP calls. But what happens if the user puts bad parameters into the schedule? What if the service that we called is down and returns a 500 code?<\/p>\n<p>Whenever an action is executed, the scheduler listens for a response code and responds accordingly. If the response from the downstream service is in the 400s, that suggests that there\u2019s a user error, and the schedule is \u201ceaten up.\u201d If there\u2019s an error in the 500s, that suggests that the downstream service is simply down. In that case, we drive the schedule back into the queue to be picked up by another lambda. You can implement retries with an exponential backoff by redriving the items with a\u00a0timeout.<\/p>\n<h4>The Bottleneck<\/h4>\n<p>An attentive reader might ask, \u201cWhat happens if the Loader Lambda doesn\u2019t finish on time?\u201d This is the bottleneck in the architecture that can be reached if pressured. Sharding is the recommended solution for these types of problems. Add a column that contains a random number that represents the shard. You can even put records for each shard in a separate table. For each shard, create a loader lambda and instruct it to pull records only from a certain shard and load them into the queue. This way, you can parallelize your database loading for almost infinite scalability.<\/p>\n<h3>Perf Results<\/h3>\n<p>We tested our initial scheduler service for performance. We were able to run about 180,000 schedules per hour, but we couldn\u2019t burst beyond 4,000 schedules per minute. By batching the SQS messages and parallelizing the Worker Lambdas, we brought the total number of schedules per hour to over 900,000 triggers per hour and burst capacity to 20,000 triggers per\u00a0minute.<\/p>\n<h3>Scheduled Delivery<\/h3>\n<p>By expanding the definition of traditional schedules and putting the brunt of the scale work on serverless AWS, we created a scalable scheduler for our ML orchestration needs. By enabling scheduled executions of workflows, we serve thousands of customers with our AI models and predictions each\u00a0day!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=b11ec4f74f97\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/engineering.salesforce.com\/flow-scheduling-for-the-einstein-ml-platform-b11ec4f74f97\">Flow Scheduling for the Einstein ML Platform<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/flow-scheduling-for-the-einstein-ml-platform-b11ec4f74f97?source=rss----cfe1120185d3---4\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>At Salesforce, we have thousands of customers using a variety of products. Some of our products are enhanced with machine learning (ML) capabilities. With just a few clicks, customers can get insights about their data. Behind the scenes, it\u2019s the Einstein Platform that builds hundreds of thousands of models, unique for each customer and product,&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/02\/02\/flow-scheduling-for-the-einstein-ml-platform\/\">Continue reading <span class=\"screen-reader-text\">Flow Scheduling for the Einstein ML Platform<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-225","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":229,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","url_meta":{"origin":225,"position":0},"title":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning","date":"February 2, 2021","format":false,"excerpt":"Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":834,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/06\/how-the-new-einstein-1-platform-manages-massive-data-and-ai-workloads-at-scale\/","url_meta":{"origin":225,"position":1},"title":"How the New Einstein 1 Platform Manages Massive Data and AI Workloads at Scale","date":"March 6, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we feature Leo Tran, Chief Architect of Platform Engineering at Salesforce. With over 15 years of engineering leadership experience, Leo is instrumental in developing the Einstein 1 Platform. This platform integrates generative AI, data management, CRM capabilities, and trusted systems to provide businesses with\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":550,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality\/","url_meta":{"origin":225,"position":2},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI Quality","date":"March 10, 2022","format":false,"excerpt":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI\u00a0Quality An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":584,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-2\/","url_meta":{"origin":225,"position":3},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML \/ AI Quality","date":"March 10, 2022","format":false,"excerpt":"An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift in skillsets that teams need to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":288,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/building-a-successful-enterprise-ai-platform\/","url_meta":{"origin":225,"position":4},"title":"Building a Successful Enterprise AI Platform","date":"August 31, 2021","format":false,"excerpt":"IntroductionIn 2016, I started as a fresh grad software engineer at a small startup called MetaMind, which was acquired by Salesforce. Since then, it has been quite a journey to achieve a lot with a small team. I\u2019m part of Einstein Vision and Language Platform team. Our platform provides customers\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":859,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/23\/einstein-copilot-for-tableau-building-the-next-generation-of-ai-driven-analytics\/","url_meta":{"origin":225,"position":5},"title":"Einstein Copilot for Tableau: Building the Next Generation of AI-Driven Analytics","date":"April 23, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the extraordinary journeys of engineering leaders who have achieved success in their specific domains. Today, we meet John He, Vice President of Software Engineering, who leads the development of Einstein Copilot for Tableau \u2014 an innovative tool that redefines how users interact\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=225"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/225\/revisions"}],"predecessor-version":[{"id":238,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/225\/revisions\/238"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}