{"id":262,"date":"2021-08-31T14:40:46","date_gmt":"2021-08-31T14:40:46","guid":{"rendered":"https:\/\/fde.cat\/?p=262"},"modified":"2021-08-31T14:40:46","modified_gmt":"2021-08-31T14:40:46","slug":"the-origin-of-mlmon","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/the-origin-of-mlmon\/","title":{"rendered":"The Origin of MLMon"},"content":{"rendered":"<h3>by Means of Natural Selection, or the Preservation of Favoured Microservices in the Struggle for\u00a0Life<\/h3>\n<p>So this is going to be an experimental format for a blog post. I\u2019m going to describe a problem and solution then the problems that came up after, then solutions to them, and new problems, etc. I am not claiming these to be the \u201cbest\u201d solutions, or even applicable to anyone else. This post is more about how one particular product evolved over time. One has to keep in mind that this was a natural evolution. It is constrained not only by the problem, but the available resources to solve and the value of that solution. I\u2019ve also included a relative pain meter for each problem and implementing solution. To explain this post\u2019s shorthand, MLMon == \u201cMachine Learning Monitor,\u201d which is the system we use to run machine learning algorithms that power our various automation services. It includes training, batch scoring systems, and other compute heavy operations.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1000\/1*LViwMW_VeUi__lC5LFMT8g.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<blockquote><p><em>\u201cIf it could be demonstrated that any complex organ existed, which could not possibly have been formed by numerous, successive, slight modifications, my theory would absolutely break down. But I can find no such case.\u201d \u2015 Charles\u00a0Darwin<\/em><\/p><\/blockquote>\n<p>What do you mean constrained by the value of the solution? Shouldn\u2019t you always use the most robust solution available? Sometimes, but often, no. I like the example from <a href=\"https:\/\/www.schneier.com\/\">Bruce Schneier<\/a>: The most secure way you can encrypt a document is a truly random one-time pad\u2026 but you\u2019d be rather silly to use that to secure your 13yo daughter\u2019s diary. So when you decide on how to solve a problem, you should of course think about how well it\u2019ll solve it and how robust it\u2019ll be in the future, but you also need to question if it\u2019s necessary and if there even will be a future? Much of the work we do on my team is experimental, and a lot will simply be abandoned once we prove it\u2019s not useful or effective. So each time you see one of these solutions and think \u201cWell that\u2019s suboptimal, you should have XYZ,\u201d stop and ask, \u201cWould XYZ have been a good use of resources at the\u00a0time?\u201d<\/p>\n<blockquote><p><em>\u201cThe difference between a programmer and an engineer is the consideration of development costs, both in the present and\u00a0future!\u201d<\/em><\/p><\/blockquote>\n<p>Anyways, enough caveats, onto the evolution! I\u2019ve included a scale of how much pain the problem causes versus how much pain implementing the solution would be using \ud83d\udd25.<\/p>\n<p><strong>Problem<\/strong>: We need to run a single machine learning (ML) algorithm on data for all of our clients. The data is big and the ML is complicated, being difficult to run efficiently. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: We created EC2 instance clusters for each client (around half a dozen at this point) and a script that starts the clusters, starts to run the ML, uploads the results, and stops the instances. Each client had a different number of instances in their cluster based on the client data size. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: This is expensive! At the time, AWS charged per hour rounded up. So even if the ML took only 5 minutes, we\u2019d pay for an entire hour on machines! Also, the size of available machines had grown, so we no longer needed clusters to run a single client, they could be ran on a single powerful instance. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: Create a system that is able to run multiple clients sequentially on a single box, cleaning up after each. This system feeds off a queue of which clients need to run and autoscales the number of boxes based on the amount in the queue. It writes data to a central DB about each instance and job so we can learn how to binpack better in the future and save money. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: Using machine learning to improve things is awesome, let\u2019s create more models\/algorithms! Wait.. the current system was created to only run one type. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: Abstract the class that runs ML and use subclasses for each new model. The class needs to know how to prepare the environment, run the job, and cleanup when all is done (either successfully or in error). Pain: \ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: Creating a new subclass each time the data science team comes up with something new is a bit of a pain. It\u2019s not a good use of resources to ask the data scientists to create the subclasses since they\u2019d need to learn to build\/test\/follow standards of the project. Pain: \ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: Remove the subclasses and create a runtime configured class that reads from YAML files. Pain: \ud83d\udd25\ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: As more and more algorithms are added and each needing slightly different configuration and the system itself needs more configuration for each job, the YAML files are becoming complicated and prone to human error. Pain: \ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: Create a UI to configure jobs! This also gives a chance to create a reporting system about the outcomes of each job. So a double win! Pain: \ud83d\udd25\ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: The ML jobs are getting more divergent. They require different versions of libraries to run, and no one has time to update the old jobs that are going just fine. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: Instead of running the jobs directly on the box, we can encapsulate the environment into Docker containers. We provide a skeleton of a few different types to the data science team and let them customize to their hearts\u2019 content. This allows different dependencies, and makes clean up after each job far easier! Pain: \ud83d\udd25\ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: While all the ML jobs can run on the beefy machines originally set up, it\u2019s a bit wasteful. Some require GPU, some only a good CPU, some high memory but very little CPU, and some require very little at all to run. Keeping everything on the large boxes is wasteful. Pain: \ud83d\udd25<br \/><strong>Solution<\/strong>: Create multiple queues for different types of boxes available and separate Auto Scaling Groups for each queue. Pain: \ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: As we grow in the number of clients and ML jobs, the queues are getting backed up a bit. Some jobs are dependent on others and some jobs needed to run more often than others. Unfortunately, they all run in a more or less FIFO nature. Pain: \ud83d\udd25<br \/><strong>Considered Solution<\/strong>: Create a priority queue and have jobs assigned an importance score. This was not chosen since we are currently using the basic AWS queueing service and writing our own was not worthwhile. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: Move from a queue\/Auto Scaling Group per resource to one per job type. Pain: \ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: Creating new queue\/Auto Scaling Groups each time a new ML algorithm comes up is a bit of a hassle. Pain: \ud83d\udd25<br \/><strong>Partial Solution<\/strong>: Use a terraform module to reduce the amount of work to simply creating a few lines of code and running a command. Pain: \ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: We went from a half dozen clients with a single ML job to thousands of clients with dozens of ML jobs. The single DB is getting hammered and that\u2019s causing job failures. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Temporary Solution<\/strong>: Remove stats being written about each instance; they aren\u2019t that useful since AWS no longer charges for whole hours. Add read replicas, this is a fairly common solution to reduce DB load. This reduces the load, but still doesn\u2019t improve scaling. Pain:\u00a0\ud83d\udd25<\/p>\n<p><strong>Problem<\/strong>: More DB issues. We now have hundreds of ML boxes running at a time during peak. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Considered Solutions<\/strong>: Move from standard SQL DB to cassandra. This was deemed to be more work than useful since it\u2019d require rewriting the entire DB layer of the job runner and the reports. Pain: \ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25\ud83d\udd25<br \/><strong>Solution<\/strong>: None of the writes are needed immediately in the DB. They are primarily just there for reports. So, let\u2019s create a queue of DB writes, we can then reuse the whole system\u2019s \u201cread from queue and perform job\u201d to perform the DB writes. This requires a relatively small amount of code to be written, and reuses the throttling system already in place. Pain: \ud83d\udd25\ud83d\udd25<\/p>\n<p><strong>Possible Future Problem<\/strong>: Creating the new queues\/Auto Scaling Groups still requiring engineering time, which isn\u2019t ideal. Pain: \ud83d\udd25<br \/><strong>Possible Future Solution<\/strong>: Moving this creating into the UI. Of course, having a UI that creates infrastructure is generally a dangerous thing. So perhaps we\u2019ll revisit making a new priority queue service. Pain: \ud83d\udd25\ud83d\udd25<\/p>\n<p><strong>Possible Future Problem<\/strong>: This bespoke binpacking isn\u2019t really the greatest and as the number of clients and jobs grow, the small difference will start to add up to real costs. Pain: \ud83d\udd25\ud83d\udd25<br \/><strong>Possible Future Solution<\/strong>: Move this system into Kubernetes, which we already have in place for other services and which already knows how to efficiently pack things together. Pain:\u00a0\ud83d\udd25\ud83d\udd25\ud83d\udd25<\/p>\n<blockquote><p><em>\u201cJust as smart financial debt can help you reach major life goals faster, not all technical debt is bad, and managing it well can yield tremendous benefits for your company.\u201d\u200a\u2014\u200a<\/em><a href=\"https:\/\/hackernoon.com\/u\/FirstMark\"><em>hackernoon.com<\/em><\/a><\/p><\/blockquote>\n<p>As mentioned above, this article is in an experimental format. One thing that became clear <em>while<\/em> I was writing is the pain scale. When we faced a challenge and brainstormed solutions, we\u2019d analyze how much trouble implementation would be, but we never boiled it down to a simple scale like in this article. Looking back, it is obvious, to paraphrase Dr. Henry Cloud, that changes happen when the pain of the system remaining the same exceeds or equals the pain of changing. Each decision made is an attempt to minimize pain\/cost or, more positively, maximize expenditure. We make decisions with trade-offs and knowledge that nothing is final; the system shall continue to\u00a0evolve.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=951cda190dbe\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/engineering.salesforce.com\/the-origin-of-mlmon-951cda190dbe\">The Origin of MLMon<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/the-origin-of-mlmon-951cda190dbe?source=rss----cfe1120185d3---4\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>by Means of Natural Selection, or the Preservation of Favoured Microservices in the Struggle for\u00a0Life So this is going to be an experimental format for a blog post. I\u2019m going to describe a problem and solution then the problems that came up after, then solutions to them, and new problems, etc. I am not claiming&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/the-origin-of-mlmon\/\">Continue reading <span class=\"screen-reader-text\">The Origin of MLMon<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-262","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":646,"url":"https:\/\/fde.cat\/index.php\/2022\/10\/31\/improving-instagram-notification-management-with-machine-learning-and-causal-inference\/","url_meta":{"origin":262,"position":0},"title":"Improving Instagram notification management with machine learning and causal inference","date":"October 31, 2022","format":false,"excerpt":"We\u2019re sharing how Meta is applying statistics and machine learning (ML) to improve notification personalization and management on Instagram \u2013 particularly on daily digest push notifications. By using causal inference and ML to identify highly active users who are likely to see more content organically, we have been able to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":692,"url":"https:\/\/fde.cat\/index.php\/2023\/03\/22\/how-is-indias-brilliant-big-data-processing-team-engineering-salesforce-data-cloud\/","url_meta":{"origin":262,"position":1},"title":"How is India\u2019s Brilliant Big Data Processing Team Engineering Salesforce Data Cloud?","date":"March 22, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. Meet Archana Kumari, one of Salesforce\u2019s first India-based woman engineering leaders. In her role, Archana leads Salesforce India\u2019s Data Cloud big data processing compute layer team \u2014 charged with providing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":893,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/","url_meta":{"origin":262,"position":2},"title":"Meta\u2019s approach to machine learning prediction robustness","date":"July 10, 2024","format":false,"excerpt":"Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":700,"url":"https:\/\/fde.cat\/index.php\/2023\/04\/11\/3-ways-salesforce-takes-ai-research-to-the-next-level\/","url_meta":{"origin":262,"position":3},"title":"3 Ways Salesforce Takes AI Research to the Next Level","date":"April 11, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the life experiences and career paths that have shaped Salesforce engineering leaders. Meet Shelby Heinecke, a research manager for the Salesforce AI team. Shelby leads her diverse team on a variety of projects, ranging from identity resolution to recommendation systems to conversational\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":856,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/18\/in-their-own-words-15-salesforce-engineering-innovators-discuss-the-art-of-problem-solving\/","url_meta":{"origin":262,"position":4},"title":"In Their Own Words: 15 Salesforce Engineering Innovators Discuss the Art of Problem Solving","date":"April 18, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d series, we explore the problem-solving skills of engineering leaders. In this special edition, we caught up with some of the brightest minds from Salesforce Engineering across India, Argentina, and the U.S, and met a few new innovators who the Engineering Blog will feature soon. Join us\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":225,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/flow-scheduling-for-the-einstein-ml-platform\/","url_meta":{"origin":262,"position":5},"title":"Flow Scheduling for the Einstein ML Platform","date":"February 2, 2021","format":false,"excerpt":"At Salesforce, we have thousands of customers using a variety of products. Some of our products are enhanced with machine learning (ML) capabilities. With just a few clicks, customers can get insights about their data. Behind the scenes, it\u2019s the Einstein Platform that builds hundreds of thousands of models, unique\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/262","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=262"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/262\/revisions"}],"predecessor-version":[{"id":443,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/262\/revisions\/443"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=262"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=262"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=262"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}