{"id":231,"date":"2021-02-02T19:58:10","date_gmt":"2021-02-02T19:58:10","guid":{"rendered":"https:\/\/fde.cat\/?p=231"},"modified":"2021-02-02T19:58:10","modified_gmt":"2021-02-02T19:58:10","slug":"journey-to-a-million-models","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/journey-to-a-million-models\/","title":{"rendered":"Journey to a million models"},"content":{"rendered":"<h3>Journey to a Million&nbsp;Models<\/h3>\n<p>The AIOps team in Salesforce started developing an anomaly detection system using the large amount of telemetry data collected from thousands of servers. The goal of this project was to enable proactive incident detection and bring down the mean time to detect (MTTD) and mean time to remediate (MTTR)<\/p>\n<p>Simple problem,&nbsp;right?<\/p>\n<p>Our solution was to create a forecasting model with the time-series observability data to predict the potential spikes or dips causing incidents in various system and application metrics.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*CnF2LsaP2afwG0eMowXHlg.png?w=750&#038;ssl=1\" alt=\"\" data-recalc-dims=\"1\"><\/figure>\n<p>Let\u2019s take a deep dive and validate our hypothesis.<\/p>\n<p>We decided to start with a univariate approach on a single metric which represents the average page load time for our customers.<\/p>\n<p>How many models do we need? Answering this question isn\u2019t&nbsp;simple.<\/p>\n<p>1 time-series metric <strong>=<\/strong> 1&nbsp;model<\/p>\n<p>Too easy? Unfortunately the real world is a difficult one.<\/p>\n<p>The inbound traffic pattern and seasonality is completely different for various regions (Asia Pacific\/North America\/Europe etc). We have to factor in this variable to generate good predictions.<\/p>\n<p>10 Regions <strong>x<\/strong> 1 time-series metric <strong>=<\/strong> 10&nbsp;models<\/p>\n<p>Are we sure that each datacenter serving a region has the same type of hardware specification?<\/p>\n<p>Not at all. Based on the procurement period, we keep updating the hardware specification. So, it\u2019s good to have host level&nbsp;models.<\/p>\n<p>10 Regions <strong>x<\/strong> ~10k hosts <strong>x<\/strong> 1 time-series metric <strong>=<\/strong> 100k&nbsp;models<\/p>\n<p>Are we done yet? Here comes the interesting part.<\/p>\n<p>Let\u2019s say we collect the metric in a minute interval and use that for training the model. Using the lowest granularity available is very important to find minor service degradations, but the downside of this approach is noisy alerts. Alerts caused by minor glitches are normally not actionable and do not cause any customer impact. One way to generate quality alerts from time-series data is to create an ensemble of models trained on data aggregated at different time intervals.<\/p>\n<img decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*tICDjrV_S1Ns0t27a9ZQFA.png?w=750&#038;ssl=1\" alt=\"\" data-recalc-dims=\"1\">\n<p>We generated models for the metric with data aggregated over 1 minute, 15 minutes, and 1 hour. The higher the interval, the higher the weightage we give to the models. A spike on data averaged at a minute level may not be an incident, but the same spike on data averaged at an hourly level can cause a serious incident.<\/p>\n<p>Let\u2019s count the number of models&nbsp;again:<\/p>\n<p>10 Regions x 10k hosts x 1 time-series metric x Ensemble of 3 models = 300k&nbsp;models<\/p>\n<p>Now you can see that, starting from 1 model, we are about to produce 300k models. But its not 1 million yet, right? This is rough math for a univariate approach. Designing a proactive incident detection system uses a lot of system and application metrics. Once we add multiple metrics to the system, the final number would be the number of metrics times&nbsp;300k.<\/p>\n<h3>Serving models&nbsp;(fast)<\/h3>\n<p>Now that we have time-series models with very high accuracy, the next challenge is model serving. It is of no use if we can\u2019t serve these models fast for real-time data. We have a couple of options here: use one of the open source platforms, or build one ourselves. As a first step, we evaluated <a href=\"https:\/\/mlflow.org\/\">MLflow<\/a> and <a href=\"https:\/\/www.kubeflow.org\/\">Kubeflow<\/a>. (There are many other open source products, but we decided to evaluate the most popular ones first.) Both of these platforms offer solid model serving options with a lot of features. But these projects were in very early stages with a lot of changes going in when we started this project in early 2019, so we decided to park these options and started to work on building an easy way to serve models. Simple and fast is what we needed, not a lot of features.<\/p>\n<p>The first thing was to come up with an abstraction of prediction for models generated by various libraries.<\/p>\n<pre>class Model(abc.ABC):\u2029\u2029    def predict(self, model_id, data_points):\u2029    \"\"\"\u2029    Predicts anomaly scores for time series data \u2029    \"\"\"<\/pre>\n<p>We implemented <em>predict<\/em> for all distinct kinds of models that we use. Models are generated with various open source libraries and custom algorithms developed internally.<\/p>\n<p>The last step is to invoke the predict method from a simple web app. We used <a href=\"https:\/\/flask.palletsprojects.com\/en\/1.1.x\/\">Flask<\/a>. You can see the sample code&nbsp;below.<\/p>\n<pre>import pandas as pd\u2029\u2029@app.route('\/predict', methods=['POST'])\u2029def predict():\u2029    \"\"\"\u2029    Predicts anomaly score for time-series data\u2029    \"\"\"\u2029    request_df = pd.DataFrame(requests.json[\"datapoints\"])\u2029    model = Model(requests.json.get(\"model_id\", default_model))\u2029    prediction_df = model.predict(request_df)\u2029    return jsonify({'anomaly_scores: prediction_df.to_dict('list')})<\/pre>\n<p>This simple code snippet can serve models. But we need to optimise the response latency to enable smooth predictions for real-time data. Let\u2019s see how we do&nbsp;that.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/517\/1*bIbHAeoVR5p1CIH2_apdSQ.png?w=750&#038;ssl=1\" alt=\"\" data-recalc-dims=\"1\"><\/figure>\n<p>Compressing the models and adding a caching layer was important. Fetching models from storage was slow, as some of the models are large in size. Compressing models helped to reduce the storage size and network throughput more than 6x. We were able to serve ~1 million+ models generated out of 10\u201315 time-series metrics smoothly with this simple&nbsp;system.<\/p>\n<p><strong><em>This blog content is based on our experience and learnings from building an anomaly detection system. There are multiple ways of achieving this. You can factor in each variable as features and train a single model to serve this purpose, but the quality of alerts produced by this approach worked much better for&nbsp;us.<\/em><\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&#038;referrerSource=full_rss&#038;postId=8233bc5d9eb7\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/engineering.salesforce.com\/journey-to-a-million-models-8233bc5d9eb7\">Journey to a million models<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Journey to a Million&nbsp;Models The AIOps team in Salesforce started developing an anomaly detection system using the large amount of telemetry data collected from thousands of servers. The goal of this project was to enable proactive incident detection and bring down the mean time to detect (MTTD) and mean time to remediate (MTTR) Simple problem,&nbsp;right?&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/02\/02\/journey-to-a-million-models\/\">Continue reading <span class=\"screen-reader-text\">Journey to a million models<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-231","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":321,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/meet-kats-a-one-stop-shop-for-time-series-analysis\/","url_meta":{"origin":231,"position":0},"title":"Meet Kats \u2014 a one-stop shop for time series analysis","date":"August 31, 2021","format":false,"excerpt":"What it is:\u00a0 A new library to analyze time series data. Kats is a lightweight, easy-to-use, and generalizable framework for generic time series analysis, including forecasting, anomaly detection, multivariate analysis, and feature extraction\/embedding. To the best of our knowledge, Kats is the first comprehensive Python library for generic time series\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":883,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/19\/pvf-a-novel-metric-for-understanding-ai-systems-vulnerability-against-sdcs-in-model-parameters\/","url_meta":{"origin":231,"position":1},"title":"PVF: A novel metric for understanding AI systems\u2019 vulnerability against SDCs in model parameters","date":"June 19, 2024","format":false,"excerpt":"We\u2019re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems\u2019 vulnerability against silent data corruptions (SDCs) in model parameters. PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models. We\u2019re\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":806,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/ai-debugging-at-meta-with-hawkeye\/","url_meta":{"origin":231,"position":2},"title":"AI debugging at Meta with HawkEye","date":"December 19, 2023","format":false,"excerpt":"HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products. HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":893,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/","url_meta":{"origin":231,"position":3},"title":"Meta\u2019s approach to machine learning prediction robustness","date":"July 10, 2024","format":false,"excerpt":"Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":550,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality\/","url_meta":{"origin":231,"position":4},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI Quality","date":"March 10, 2022","format":false,"excerpt":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI\u00a0Quality An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":584,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-2\/","url_meta":{"origin":231,"position":5},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML \/ AI Quality","date":"March 10, 2022","format":false,"excerpt":"An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift in skillsets that teams need to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/231","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=231"}],"version-history":[{"count":2,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/231\/revisions"}],"predecessor-version":[{"id":233,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/231\/revisions\/233"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=231"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=231"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=231"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}