{"id":228,"date":"2021-02-02T20:02:39","date_gmt":"2021-02-02T20:02:39","guid":{"rendered":"https:\/\/fde.cat\/?p=228"},"modified":"2021-02-02T20:02:39","modified_gmt":"2021-02-02T20:02:39","slug":"realtime-predictions-in-a-multitenant-environment","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/realtime-predictions-in-a-multitenant-environment\/","title":{"rendered":"Realtime Predictions in a Multitenant Environment"},"content":{"rendered":"<h3>Real-time Predictions in a Multitenant Environment<\/h3>\n<h3>Introduction<\/h3>\n<p>The Einstein Vision and Language Platform Team at Salesforce enables <a href=\"https:\/\/engineering.salesforce.com\/deep-learning-dataset-management-system-at-scale-571532d0d200\">data management<\/a>, <a href=\"https:\/\/engineering.salesforce.com\/training-experimentation-a-next-generation-generic-ml-training-and-data-science-platform-for-dcad8c4621b\">training<\/a>, and prediction for deep learning-based <a href=\"https:\/\/einstein.ai\/products\">Vision and Language<\/a> use cases. Consumers of the platform can use our <a href=\"https:\/\/metamind.readme.io\/\">API gateway<\/a> to upload datasets, train those datasets, and ultimately use the models generated out of training to run predictions against a given input payload (for example, input text during a live chat). The platform is multitenant and supports all the above operations for different customers simultaneously.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*MmBpFJCqtYNUS-Z3-_ICLw.gif?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><figcaption>Prediction Service in Action\u200a\u2014\u200alive chat text input is sent to Prediction Service to understand customer intention.<\/figcaption><\/figure>\n<p>Prediction Service is one of the Einstein Vision and Language Platform services that runs behind the scenes. As the name suggests, its job is to serve predictions. The nature of the predictions is real-time. For example, a customer may train a custom model to learn different intents from a set of chat messages. Then, during a live chat session, the utterances are sent to the Prediction Service where the service uses the customer-trained model to identify and return the intent of those utterances. Based on the intent returned by the model, the chatbot takes the appropriate decision (for example, checking the status of an order). Failing to return prediction result within the Service-Level Agreement (SLA) window may result in poor end-user experience.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*_Vr3xFjJqU905XqtFSmtkg.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><figcaption><em>Figure 1.0: Einstein Vision and Language Platform Prediction Functional Overview<\/em><\/figcaption><\/figure>\n<p>In addition to being real-time, predictions are multitenant: multiple customers train one or more models and use them to get predictions. Together, real-time &amp; multitenancy aspects of predictions make it an interesting challenge to solve. The real-time aspect of predictions implies that the time taken to serve a prediction request should be low\u200a\u2014\u200afor us it depends on use-case; typically the SLA\u2019s for latencies range from 150 ms to 500 ms. And the multitenancy aspect means managing models for thousands of tenants in a way that model access from memory to disk, and file store (AWS S3) to disk is minimized.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*PfGvLVPTpffCkLJg6fcCqg.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><figcaption><em>Figure 1.1: Multiple Tenants Run Predictions Simultaneously<\/em><\/figcaption><\/figure>\n<h3>Typical Prediction Request\/Response<\/h3>\n<p>Before we dive deeper, let\u2019s take a look at a typical prediction request. This is what our service looks like to the consumer.<\/p>\n<h4>Request<\/h4>\n<p>The following example shows the request parameters for a language model (<a href=\"https:\/\/metamind.readme.io\/reference#prediction-intent\">intent classification<\/a>) prediction request. The modelId is the unique ID that\u2019s returned after you train a dataset and create a model. The document parameter contains the input payload for prediction.<\/p>\n<pre>{<br>  \"modelId\": \"2ELVAO5BNVGZLBHMYTV7CGD5AY\",<br>  \"document\": \"my password stopped working\"<br>}<\/pre>\n<h4>Response<\/h4>\n<p>After you send the request via an API call, you get a response that looks like the following JSON. The response consists of the predicted label\u2014alternatively called category\u2014for the input payload. Along with label, the model returns a probability for each label, which indicates the confidence of the prediction.<\/p>\n<p>In the response below, the model is 99.04% confident that the input, my password stopped working, belongs to the label Password Help. Similarly, the rest of the probability scores indicate that the input doesn\u2019t belong to other\u00a0labels.<\/p>\n<pre>{<br>  \"probabilities\": [<br>    {<br>      \"label\": \"Password Help\",<br>      \"probability\": 0.99040705<br>    },<br>    {<br>      \"label\": \"Order Change\",<br>      \"probability\": 0.003532466<br>    },<br>    {<br>      \"label\": \"Shipping Info\",<br>      \"probability\": 0.003473858<br>    },<br>    {<br>      \"label\": \"Billing\",<br>      \"probability\": 0.0024010758<br>    },<br>    {<br>      \"label\": \"Sales Opportunity\",<br>      \"probability\": 0.00018560764<br>    }<br>  ],<br>  \"object\": \"predictresponse\"<br>}<\/pre>\n<p>Sending a prediction request is just a simple cURL\u00a0call.<\/p>\n<pre>curl -X POST <br>-H \"Authorization: Bearer &lt;TOKEN&gt;\" <br>-H \"Cache-Control: no-cache\" <br>-H \"Content-Type: application\/json\" <br>-d \"{\"modelId\":\"2ELVAO5BNVGZLBHMYTV7CGD5AY\",\"document\":\"my password stopped working\"}\" <br><a href=\"https:\/\/api.einstein.ai\/v2\/language\/intent\">https:\/\/api.einstein.ai\/v2\/language\/intent<\/a><\/pre>\n<p>You can quickly <a href=\"https:\/\/metamind.readme.io\/docs\/what-you-need-to-call-api\">sign up for free<\/a> and build your own model using your own data or one of the sample datasets in our <a href=\"https:\/\/metamind.readme.io\/reference\">API reference<\/a>. Or try one of our <a href=\"https:\/\/metamind.readme.io\/docs\/use-pre-built-models\">pre-built<\/a> models. Feel free to drop a note in comments and let us know about your experience using our\u00a0API.<\/p>\n<h3>Lifecycle of a Prediction Request<\/h3>\n<p>To understand how predictions are handled in a multitenant environment, let\u2019s look at the lifecycle of a single prediction request.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*YxIBL1WGPyLcxAOV3J1pcg.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><figcaption><em>Figure 2.0: Lifecycle of a Prediction Request<\/em><\/figcaption><\/figure>\n<p>After you train a dataset, the model artifact is stored on an AWS S3 file store. The model metadata (for example, model ID, S3 location) is stored in the database via the <a href=\"https:\/\/engineering.salesforce.com\/deep-learning-dataset-management-system-at-scale-571532d0d200\">Data Management Service (DM)<\/a>. To get the location of the model, Prediction Service queries DM service using the customer\u2019s tenant ID and model ID. DM returns a pre-signed S3 URL with a temporary access\u00a0token.<\/p>\n<p>Prediction Service then downloads the model tar from AWS S3 and stores it on the Elastic File System (EFS) disk (if the model does not exist on the disk already). Then the model tar is extracted and the model is loaded into memory. Finally, the pre-processed input is run through the model (neural network) and the output is returned\u00a0back.<\/p>\n<p>An example of pre-processing is tokenization &amp; embedding lookup. Typically, the prediction input (eg. live chat) for language models (eg. intent classification or Named Entity Recognition) is first tokenized where the input is split into words. Further, the words are then used to do a lookup to get their vector representations. This lookup table of words &amp; their vector representations is called an embedding\u200a\u2014\u200a<a href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\">glove<\/a> is one type of embedding. This is what vector representation of word \u201csalesforce\u201d looks like\u200a\u2014\u200aonly first five dimensions shown:<\/p>\n<p>salesforce 0.23245 -0.31159 0.28237 0.10496 0.3087\u00a0&#8230;.<\/p>\n<p>The vector representation is then passed to the model for prediction. The model returns a set of labels and their probability scores as shown in example response\u00a0above.<\/p>\n<p><strong>Note:<\/strong> Embeddings are intentionally kept separate from model artifact to reduce memory &amp; disk footprint, which factors into overall performance.<\/p>\n<h3>Prediction Service\u00a0Overview<\/h3>\n<p>The Einstein Vision and Language Platform runs on a Kubernetes cluster. All services are deployed using Kubernetes Deployment to launch multiple replicas of each service as Pods which are exposed via cluster-scoped internal Kubernetes Service endpoints.<\/p>\n<p>The diagram below shows language and vision predictors running on the Kubernetes cluster. Predictor is a group of pods responsible for carrying out prediction\/inference and they are fronted by a Kubernetes service. The API Gateway acts as a proxy to the Prediction Service and routes requests across different predictors.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*vzQljnAfT7FqVS9rXZIAGA.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><figcaption><em>Figure 3.0: Prediction Service\u00a0Overview<\/em><\/figcaption><\/figure>\n<h4>Predictor Pod<\/h4>\n<p>Let\u2019s dive deeper into the Predictor. The Predictor is collection of pods fronted by a Kubernetes service that acts as a load balancer. Each pod in the Predictor consists of two containers: model management <strong>sidecar<\/strong> and <strong>inference<\/strong>.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*r3R62gA1FmlF74UcHkA-FQ.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><figcaption><em>Figure 3.1: Predictor Pod<\/em><\/figcaption><\/figure>\n<p><strong>sidecar container<\/strong><\/p>\n<ul>\n<li>The sidecar container is a RESTful server and is responsible for model management and business logic flows. It accepts requests from the API Gateway and downloads the model (if not already present) and related metadata from the DM service. This container communicates with the inference server over gRPC and provides model metadata (for example, location on disk) and prediction input\u00a0payload.<\/li>\n<\/ul>\n<p><strong>inference container<\/strong><\/p>\n<ul>\n<li>The inference container is a gRPC server that accepts requests from the sidecar container. It performs pre-processing, prediction, and post-processing.<\/li>\n<\/ul>\n<p><strong>gRPC contract<\/strong><\/p>\n<ul>\n<li>The Einstein Vision and Language Platform team published a standard gRPC contract which provides a way for all Data Science and Applied Research teams at Salesforce to onboard to the platform. The gRPC contract abstracts all of the model download and business logic flows so that data scientists and researchers can focus purely on pushing the boundaries of model quality. An example of gRPC protobuf used by one of our Intent Classification predictors looks like\u00a0this:<\/li>\n<\/ul>\n<pre>syntax = \"proto3\";<br><br>package com.salesforce.predictor.proto.intent.v1;<br><br>option java_multiple_files = true;<br>option py_generic_services = true;<br><br>\/*<br>* Intent Classification Predictor Service definition, <br>* accepts a model_id &amp; document and returns labels with<br>* probabilities for each.<br>*\/<br>service IntentPredictorService {<br>    rpc predict(IntentPredictionRequest) returns    (IntentPredictionResponse) {}<br>}<br><br>message IntentPredictionRequest {<br>    string model_id = 1;  \/\/ id of the model<br>    bytes document = 2;   \/\/ document to classify<br>}<br><br>message IntentPredictionResponse {<br>    bool success = 1;<br>    string error = 2;<br>    repeated IntentPredictions probabilities = 3;<br>}<br><br>message IntentPredictions {<br>    string label = 1;      \/\/ predicted label<br>    float probability = 2; \/\/ probability score<br>}<\/pre>\n<h3>Conclusion<\/h3>\n<p>The blog presents an overview of Einstein Vision &amp; Language Platform and discusses the architecture and components of Prediction Service. In that, the prediction request lifecycle describes the operations carried out in real-time to successfully serve a prediction request and thereby highlights the challenges involved in achieving low latency SLAs while managing hundreds of models from multiple tenants. Also, we briefly discussed gRPC protocol buffers as the means for smooth on-boarding and collaboration between Data Science &amp; Platform teams. In a future blog post, we plan to discuss in detail our architecture for in-memory and on-disk model management and the challenges associated with\u00a0it.<\/p>\n<p>Please <a href=\"mailto:einstein.ai.platform@salesforce.com\">reach out<\/a> to us with any questions, or if there is something you\u2019d be interested in discussing that we haven\u2019t\u00a0covered.<\/p>\n<p><em>Follow us on Twitter: <\/em><a href=\"https:\/\/twitter.com\/SalesforceEng\"><em>@SalesforceEng<\/em><\/a><\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1000\/1*HKPAFnTx6TboNUXplvLF2A.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<h3>Acknowledgements<\/h3>\n<p>I would like to thank <a href=\"https:\/\/www.linkedin.com\/in\/shuliund\/\">Shu Liu<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/leo-z-25b79464\/\">Linwei Zhu<\/a> &amp; <a href=\"https:\/\/www.linkedin.com\/in\/gkrishnan34\/\">Gopi Krishnan Nambiar<\/a> for contributing to various aspects of Prediction Service design and implementation, <a href=\"https:\/\/www.linkedin.com\/in\/warnerjustin\/\">Justin Warner<\/a> for refactoring &amp; improving the state of integration tests for repeatable and isolated test environment, <a href=\"https:\/\/www.linkedin.com\/in\/shashankharinath\/\">Shashank Harinath<\/a> for championing the use of gRPC based contracts for effective Data Science &amp; Platform collaboration &amp; on-boarding and <a href=\"https:\/\/www.linkedin.com\/in\/vaishnavigalgali\/\">Vaishnavi Galgali<\/a> for helping with migration to AWS EKS. Last but not the least, thanks to <a href=\"https:\/\/www.linkedin.com\/in\/ivo\/\">Ivaylo Mihov<\/a> for the continuous support &amp; inspiration.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=3b9018fdf63c\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/engineering.salesforce.com\/realtime-predictions-in-a-multitenant-environment-3b9018fdf63c\">Realtime Predictions in a Multitenant Environment<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/realtime-predictions-in-a-multitenant-environment-3b9018fdf63c?source=rss----cfe1120185d3---4\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Real-time Predictions in a Multitenant Environment Introduction The Einstein Vision and Language Platform Team at Salesforce enables data management, training, and prediction for deep learning-based Vision and Language use cases. Consumers of the platform can use our API gateway to upload datasets, train those datasets, and ultimately use the models generated out of training to&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/02\/02\/realtime-predictions-in-a-multitenant-environment\/\">Continue reading <span class=\"screen-reader-text\">Realtime Predictions in a Multitenant Environment<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-228","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":288,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/building-a-successful-enterprise-ai-platform\/","url_meta":{"origin":228,"position":0},"title":"Building a Successful Enterprise AI Platform","date":"August 31, 2021","format":false,"excerpt":"IntroductionIn 2016, I started as a fresh grad software engineer at a small startup called MetaMind, which was acquired by Salesforce. Since then, it has been quite a journey to achieve a lot with a small team. I\u2019m part of Einstein Vision and Language Platform team. Our platform provides customers\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":337,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/coordinated-rate-limiting-in-microservices\/","url_meta":{"origin":228,"position":1},"title":"Coordinated Rate Limiting in Microservices","date":"August 31, 2021","format":false,"excerpt":"The ProblemAny multitenant service with public REST APIs needs to be able to protect itself from excessive usage by one or more tenants. Additionally, as the number of instances that support these services is dynamic and varies based on load, the need arrises to perform coordinated rate limiting on a\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":229,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","url_meta":{"origin":228,"position":2},"title":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning","date":"February 2, 2021","format":false,"excerpt":"Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":893,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/","url_meta":{"origin":228,"position":3},"title":"Meta\u2019s approach to machine learning prediction robustness","date":"July 10, 2024","format":false,"excerpt":"Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":806,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/ai-debugging-at-meta-with-hawkeye\/","url_meta":{"origin":228,"position":4},"title":"AI debugging at Meta with HawkEye","date":"December 19, 2023","format":false,"excerpt":"HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products. HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":550,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality\/","url_meta":{"origin":228,"position":5},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI Quality","date":"March 10, 2022","format":false,"excerpt":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI\u00a0Quality An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/228","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=228"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/228\/revisions"}],"predecessor-version":[{"id":242,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/228\/revisions\/242"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=228"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=228"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}