{"id":550,"date":"2022-03-10T15:22:53","date_gmt":"2022-03-10T15:22:53","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality\/"},"modified":"2022-03-10T15:22:53","modified_gmt":"2022-03-10T15:22:53","slug":"einstein-evaluation-store-beyond-metrics-for-ml-ai-quality","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality\/","title":{"rendered":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI Quality"},"content":{"rendered":"<h3>Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI\u00a0Quality<\/h3>\n<p>An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift in skillsets that teams need to successfully drive ML and artificial intelligence (AI) end-to-end (from data scientists, to data and ML engineers). The need for maturity and robustness has only accelerated recently, as intelligent solutions are applied more and more into critical business areas and core products.<\/p>\n<p>Salesforce is a B2B company that caters to the enterprise space. We have unique use cases in machine learning products for our business customers, who, in many ways, are relatively unsophisticated in regards to ML fluency. We also prioritize automating our ML products and making them as robust as possible. We have to, because we typically deploy at least a model per customer; since we have tens of thousands of customers, we quickly run into a fundamental challenge. For their lifecycle, training, validation, monitoring and so forth, we need to deal with hundreds of thousands of managed pipelines and models. Those intimidating numbers are a forcing function in how we see ML automation and how we enable it at scale for all of our tenants\/customers. In particular, we also focus on automation for building models (autoML), given that it is impossible to have teams that handcraft models for that many customers (which is still the typical process in most of our industry). Having no human in the loop while building and training models requires an extra layer of automation in every other aspect of the lifecycle; in particular, how we approach the <em>quality<\/em> of those models, which ultimately determines what gets deployed, what gets left behind, and for how\u00a0long.<\/p>\n<p>We have built and operated sophisticated systems for supporting data (Data Lakes, Feature Store), and model management (Model Store) for some time. Recently, we introduced another critical component covering an often overlooked aspect of the ML lifecycle: <strong>testing. <\/strong>Enter our <em>Evaluation Store<\/em>. Our philosophy for ML is that the ML lifecycle is just a specific manifestation of the software engineering lifecycle, with the twist of data thrown in. We think of the process of building and deploying models in similar terms to building and deploying software, in particular on the aspect of quality assurance. Most ML platform solutions deal with <em>quality<\/em> via a combination of <em>metric<\/em> collection\/calculation systems (they are typically hooked into classic metrics\/observability components like openTSDB, Datadog, Splunk, Graphana etc.), and a set of batch metric datasets\/stores for offline or adhoc querying via query engines (Presto, Athena, etc.). While we also deeply care about metrics and treat them as core entities, the sheer number of models we work with has forced us to think about quality in a much more structured way. Namely, we think in terms of tests\/evaluations that have a clear <em>pass\/fail<\/em> outcome\u200a\u2014\u200avery similar to how we view software engineering quality in terms of test coverage, test types, and <strong>behavioral scenarios<\/strong>. In our case, the primary approach to quality is to define\/express it as a <strong>series\/suite of tests<\/strong> (not metrics), involving <em>invariants<\/em> in data, code, parameters and threshold functions along with clearly expressed expectations.<\/p>\n<p>Evaluation Store in Canonical ML Lifecycle<\/p>\n<h4>Metric vs. Test Centric ML\u200a\u2014\u200aThe Paradigm\u00a0Shift<\/h4>\n<p>The shift of perspective from metrics to tests has major benefits in multiple areas of the ML lifecycle. Capturing tests in a structured way for each model\/pipeline allows us to track consistently what test <em>data<\/em> was used, including test <em>segments<\/em> within datasets with particular requirements, and to monitor performance along multiple dimensions for these entities. It also allows to track consistently the more intangible aspects of tests that would otherwise be lost if we thought purely in terms of simple metrics. This includes aspects like <em>expectations<\/em> from the tester, which are sometimes represented as static or historic thresholds, in addition to what the test code itself is (what test\/benchmark are we performing exactly, and which other model is performing the same exact\u00a0test).<\/p>\n<p>The contrast between metrics and tests goes deeper. A metric on its own, such as an OCR model with accuracy 95%, does not tell ML engineers a lot. It\u2019s the context, whether implicit or explicit, associated with that metric that makes it meaningful. In the OCR example, we need to know, among other things, what dataset that metric came from, something like MNIST, a standard dataset for handwritten digits. Anyone that has trained simple models on MNIST can tell that 95% is not necessarily a good result, because they can compare with previous results on the same dataset. The point is that it is always <em>the comparison<\/em> with some <em>expectation<\/em>, rather than the raw metric, that tells the tester whether something is of good quality or not, and whether it meets desired criteria. Metrics should in that sense be considered as raw data that require <em>interpretation<\/em>. In clear\u00a0terms:<\/p>\n<p><em>metric centric ML platforms are human driven<\/em>\/piloted<\/p>\n<p>The key reason why the majority of ML platforms have a metric centered approach for quality is because of their deeply embedded assumption to cater to a small and human manageable number of models. That assumption originates mostly from a research centric world view and perspective. On the other\u00a0hand:<\/p>\n<p><em>test\/evaluation centric ML platforms allow full automation<\/em> and can be driven on auto\u00a0pilot.<\/p>\n<p>The reason underpinning this stark contrast between a <em>metric centered<\/em> and a <em>test centered<\/em> approach is simple. Metrics always require (human) interpretation to be actionable. Tests are immediately actionable by automated processes (i.e validation, deployment gates, experiment progressions, monitoring etc.). Test centered ML platforms are closer to the world of engineering products driving business cases, especially if they require industrial scale. In that sense they are much closer to an engineering centric perspective. At Salesforce this represents a fundamental paradigm shift and lies at the core of our ML platform philosophy, displayed throughout our ML stack and quality management processes.<\/p>\n<p>We also have many cases where we don\u2019t have the luxury of using a lot of data to test but can only test and rely on specific, highly curated prototypes\/single representative points. Here the outcome is not a metric (aggregate), but a label instead. Tests on prototypes again challenge the reliance on metrics for determining quality. As a result, in our store we can express <em>datasets<\/em>, <em>segments,<\/em> and <em>prototypes<\/em> as integral parts of the evaluation context. These various levels of <em>data coverage<\/em>, along with curated scenarios and sequences of evaluation tasks can <strong>provide drill down<\/strong> insights and more holistic and interpretable assessments of quality that are easily lost with aggregate metrics.<\/p>\n<p>Through the Evaluation Store, we allow our ML engineers to express their tests\/evaluations (we use both words as synonyms) via a (Python) library that has similar interfaces to vanilla testing libraries in software development. In fact, our library is inspired and borrows concepts heavily from Jest (a modern Javascript testing library). The library captures inputs and results of the tests along with other metadata and takes care of reporting them to the Evaluation Store via APIs. The key concepts we organize our testing on\u00a0are:<\/p>\n<p><strong>Evaluation<\/strong>: maps to an entire test suite\/scenario covering a full <em>behavioral<\/em> suite. As mentioned, the direct mapping in software testing is that of a <strong>test suite. <\/strong>Among other metadata, we capture here in a structured way the <em>datasets<\/em> used, and the code used for the test, plus what is being tested (i.e. models, pipelines or\u00a0data).<strong>Task(s)<\/strong>: a particular check; here we also capture which <em>data segments<\/em> or <em>prototypes<\/em> we are performing the task on. Segments represent interesting subsets of the dataset (for example, all women age 18\u201335 in a dataset). Tasks represent a single test in software testing\u00a0terms.<strong>Expectation(s)<\/strong>: typically a boolean expression contained within a task. Parameters and variables can be referenced here to be materialized during execution\/runtime, and compared against what is expected by the tester. We have a few types of expectation definitions, mostly guided by simple DSLs (i.e. sql flavors and similar).python code snippet, expressing an evaluation, note similarities with unit\u00a0testing<\/p>\n<h4><strong>Validation, Experimentation and Monitoring via ML\u00a0Tests<\/strong><\/h4>\n<p>With the Evaluation Store as a dedicated system to represent structured tests, processes in ML that we typically think of as isolated, like offline <em>validation<\/em>, online <em>experimentation<\/em>, and (model\/drift) <em>monitoring<\/em>, can be expressed via tests and their relation to the deployment process. The processes above are about managing <em>quality, <\/em>they are underpinned by testing techniques and concepts. More importantly they should be seen as a spectrum of quality checking and management processes that models progressively go through during their lifecycle journey, with tests as the foundational unifying primitive. Online experimentation cases (A\/B tests, champion challenger, etc.) are nothing but a deployment of multiple model versions that incorporate tests for progression. Monitoring for data drift can be expressed as a continuously running test (say for KL divergence) after a model has been deployed which <em>retrains<\/em> or <em>alarms<\/em> if it fails its expectation, hence affecting\/guiding the deployment lifecycle. Our perspective is that <em>deployments in ML represent complex processes<\/em> (we think of them as pipelines\/flows) and are not singular atomic events; they ultimately have lifecycle. A simple staggered deployment, for instance, involves deploying to 5% of traffic, then testing and proceeding to 10%, then 50%, and so on. Hence, this deployment type is not immediate and has a lifecycle involving repeated, time-delayed consultations with automated tests. Similarly, drift monitoring, a classic use case of model monitoring, can be expressed as continuously or scheduled running tests that affect the deployment lifecycle if test expectations are not\u00a0met.<\/p>\n<p>conceptual a\/b deployment lifecycle<\/p>\n<h4>Human Evaluations\/Evaluators<\/h4>\n<p>The Evaluation Store allows us to express human in the loop type of decisions from end-customers or even auditors, who may not be involved at all in the building of the model but who understand very well what good predictions and bad predictions would be for their business cases. Human evaluations typically also involve <em>interactive evaluation sessions <\/em>with models, where just a few <em>samples<\/em> are tested, one at a time, each one representing a particular <em>behavioral case<\/em> from the customer. Cases like this are classic examples where the evaluation is not based on metrics but on raw predictions, whose outcome determines the human decision <em>(pass\/fail)<\/em>. They can provide a thumbs up or thumbs down on the model versions we enable for these tests. The separation of stakeholders, the difference in <strong>who builds<\/strong> vs <strong>who tests<\/strong>, is critical in our view, and represents the future of testing in ML. Customers, whether in an enterprise setting or not, may not be aware about the latest transformer models, but they know very well what is a good vs what is a bad prediction for them. As far as <em>tests<\/em> go, despite lack of fluency in ML, customers can be very sophisticated and they are right at home. In testing, they can speak their own business language. Hence, we have a strong case for customers providing tests along with test data themselves, rather than relying solely on who builds their\u00a0models.<\/p>\n<p>human in the loop and evaluation store<\/p>\n<h4>Evaluations of Models, Data, Pipelines and Deployments<\/h4>\n<p>An obvious benefit is that we are able to leverage the same concepts for establishing not just model, but also <strong>data <\/strong>quality standards and checks as well, where expressing quality via tests has the same benefits as for models or pipelines (executable artifacts, in software terms). Data quality is a critical ingredient for good models and a data-centric approach to ML. Data tests can be performed in a structured way, before training or even during the sourcing process in our data systems, to prevent problems in downstream tasks, like training or generation of derived datasets. Examples of such data tests relevant for downstream tasks include checking that datasets have sufficient number of samples for training, checking for missing values in important features beyond historical norms, and more sophisticated tests around significant shifts in distributions. Similar to models or data, the Evaluation Store supports evaluations of entire pipelines, as well as what we call multi-tenant deployments, which represent deployments of hundreds or even thousands of different models. These capabilities allow us to have a representation of quality beyond just simple\u00a0models.<\/p>\n<p>data validation<\/p>\n<p>The richness of potential checks for ML applications and the large number of resulting tests, including functional tests as well as statistical ones for different data layers (segments, prototypes, full datasets), calls for standardizing test quality along a few dimensions. Specifically, we express tests along the following axes:<\/p>\n<p><strong>performance<\/strong> tests (i.e. accuracy, rmse based, etc.\u00a0)<strong>robustness<\/strong> tests (perturb inputs and check for changes,\u00a0\u2026)<strong>privacy<\/strong> tests (do we leak private, PII\u00a0info\u2026)<strong>security<\/strong> tests (<em>adversarial<\/em> <em>attacks<\/em>, poisonous observations\u2026)<strong><em>fairness\/ethical<\/em><\/strong> tests (i.e. under\/overrepresented segments\u00a0\u2026)<strong><em>cost<\/em><\/strong> tests (memory, CPU, monetary benchmarks)<\/p>\n<p>These major dimensions are an initial set, and the list may grow; however, we consistently structure quality along these six factors. We do this in particular, to shift away from a unidimensional focus in evaluations (i.e. accuracy\/performance) towards a more holistic view of model quality that allows ML engineers to asses <em>real world<\/em> behavior of their models as well as to push them towards creative test scenarios.<\/p>\n<p>structuring quality along six\u00a0axes<\/p>\n<p>Venturing into the future, we think that testing ML\/AIs will increase in complexity and sophistication. Yet, at the same time, tests on models\/AIs also need to become more <em>transparent<\/em> and <em>accessible<\/em> to customers in order to increase their trust in our intelligent solutions. As models are deployed in increasingly critical and high stakes cases (health care, finance, autonomous driving\/AVs, etc.), we cannot afford an unstructured approach to quality (e.g. would relying on just the train\/test data split for AVs be good enough? How are they being tested?). While our industry seems currently poised on chasing (what may be marginal) improvements via bigger models, we also think that the process of structured testing may reveal important insights that then lead to significant improvements. More structured testing and test tooling is the area that we see as the most promising to establish machine learning firmly as an <em>engineering discipline<\/em>, and elevate it from its research\u00a0roots.<\/p>\n<p>Both data requirements and the procedure of testing ML\/AI agents could become similar to testing processes for human agents, where the tested and the tester represent different stakeholders and the process is to go through a list of questions\/tasks to make a decision. This sounds a lot like a typical human interview. Because of the difference in stakeholders, we also should not assume that tests will always have access to a global view of data on what is being tested. Instead, we will need to use\/generate very few data points in addition to increasing reliance on trusted credentials.<\/p>\n<p>Ultimately, we think there are major benefits to expressing ML quality in terms of tests and test suites, as opposed to just metrics. The difference between the two determines how ML platforms are shaped, where reliance on <em>metrics<\/em> only gears them towards <strong>scientific labs, <\/strong>churning out few highly curated and high human touch ML models, while reliance on <em>tests<\/em> structures platforms towards <strong>industrial factories, <\/strong>churning out very large numbers of models (think millions!).<strong> <\/strong>We can, at the very least, adopt and borrow from the experience in software engineering and the testing processes and concepts they have over several decades. The Evaluation Store and its foundation on test concepts frames the ML quality strategy for our platform.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-4ec2f5504421\">Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI Quality<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-4ec2f5504421?source=rss----cfe1120185d3---4\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI\u00a0Quality An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift in skillsets that teams need&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality\/\">Continue reading <span class=\"screen-reader-text\">Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML\/AI Quality<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-550","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":584,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/10\/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-2\/","url_meta":{"origin":550,"position":0},"title":"Einstein Evaluation Store\u200a\u2014\u200aBeyond Metrics for ML \/ AI Quality","date":"March 10, 2022","format":false,"excerpt":"An important transition is underway in machine learning (ML) with companies gravitating from a research-driven approach towards a more engineering-led process for applying intelligence to their business problems. We see this in the growing field of ML operations, as well as in the shift in skillsets that teams need to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":229,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","url_meta":{"origin":550,"position":1},"title":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning","date":"February 2, 2021","format":false,"excerpt":"Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":893,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/","url_meta":{"origin":550,"position":2},"title":"Meta\u2019s approach to machine learning prediction robustness","date":"July 10, 2024","format":false,"excerpt":"Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":599,"url":"https:\/\/fde.cat\/index.php\/2022\/06\/14\/applying-federated-learning-to-protect-data-on-mobile-devices\/","url_meta":{"origin":550,"position":3},"title":"Applying federated learning to protect data on mobile devices","date":"June 14, 2022","format":false,"excerpt":"What the research is: Federated learning with differential privacy (FL-DP) is one of the latest privacy-enhancing technologies being evaluated at Meta as we constantly work to enhance user privacy and further safeguard users\u2019 data in the products we design, build, and maintain. FL-DP enhances privacy in two important ways: It\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":225,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/flow-scheduling-for-the-einstein-ml-platform\/","url_meta":{"origin":550,"position":4},"title":"Flow Scheduling for the Einstein ML Platform","date":"February 2, 2021","format":false,"excerpt":"At Salesforce, we have thousands of customers using a variety of products. Some of our products are enhanced with machine learning (ML) capabilities. With just a few clicks, customers can get insights about their data. Behind the scenes, it\u2019s the Einstein Platform that builds hundreds of thousands of models, unique\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":854,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/12\/innovating-tableau-pulse-hurdling-ai-integration-and-scalability-obstacles-for-next-gen-customer-insights\/","url_meta":{"origin":550,"position":5},"title":"Innovating Tableau Pulse: Hurdling AI Integration and Scalability Obstacles for Next-Gen Customer Insights","date":"April 12, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the inspiring paths of engineering leaders who have made remarkable strides in their respective fields. Today, we meet Harini Nallan Chakrawarthy, Vice President of Software Engineering, who leads the development of Tableau Pulse. This new Salesforce feature that uses generative AI to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/550","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=550"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/550\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=550"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=550"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=550"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}