{"id":887,"date":"2024-06-25T13:43:30","date_gmt":"2024-06-25T13:43:30","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/06\/25\/the-future-of-ai-testing-salesforces-next-gen-framework-for-ai-model-performance\/"},"modified":"2024-06-25T13:43:30","modified_gmt":"2024-06-25T13:43:30","slug":"the-future-of-ai-testing-salesforces-next-gen-framework-for-ai-model-performance","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/06\/25\/the-future-of-ai-testing-salesforces-next-gen-framework-for-ai-model-performance\/","title":{"rendered":"The Future of AI Testing: Salesforce\u2019s Next Gen Framework for AI Model Performance"},"content":{"rendered":"<p>In our \u201cEngineering Energizers\u201d Q&amp;A series, we explore the innovative minds shaping the future of Salesforce engineering. Today, we meet Erwin Karbasi, who leads the development of the Salesforce Central Evaluation Framework (SF Eval), a revolutionary internal tool used by Salesforce engineers to assess the performance of generative AI models.<\/p>\n<p>Explore how SF Eval addresses AI testing challenges, enhances application reliability, and incorporates user feedback to continuously improve AI model outputs.<\/p>\n<h5 class=\"wp-block-heading\"><strong>What is your AI team\u2019s mission?<\/strong><\/h5>\n<p>Our team ensures that Salesforce AI components, such as <a href=\"https:\/\/www.salesforce.com\/einsteincopilot\"><\/a><a href=\"https:\/\/www.salesforce.com\/einsteincopilot\">Einstein Copilot<\/a>, deliver outputs that are n<strong>ot only high-quality, reliable, and relevant but also ethically aligned<\/strong>. This commitment builds trust with our users, as we rigorously test these components through SF Eval. (Think of it as a chef taste-testing dishes before serving!) This robust framework helps us identify and address potential issues such as bias and accuracy, ensuring our AI tools exceed user expectations in terms of quality and dependability.<\/p>\n<p><strong>SF Eval is a comprehensive, layered platform that integrates traditional machine learning metrics with AI-assisted metrics to assess the performance of AI models comprehensively<\/strong>. This includes evaluating various components like prompts, large language models (LLMs), and Einstein Copilot. By ensuring these components meet high standards of accuracy and relevance, we empower businesses with dependable AI tools.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-layout-constrained wp-container-core-group-is-layout-1 wp-block-group-is-layout-constrained\">\n<h5 class=\"wp-block-heading\"><strong>What are the challenges in evaluating the accuracy and relevance of generative AI and LLMs?<\/strong><\/h5>\n<p><strong>Ambiguity of Prompts<\/strong>: This can lead to irrelevant responses from the AI. To mitigate this, the team refines prompts for clarity and consistency, ensuring they are precise and less likely to generate off-target outputs.<\/p>\n<p><strong>Factual Accuracy of LLM Outputs<\/strong>: Salesforce tackles this challenge by incorporating real-time fact-checking mechanisms and human oversight. This dual approach allows for the verification of critical outputs, confirming that the information provided by the AI is both accurate and trustworthy.<\/p>\n<p><strong>Accuracy of Retrieved Information<\/strong>: The team addresses this by using reliable sources for data retrieval and implementing reranking algorithms. These algorithms help in validating and ensuring the relevance and accuracy of the retrieved data, which is crucial for maintaining the relevance and integrity of AI-generated responses.<\/p>\n<\/div>\n<p>These strategies collectively enhance user trust and dependability in AI applications.<\/p>\n<\/div>\n<\/div>\n<h5 class=\"wp-block-heading\"><strong>What were the initial challenges in developing SF Eval?<\/strong><\/h5>\n<p>Initially, defining the appropriate metrics was a major hurdle. The team had to decide whether to adopt existing industry metrics or develop new ones tailored to their specific needs, such as CRM data relevance. This was crucial for ensuring the quality and relevance of the AI outputs.<\/p>\n<p>Integration posed another challenge, requiring seamless coordination between various components of Salesforce\u2019s extensive platform. This integration was essential for creating a cohesive framework that could support both internal applications and external user needs effectively.<\/p>\n<p>Lastly, addressing the needs of both internal and external customers was complex. The team aimed to create a unified platform that could cater to diverse user requirements, integrating seamlessly into their development pipelines. This required continuous feedback and adjustments to ensure the framework met all user expectations and enhanced their overall experience with Salesforce AI tools.<\/p>\n<p><em>The SF Eval ecosystem, accessible through SFDX \/ Einstein 1 Studio, allows users to perform gen AI evaluations and score them using metrics from the metrics hub.<\/em><\/p>\n<h5 class=\"wp-block-heading\"><strong>What is a key additional feature of SF Eval that addresses a major challenge in testing AI applications?<\/strong><\/h5>\n<p>One specific feature is its comprehensive benchmarking and prompt evaluation and improvement capabilities. This feature is crucial for systematically assessing and enhancing AI performance. By using a variety of standardized tests and metrics, SF Eval can benchmark AI models against industry standards and best practices. This process helps identify strengths and weaknesses in the AI\u2019s output, providing a clear pathway for targeted improvements. Additionally, prompt evaluation and improvement ensure that the prompts used to test the AI are continuously refined for clarity, relevance, and effectiveness, resulting in higher quality AI interactions.<\/p>\n<p>Another critical feature focuses on the retrieval aspect, known as RAG (Retrieval-Augmented Generation). This feature implements context-aware evaluation models that enhance the AI\u2019s ability to generate contextually relevant responses across various domains and scenarios. By ensuring that the retrieved data is pertinent to the prompts, this feature addresses significant challenges related to the accuracy and relevance of AI-generated content.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-core-group-is-layout-4 wp-block-group-is-layout-constrained\">\n<h5 class=\"wp-block-heading\"><strong>How can SF Eval be utilized across different stages of its application, and what are the specific purposes for each?<\/strong><\/h5>\n<p>SF Eval is utilized in three key phases:<\/p>\n<p><strong>Development: <\/strong>During the development phase, SF Eval is employed to rigorously test and validate the initial prompts and strategic plans. This involves identifying and rectifying any potential errors or inefficiencies early in the process, ensuring that the foundational elements are robust and effective before moving to the next stages.<\/p>\n<p><strong>Benchmarking: <\/strong>In the benchmarking phase, SF Eval conducts a detailed comparative analysis of various LLMs based on key criteria such as accuracy, trustworthiness, performance metrics, and cost-effectiveness. This phase is crucial for decision-makers to select the most appropriate LLM that aligns with the organization\u2019s specific CRM requirements and strategic goals.<\/p>\n<p><strong>Production: <\/strong>Once in production, SF Eval continuously monitors the deployed system to ensure it adheres to the quality standards established during the development phase. It detects any performance drifts or deviations, enabling timely adjustments to prompts or strategic plans. This continuous evaluation ensures the system remains efficient, reliable, and aligned with the desired outcomes in a real-world operational environment.<\/p>\n<\/div>\n<h5 class=\"wp-block-heading\"><strong>How does SF Eval enhance AI application reliability and performance from development through post-deployment?<\/strong><\/h5>\n<p>SF Eval is structured in layers, starting with ad hoc testing at the development stage, where developers can receive immediate feedback on AI outputs. This is followed by batch testing, which simulates real-world scenarios to assess outputs more comprehensively. The top layer involves runtime monitoring and observability, ensuring continuous assessment even after deployment. This multi-tiered approach allows for thorough testing and refinement of AI applications, ensuring they perform optimally in real-world settings.<\/p>\n<h5 class=\"wp-block-heading\"><strong>How does customer feedback influence the development of AI applications at Salesforce?<\/strong><\/h5>\n<p>One specific feature is its dynamic adaptability to customer feedback. This feature is crucial for refining AI outputs based on real-time user interactions. By incorporating feedback directly into the evaluation process, SF Eval can adjust prompts to enhance their relevance, clarity, and effectiveness. This feedback loop mechanism ensures that the AI applications remain aligned with user needs and expectations, significantly improving the responsiveness and adaptability of prompt-based interactions.<\/p>\n<p>Customer feedback significantly shapes the development of AI applications at Salesforce, ensuring that the tools not only meet but also adapt to user needs. For instance, feedback has led to the enhancement of sentiment analysis models to detect subtle emotions like frustration or confusion, thereby improving the effectiveness of customer support interactions.<\/p>\n<p>Salesforce\u2019s feedback loop mechanism within SF Eval facilitates this process of continuous improvement. This mechanism allows users to provide real-time feedback on AI outputs, which Salesforce integrates into ongoing AI development. This integration helps in making dynamic adjustments to AI models and algorithms based on user interactions and inputs, ensuring that the AI applications are practical, user-centric, and aligned with the evolving expectations of users.<\/p>\n<p>Customer feedback is not just influential but central to the iterative development process at Salesforce, fostering enhancements that refine user experience and application reliability.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-core-group-is-layout-5 wp-block-group-is-layout-constrained\">\n<h5 class=\"wp-block-heading\">Learn More<\/h5>\n<p>Hungry for more AI stories? Learn how <a href=\"http:\/\/amazon%20sagemaker,\/\">Amazon SageMake<\/a>r enhances Salesforce Einstein\u2019s LLM latency and throughput in <a href=\"https:\/\/engineering.salesforce.com\/revolutionizing-ai-how-sagemaker-enhances-salesforce-einsteins-large-language-model-latency-and-throughput\/\">this blog<\/a>.<\/p>\n<p>Stay connected \u2014 join our <a href=\"https:\/\/flows.beamery.com\/salesforce\/eng-social-2023\">Talent Community<\/a>!<\/p>\n<p>Check out our <a href=\"https:\/\/www.salesforce.com\/company\/careers\/teams\/tech-and-product\/?d=cta-tms-tp-2\">Technology and Product<\/a> teams to learn how you can get involved.<\/p>\n<\/div>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/the-future-of-ai-testing-salesforces-next-gen-framework-for-ai-model-performance\/\">The Future of AI Testing: Salesforce\u2019s Next Gen Framework for AI Model Performance<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/the-future-of-ai-testing-salesforces-next-gen-framework-for-ai-model-performance\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>In our \u201cEngineering Energizers\u201d Q&amp;A series, we explore the innovative minds shaping the future of Salesforce engineering. Today, we meet Erwin Karbasi, who leads the development of the Salesforce Central Evaluation Framework (SF Eval), a revolutionary internal tool used by Salesforce engineers to assess the performance of generative AI models. Explore how SF Eval addresses&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/06\/25\/the-future-of-ai-testing-salesforces-next-gen-framework-for-ai-model-performance\/\">Continue reading <span class=\"screen-reader-text\">The Future of AI Testing: Salesforce\u2019s Next Gen Framework for AI Model Performance<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-887","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":884,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/21\/how-einstein-copilot-sharpens-large-language-model-outputs-and-redefines-ai-data-testing\/","url_meta":{"origin":887,"position":0},"title":"How Einstein Copilot Sharpens Large Language Model Outputs and Redefines AI Data Testing","date":"June 21, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the paths of engineering leaders who have attained significant accomplishments in their respective fields. Today, we spotlight Armita Peymandoust, Senior Vice President of Software Engineering at Salesforce, who spearheads the development of Einstein Copilot, a conversational AI assistant for CRM that integrates\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":848,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/01\/unveiling-the-cutting-edge-features-of-ml-console-for-ai-model-lifecycle-management\/","url_meta":{"origin":887,"position":1},"title":"Unveiling the Cutting-Edge Features of ML Console for AI Model Lifecycle Management","date":"April 1, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the journeys of engineering leaders who have made remarkable contributions in their fields. Today, we meet Venkat Krishnamani, a Lead Member of the Technical Staff for Salesforce Engineering and the lead engineer for Salesforce Einstein\u2019s Machine Learning (ML) Console. This vital tool\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":751,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/22\/how-is-einstein-gpt-shaping-the-future-of-salesforce-development-and-unleashing-developer-productivity\/","url_meta":{"origin":887,"position":2},"title":"How is Einstein GPT Shaping the Future of Salesforce Development and Unleashing Developer Productivity?","date":"August 22, 2023","format":false,"excerpt":"By Yingbo Zhou and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Yingbo Zhou, a Senior Director of Research for Salesforce AI Research, where he leads the team to develop the model for Einstein GPT for Developers\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":791,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/22\/how-is-einstein-shaping-the-future-of-salesforce-development-and-unleashing-developer-productivity\/","url_meta":{"origin":887,"position":3},"title":"How is Einstein Shaping the Future of Salesforce Development and Unleashing Developer Productivity?","date":"August 22, 2023","format":false,"excerpt":"By Yingbo Zhou and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Yingbo Zhou, a Senior Director of Research for Salesforce AI Research, where he leads the team to develop the model for Einstein for Developers, a\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":774,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/17\/how-the-einstein-team-operationalizes-ai-models-at-lightning-speed-and-massive-scale\/","url_meta":{"origin":887,"position":4},"title":"How the Einstein Team Operationalizes AI Models at Lightning Speed and Massive Scale","date":"October 17, 2023","format":false,"excerpt":"By Yuliya Feldman and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Yuliya Feldman, a Software Engineering Architect at Salesforce. Yuliya works on Salesforce Einstein\u2019s Machine Learning Services team, responsible for operationalizing AI models, which serves as\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":789,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/17\/how-the-einstein-team-operationalizes-ai-models-at-lightning-speed-and-massive-scale-2\/","url_meta":{"origin":887,"position":5},"title":"How the Einstein Team Operationalizes AI Models at Lightning Speed and Massive Scale","date":"October 17, 2023","format":false,"excerpt":"By Yuliya Feldman and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Yuliya Feldman, a Software Engineering Architect at Salesforce. Yuliya works on Salesforce Einstein\u2019s Machine Learning Services team, responsible for operationalizing AI models, which serve as\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/887","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=887"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/887\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=887"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=887"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=887"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}