{"id":736,"date":"2023-07-18T10:46:00","date_gmt":"2023-07-18T10:46:00","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/07\/18\/how-can-apache-spark-windowing-supercharge-your-performance-and-simplify-coding\/"},"modified":"2023-07-18T10:46:00","modified_gmt":"2023-07-18T10:46:00","slug":"how-can-apache-spark-windowing-supercharge-your-performance-and-simplify-coding","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/07\/18\/how-can-apache-spark-windowing-supercharge-your-performance-and-simplify-coding\/","title":{"rendered":"How Can Apache Spark Windowing Supercharge Your Performance and Simplify Coding?"},"content":{"rendered":"<p><em>By Matan Rabi and Scott Nyberg.<\/em><\/p>\n<p><a href=\"https:\/\/www.salesforce.com\/products\/einstein\/overview\/\" target=\"_blank\" rel=\"noopener\">Salesforce Einstein<\/a>\u2019s data scales generate terabytes of data each day to train, test, and observe AI models. By only utilizing one machine, this time-consuming process could take days or weeks to complete.<\/p>\n<p>In response, Einstein\u2019s <a href=\"https:\/\/engineering.salesforce.com\/how-is-salesforce-einstein-optimizing-ai-classification-model-accuracy\/\" target=\"_blank\" rel=\"noopener\">Machine Learning Observability Platform (MLOP) team<\/a> uses <a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Spark<\/a>, an open-source framework module for rapid large-scale data processing. Spark enables MLOP to perform distributed computations across multiple computation units (processor cores) given a set of resources, ranging from a Kubernetes cluster to VM clusters and more.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-1\">\n<p>Some of the computations and transformations MLOP team needs to do, include splitting a dataset into groups and:<\/p>\n<p>Adding a value <strong>for each row in each group<\/strong>. The value depends on other rows in the same group<\/p>\n<p>Selecting the <strong>last\/first\/Nth row in each group<\/strong>. In this case, the group order has a meaning<\/p>\n<\/div>\n<p>Apache Spark\u2019s SQL Module has a windowing feature that allows MLOP (and you!) to achieve those goals with less code and better performance.<\/p>\n<p>Read on to learn more about the added benefits of windowing and how to use them with practical use case examples.<\/p>\n<h4 class=\"wp-block-heading\">Adding a value for each row in each group within a table<\/h4>\n<p>Consider this fictional transactions dataset:<\/p>\n<p>Imagine you are adding a value to each transaction, which represents the percentage of the transaction amount out of the total transactions done in the same vendor. So in this case, our group would be transactions that has the same vendor.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-4\">\n<div class=\"wp-block-group is-layout-constrained wp-container-2\">\n<p>To address this use case without windowing, you would need to:<\/p>\n<p>Group by vendor, sum the amount, and insert it into a separate dataframe.<\/p>\n<p>Join the vendor column in the original dataframe with a newly aggregated dataframe<\/p>\n<p>Add a new column for the transaction amount (the original dataframe) \/ total vendor transaction amount (aggregated dataframe)<\/p>\n<\/div>\n<div class=\"wp-block-group is-layout-constrained wp-container-3\">\n<p>Let\u2019s see how it looks in code:<\/p>\n<\/div>\n<\/div>\n<p>Now, let\u2019s examine what would have happen if you use windowing.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-7\">\n<div class=\"wp-block-group is-layout-constrained wp-container-5\">\n<p>There are two stages to windowing:<\/p>\n<p><strong>Creating the window. <\/strong>This basically represents how you want each group to appear within the table.<br \/>You can use the <strong>partition by<\/strong> here clause if you want to divide your groups by a key, or <strong>order by<\/strong> if you want to create and order your group. Note that if you use order by, the entire table will be one group.<\/p>\n<p><strong>Applying a window function over the window.<\/strong> You want to apply this for a row in each group. This includes aggregate functions (i.e. min, max, count, sum, avg), ranking functions (i.e. rank, ntile), and analytic functions (i.e. first_value, last_value, nth_value, lag, lead). Ranking functions and analytic functions are usually effective when you order your group(s).<\/p>\n<\/div>\n<div class=\"wp-block-group is-layout-constrained wp-container-6\">\n<p>Got all that? Now, the steps to achieve your above use case with windowing would be:<\/p>\n<p>Create a window partitioned by vendor.<\/p>\n<p>Apply a sum on amount over the window.<\/p>\n<p>Divide the amount with sum on amount.<\/p>\n<\/div>\n<\/div>\n<p>Let\u2019s see how it looks in code:<\/p>\n<p>window = Window.partitionBy(&#8216;vendor&#8217;)<br \/>df = df.withColumn(&#8216;total_vendor_amount&#8217;, sum(&#8216;amount&#8217;).over(window))<br \/>df = df.withColumn(&#8216;percentage_out_of_total&#8217;, df.amount \/ df.total_vendor_amount)<\/p>\n<p>In both approaches, you\u2019ve reached the same outcome:<\/p>\n<p>By using windowing, you saved a join, did not create another dataset, and, as a special bonus, made your code neater. Of course this represents a tiny dataset, but you could imagine that for large datasets there could be a big difference in performance and efficiency.<\/p>\n<h4 class=\"wp-block-heading\">Select the first\/last\/Nth row in each group within a table<\/h4>\n<p>This example spotlights the need to obtain the highest transaction amount for each vendor. In that case, we need to order each group (transactions that has the same vendor) by amount and select the first transaction.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-8\">\n<p>If you wanted to achieve it without windowing, your steps would be:<\/p>\n<p>Group by vendor and select the max amount.<\/p>\n<p>Join the two tables, where vendor and transaction amount equals the max amount.<\/p>\n<\/div>\n<p>Your code would look like this:<\/p>\n<p>max_vendor_amount_df = <br \/>    original_df.groupBy(&#8216;vendor&#8217;).agg(max(&#8216;amount&#8217;).alias(&#8216;max_amount&#8217;))<\/p>\n<p>join_condition = [original_df.vendor == max_vendor_amount_df.vendor, <br \/>                  original_df.amount == max_vendor_amount_df.max_amount]<\/p>\n<p>joined_df = original_df.join(max_vendor_amount_df, join_condition, &#8216;inner&#8217;)<\/p>\n<p>Similar to the first use case, in this approach you have created an additional dataframe. Additionally, you used a join to get the wanted result.<\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-10\">\n<div class=\"wp-block-group is-layout-constrained wp-container-9\">\n<p>Conversely, let\u2019s examine the steps needed to achieve this while using windowing:<\/p>\n<p>Create a window partitioned on vendor, ordered by amount descending.<\/p>\n<p>Apply <strong>rank<\/strong> over the ordered window.<\/p>\n<p>Filter where <strong>rank<\/strong> equals to 1.<\/p>\n<\/div>\n<p>Let\u2019s see how it looks in code:<\/p>\n<\/div>\n<p>window = Window.partitionBy(&#8216;vendor&#8217;).orderBy(col(&#8216;amount&#8217;).desc())<br \/>df = df.withColumn(&#8216;rank&#8217;, rank().over(window))<br \/>df = df.where(df.rank == 1)<\/p>\n<p>Both would result in a similar dataframe while achieving the needed result:<\/p>\n<p>But, using windowing allowed you to avoid an unnecessary join, an additional dataframe creation, and a group by. In looking at the steps (aka \u201cverbal algorithm\u201d) for each approach, you will see that windowing allows you to also write a more intuitive code.<\/p>\n<h4 class=\"wp-block-heading\">MLOP\u2019s use case for windowing<\/h4>\n<p>Salesforce\u2019s MLOP team is responsible for machine learning models observability. This involves comparing the predictions made for customers vs the ground truth. In this manner, the team uses mathematical formulas to see where models excel and where they must improve.<\/p>\n<p>MLOP also segments the data to pinpoint any issues. For example, a model might be performing well on Caucasian men, aged 18-25 from Indiana but underperforms on Asian women, aged 26-35 from California.<\/p>\n<p>When does MLOP use windowing? It\u2019s usually when the team uses the ground truth, which could rely on user input that might change over time, such as predicting the priority of a case. In fact, for each case ID, there could be multiple \u201cground truths\u201d over time. Ultimately, for each case ID, MLOP takes the most updated ground truth.<\/p>\n<p>This traces back to the above second use case: In the ground truth table, we partitioned on case ID, ordered by event timestamp, applied rank window function and filtered for rows where rank == 1.<\/p>\n<h3 class=\"wp-block-heading\">Learn more<\/h3>\n<p>Hungry for more AI stories?<a href=\"https:\/\/engineering.salesforce.com\/automation-engineering-secrets-revealed-slashing-customer-processing-time-from-hours-to-seconds\/\"> <\/a><a href=\"https:\/\/engineering.salesforce.com\/how-is-salesforce-einstein-optimizing-ai-classification-model-accuracy\/\" target=\"_blank\" rel=\"noopener\">Check out this blog<\/a> for a behind the scenes look at the MLOP team.<\/p>\n<p>To learn more about Apache Spark and how to optimize your spark applications by understanding the concept of partitions, read this<a href=\"https:\/\/engineering.salesforce.com\/how-to-optimize-your-apache-spark-application-with-partitions-257f2c1bb414\/\"> blog<\/a>.<\/p>\n<p>Stay connected \u2014 join our<a href=\"https:\/\/careers.mail.salesforce.com\/w2?cid=7017y00000CRDS7AAP\"> Talent Community<\/a>!<\/p>\n<p><a href=\"https:\/\/www.salesforce.com\/company\/careers\/teams\/tech-and-product\/?d=cta-tms-tp-2\">Check out our Technology and Product teams<\/a> to learn how you can get involved.<\/p>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/how-can-apache-spark-windowing-supercharge-your-performance-and-simplify-coding\/\">How Can Apache Spark Windowing Supercharge Your Performance and Simplify Coding?<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/how-can-apache-spark-windowing-supercharge-your-performance-and-simplify-coding\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>By Matan Rabi and Scott Nyberg. Salesforce Einstein\u2019s data scales generate terabytes of data each day to train, test, and observe AI models. By only utilizing one machine, this time-consuming process could take days or weeks to complete. In response, Einstein\u2019s Machine Learning Observability Platform (MLOP) team uses Apache Spark, an open-source framework module for&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/07\/18\/how-can-apache-spark-windowing-supercharge-your-performance-and-simplify-coding\/\">Continue reading <span class=\"screen-reader-text\">How Can Apache Spark Windowing Supercharge Your Performance and Simplify Coding?<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-736","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":572,"url":"https:\/\/fde.cat\/index.php\/2022\/05\/05\/how-to-optimize-your-apache-spark-application-with-partitions\/","url_meta":{"origin":736,"position":0},"title":"How to Optimize Your Apache Spark Application with Partitions","date":"May 5, 2022","format":false,"excerpt":"In Salesforce Einstein, we use Apache Spark to perform parallel computations on large sets of data, in a distributed manner. In this article, we will take a deep dive into how you can optimize your Spark application with partitions. Introduction Today, we often need to process terabytes of data per\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":580,"url":"https:\/\/fde.cat\/index.php\/2022\/05\/05\/how-to-optimize-your-apache-spark-application-with-partitions-2\/","url_meta":{"origin":736,"position":1},"title":"How to Optimize Your Apache Spark Application with Partitions","date":"May 5, 2022","format":false,"excerpt":"In Salesforce Einstein, we use\u00a0Apache Spark\u00a0to perform parallel computations on large sets of data, in a distributed manner. In this article, we will take a deep dive into how you can optimize your Spark application with partitions. Introduction Today, we often need to process terabytes of data per day to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":733,"url":"https:\/\/fde.cat\/index.php\/2023\/07\/11\/how-is-salesforce-einstein-optimizing-ai-classification-model-accuracy\/","url_meta":{"origin":736,"position":2},"title":"How is Salesforce Einstein Optimizing AI Classification Model Accuracy?","date":"July 11, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional journeys that have shaped Salesforce Engineering leaders. Meet Matan Rabi, Senior Software Engineer on Salesforce Einstein\u2019s Machine Learning Observability Platform (MLOP) team. Matan and his team strive to optimize the accuracy of Einstein\u2019s AI classification models, empowering customers across industries\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":860,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/24\/inside-data-clouds-secret-formula-for-processing-one-quadrillion-records-monthly\/","url_meta":{"origin":736,"position":3},"title":"Inside Data Cloud\u2019s Secret Formula for Processing One Quadrillion Records Monthly","date":"April 24, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the inspiring journeys of engineering leaders who have significantly advanced their fields. Today, we meet Soumya KV, who spearheads the development of the Data Cloud\u2019s internal apps layer at Salesforce. Her India-based team specializes in advanced data segmentation and activation, enabling tailored\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":869,"url":"https:\/\/fde.cat\/index.php\/2024\/05\/22\/composable-data-management-at-meta\/","url_meta":{"origin":736,"position":4},"title":"Composable data management at Meta","date":"May 22, 2024","format":false,"excerpt":"In recent years, Meta\u2019s data management systems have evolved into a composable architecture that creates interoperability, promotes reusability, and improves engineering efficiency.\u00a0 We\u2019re sharing how we\u2019ve achieved this, in part, by leveraging Velox, Meta\u2019s open source execution engine, as well as work ahead as we continue to rethink our data\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":626,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/31\/introducing-velox-an-open-source-unified-execution-engine\/","url_meta":{"origin":736,"position":5},"title":"Introducing Velox: An open source unified execution engine","date":"August 31, 2022","format":false,"excerpt":"Meta is introducing Velox, an open source unified execution engine aimed at accelerating data management systems and streamlining their development. Velox is under active development. Experimental results from our paper published at the International Conference on Very Large Data Bases (VLDB) 2022 show how Velox improves efficiency and consistency in\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=736"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/736\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}