{"id":580,"date":"2022-05-05T14:51:18","date_gmt":"2022-05-05T14:51:18","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/05\/05\/how-to-optimize-your-apache-spark-application-with-partitions-2\/"},"modified":"2022-05-05T14:51:18","modified_gmt":"2022-05-05T14:51:18","slug":"how-to-optimize-your-apache-spark-application-with-partitions-2","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/05\/05\/how-to-optimize-your-apache-spark-application-with-partitions-2\/","title":{"rendered":"How to Optimize Your Apache Spark Application with Partitions"},"content":{"rendered":"<p>In Salesforce Einstein, we use\u00a0<a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Spark<\/a>\u00a0to perform parallel computations on large sets of data, in a distributed manner. In this article, we will take a deep dive into how you can optimize your Spark application with partitions.<\/p>\n<h2><strong>Introduction<\/strong><\/h2>\n<p>Today, we often need to process terabytes of data per day to reach conclusions. To do this in an acceptable timeframe, we need to perform certain computations on that data in parallel.<\/p>\n<p>Parallelizing on one machine will always have limits, no matter how big the machine is. Our true factor of parallelism is determined by the number of cores on our machine (128 is the maximum today on a single machine). So instead, we want to be able to distribute the load to multiple machines. That way, we can use a fleet of commodity hardware and reach an \u201cinfinite\u201d parallelism factor (we can always add new machines).<\/p>\n<p>Spark is a distributed processing system that helps us distribute the load on multiple machines, without the overhead of syncing them and managing errors for each. But the thing is, it\u2019s not Spark\u2019s job to decide how best to distribute the load. It keeps default configuration to help us get started, but those are not enough \u2014 relying on the defaults can lead to a\u00a0<strong>70%<\/strong>\u00a0performance gap when not tuning our application, as we will see in our example later on. Our job is to tell Spark exactly how we want to distribute the load on our dataset. To do so, we must learn and understand the concept of partitions.<\/p>\n<h2><strong>What is a partition?<\/strong><\/h2>\n<p>Let\u2019s start off with an example. Say we have a table with three columns: class ID, student ID and grade. We want to calculate the average grade for each class ID. While calculating each class average grade is independent from other classes, grades within each class are dependent on each other to reach the average grade of that class. Thus, here we can parallelize our program to the maximum number of Class IDs, but no more than that. In Spark, this is called a \u201cpartition,\u201d and in our example we have partitioned our data by a certain column \u2014 Class ID. Computations on datasets in Spark are translated into tasks where each task runs on exactly one core. Each task (you guessed it) maps to a single partition, so the number of tasks equals the number of partitions.<\/p>\n<h2><strong>Not all computations are born equal<\/strong><\/h2>\n<p>Spark works in a staged computation; it bundles certain computations into the same stage and executes them with the same level of parallelism.<\/p>\n<p>To understand how Spark decides what to bundle into a stage, we need to understand the two kinds of transformations that we can perform on our data:<\/p>\n<p><strong>Narrow transformations:<\/strong>\u00a0These are transformations in which data in each partition does not require access to data in other partitions in order to be fully executed. For example, functions like map, filter, and union are narrow transformations.<strong>Wide transformations:<\/strong>\u00a0These are transformations in which data in each partition is not independent, requiring data from other partitions in order to be fully executed. For example, functions like reduceByKey, groupByKey, and join are wide transformations.<\/p>\n<p>Wide transformations require an operation called \u201cshuffle,\u201d which is basically transferring data between the different partitions. Shuffle is considered to be a rather expensive operation, and we should avoid it if we can. Shuffle will result in different partitions.<\/p>\n<p>Notice that often when we have x amount of partitions and we are doing a wide transformation (i.e. groupBy), Spark will first groupBy in each initial partition and only then shuffle the data, partition it by key, and groupBy again in each shuffled partition, to increase efficiency and reduce the amount of rows while shuffling.<\/p>\n<p>In Spark, each stage is built from transformations that can be done in a row, without the need for shuffle (narrow transformations). To recap, stages are created based on chunks of processing that can be done in a parallel manner, without shuffling things around again.<\/p>\n<h2><strong>Controlling the number of partitions in each stage<\/strong><\/h2>\n<p>As mentioned before, Spark can be rather naive when it comes to partitioning our data correctly. That\u2019s because it\u2019s not really Spark\u2019s job. We should make sure our data is well-partitioned in each executing stage.<\/p>\n<p>The adaptive query execution (introduced in Spark 3) is now enabled by default in the latest Spark version (3.2.1), and you should keep it that way. Among other things (including determining join strategies in real-time and optimizing skewed joins), it reduces the number of partitions used in shuffle operations in run-time, as a dependent of dataset size (hence, adaptive). But, those are still pre-run static config values that we have to set appropriately.<\/p>\n<p>Let\u2019s have a look at some useful configuration values regarding Spark partitions:<\/p>\n<p>spark.sql.files.maxPartitionBytes: The maximum number of bytes to pack into a single partition when reading files. Default is 128 MB.spark.sql.files.minPartitionNum: The suggested (not guaranteed) minimum number of partitions when reading files. Default is\u00a0spark.default.parallelism\u00a0which equals to two or the total number of cores in our cluster, whichever is bigger.<\/p>\n<p>spark.sql.adapative.enabled\u00a0and\u00a0spark.sql.adaptive.coalescePartitions.enabled\u00a0must be true (which is the default case) for the next config values to apply:<\/p>\n<p>spark.sql.adaptive.advisoryPartitionSizeInBytes: Target size of shuffle partitions during adaptive optimization. Default is 64 MB.spark.sql.adaptive.coalescePartitions.initialPartitionNum:\u00a0As stated above, the adaptive query execution optimizes while\u00a0<strong>reducing\u00a0<\/strong>(or in Spark terms \u2013 coalescing) the number of partitions used in shuffle operations. This means that the initial number must be set high enough. This value is the initial number of partitions to use when shuffling, before starting to reduce it. The default is\u00a0spark.sql.shuffle.partitions, which equals to 200.spark.sql.adaptive.coalescePartitions.parallelismFirst:\u00a0When this value is set to true (the default), Spark ignores\u00a0spark.sql.adaptive.advisoryPartitionSizeInBytes\u00a0and only respects\u00a0spark.sql.adaptive.coalescePartitions.minPartitionSize\u00a0which defaults to 1M. This is meant to increase parallelism.<\/p>\n<p>If we set\u00a0spark.sql.adapative.enabled\u00a0to false, the target number of partitions while shuffling will simply be equal to\u00a0spark.sql.shuffle.partitions.<\/p>\n<p>In addition to to these static configuration values, we often need to dynamically repartition our dataset. One example is when we filter our dataset. We might end up with uneven partitions, which will cause skewed data and un-balanced processing. Another example could be when we want our data to be written in a partitioned way to different folders by certain key. We might want our dataset partitioned by that key in memory beforehand, to avoid searching in multiple partitions while writing.<\/p>\n<p>Having said that,\u00a0<strong>let\u2019s see how we can dynamically repartition our dataset using Spark\u2019s different partition strategies<\/strong>:<\/p>\n<p><strong>Round Robin Partitioning<\/strong>: Distributing the data from the source number of partitions to the target number of partitions in a round robin way, to keep equal distribution between the resulted partitions. Since repartitioning is a shuffle operation, if we don\u2019t pass any value, it will use the configuration values mentioned above to set the final number of partitions. Example of use:\u00a0df.repartition(10).<\/p>\n<p><strong>Hash Partitioning<\/strong>: Splits our data in such way that elements with the same hash (can be key, keys, or a function) will be in the same partition. We can also pass wanted number of partitions, so that the final determined partition will be hash % numPartitions. Notice that if numPartitions is bigger than the number of groups with the same hash, there would be empty partitions.<br \/>Example of use:\u00a0df.repartiton(10, &#8216;class_id&#8217;)<\/p>\n<p><strong>Range Partitioning<\/strong>: Very similar to hash partitioning, only it\u2019s based on a range of values. Due to performance reasons, this method uses a sampling to estimate the ranges. Hence, the output may be inconsistent, since the sampling can return different values. The sample size can be controlled by the config value\u00a0spark.sql.execution.rangeExchange.sampleSizePerPartition.<br \/>Example of use:\u00a0df.repartitionByRange(10, &#8216;grade&#8217;)<\/p>\n<p>When decreasing our partitions, we can also use the\u00a0coalesce\u00a0method, which does not perform data shuffling. It simply merges existing partitions (as opposed to repartition, which shuffles the data and creates new partitions). Hence, it is more performant than repartition.<\/p>\n<p>But, it might split our data unevenly between the different partitions since it doesn\u2019t uses shuffle. In general, we should use coalesce when our parent partitions are already evenly distributed, or if our target number of partitions is marginally smaller than the source number of partitions. In other cases, we would probably want to use repartition, to make sure our data is distributed evenly.<\/p>\n<h2><strong>What is the optimal number of partitions?<\/strong><\/h2>\n<p>Of course, there is no one answer to this question. How you should partition your data depends on:<\/p>\n<p><strong>Available resources in your cluster<\/strong><br \/>Spark\u2019s official recommendation is that you have ~3x the number of partitions than available cores in cluster, to maximize parallelism along side with the overhead of spinning more executors. But this is not quite so straight forward. If our tasks are rather simple (taking less than a second), we might want to consider decreasing our partitions (to avoid the overhead), even if it means less than 3x. We should also take under consideration the memory of each executor, and make sure we are not exceeding it. If we do, we might want to create more partitions, even if it\u2019s more than 3x our available cores, so that each will be smaller, or increase the memory of our executors.<strong>How your data looks (sizing, cardinality) and computations being made<\/strong><br \/>You should make sure that your data is distributed as evenly as possible across your different partitions. Sometimes it can be easily done (using round robin partitioning), but other times it can be more complicated. Group by, windowing, and writing our data in a way that it will be easily queried later often require partition by key. When doing this, we need to make sure that one group\u2019s values are not marginally larger than the others. If that\u2019s the case, we need to see if we can achieve the same result with different measures by normalizing our data or changing our computations (i.e. split this specific key into two, apply certain computations, and then merge between them).<\/p>\n<h2><strong>Spark Partitions in Action<\/strong><\/h2>\n<p>Let\u2019s put some of what we\u2019ve learned about Spark partitions into an example.<\/p>\n<p>Say we have a dataset weighing ~24 GB that contains fraud deals that occurred in 2021, in certain businesses.<br \/>It looks like this:<\/p>\n<p>For each business, we want to calculate the total amount of fraud deals made in that business and how this total amount compares to the average total amount.<\/p>\n<p>In the end, we want to write the result as csv \u2014 partitioned to folders by total business fraud amount\/average fraud amount (accurate to the third decimal point).<\/p>\n<p>Let\u2019s see a naive example of how to do it using pyspark and Spark\u2019s DataFrame API:<\/p>\n<p>spark = SparkSession.builder.getOrCreate()<br \/>df = spark.read.option(&#8216;header&#8217;, &#8216;true&#8217;).csv(&#8216;.\/example_data\/dataset_1.csv&#8217;)df = df.withColumn(&#8216;amount&#8217;, F.col(&#8216;amount&#8217;).cast(&#8216;int&#8217;))<br \/>df = df.groupBy(&#8216;business&#8217;).agg(F.sum(&#8216;amount&#8217;).alias(&#8216;total_amount&#8217;))df_avg = df.select(F.avg(&#8216;total_amount&#8217;).alias(&#8216;avg_amount&#8217;))<br \/>df = df.crossJoin(df_avg)<br \/>df = df.withColumn(&#8216;compared_to_avg&#8217;, <br \/>        F.round(F.col(&#8216;total_amount&#8217;) \/ F.col(&#8216;avg_amount&#8217;), 3))<br \/>    df.write.mode(&#8216;overwrite&#8217;).partitionBy(&#8216;compared_to_avg&#8217;).csv(&#8216;.\/output_data\/&#8217;)<\/p>\n<p>Great! We\u2019ve done it. The main stages in our application are:<\/p>\n<p>Cast amount to int and group by business in each partition to partially sum the total amount of each business (partially because this business can also appear in other partitions)Shuffle the data so that it is partitioned by business, summing the total amount for each business and calculating the average amount for each businessCreate another DataFrame that contains the average total amount (of all business) and join both data frames to add the average amount for each rowCalculate compared_to_avg as the total business fraud amount \/ average fraud amount and write again as csv, partitioned by compared_to_avg<\/p>\n<p>Now let\u2019s look at some statistics in Spark UI.<\/p>\n<p><strong>Stage #1<\/strong>:<\/p>\n<p>Spark used 192 partitions, each containing ~128 MB of data (which is the default of\u00a0spark.sql.files.maxPartitionBytes).<br \/>The entire stage took\u00a0<strong>32s<\/strong>.<\/p>\n<p><strong>Stage #2<\/strong>:<\/p>\n<p>We can see that the groupBy shuffle resulted in 11 partitions, each containing ~1 MB of data (which is the default of\u00a0spark.sql.adaptive.coalescePartitions.minPartitionSize).<br \/>The entire stage took\u00a0<strong>0.4s.<\/strong><\/p>\n<p><strong>Stage #3<\/strong>:<\/p>\n<p>Spark used 1 partition containing 708 B to fully calculate the average total amount and join this data with each row.<br \/>The entire stage took\u00a0<strong>4ms<\/strong>.<\/p>\n<p><strong>Stage #4<\/strong>:<\/p>\n<p>We can see that join shuffle resulted back in 11 partitions, each containing ~1 MB of data (which is the default of\u00a0spark.sql.adaptive.coalescePartitions.minPartitionSize).<br \/>The entire stage took\u00a0<strong>15s.<\/strong><\/p>\n<p>So\u2026 how can we optimize it? First, we need to have knowledge of our data size, it\u2019s cardinality, and our resources. We already know that our dataset weighs ~24 GB. Regarding its cardinality, we can see that after we grouped our dataset by business, its size was down by a lot (to around 11 MB), meaning that we have a low cardinality when looking at the different businesses. Regarding our resources\u2014 I\u2019m running Spark in local mode, and my local machine has 16 cores and 32 GB of RAM.<\/p>\n<p>As mentioned in the previous section, to maximize parallelism, we want 3x partitions. So, for stage #1, the optimal number of partitions will be ~48 (16 x 3), which means ~500 MB per partition (our total RAM can handle 16 executors each processing 500 MB). To decrease the number of partitions resulting from shuffle operations, we can use the default advisory partition shuffle size, and set parallelism first to false. (Spark documentation also recommends you set this value to false when you know what you\u2019re doing.) One last thing\u2014 before writing, our dataset is partitioned by business (as a result of our groupBy), but we write our data partitioned by compared_to_avg column. It might be a good idea to partition it on compared_to_avg before writing.<\/p>\n<p>Let\u2019s see how the optimized version looks:<\/p>\n<p>spark_conf = SparkConf()<br \/>    spark_conf.set(&#8216;spark.sql.adaptive.coalescePartitions.initialPartitionNum&#8217;, 24)   spark_conf.set(&#8216;spark.sql.adaptive.coalescePartitions.parallelismFirst&#8217;, &#8216;false&#8217;)<br \/>spark_conf.set(&#8216;spark.sql.files.minPartitionNum&#8217;, 1)<br \/>spark_conf.set(&#8216;spark.sql.files.maxPartitionBytes&#8217;, &#8216;500mb&#8217;)spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()df = spark.read.option(&#8216;header&#8217;, &#8216;true&#8217;).csv(&#8216;.\/example_data\/dataset_1.csv&#8217;)<br \/>df = df.withColumn(&#8216;amount&#8217;, F.col(&#8216;amount&#8217;).cast(&#8216;int&#8217;))<br \/>df = df.groupBy(&#8216;business&#8217;).agg(F.sum(&#8216;amount&#8217;).alias(&#8216;total_amount&#8217;))<br \/>df_avg = df.select(F.avg(&#8216;total_amount&#8217;).alias(&#8216;avg_amount&#8217;))<br \/>df = df.crossJoin(df_avg)<br \/>df = df = df.withColumn(&#8216;compared_to_avg&#8217;, <br \/>        F.round(F.col(&#8216;total_amount&#8217;) \/ F.col(&#8216;avg_amount&#8217;), 3))<br \/>df = df.repartition(24, &#8216;compared_to_avg&#8217;)<br \/>    df.write.mode(&#8216;overwrite&#8217;).partitionBy(&#8216;compared_to_avg&#8217;).csv(&#8216;.\/output_data\/&#8217;)<\/p>\n<p><strong>Stage #1<\/strong>:<\/p>\n<p>Like we told it to using the\u00a0spark.sql.files.maxPartitionBytes\u00a0config value, Spark used 54 partitions, each containing ~ 500 MB of data (it\u2019s not exactly 48 partitions because as the name suggests \u2013 max partition bytes only guarantees the maximum bytes in each partition).<br \/>The entire stage took\u00a0<strong>24s<\/strong>.<\/p>\n<p><strong>Stage #2<\/strong>:<\/p>\n<p>We have set\u00a0spark.sql.adaptive.coalescePartitions.parallelismFirst\u00a0to false, and now Spark AQE uses the default value for shuffle coalescing advisory partitions size (128 MB), resulting in 1 task containing 2 MB. Notice that data size in this stage is much smaller than the previous execution, because we had fewer partitions in stage #1, and the partial groupBy in each partition resulted in much less data to shuffle and fully groupBy.<br \/>The entire stage took\u00a0<strong>0.2s<\/strong>.<\/p>\n<p><strong>Stage #3<\/strong>:<\/p>\n<p>The previous stage already calculated the entire average amount for all business, because they all were in the same partition, so all this stage had left to do was simply join the output.<br \/>The entire stage took\u00a0<strong>2ms<\/strong>.<\/p>\n<p><strong>Stage #4<\/strong>:<\/p>\n<p>We have another stage added \u2014 partitioning the data using the hash partitioner. Spark used 1 partition containing 2.6 MB of data.<br \/>The entire stage took\u00a0<strong>0.1s<\/strong>.<\/p>\n<p><strong>Stage #5<\/strong>:<\/p>\n<p>We wanted our data partitioned into 24 partitions by compared_to_avg column, and that\u2019s exactly what we\u2019ve got. I chose 24 after trial and error, trying to balance between increasing the parallelism factor and the parallelism overhead (resulting in each task taking ~2s). Each of the 24 partitions held ~5.1 KB of data. The entire stage took\u00a0<strong>4s<\/strong>.<\/p>\n<p>Application without partition tuning:\u00a0<strong>47.4s.<\/strong>Application with partition tuning:\u00a0<strong>28.1s.<\/strong><\/p>\n<p>Notice the difference! We\u2019ve managed to achieve the same goal, but much faster. By understanding the concept of partitions, how Spark manages them, and by considering the different factors, we improved our performance by\u00a0<strong>40%<\/strong>!<\/p>\n<p><em>*Dataset was taken from\u00a0<\/em><a href=\"https:\/\/kaggle.com\/\" target=\"_blank\" rel=\"noopener\"><em>kaggle.com<\/em><\/a><em>.<\/em><\/p>\n<h2><strong>Final note<\/strong><\/h2>\n<p>We have learned how partitions allow us to parallelize computations on our dataset and how we can control the way Spark partitions our data, taking into consideration our dataset, computations being made, and cluster size. We\u2019ve also learned how we can leverage all of that to make our application more performant.<\/p>\n<p>While we demonstrated optimization using correct partitioning, there are other factors for tuning and optimizing our Spark application. SQL hints (i.e. for specific join strategies), dynamic repartition to avoid skewed joins, Caching, and Dynamic Resource Allocation, can all be used to optimize our application, and I encourage you to dive deeper. There\u2019s no better place to start than the official documentation:\u00a0<a href=\"https:\/\/spark.apache.org\/docs\/latest\/sql-performance-tuning.html\" target=\"_blank\" rel=\"noopener\">https:\/\/spark.apache.org\/docs\/latest\/sql-performance-tuning.html<\/a>.<\/p>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/how-to-optimize-your-apache-spark-application-with-partitions\/\">How to Optimize Your Apache Spark Application with Partitions<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">salesforce-engineering.go-vip.net<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/how-to-optimize-your-apache-spark-application-with-partitions\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>In Salesforce Einstein, we use\u00a0Apache Spark\u00a0to perform parallel computations on large sets of data, in a distributed manner. In this article, we will take a deep dive into how you can optimize your Spark application with partitions. Introduction Today, we often need to process terabytes of data per day to reach conclusions. To do this&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/05\/05\/how-to-optimize-your-apache-spark-application-with-partitions-2\/\">Continue reading <span class=\"screen-reader-text\">How to Optimize Your Apache Spark Application with Partitions<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-580","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":572,"url":"https:\/\/fde.cat\/index.php\/2022\/05\/05\/how-to-optimize-your-apache-spark-application-with-partitions\/","url_meta":{"origin":580,"position":0},"title":"How to Optimize Your Apache Spark Application with Partitions","date":"May 5, 2022","format":false,"excerpt":"In Salesforce Einstein, we use Apache Spark to perform parallel computations on large sets of data, in a distributed manner. In this article, we will take a deep dive into how you can optimize your Spark application with partitions. Introduction Today, we often need to process terabytes of data per\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":736,"url":"https:\/\/fde.cat\/index.php\/2023\/07\/18\/how-can-apache-spark-windowing-supercharge-your-performance-and-simplify-coding\/","url_meta":{"origin":580,"position":1},"title":"How Can Apache Spark Windowing Supercharge Your Performance and Simplify Coding?","date":"July 18, 2023","format":false,"excerpt":"By Matan Rabi and Scott Nyberg. Salesforce Einstein\u2019s data scales generate terabytes of data each day to train, test, and observe AI models. By only utilizing one machine, this time-consuming process could take days or weeks to complete. In response, Einstein\u2019s Machine Learning Observability Platform (MLOP) team uses Apache Spark,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":345,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/real-time-einstein-insights-using-kafka-streams\/","url_meta":{"origin":580,"position":2},"title":"Real-time Einstein Insights Using Kafka Streams","date":"August 31, 2021","format":false,"excerpt":"Sales representatives deal with hundreds of emails everyday. To help them prioritize, Salesforce offers critical insights on emails received. These insights are either generated by our deep learning models or defined by the customer by matching keywords using regex expressions. Insights are generated in real time in our microservice architecture,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":860,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/24\/inside-data-clouds-secret-formula-for-processing-one-quadrillion-records-monthly\/","url_meta":{"origin":580,"position":3},"title":"Inside Data Cloud\u2019s Secret Formula for Processing One Quadrillion Records Monthly","date":"April 24, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the inspiring journeys of engineering leaders who have significantly advanced their fields. Today, we meet Soumya KV, who spearheads the development of the Data Cloud\u2019s internal apps layer at Salesforce. Her India-based team specializes in advanced data segmentation and activation, enabling tailored\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":733,"url":"https:\/\/fde.cat\/index.php\/2023\/07\/11\/how-is-salesforce-einstein-optimizing-ai-classification-model-accuracy\/","url_meta":{"origin":580,"position":4},"title":"How is Salesforce Einstein Optimizing AI Classification Model Accuracy?","date":"July 11, 2023","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional journeys that have shaped Salesforce Engineering leaders. Meet Matan Rabi, Senior Software Engineer on Salesforce Einstein\u2019s Machine Learning Observability Platform (MLOP) team. Matan and his team strive to optimize the accuracy of Einstein\u2019s AI classification models, empowering customers across industries\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":332,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/fully-sharded-data-parallel-faster-ai-training-with-fewer-gpus\/","url_meta":{"origin":580,"position":5},"title":"Fully Sharded Data Parallel: faster AI training with fewer GPUs","date":"August 31, 2021","format":false,"excerpt":"Training AI models at a large scale isn\u2019t easy. Aside from the need for large amounts of computing power and resources, there is also considerable engineering complexity behind training very large models. At Facebook AI Research (FAIR) Engineering, we have been working on building tools and infrastructure to make training\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/580","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=580"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/580\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=580"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=580"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=580"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}