{"id":842,"date":"2024-03-20T21:50:40","date_gmt":"2024-03-20T21:50:40","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/03\/20\/optimizing-rtc-bandwidth-estimation-with-machine-learning\/"},"modified":"2024-03-20T21:50:40","modified_gmt":"2024-03-20T21:50:40","slug":"optimizing-rtc-bandwidth-estimation-with-machine-learning","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/03\/20\/optimizing-rtc-bandwidth-estimation-with-machine-learning\/","title":{"rendered":"Optimizing RTC bandwidth estimation with machine learning"},"content":{"rendered":"<p><span>Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta\u2019s family of apps.<\/span><br \/>\n<span>We\u2019ve adopted a machine learning (ML)-based approach that allows us<\/span><span> to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport.<\/span><br \/>\n<span>We\u2019re sharing our experiment results from this approach, some of the challenges we encountered during execution, and learnings for new adopters.<\/span><\/p>\n<p><span>Our existing bandwidth estimation (BWE) module at Meta is<\/span><a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2910017.2910605\"> <span>based on WebRTC\u2019s Google Congestion Controller (GCC)<\/span><\/a><span>. We have made several improvements through parameter tuning, but this has resulted in a more complex system, as shown in Figure 1.<\/span><\/p>\n<p>Figure 1: BWE module\u2019s system diagram for congestion control in RTC.<\/p>\n<p><span>One challenge with the tuned congestion control (CC)\/BWE algorithm was that it had multiple parameters and actions that were dependent on network conditions. For example, there was a trade-off between quality and reliability; improving quality for high-bandwidth users often led to reliability regressions for low-bandwidth users, and vice versa, making it challenging to optimize the user experience for different network conditions.<\/span><\/p>\n<p><span>Additionally, we noticed some inefficiencies in regards to improving and maintaining the module with the complex BWE module:<\/span><\/p>\n<p><span>Due to the absence of realistic network conditions during our experimentation process, fine-tuning the parameters for user clients necessitated several attempts.<\/span><br \/>\n<span>Even after the rollout, it wasn\u2019t clear if the optimized parameters were still applicable for the targeted network types.<\/span><br \/>\n<span>This resulted in complex code logics and branches for engineers to maintain.<\/span><\/p>\n<p><span>To solve these inefficiencies, we developed a machine learning (ML)-based, network-targeting approach that offers a cleaner alternative to hand-tuned rules. This approach also allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport.<\/span><\/p>\n<h2><span>Network characterization<\/span><\/h2>\n<p><span>An ML model-based approach leverages time series data to improve the bandwidth estimation by using offline parameter tuning for characterized network types.\u00a0<\/span><\/p>\n<p><span>For an RTC call to be completed, the endpoints must be connected to each other through network devices. The optimal configs that have been tuned offline are stored on the server and can be updated in real-time. During the call connection setup, these optimal configs are delivered to the client. During the call, media is transferred directly between the endpoints or through a relay server. Depending on the network signals collected during the call, an ML-based approach characterizes the network into different types and applies the optimal configs for the detected type.<\/span><\/p>\n<p><span>Figure 2 illustrates an example of an RTC call that\u2019s optimized using the ML-based approach. <\/span><span>\u00a0<\/span><\/p>\n<p>Figure 2: An example RTC call configuration with optimized parameters delivered from the server and based on the current network type.<\/p>\n<h2><span>Model learning and offline parameter tuning<\/span><\/h2>\n<p><span>On a high level, network characterization consists of two main components, as shown in Figure 3. The first component is offline ML model learning using ML to categorize the network type (random packet loss versus bursty loss). The second component uses offline simulations to tune parameters optimally for the categorized network type.\u00a0<\/span><\/p>\n<p>Figure 3: Offline ML-model learning and parameter tuning.<\/p>\n<p><span>For model learning, we leverage the time series data (network signals and non-personally identifiable information, see Figure 6, below) from production calls and simulations. Compared to the aggregate metrics logged after the call, time series captures the time-varying nature of the network and dynamics. We use<\/span><a href=\"https:\/\/engineering.fb.com\/2016\/05\/09\/core-infra\/introducing-fblearner-flow-facebook-s-ai-backbone\/\"><span> FBLearner<\/span><\/a><span>, our internal AI stack, for the training pipeline and deliver the PyTorch model files on demand to the clients at the start of the call.<\/span><\/p>\n<p><span>For offline tuning, we use simulations to run network profiles for the detected types and choose the optimal parameters for the modules based on improvements in technical metrics (such as quality, freeze, and so on.).<\/span><\/p>\n<h2><span>Model architecture<\/span><\/h2>\n<p><span>From our experience, we\u2019ve found that it\u2019s necessary to combine time series features with non-time series (i.e., derived metrics from the time window) for a highly accurate modeling.<\/span><\/p>\n<p><span>To handle both time series and non-time series data, we\u2019ve designed a model architecture that can process input from both sources.<\/span><\/p>\n<p><span>The time series data will pass through a<\/span><a href=\"https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/\"> <span>long short-term memory (LSTM) layer<\/span><\/a><span> that will convert time series input into a one-dimensional vector representation, such as 16\u00d71. The non-time series data or dense data will pass through a dense layer (i.e., a fully connected layer). Then the two vectors will be concatenated, to fully represent the network condition in the past, and passed through a fully connected layer again. The final output from the neural network model will be the predicted output of the target\/task, as shown in Figure 4.\u00a0<\/span><\/p>\n<p>Figure 4: Combined-model architecture with LSTM and Dense Layers<\/p>\n<h2><span>Use case: Random packet loss classification<\/span><\/h2>\n<p><span>Let\u2019s consider the use case of categorizing packet loss as either random or congestion. The former loss is due to the network components, and the latter is due to the limits in queue length (which are delay dependent). Here is the ML task definition:<\/span><span><br \/>\n<\/span><span><br \/>\n<\/span><span>Given the network conditions in the past N seconds (10), and that the network is currently incurring packet loss, the goal is to characterize the packet loss at the current timestamp as RANDOM or not.<\/span><\/p>\n<p><span>Figure 5 illustrates how we leverage the architecture to achieve that goal:<\/span><\/p>\n<p>Figure 5: Model architecture for a random packet loss classification task.<\/p>\n<h3><span>Time series features<\/span><\/h3>\n<p><span>We leverage the following time series features gathered from logs:<\/span><\/p>\n<p>Figure 6: Time series features used for model training.<\/p>\n<h3><span>BWE optimization<\/span><\/h3>\n<p><span>When the ML model detects random packet loss, we perform local optimization on the BWE module by:<\/span><\/p>\n<p><span>Increasing the tolerance to random packet loss in the loss-based BWE (holding the bitrate).<\/span><br \/>\n<span>Increasing the ramp-up speed, depending on the link capacity on high bandwidths.<\/span><br \/>\n<span>Increasing the network resiliency by sending additional forward-error correction packets to recover from packet loss.<\/span><\/p>\n<h2><span>Network prediction<\/span><\/h2>\n<p><span>The network characterization problem discussed in the previous sections focuses on classifying network types based on past information using time series data. For those simple classification tasks, we achieve this using the hand-tuned rules with some limitations. The real power of leveraging ML for networking, however, comes from using it for predicting future network conditions.<\/span><\/p>\n<p><span>We have applied ML for solving congestion-prediction problems for optimizing low-bandwidth users\u2019 experience.<\/span><\/p>\n<h2><span>Congestion prediction<\/span><\/h2>\n<p><span>From our analysis of production data, we found that low-bandwidth users often incur congestion due to the behavior of the GCC module. By predicting this congestion, we can improve the reliability of such users\u2019 behavior. Towards this, we addressed the following problem statement using round-trip time (RTT) and packet loss:<\/span><span><br \/>\n<\/span><span><br \/>\n<\/span><span>Given the historical time-series data from production\/simulation (\u201cN\u201d seconds), the goal is to predict packet loss due to congestion or the congestion itself in the next \u201cN\u201d seconds; that is, a spike in RTT followed by a packet loss or a further growth in RTT.<\/span><\/p>\n<p><span>Figure 7 shows an example from a simulation where the bandwidth alternates between 500 Kbps and 100 Kbps every 30 seconds. As we lower the bandwidth, the network incurs congestion and the ML model predictions fire the green spikes even before the delay spikes and packet loss occur. This early prediction of congestion is helpful in faster reactions and thus improves the user experience by preventing video freezes and connection drops.<\/span><\/p>\n<p>Figure 7: Simulated network scenario with alternating bandwidth for congestion prediction<\/p>\n<h2><span>Generating training samples<\/span><\/h2>\n<p><span>The main challenge in modeling is generating training samples for a variety of congestion situations. With simulations, it\u2019s harder to capture different types of congestion that real user clients would encounter in production networks. As a result, we used actual production logs for labeling congestion samples, following the RTT-spikes criteria in the past and future windows according to the following assumptions:<\/span><\/p>\n<p><span>Absent past RTT spikes, packet losses in the past and future are independent.<\/span><br \/>\n<span>Absent past RTT spikes, we cannot predict future RTT spikes or fractional losses (i.e., flosses).<\/span><\/p>\n<p><span>We split the time window into past (4 seconds) and future (4 seconds) for labeling.<\/span><span><br \/>\n<\/span><\/p>\n<p>Figure 8: Labeling criteria for congestion prediction<\/p>\n<h2><span>Model performance<\/span><\/h2>\n<p><span>Unlike network characterization, where ground truth is unavailable, we can obtain ground truth by examining the future time window after it has passed and then comparing it with the prediction made four seconds earlier. With this logging information gathered from real production clients, we compared the performance in offline training to online data from user clients:<\/span><\/p>\n<p>Figure 9: Offline versus online model performance comparison.<\/p>\n<h2><span>Experiment results<\/span><\/h2>\n<p><span>Here are some highlights from our deployment of various ML models to improve bandwidth estimation:<\/span><\/p>\n<h3><span>Reliability wins for congestion prediction<\/span><\/h3>\n<p><span><\/span> <span>connection_drop_rate -0.326371 +\/- 0.216084<br \/>\n<\/span><span> last_minute_quality_regression_v1 -0.421602 +\/- 0.206063<br \/>\n<\/span><span> last_minute_quality_regression_v2 -0.371398 +\/- 0.196064<br \/>\n<\/span><span> bad_experience_percentage -0.230152 +\/- 0.148308<br \/>\n<\/span><span> transport_not_ready_pct -0.437294 +\/- 0.400812<\/span><\/p>\n<p><span><\/span><span> peer_video_freeze_percentage -0.749419 +\/- 0.180661<br \/>\n<\/span><span> peer_video_freeze_percentage_above_500ms -0.438967 +\/- 0.212394<\/span><\/p>\n<h3><span>Quality and user engagement wins for random packet loss characterization in high bandwidth<\/span><\/h3>\n<p><span><\/span><span> peer_video_freeze_percentage -0.379246 +\/- 0.124718<br \/>\n<\/span><span> peer_video_freeze_percentage_above_500ms -0.541780 +\/- 0.141212<br \/>\n<\/span><span> peer_neteq_plc_cng_perc -0.242295 +\/- 0.137200<\/span><\/p>\n<p><span> total_talk_time 0.154204 +\/- 0.148788<\/span><\/p>\n<h3><span>Reliability and quality wins for cellular low bandwidth classification<\/span><\/h3>\n<p><span> connection_drop_rate -0.195908 +\/- 0.127956<br \/>\n<\/span><span> last_minute_quality_regression_v1 -0.198618 +\/- 0.124958<br \/>\n<\/span><span> last_minute_quality_regression_v2 -0.188115 +\/- 0.138033<\/span><\/p>\n<p><span> peer_neteq_plc_cng_perc -0.359957 +\/- 0.191557<br \/>\n<\/span><span> peer_video_freeze_percentage -0.653212 +\/- 0.142822<\/span><\/p>\n<h3><span>Reliability and quality wins for cellular high bandwidth classification<\/span><\/h3>\n<p><span> avg_sender_video_encode_fps 0.152003 +\/- 0.046807<br \/>\n<\/span><span> avg_sender_video_qp -0.228167 +\/- 0.041793<br \/>\n<\/span><span> avg_video_quality_score 0.296694 +\/- 0.043079<br \/>\n<\/span><span> avg_video_sent_bitrate 0.430266 +\/- 0.092045<\/span><\/p>\n<h2><span>Future plans for applying ML to RTC<\/span><\/h2>\n<p><span>From our project execution and experimentation on production clients, we noticed that a ML-based approach is more efficient in targeting, end-to-end monitoring, and updating than traditional hand-tuned rules for networking. However, the efficiency of ML solutions largely depends on data quality and labeling (using simulations or production logs). By applying ML-based solutions to solving network prediction problems \u2013 congestion in particular \u2013 we fully leveraged the power of ML.\u00a0<\/span><\/p>\n<p><span>In the future, we will be consolidating all the network characterization models into a single model using the multi-task approach to fix the inefficiency due to redundancy in model download, inference, and so on. We will be building a shared representation model for the time series to solve different tasks (e.g., bandwidth classification, packet loss classification, etc.) in network characterization. We will focus on building realistic production network scenarios for model training and validation. This will enable us to use ML to identify optimal network actions given the network conditions. We will persist in refining our learning-based methods to enhance network performance by considering existing network signals.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/03\/20\/networking-traffic\/optimizing-rtc-bandwidth-estimation-machine-learning\/\">Optimizing RTC bandwidth estimation with machine learning<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta\u2019s family of apps. We\u2019ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport. We\u2019re sharing our experiment results from this approach, some of&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/03\/20\/optimizing-rtc-bandwidth-estimation-with-machine-learning\/\">Continue reading <span class=\"screen-reader-text\">Optimizing RTC bandwidth estimation with machine learning<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-842","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":841,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/20\/better-video-for-mobile-rtc-with-av1-and-hd\/","url_meta":{"origin":842,"position":0},"title":"Better video for mobile RTC with AV1 and HD","date":"March 20, 2024","format":false,"excerpt":"At Meta, we support real-time communication (RTC) for billions of people through our apps, including Messenger, Instagram, and WhatsApp. We\u2019ve seen significant benefits by adopting the AV1 codec for RTC. Here\u2019s how we are improving the RTC video quality for our apps with tools like the AV1 codec, the challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":880,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/13\/mlow-metas-low-bitrate-audio-codec\/","url_meta":{"origin":842,"position":1},"title":"MLow: Meta\u2019s low bitrate audio codec","date":"June 13, 2024","format":false,"excerpt":"At Meta, we support real-time communication (RTC) for billions of people through our apps, including WhatsApp, Instagram, and Messenger.\u00a0 We are working to make RTC accessible by providing a high-quality experience for everyone \u2013 even those who might not have the fastest connections or the latest phones. As more and\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":630,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/07\/network-entitlement-a-contract-based-network-sharing-solution\/","url_meta":{"origin":842,"position":2},"title":"Network Entitlement: A contract-based network sharing solution","date":"September 7, 2022","format":false,"excerpt":"Meta\u2019s overall network usage and traffic volume has increased as we\u2019ve continued to add new services. Due to the scarcity of fiber resources, we\u2019re developing an explicit resource reservation framework to effectively plan, manage, and operate the shared consumption of network bandwidth, which will help us keep up with demand\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":604,"url":"https:\/\/fde.cat\/index.php\/2022\/07\/06\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/","url_meta":{"origin":842,"position":3},"title":"Watch Meta\u2019s engineers discuss QUIC and TCP innovations for our network","date":"July 6, 2022","format":false,"excerpt":"With more than 75 percent of our internet traffic set to use QUIC and HTTP\/3 together, QUIC is slowly moving to become the de facto protocol used for internet communication at Meta. For Meta\u2019s data center network, TCP remains the primary network transport protocol that supports thousands of services on\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":894,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/taming-the-tail-utilization-of-ads-inference-at-meta-scale\/","url_meta":{"origin":842,"position":4},"title":"Taming the tail utilization of ads inference at Meta scale","date":"July 10, 2024","format":false,"excerpt":"Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization. The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability.\u00a0 Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":618,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/scaling-data-ingestion-for-machine-learning-training-at-meta\/","url_meta":{"origin":842,"position":5},"title":"Scaling data ingestion for machine learning training at Meta","date":"August 10, 2022","format":false,"excerpt":"Many of Meta\u2019s products, such as search and language translations, utilize AI models to continuously improve user experiences. As the performance of hardware we use to support training infrastructure increases, we need to scale our data ingestion infrastructure accordingly to handle workloads more efficiently. GPUs, which are used for training\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/842","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=842"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/842\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=842"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=842"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=842"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}