{"id":337,"date":"2021-08-31T14:39:28","date_gmt":"2021-08-31T14:39:28","guid":{"rendered":"https:\/\/fde.cat\/?p=337"},"modified":"2021-08-31T14:39:28","modified_gmt":"2021-08-31T14:39:28","slug":"coordinated-rate-limiting-in-microservices","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/coordinated-rate-limiting-in-microservices\/","title":{"rendered":"Coordinated Rate Limiting in Microservices"},"content":{"rendered":"<h3>The Problem<\/h3>\n<p>Any multitenant service with public REST APIs needs to be able to protect itself from excessive usage by one or more tenants. Additionally, as the number of instances that support these services is dynamic and varies based on load, the need arrises to perform coordinated rate limiting on a per tenant, per endpoint\u00a0basis.<\/p>\n<p>Take the following scenario where a multitenant service has two endpoints:<\/p>\n<p>GET \/v1\/organizations\/{orgId}\/product\/{id}PUT \/v1\/organizations\/{orgId}\/product\/{id}<\/p>\n<p>Since a read operation is cheaper than a write operation, you may want limit them differently, say 100 calls per second vs. 10 calls per second. There is also a need to publish a service level agreement (SLA) about how often the service can be called to let integrating developers know what to expect when building out their applications. Finally, you don\u2019t want these numbers to change based on the number of instances running of the\u00a0service.<\/p>\n<p>For multitenant services, there is a need to make sure no organization can consume too much of the resources. For example, let\u2019s say both Org A and Org B are calling the service. While we promise horizontal scale to solve some of this problem, it isn\u2019t instantaneous. Excessive calls from Org A cannot be allowed to block calls from Org B from executing, even while the service is scaling\u00a0up.<\/p>\n<h3>The Solution<\/h3>\n<p>To solve this problem in a highly performant manner, we use a simple implementation of a Servlet Filter. This class makes use of defined per endpoint rate limits, a two layer cache, and a distributed counter.<\/p>\n<p><em>Note: this solution does perform strict rate limit checking. Due to running multiple independent, instances each org may go slightly over their limit during the last synchronization interval. We deemed this to be acceptable, as the goal is to be highly performant and protect the service, not to be completely strict.<\/em><\/p>\n<h3>The Rate Limit Definition<\/h3>\n<p>The first step in this process is defining the rate limit definitions; how often can a given endpoint be called per organization? To do this, the limits are defined in YAML, which can easily be read by any application.<\/p>\n<p>slas:<br \/>  &#8211; id: get-product<br \/>    enabled: true<br \/>    match:<br \/>      methods: [ &#8216;GET&#8217; ]<br \/>      pathPattern: \/product\/*<br \/>    tiers:<br \/>      &#8211; period: 10<br \/>        threshold: 1000<br \/>  &#8211; id: put-product<br \/>    enabled: true<br \/>    match:<br \/>      methods: [ &#8216;PUT&#8217; ]<br \/>      pathPattern: \/product\/*<br \/>    tiers:<br \/>      &#8211; period: 10<br \/>        threshold: 100<\/p>\n<p>The above definition defines limits for both getting and creating\/updating products.<\/p>\n<p>For get product, a limit of 1000 calls enforced on a 10 second boundary is\u00a0defined.For put product, a limit of 100 calls enforced on a 10 second boundary is\u00a0defined.<\/p>\n<p>You might ask, \u201cWhy don\u2019t we just define them per second?\u201d The answer is that we want to handle bursts, but still protect ourselves. By increasing the window size, the bursty nature of the traffic can be supported.<\/p>\n<p>It is also possible to define multiple tiers. In that case, all tiers must pass evaluation. For example, 10 calls in 1 second but no more than 50 in 10 seconds. Again, this allows supporting the bursty nature of API calls while still managing the overall throughput allowed.<\/p>\n<h3>Distributed Counter<\/h3>\n<p>In order to solve the coordination problem between multiple instances of the service, a distributed counter is used (<em>See \u201cA Deeper Look\u201d below for details of the key<\/em>). The counter is then synchronized out to a distributed cache so that an accurate count across all instances of the service can be maintained.<\/p>\n<p>This information is then matched up with the rate limit definitions that determine how often an API can be invoked over what period of time. This period of time is used as a window that maintains the count until the limit\u00a0resets.<\/p>\n<h3>The Response<\/h3>\n<p>Now that per tenant, per endpoint limits are enforced, the caller of the API needs to be informed of what they are allowed to do. To do this, the following headers are returned with each API\u00a0call.<\/p>\n<p>x-ratelimit-limit: number of calls allowed in the time\u00a0windowx-ratelimit-remaining: number of remaining calls that can be made in time\u00a0windowx-ratelimit-reset: number of seconds remaining in the time\u00a0window<\/p>\n<p>If the caller has exceeded the limit, a 429 \u201cToo Many Requests\u201d error is returned, and the caller needs to back off for the remaining time in the\u00a0window.<\/p>\n<p><em>Note: The format of x-ratelimit-XXX, while not an <\/em><a href=\"https:\/\/www.ietf.org\/standards\/rfcs\/\"><em>RFC<\/em><\/a><em>, is commonly used by SaaS services such as Github and\u00a0Twitter.<\/em><\/p>\n<h3>A Deeper\u00a0Look<\/h3>\n<p><strong>The Counter\u2019s Key<\/strong><\/p>\n<p>To ensure uniqueness, the counter needs to include the tenant (or org) ID, the endpoint (minus any variables within the path), the HTTP method being invoked, and the start and end time in its\u00a0name.<\/p>\n<p>So then, how do we compute the start and end\u00a0time?<\/p>\n<p>As shown above, each rate limit definition has a \u201cperiod\u201d or window over which it is enforced. In the example, this is 10\u00a0seconds.<\/p>\n<p>To compute the window start time and end time, the following algorithm is used. This will always truncate the current system time to the nearest bucket as long as it is within the\u00a0window.<\/p>\n<p>long windowTime = 10000; \/\/ 10 seconds in milliseconds<br \/>long startTime = System.<em>currentTimeMillis<\/em>();<br \/>long windowStart = (startTime \/ windowTime) * windowTime;<br \/>long windowEnd = windowStart + windowTime;String counterKey = orgId + &#8220;_&#8221; + endpoint + &#8220;_&#8221; + httpMethod + <br \/>  &#8220;_&#8221; + windowStart + &#8220;_&#8221; + windowEnd;<\/p>\n<p>Once a new window begins, the key automatically changes to the new start\u00a0time.<\/p>\n<p>Let\u2019s take the example of a system time of 162731878077 (which is 2021\u201307\u201326 12:59:40.077).<\/p>\n<p>Window Start = 162731878077 \/ 10000 = 16273187.8077 (truncated to 16273187 due to integer math) * 10000 = 162731870000Window End = 162731870000 + 10000 = 162731880000<\/p>\n<p>100 ms\u00a0later<\/p>\n<p>Window Start = 162731878177 \/ 10000 = 16273187.8177 (truncated to 16273187) * 10000 = 162731870000Window End = 162731870000 + 10000 = 162731880000<\/p>\n<p><em>Note: This algorithm favors simplicity in order to avoid having all instances synchronize the start time. What this means is that the first window may end up being short (roll over early), but all subsequent windows will match the desired\u00a0size.<\/em><\/p>\n<p><strong>A Two Layer Caching System with Automated Clean\u00a0Up<\/strong><\/p>\n<p>As with any SaaS offering, minimizing the latency induced within an API call is crucial to success. For this implementation, a two layer caching system is used, one local to the service instance, that utilizes the library <a href=\"https:\/\/github.com\/ben-manes\/caffeine\">Caffeine<\/a>, and a second externalized via Redis. (These were chosen as they are lightweight, highly efficient, and support setting a time to live on an entry. Additionally, Redis supports atomic increment operations which allows for easily maintaining the distributed count).<\/p>\n<p>This approach allows for choosing what interval to synchronize with Redis instead of forcing each API call to reach out to the external cache. The way this is implemented is as\u00a0follows.<\/p>\n<p>\/\/ The rate limit window in milliseconds<br \/>long windowTime = 10000;\/\/ The local cache<br \/>Cache&lt;String, Counter&gt; callsPerCounterCache = Caffeine.<em>newBuilder<\/em>().maximumSize(1000)<br \/>        .expireAfterWrite(windowTime + 2000, TimeUnit.<em>MILLISECONDS<\/em>).build();\/\/ The counter class<br \/>class Counter {<br \/>  long callsSinceLastSync = 0;<br \/>  long totalCalls = 0;<br \/>  long lastSyncTimeMs = 0;<br \/>  boolean firstTime = true;<br \/>}\/\/ Get Counter and Increment    <br \/>Counter counter = callsPerCounterCache.get(counterKey, k -&gt; new Counter());<br \/>counter.callsSinceLastSync++;<br \/>counter.totalCalls++;\/\/ If we have reached a sync interval, sync with Redis<br \/>\/\/ This is always true on the first pass<br \/>if (System.currentTimeMillis() &#8211; counter.lastSyncTimeMs &gt; syncThresholdMs) {<br \/>  syncWithRedis(counterName, windowTime, counter);<br \/>}return counter.totalCalls;\/\/\/\/\/\/\/ syncWithRedis method \/\/\/\/\/\/\/ <br \/>long currentTime = System.currentTimeMillis();<br \/>RedisCommands&lt;String, String&gt; sync = redisConnection.sync();<br \/>counter.totalCalls = sync.incrby(counterName, counter.callsSinceLastSync);<\/p>\n<p>\/\/ If the counter expire time has not been set, set it<br \/>if (counter.firstTime) {<br \/>  sync.expire(counterName, TimeUnit.SECONDS.convert(windowTime + 2000, TimeUnit.MILLISECONDS));          <br \/>  counter.firstTime = false;   <br \/>}\/\/ Reset the local count and last sync time<br \/>counter.lastSyncTimeMs = currentTime;<br \/>counter.callsSinceLastSync = 0;<\/p>\n<h3>Performance Testing<\/h3>\n<p>In order to validate the performance of the rate limiter, the following test was run on the following environment.<\/p>\n<p><strong>Test<\/strong><\/p>\n<p>1000 calls per second to a single API\u00a0endpoint25 tenants3 instances of the\u00a0serviceSynchronize with Redis once per second, per tenant\/endpoint<\/p>\n<p><strong>Environment<\/strong><\/p>\n<p>Kubernetes Environment deployed in AWS US-East-13 instances each running on their own m5.xlarge instance (4 vCPU, 16 GB\u00a0RAM)Test Clients running in AWS US-East-1.<strong>Redis Atomic Increment Call Rate per\u00a0Second<\/strong><strong>Redis Atomic Increment p95 Latency in milliseconds<\/strong><strong>Redis Rate Limiter p95 Latency in Milliseconds<\/strong><\/p>\n<p>The above charts show that the expected 75 (25 tenants across three instances) calls are made to Redis per second, that the Redis atomic increment latency is a p95 of 15.7ms, and that, by only syncing once per second, the rate limiter induces a <strong>p95 of 1 <\/strong>millisecond of latency under\u00a0load.<\/p>\n<h3>Conclusion<\/h3>\n<p>As companies move more towards API-first, multitenant, horizontally scalable, distributed offerings, it is paramount that they build in the service protection to maintain stability under load. The above article demonstrates a simple implementation with a two layer cache approach that can be employed to protect the system without hindering performance.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/coordinated-rate-limiting-in-microservices-bb8861126beb\">Coordinated Rate Limiting in Microservices<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/coordinated-rate-limiting-in-microservices-bb8861126beb?source=rss----cfe1120185d3---4\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Problem Any multitenant service with public REST APIs needs to be able to protect itself from excessive usage by one or more tenants. Additionally, as the number of instances that support these services is dynamic and varies based on load, the need arrises to perform coordinated rate limiting on a per tenant, per endpoint\u00a0basis.&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/coordinated-rate-limiting-in-microservices\/\">Continue reading <span class=\"screen-reader-text\">Coordinated Rate Limiting in Microservices<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-337","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":228,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/realtime-predictions-in-a-multitenant-environment\/","url_meta":{"origin":337,"position":0},"title":"Realtime Predictions in a Multitenant Environment","date":"February 2, 2021","format":false,"excerpt":"Real-time Predictions in a Multitenant EnvironmentIntroductionThe Einstein Vision and Language Platform Team at Salesforce enables data management, training, and prediction for deep learning-based Vision and Language use cases. Consumers of the platform can use our API gateway to upload datasets, train those datasets, and ultimately use the models generated out\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":299,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/caching-with-the-salesforce-commerce-sdk\/","url_meta":{"origin":337,"position":1},"title":"Caching with the Salesforce Commerce SDK","date":"August 31, 2021","format":false,"excerpt":"Co-written by Brian\u00a0RedmondEvery e-commerce application is going to need caching. For some of our customers, millions of shoppers may look at the same product information and, if you have to request that information from the service every time, your application will not scale. This is why we built the Commerce\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":559,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/30\/how-meta-enables-de-identified-authentication-at-scale\/","url_meta":{"origin":337,"position":2},"title":"How Meta enables de-identified authentication at scale","date":"March 30, 2022","format":false,"excerpt":"Data minimization \u2014 collecting the minimum amount of data required to support our services \u2014 is one of our core principles at Meta as we continue developing new privacy-enhancing technologies (PETs). We are constantly seeking ways to improve privacy and protect user data on our family of products. Previously, we\u2019ve\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":274,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/foqs-scaling-a-distributed-priority-queue\/","url_meta":{"origin":337,"position":3},"title":"FOQS: Scaling a distributed priority queue","date":"August 31, 2021","format":false,"excerpt":"We will be hosting a talk about our work on Scaling a Distributed Priority Queue during our virtual Systems @Scale event at 11 am PT on Wednesday, February 24, followed by a live Q&A session. Please submit any questions to systemsatscale@fb.com before the event. The entire Facebook ecosystem is powered\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":483,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/lessons-learned-using-spring-data-redis\/","url_meta":{"origin":337,"position":4},"title":"Lessons Learned using Spring Data Redis","date":"October 5, 2021","format":false,"excerpt":"Context Our Commerce Cloud team that is in charge of the Omnichannel Inventory service uses Redis as a remote cache to store data that lends itself for caching. The remote cache allows our multiple processes to get a synchronized and single view of the cached data. (See our previous blog\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":628,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/06\/viewing-the-world-as-a-computer-global-capacity-management\/","url_meta":{"origin":337,"position":5},"title":"Viewing the world as a computer: Global capacity management","date":"September 6, 2022","format":false,"excerpt":"Meta currently operates 14 data centers around the world. This rapidly expanding global data center footprint poses new challenges for service owners and for our infrastructure management systems. Systems like Twine, which we use to scale cluster management, and RAS, which handles perpetual region-wide resource allocation, have provided the abstractions\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/337","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=337"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/337\/revisions"}],"predecessor-version":[{"id":373,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/337\/revisions\/373"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=337"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=337"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=337"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}