{"id":287,"date":"2021-08-31T14:40:23","date_gmt":"2021-08-31T14:40:23","guid":{"rendered":"https:\/\/fde.cat\/?p=287"},"modified":"2021-08-31T14:40:23","modified_gmt":"2021-08-31T14:40:23","slug":"optimizing-eks-networking-for-scale","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/optimizing-eks-networking-for-scale\/","title":{"rendered":"Optimizing EKS networking for scale"},"content":{"rendered":"<p>Authors: <a href=\"https:\/\/medium.com\/u\/3f919429e897\">Savithru Lokanath<\/a>, <a href=\"https:\/\/medium.com\/u\/910da6877fbf\">Arpeet Kale<\/a>, <a href=\"https:\/\/medium.com\/u\/d358f0149606\">Vaishnavigalgali<\/a><\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1000\/1*mvJ1xszHAPWV1eszvmUpjA.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>Elastic Kubernetes Service (EKS) is a service under the <a href=\"https:\/\/aws.amazon.com\/\">Amazon Web Services (AWS)<\/a><em> <\/em>umbrella that provides managed <a href=\"https:\/\/kubernetes.io\/\">Kubernetes<\/a> service. It significantly reduces the time to deploy, manage, and scale the infrastructure required to run production-scale Kubernetes clusters.<\/p>\n<p>AWS has simplified EKS networking significantly with its container network interface (CNI) plugin. With no network overlays, a Kubernetes pod (container) gets an IP address from the same Virtual Private Cloud (VPC)<em> <\/em>allocated subnet as would an Elastic Compute Cloud (EC2) instance. This means that any workload, be it a container, lambda, or EC2 instance, can now talk to another without the need for Network Address Translation (NAT).<\/p>\n<p>Each EC2 instance can have multiple elastic network interfaces (ENIs) attached to it, and each interface can be assigned multiple IP addresses from the VPC\u2019s private address\u00a0space.<\/p>\n<ul>\n<li>The first interface, called the primary interface, is assigned an IP address called the primary IP\u00a0address.<\/li>\n<li>All other interfaces are called secondary interfaces and hold secondary IP addresses.<\/li>\n<\/ul>\n<p>The CNI plugin, which runs as a pod on each Kubernetes node (EC2 instance), is responsible for the networking on each node. By default, it assigns each new pod a private IP address from the VPC private subnet that the Kubernetes node is in while attaching the IP address to one of the secondary ENIs of the\u00a0node.<\/p>\n<p>The plugin consists of two primary components:<\/p>\n<ul>\n<li><strong>IPAM<\/strong> daemon, responsible for attaching ENIs to instances, assigning secondary IP addresses to these ENIs,\u00a0etc<\/li>\n<li><strong>CNI plugin,<\/strong> responsible for setting up the virtual ethernet (veth) interfaces, attaching it to the pod, and setting up the bridge between the veth and the host\u00a0network<\/li>\n<\/ul>\n<h3>Problem<\/h3>\n<p>By default, AWS reserves a large pool of IP addresses to an EKS (Kubernetes) node that is always available to be used by the node. This pool of IP\u2019s, also known as <strong>\u201cWarm-Pool,\u201d <\/strong>attached to the secondary interface of the node,<strong> <\/strong>is determined by the EC2 instance type and cannot be shared with any other AWS service or node.<\/p>\n<p>For example, an instance of type <em>m3.2xlarge<\/em> can have up to four ENIs, and each ENI can be assigned up to 30 IP addresses. With default CNI configuration, when a worker node first joins the cluster, ipamD ensures that, along with an active (primary) ENI holding the primary and secondary IP addresses, there\u2019s also a spare ENI attached to the node working in standby mode at all times. The standby ENI and its IP addresses will be utilized when there are no more secondary IP addresses on the primary ENI. Larger instance types have a higher limit on the number of ENIs that can be attached and hence can hold more secondary IP addresses (for more information, see <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-eni.html#AvailableIpPerENI\">IP Addresses Per Network Interface Per Instance Type<\/a> document).<\/p>\n<p>So the total number of IP\u2019s reserved for an instance of type <em>m3.2xlarge <\/em>at any point in time will be 60 IP addresses,<\/p>\n<ul>\n<li><strong>Primary ENI (active):<\/strong> 1 primary IP + 1*29 secondary IP\u2019s<\/li>\n<li><strong>Secondary ENI (standby):<\/strong> 1 primary IP + 1*29 secondary IP\u2019s<\/li>\n<\/ul>\n<p>This is a particularly large number of IP\u2019s for a single Kubernetes node which might run a few resource-intensive applications and daemon set pods. We try to size our EC2 instances in a way that we run a single replica of the application on that given node at any time (although of course, some exceptions apply). Adding all the daemon sets (monitoring, logging, networking, etc), the average number of pods on a node in our environment seems to hover around 5\u20137, hence requiring no more than 10 IP\u2019s\/node.<\/p>\n<p>With this default behavior, roughly 50 IP\u2019s are unused, which results in inefficiencies and, if not properly planned or triaged, can quickly lead to VPC subnet exhaustion.<\/p>\n<h3>Solution<\/h3>\n<p>AWS provides a couple of CNI config variables that you can set to avoid such scenarios. The config variable WARM_IP_TARGET<strong> <\/strong>specifies the number of free IP\u2019s that the ipamD daemon should attempt to keep available for pod assignment on the node at all times. WARM_ENI_TARGET, on the other hand, specifies the number of ENIs ipamD must keep ready to be attached to the node at all\u00a0times.<\/p>\n<p><strong><em>NOTE: <\/em><\/strong><em>AWS defaults the value of <\/em><em>WARM_ENI_TARGET to 1 in the CNI config.<\/em><\/p>\n<p>For example, if the WARM_IP_TARGET variable is set to a value of 10, then ipamD attempts to keep 10 free IP addresses available at all times. If the ENIs on the node are unable to provide these free addresses, ipamD attempts to allocate more interfaces until WARM_IP_TARGET number of free IP addresses are available in the reserved\u00a0pool.<\/p>\n<pre><strong>$ diff -up aws-k8s-cni-without-warm-ip-pool.yaml aws-k8s-cni-with-warm-ip-pool.yaml<\/strong><br>+ - \"name\": \"WARM_IP_TARGET\"<br>+   \"value\": \"10\"<\/pre>\n<p>To understand better, let\u2019s take a look at the scenario where we launch an <em>m3.2xlarge <\/em>(supports maximum of 4 ENI&#8217;s and 30 IP&#8217;s per ENI) EKS node with WARM_IP_TARGET set to 10. Now ipamD will ensure that there are 10 free IP\u2019s on the primary ENI at any point in time, and, since there\u2019s at least one pod using a secondary IP on the primary ENI, the ipamD will honor the WARM_ENI_TARGET=1 config and a second ENI will be attached, holding 10 more IP\u2019s.<\/p>\n<p>With this setting, we have 20 IP addresses reserved at any point in time, a savings of 40 IP\u2019s from the previous\u00a0setting,<\/p>\n<ul>\n<li><strong>Primary ENI (active):<\/strong> 1 primary IP + 1*9 secondary IP\u2019s<\/li>\n<li><strong>Secondary ENI (standby):<\/strong> 1 primary IP + 1*9 secondary IP\u2019s<\/li>\n<\/ul>\n<p>If the number of pods scheduled to run on the node increases above 10, then ipamD will start allocating IP addresses to the primary ENI in blocks of 10. This behavior will continue until it has allocated 30 IP\u2019s, which is also the maximum number of IP\u2019s that an ENI can hold. If a new pod gets scheduled on this node, then ipamD will start allocating IP\u2019s from the secondary ENI, and a third ENI is added to the node due to the WARM_ENI_TARGET config variable. Once the 30 IP\u2019s are depleted on the secondary ENI, the IP\u2019s will be allocated from the third ENI.<\/p>\n<p>Optimizing further, let\u2019s set the WARM_ENI_TARGET environment variable to 0 in the CNI\u00a0config,<\/p>\n<pre><strong>$ diff -up aws-k8s-cni-with-warm-ip-pool.yaml aws-k8s-cni-with-warm-ip-pool-without-eni-pool.yaml<\/strong><br>  - \"name\": \"WARM_ENI_TARGET\"<br>-   \"value\": \"1\"<br>+   \"value\": \"0\"<\/pre>\n<p>On applying the above settings, the secondary ENI won\u2019t be kept in standby mode anymore. It will only be attached when all of the secondary IP\u2019s have been consumed on the primary ENI, thus saving 10 more\u00a0IP\u2019s.<\/p>\n<ul>\n<li><strong>Primary ENI (active):<\/strong> 1 primary IP + 1*9 secondary IP\u2019s<\/li>\n<\/ul>\n<p>With these two optimizations, we have now saved 50 IP addresses per node from being underutilized. The ENIs and corresponding IP\u2019s will be released back to the VPC subnet when the pods no longer exist on that\u00a0node.<\/p>\n<h3>Performance<\/h3>\n<p>Let\u2019s measure the time taken for a pod launch and see how this is affected in the scenario where there\u2019s a need to attach a new ENI.<\/p>\n<p><strong><em>NOTE:<\/em><\/strong><em> The below screenshots and CLI outputs are from a test cluster that was created just for the purpose of this\u00a0blog<\/em><\/p>\n<h4>DEFAULT SETTINGS<\/h4>\n<ul>\n<li>In this setting, WARM_ENI_TARGET=1 and WARM_IP_TARGET is not\u00a0set.<\/li>\n<li>The EKS node is of type <strong><em>m3.2xlarge <\/em><\/strong>and has at least 1 pod scheduled on it. Hence, during the node attach process, the instance<strong><em> <\/em><\/strong>has 2 ENIs (active and standby) attached and 60 IP addresses allocated (2 primary IP\u2019s + 2*29 secondary IP\u2019s).<\/li>\n<\/ul>\n<pre><strong>$ aws ec2 describe-instances \u2014 instance-ids i-030f0defdcee570de | jq .Reservations[0].Instances[0].NetworkInterfaces[0].PrivateIpAddresses | grep PrivateIpAddress | wc -l<\/strong><br>30<br><br><strong>$ aws ec2 describe-instances \u2014 instance-ids i-030f0defdcee570de | jq .Reservations[0].Instances[0].NetworkInterfaces[1].PrivateIpAddresses | grep PrivateIpAddress | wc -l<\/strong><br>30<\/pre>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*LblWQ-UXF9cnphiIuD-oWg.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<ul>\n<li>On creating a deployment with 15 pod replicas on the node, we observed that it took about 19 seconds to mark them as <strong><em>\u201cRunning.\u201d<\/em><\/strong><\/li>\n<\/ul>\n<pre>2020\u201308\u201320T01:19:13Z minus 2020\u201308\u201320T01:18:54Z= 19 seconds<\/pre>\n<h4>OPTIMIZED SETTINGS<\/h4>\n<ul>\n<li>Here, we set WARM_ENI_TARGET=0 and WARM_IP_TARGET=10<\/li>\n<li>Running the same experiment, the instance now has 13 IP\u2019s allocated to it when it joins the EKS cluster (1 primary IP + 1*12 secondary IP\u2019s = 13 IP\u2019s). Note that the 3 extra secondary IP\u2019s are used by pods running on the\u00a0node.<\/li>\n<\/ul>\n<pre><strong>$ aws ec2 describe-instances \u2014 instance-ids i-0c1dd19fa7b697e8a | jq .Reservations[0].Instances[0].NetworkInterfaces[0].PrivateIpAddresses | grep PrivateIpAddress | wc -l<br><\/strong>13<br><br><strong>$ aws ec2 describe-instances \u2014 instance-ids i-0c1dd19fa7b697e8a | jq .Reservations[0].Instances[0].NetworkInterfaces[1].PrivateIpAddresses | grep PrivateIpAddress | wc -l<br><\/strong>0<\/pre>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*wzg2zO7tbKbGaOv2zBiKWQ.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<ul>\n<li>With the WARM_IP_TARGET config variable set, the time taken to deploy 15 pods\u00a0is:<\/li>\n<\/ul>\n<pre>2020\u201308\u201320T02:01:55Z minus 2020\u201308\u201320T02:01:36Z= 19 seconds<\/pre>\n<ul>\n<li>The 10 IP\u2019s in the warm pool are consumed by the first 10 pods but there\u2019s still a need for 5 more. So the ipamD now allocates extra IP\u2019s (3 + 15 + 10 = 28 IP\u2019s) so that we have 10 free IP addresses at all times. Also noticeable is that there isn\u2019t any delay in getting those extra IP\u2019s reserved to the\u00a0node.<\/li>\n<\/ul>\n<pre><strong>$ aws ec2 describe-instances \u2014 instance-ids i-0c1dd19fa7b697e8a | jq .Reservations[0].Instances[0].NetworkInterfaces[0].PrivateIpAddresses | grep PrivateIpAddress | wc -l<\/strong><br>28<br><br><strong>$ aws ec2 describe-instances \u2014 instance-ids i-0c1dd19fa7b697e8a | jq .Reservations[0].Instances[0].NetworkInterfaces[1].PrivateIpAddresses | grep PrivateIpAddress | wc -l<\/strong><br>0<\/pre>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*xuZP9iHQOhYnOvn3Rgf1Zg.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<ul>\n<li>On further scaling the deployment to 20 replicas, the five new pods will get IP addresses from the warm pool. In order to satisfy the warm pool condition (primary ENI<em> <\/em>cannot serve 23 secondary IP\u2019s and cannot reserve 10 free IP\u2019s at the same time), a secondary ENI will now be attached to the instance.<\/li>\n<\/ul>\n<pre><strong>$ aws ec2 describe-instances \u2014 instance-ids i-0c1dd19fa7b697e8a | jq .Reservations[0].Instances[0].NetworkInterfaces[0].PrivateIpAddresses | grep PrivateIpAddress | wc -l<\/strong><br>28<br><br><strong>$ aws ec2 describe-instances \u2014 instance-ids i-0c1dd19fa7b697e8a | jq .Reservations[0].Instances[0].NetworkInterfaces[1].PrivateIpAddresses | grep PrivateIpAddress | wc -l<br><\/strong>6<\/pre>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*Gd7l28qXU_8fDPzo_J4L5g.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<ul>\n<li>With the WARM_IP_TARGET config variable set and scaling the pods to 20, the time taken to deploy 20 pods\u00a0is:<\/li>\n<\/ul>\n<pre>2020\u201308\u201320T02:11:29Z minus 2020\u201308\u201320T02:11:09Z= 20 seconds<\/pre>\n<ul>\n<li>So we lost about a second when there\u2019s a need to attach a new ENI. But on the other hand, we saved 50 IP\u2019s\/nodes which could be used to serve other EKS or EC2 workloads.<\/li>\n<\/ul>\n<h3>Conclusion<\/h3>\n<p>Networking in AWS is generally simple but can get complicated if things are not configured correctly. VPC subnet exhaustion is a fairly common problem for many who run Kubernetes in the cloud (overlay networking mode excluded). It\u2019s even more of a challenge when combined with the <a href=\"https:\/\/github.com\/aws\/containers-roadmap\/issues\/170\">inability to add or update subnets registered with the EKS control plane<\/a>.<\/p>\n<p>This optimization helps minimize the impact of IP exhaustion when running production-scale Kubernetes clusters on\u00a0AWS.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=1325706c8f6d\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/engineering.salesforce.com\/optimizing-eks-networking-for-scale-1325706c8f6d\">Optimizing EKS networking for scale<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/optimizing-eks-networking-for-scale-1325706c8f6d?source=rss----cfe1120185d3---4\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Authors: Savithru Lokanath, Arpeet Kale, Vaishnavigalgali Elastic Kubernetes Service (EKS) is a service under the Amazon Web Services (AWS) umbrella that provides managed Kubernetes service. It significantly reduces the time to deploy, manage, and scale the infrastructure required to run production-scale Kubernetes clusters. AWS has simplified EKS networking significantly with its container network interface (CNI)&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/optimizing-eks-networking-for-scale\/\">Continue reading <span class=\"screen-reader-text\">Optimizing EKS networking for scale<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-287","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":268,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/zero-downtime-node-patching-in-a-kubernetes-cluster\/","url_meta":{"origin":287,"position":0},"title":"Zero Downtime Node Patching in a Kubernetes Cluster","date":"August 31, 2021","format":false,"excerpt":"Authors: Vaishnavi Galgali, Arpeet Kale, Robert\u00a0XueIntroductionThe Salesforce Einstein Vision and Language services are deployed in an AWS Elastic Kubernetes Service (EKS) cluster. One of the primary security and compliance requirements is operating system patching. The cluster nodes that the services are deployed on need to have regular operating system updates.\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":900,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/22\/data-clouds-lightning-fast-migration-from-amazon-ec2-to-kubernetes-in-6-months\/","url_meta":{"origin":287,"position":1},"title":"Data Cloud\u2019s Lightning-Fast Migration: From Amazon EC2 to Kubernetes in 6 Months","date":"July 22, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we delve into the journeys of distinguished engineering leaders. Today, we feature Archana Kumari, Director of Software Engineering at Salesforce. Archana leads our India-based Data Cloud Compute Layer team, which played a pivotal role in a recent transition from Amazon EC2 to Kubernetes for\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":279,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/notary-a-certificate-lifecycle-management-controller-for-kubernetes\/","url_meta":{"origin":287,"position":2},"title":"Notary: A Certificate Lifecycle Management Controller for Kubernetes","date":"August 31, 2021","format":false,"excerpt":"Authors: Vaishnavi Galgali, Savithru Lokanath, Arpeet\u00a0KaleIntroductionAll services in the Einstein Vision and Language Platform use TLS\/SSL certificates to encrypt communication between microservices. The certificates are generated in AWS Certificate Manager (ACM) and stored in the AWS Secrets Manager in the form of keystores and truststores (private and public keys). Certificate\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":454,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/looking-at-the-kubernetes-control-plane-for-multi-tenancy\/","url_meta":{"origin":287,"position":3},"title":"Looking at the Kubernetes Control Plane for Multi-Tenancy","date":"August 31, 2021","format":false,"excerpt":"The Salesforce Platform-as-a-Service Security Assurance team is constantly assessing modern compute platforms for security level and features. We use the insights from these research efforts to provide fast and comprehensive support to engineering teams who explore platform options that adequately support their security requirements. Unsurprisingly, Kubernetes is one of the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":284,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/hadoop-hbase-on-kubernetes-and-public-cloud-part-i\/","url_meta":{"origin":287,"position":4},"title":"Hadoop\/HBase on Kubernetes and Public Cloud (Part I)","date":"August 31, 2021","format":false,"excerpt":"Authors: Dhiraj Hegde, Ashutosh Parekh, and Prashant\u00a0MurthyAt Salesforce, we run a large number of HBase and HDFS clusters in our own data centers. More recently, we have started deploying our clusters on Public Cloud infrastructure to take advantage of the on-demand scalability available there. As part of this foray onto\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":285,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/hadoop-hbase-on-kubernetes-and-public-cloud-part-ii\/","url_meta":{"origin":287,"position":5},"title":"Hadoop\/HBase on Kubernetes and Public Cloud (Part II)","date":"August 31, 2021","format":false,"excerpt":"The first part of this two part blog provided an introduction to concepts in Kubernetes and Public Cloud that are relevant to stateful application management. We also covered how Kubernetes and Hadoop features were leveraged to provide a highly available service. In this second part of the blog we will\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/287","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=287"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/287\/revisions"}],"predecessor-version":[{"id":423,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/287\/revisions\/423"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=287"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=287"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=287"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}