{"id":330,"date":"2021-08-31T14:39:51","date_gmt":"2021-08-31T14:39:51","guid":{"rendered":"https:\/\/fde.cat\/?p=330"},"modified":"2021-08-31T14:39:51","modified_gmt":"2021-08-31T14:39:51","slug":"enforcing-encryption-at-scale","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/enforcing-encryption-at-scale\/","title":{"rendered":"Enforcing encryption at scale"},"content":{"rendered":"<p><span>Our infrastructure supports thousands of services that handle billions of requests per second. We\u2019ve previously discussed how we built our <\/span><a href=\"https:\/\/engineering.fb.com\/security\/service-encryption\/\"><span>service encryption infrastructure<\/span><\/a><span> to keep these globally distributed services operating securely and performantly. This post discusses the system we designed to enforce encryption policies within our network and shares some of the lessons we learned in the process. The goal of this enforcement is to catch any regression quickly and shut it off, keeping our internal traffic secure at the application level via TLS.<\/span><\/p>\n<h2><span>Organizational challenges<\/span><\/h2>\n<p><span>Implementing a transit encryption enforcement policy at Facebook scale requires careful planning and communication, in addition to the technical challenges we\u2019ll discuss in a bit. We want the site to stay up and remain reliable so the people using our services will be unaffected by and unaware of any changes to the infrastructure.<\/span><\/p>\n<p><span>Communicating the intent, specific timelines, and rollout strategy went a long way toward minimizing any potential disruptions for the thousands of teams that run services at Facebook. We use <\/span><a href=\"https:\/\/www.facebook.com\/workplace\"><span>Workplace<\/span><\/a><span> within Facebook, which enables us to easily distribute that information across a variety of groups with a single share button and consolidate feedback and concerns in a single place for all employees to see. We made sure to include the following:<\/span><\/p>\n<p><span>A description of the impact of our enforcement mechanism and how it might appear at the application layer<\/span><br \/>\n<span>A dashboard for engineers to see whether their traffic would be affected<\/span><br \/>\n<span>The rollout and monitoring plan<\/span><br \/>\n<span>Dedicated points of contact and a Workplace group where users could ask questions about impact and troubleshoot any issues<\/span><\/p>\n<p><span>The post required multiple discussions within the team to come up with a rollout plan, dashboard requirements, and realistic timelines to meet the goals of the project. This level of communication proved to be useful as the team gathered important feedback early in the process.\u00a0<\/span><\/p>\n<h2><span>Building our SSLWall<\/span><\/h2>\n<p><span>Hardware choke points are a natural approach to providing transparent enforcement. There are options, such as layer 7 firewalls, that let us do deep packet inspection, but executing fine-grained rollouts and the complexities of Facebook\u2019s network would make implementing such a solution a nightmare. Additionally, working at a network firewall level would introduce a much larger blast radius of impacted traffic, and a single configuration issue could end up killing off traffic that we weren\u2019t meant to touch.<\/span><\/p>\n<p><span>Our team decided to develop and deploy what is internally known as SSLWall, a system that cuts off non-SSL connections across various boundaries. Let\u2019s dive a bit into the design decisions behind this solution.<\/span><\/p>\n<h3><span>Requirements\u00a0<\/span><\/h3>\n<p><span>We needed to be thorough when considering the requirements of a system that would potentially block traffic at such a large scale. The team came up with the following requirements for SSLWall, all of which had an impact on our design decisions:<\/span><\/p>\n<p><span>Visibility into what traffic is being blocked. Service owners needed a way to assess impacts, and our team needed to be proactive and reach out whenever we felt there was a problem brewing.<\/span><br \/>\n<span>A passive monitoring mode in which we could turn a knob to flip to active enforcement. This helps us determine impacts early on and prepare teams.<\/span><br \/>\n<span>A mechanism to allow certain use cases to bypass enforcement, such as BGP, SSH, and approved network diagnostic tools.<\/span><br \/>\n<span>Support for cases like HTTP CONNECT and STARTTLS. These are instances that do a little bit of work over plaintext before doing a TLS handshake. We have many use cases for these in our infrastructure, such as HTTP tunneling, MySQL security, and SMTP, so these must not break, especially since they eventually encrypt the data with TLS.<\/span><br \/>\n<span>Extensible configurability. We might have different requirements depending on the environment in which SSLWall operates. Additionally, having important knobs that can be tuned with little disruption means we can roll features forward or back at our own pace.<\/span><br \/>\n<span>Transparent to the application. Applications should not need to rebuild their code or incur any additional library dependencies for SSLWall to operate. The team needed the ability to iterate quickly and change configuration options independently. In addition, being transparent to the application means SSLWall needs to be performant and use minimal resources without having an impact on latencies.<\/span><\/p>\n<p><span>These requirements all led us down the path of managing a host-level daemon that had a user space and kernel-level component. We needed a low-compute way to inspect all connections transparently and act on them.\u00a0\u00a0<\/span><\/p>\n<h3><span>eBPF<\/span><\/h3>\n<p><span>Since we wanted to inspect every connection without needing any changes at the application level, we needed to do some work in the kernel context. We <\/span><a href=\"https:\/\/ebpf.io\/\"><span>use eBPF<\/span><\/a><span> extensively, and it provides all of the capabilities needed for SSLWall to achieve its goals. We leveraged a number of technologies that eBPF provides:<\/span><\/p>\n<p><a href=\"http:\/\/man7.org\/linux\/man-pages\/man8\/tc-bpf.8.html\"><span>tc-bpf<\/span><\/a><span>: We leveraged Linux\u2019s <\/span><a href=\"http:\/\/tldp.org\/HOWTO\/Traffic-Control-HOWTO\/intro.html\"><span>traffic control<\/span><\/a><span> (TC) facility and implemented a filter using eBPF.\u00a0 At this layer, we are able to do some computation on a per-packet basis for packets flowing in and out of the box. TC allows us to operate on a broader range of kernels within Facebook\u2019s fleet. It wasn\u2019t the perfect solution, but it worked for our needs at the time.<\/span><br \/>\n<span>kprobes: eBPF allows us to attach programs to kprobes, so we can run some code within the kernel context whenever certain functions are called. We were interested in the <\/span><span>tcp_connect<\/span><span> and <\/span><span>tcp_v6_destroy_sock<\/span><span> functions. These functions are called when a tcp connection is established and torn down, respectively. Old kernels played a factor in our use of kprobes as well.<\/span><br \/>\n<span>maps: eBPF provides access to a number of map types, including arrays, bounded LRU maps, and perf events<\/span><\/p>\n<p>Diagrams showing how kprobes, the tc filter, and our maps interact with one another when determining whether a connection needs to be blocked.<\/p>\n<h3><span>The management daemon<\/span><\/h3>\n<p><span>We built a daemon, which manages the eBPF programs we install and emits logs to <\/span><a href=\"https:\/\/engineering.fb.com\/data-infrastructure\/scribe\/\"><span>Scribe<\/span><\/a><span> from our perf events. The daemon also provides the ability to update our TC filter, handles configuration changes (leveraging <\/span><a href=\"https:\/\/research.fb.com\/wp-content\/uploads\/2016\/11\/holistic-configuration-management-at-facebook.pdf\"><span>Facebook\u2019s Configerator<\/span><\/a><span>), and monitors health.<\/span><\/p>\n<p><span>Our eBPF programs are also bundled with this daemon. This makes management of releases easier to deal with, as we only have one software unit to monitor instead of needing to track a daemon and eBPF release. Additionally, we can modify the schema of our BPF tables, which both user space and kernel space consult, without compatibility concerns between releases.<\/span><\/p>\n<h3><span>Technical challenges<\/span><\/h3>\n<p><span>As one would expect, we encountered a number of interesting technical challenges while rolling out SSLWall at Facebook\u2019s scale. A few highlights include:<\/span><span><br \/>\n<\/span><span>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/TCP_Fast_Open\"><span>TCP Fast Open (TFO)<\/span><\/a><span>: We hit an interesting challenge around kprobe and TC filter execution order that was exposed by our use of TFO within the infra. In particular, we needed to move some of our flow tracking code to a kprobe prehandler.<\/span><br \/>\n<span>BPF Program Size Limit: All BPF programs are subject to size and complexity limits, which may vary based on the kernel version.<\/span><br \/>\n<span>Performance: We spent many engineering cycles optimizing our BPF programs, particularly the TC filter, so that SSLWall\u2019s CPU impact on some of our critical high QPS services with high fanout remained trivial. Identifying early exit conditions and using BPF arrays over LRUs where possible proved effective.<\/span><\/p>\n<h2><span>TransparentTLS and the long tail<\/span><\/h2>\n<p><span>With enforcement in place, we needed a way to address noncompliant services without significant engineering time. This included things like torrent clients, open source message queues, and some Java applications. While most applications use common internal libraries where we could bake this logic in, the ones that do not need a different solution.<\/span><\/p>\n<p><span>Essentially, the team was left with the following requirements for what we refer to as Transparent TLS (or TTLS for short):<\/span><\/p>\n<p><span>Transparently encrypt connections without the need for application changes.<\/span><br \/>\n<span>Avoid double encryption for existing TLS connections.<\/span><br \/>\n<span>Performance can be suboptimal for this long tail.<\/span><\/p>\n<p><span>It\u2019s clear that a proxy solution would have helped here, but we needed to ensure that the application code didn\u2019t need to change and that configuration would be minimal.<\/span><\/p>\n<p><span>We settled on the following architecture:\u00a0<\/span><\/p>\n<\/p>\n<p><span>The challenge with this approach is transparently redirecting application connections to the local proxy. Once again, we use BPF to solve this problem. Thanks to the cgroup\/connect6 hook, we can intercept all <\/span><a href=\"https:\/\/man7.org\/linux\/man-pages\/man2\/connect.2.html\"><span>connect(2)<\/span><\/a><span> calls made by the application and redirect them to the proxy as needed.<\/span><\/p>\n<p>Diagram showing application and proxy logic for transparent connect.<\/p>\n<p><span>Aside from the application remaining unchanged, the BPF program makes policy decisions about routing through the proxy. For instance, we optimized this flow to bypass the proxy for all TLS connections created by the application to avoid double encryption.<\/span><\/p>\n<p><span>This work on enforcement has brought us to a state where we can confidently say that our traffic is encrypted at our scale. However, our work is not yet complete. For instance, there are many new facilities that have come about in BPF that we intend to leverage as we remove old kernel support. We can also improve our transparent proxy solutions and leverage custom protocols to multiplex connections and improve performance.<\/span><\/p>\n<p><span>We\u2019d like to thank Takshak Chahande, Andrey Ignatov, Petr Lapukhov, Puneet Mehra, Kyle Nekritz, Deepak Ravikumar, Paul Saab, and Michael Shao for their work on this project.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/07\/12\/security\/enforcing-encryption\/\">Enforcing encryption at scale<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/07\/12\/security\/enforcing-encryption\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Our infrastructure supports thousands of services that handle billions of requests per second. We\u2019ve previously discussed how we built our service encryption infrastructure to keep these globally distributed services operating securely and performantly. This post discusses the system we designed to enforce encryption policies within our network and shares some of the lessons we learned&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/enforcing-encryption-at-scale\/\">Continue reading <span class=\"screen-reader-text\">Enforcing encryption at scale<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-330","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":462,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/20\/how-whatsapp-is-enabling-end-to-end-encrypted-backups\/","url_meta":{"origin":330,"position":0},"title":"How WhatsApp is enabling end-to-end encrypted backups","date":"September 20, 2021","format":false,"excerpt":"For years, in order to safeguard the privacy of people\u2019s messages, WhatsApp has provided end-to-end encryption by default \u200b\u200bso messages can be seen only by the sender and recipient, and no one in between. Now, we\u2019re planning to give people the option to protect their WhatsApp backups using end-to-end encryption\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":224,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/building-a-secured-data-intelligence-platform\/","url_meta":{"origin":330,"position":1},"title":"Building a Secured Data Intelligence Platform","date":"February 2, 2021","format":false,"excerpt":"The Salesforce Unified Intelligence Platform (UIP) team is building a shared, central, internal data intelligence platform. Designed to drive business insights, UIP helps improve user experience, product quality, and operations. At Salesforce, Trust is our number one company value and building in robust security is a key component of our\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":630,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/07\/network-entitlement-a-contract-based-network-sharing-solution\/","url_meta":{"origin":330,"position":2},"title":"Network Entitlement: A contract-based network sharing solution","date":"September 7, 2022","format":false,"excerpt":"Meta\u2019s overall network usage and traffic volume has increased as we\u2019ve continued to add new services. Due to the scarcity of fiber resources, we\u2019re developing an explicit resource reservation framework to effectively plan, manage, and operate the shared consumption of network bandwidth, which will help us keep up with demand\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":322,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/consolidating-facebook-storage-infrastructure-with-tectonic-file-system\/","url_meta":{"origin":330,"position":3},"title":"Consolidating Facebook storage infrastructure with Tectonic file system","date":"August 31, 2021","format":false,"excerpt":"What the research is:\u00a0 Tectonic, our data center scale distributed file system, enables better resource utilization, promotes simpler services, and requires less operational complexity than our previous approach. Our previous storage infrastructure consisted of a set of use-case specific storage systems. Clusters, or instances of these storage systems, used to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":800,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/07\/building-end-to-end-security-for-messenger\/","url_meta":{"origin":330,"position":4},"title":"Building end-to-end security for Messenger","date":"December 7, 2023","format":false,"excerpt":"We are beginning to upgrade people\u2019s personal conversations on Messenger to use end-to-end encryption (E2EE) by default Meta is publishing two technical white papers on end-to-end encryption: Our Messenger end-to-end encryption whitepaper describes the core cryptographic protocol for transmitting messages between clients. The Labyrinth encrypted storage protocol whitepaper explains our\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":787,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/","url_meta":{"origin":330,"position":5},"title":"Watch: Meta\u2019s engineers on building network infrastructure for AI","date":"November 15, 2023","format":false,"excerpt":"Meta is building for the future of AI at every level \u2013 from hardware like MTIA v1, Meta\u2019s first-generation AI inference accelerator to publicly released models like Llama 2, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/330","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=330"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/330\/revisions"}],"predecessor-version":[{"id":380,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/330\/revisions\/380"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=330"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=330"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=330"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}