What the research is: Jump-Start is a new approach for improving the performance of virtual machines at scale. Virtual machines are a modern and popular design to implement programming languages used to build applications in general, including large-scale websites like Facebook and Instagram. However, virtual machines incur well-known performance overhead in terms of the amount… Continue reading Boosting the performance of virtual machines with Jump-Start
Month: August 2021
SRE Weekly Issue #259
View on sreweekly.com A message from our sponsor, StackHawk: Mark your calendars! The first conference for OWASP ZAP users is taking place March 9. Get your free ticket to connect with other ZAP users and learn about the project’s roadmap http://sthwk.com/zapcon-sreweekly Articles Increment: Reliability This quarter’s Increment issue is about Reliability, and I haven’t had… Continue reading SRE Weekly Issue #259
Mitigating the effects of silent data corruption at scale
What the research is: Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. This work… Continue reading Mitigating the effects of silent data corruption at scale
FOQS: Scaling a distributed priority queue
We will be hosting a talk about our work on Scaling a Distributed Priority Queue during our virtual Systems @Scale event at 11 am PT on Wednesday, February 24, followed by a live Q&A session. Please submit any questions to systemsatscale@fb.com before the event. The entire Facebook ecosystem is powered by thousands of distributed systems… Continue reading FOQS: Scaling a distributed priority queue
SRE Weekly Issue #272
View on sreweekly.com A message from our sponsor, StackHawk: See how automated security testing can change how your teams find and fix security vulnerabilities. http://sthwk.com/security-automation Articles [Salesforce] Multi-Instance Service Disruption on May 11-12, 2021 Salesforce has posted a ton of information about their major outage two weeks ago. It involved a change to their DNS… Continue reading SRE Weekly Issue #272
SRE Weekly Issue #271
View on sreweekly.com A message from our sponsor, StackHawk: Join StackHawk on Tuesday, May 25 for a hands-on authenticated security testing workshop. Follow along as we walk through three common authentication scenarios step-by-step. Register: http://sthwk.com/auth-workshop Articles Naming names in incident writeups Should you keep things anonymous (“an engineer”), or should you say exactly who did… Continue reading SRE Weekly Issue #271
Peering automation at Facebook
Traffic on the internet travels across many different kinds of links. A fast and reliable way to exchange traffic between different networks and service providers is through peering. Initially, we managed peering via a time-intensive manual process. Reliable peering is essential for Facebook and for everyone’s internet use. But there is no industry standard for… Continue reading Peering automation at Facebook
Designing Accessible Builder Apps
Global Accessibility Awareness Day (GAAD) highlights the importance of Digital Access and Inclusion for over 1 Billion People with Disabilities around the world. We enthusiastically celebrate GAAD at Salesforce because it directly speaks to our role in creating a more inclusive and just world. The World Health Organization defines Disability as “…a mismatched interaction between… Continue reading Designing Accessible Builder Apps
SRE Weekly Issue #270
View on sreweekly.com A message from our sponsor, StackHawk: APIs are not only the backbone of modern application architecture, but they are also a key part of security. Discover what API security testing is, how it works, and get started using API security tools http://sthwk.com/API-security Articles Thundering herds, noisy neighbours, and retry storms This is… Continue reading SRE Weekly Issue #270
Running Border Gateway Protocol in large-scale data centers
What the research is: A first-of-its-kind study that details the scalable design, software implementation, and operations of Facebook’s data center routing design, based on Border Gateway Protocol (BGP). BGP was originally designed to interconnect autonomous internet service providers (ISPs) on the global internet. Highly scalable and widely acknowledged as an attractive choice for routing, BGP… Continue reading Running Border Gateway Protocol in large-scale data centers