The Problem In 2015, Salesforce Commerce Cloud (which was then called Demandware) was running a typical open source Grafana/Graphite/Carbon stack to store and visualize time series metrics of the Java application clusters powering our e-commerce business. Our JVM clusters at the time produced around 500k time series metrics per minute. Even though our organization needed us… Continue reading CarbonJ: A high performance, high-scale, drop-in replacement for carbon-cache and carbon-relay
Month: October 2021
Kangaroo: A new flash cache optimized for tiny objects
What the research is: Kangaroo is a new flash cache that enables more efficient caching of tiny objects (objects that are ~100 bytes or less) and overcomes the challenges presented by existing flash cache designs. Since Kangaroo is implemented within CacheLib, Facebook’s open source caching engine, developers can use Kangaroo through CacheLib’s API to build… Continue reading Kangaroo: A new flash cache optimized for tiny objects
SRE Weekly Issue #293
View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo: https://rootly.io/?utm_source=sreweekly Articles The Downside of Hospitals Becoming “Highly Reliable” It’s one thing to… Continue reading SRE Weekly Issue #293
Connector framework: A generic approach to crawl activities in real-time
Authors: Jayanth Parayil Kumarji, Evan Jiang, Kevin Terusaki, Zhidong Ke, Heng Zhang, Jeff Lowe, Yifeng Liu, Priyadarshini Mitra Introduction Sales Cloud empowers our customers to make quick and well-informed decisions, with all of the tools they need to manage their selling process. The features we work on in the Activity Platform team, spanning Sales Cloud… Continue reading Connector framework: A generic approach to crawl activities in real-time
Autonomous testing of services at scale
Enabling developers to prototype, test, and iterate on new features quickly is important to Facebook’s success. To do this effectively, it’s key to have a stable infrastructure that doesn’t introduce unnecessary friction. This gets significantly more challenging when the infrastructure in question must also scale to support more than 3 billion people around the world,… Continue reading Autonomous testing of services at scale
Facebook engineers receive 2021 IEEE Computer Society Cybersecurity Award for static analysis tools
Until recently, static analysis tools weren’t seen by our industry as a reliable element of securing code at scale. After nearly a decade of investing in refining these systems, I’m so proud to celebrate our engineering teams today for being awarded the IEEE Computer Society’s Cybersecurity Award for Practice for development and deployment of static… Continue reading Facebook engineers receive 2021 IEEE Computer Society Cybersecurity Award for static analysis tools
RTMP Go Away: Lossless reconnections for live streaming
What it is: Real Time Messaging Protocol (RTMP) is a popular media streaming protocol that uses Transmission Control Protocol (TCP) persistent connections. When a connection between a live-streaming client and the platform is interrupted, data from the live event is lost until the client can reconnect to a new server. RTMP Go Away is a… Continue reading RTMP Go Away: Lossless reconnections for live streaming
Github Actions Security Best Practices
Introduction In the world of Continuous Integration and Continuous Deployment, Github Actions provide a nifty edge to quickly build end-to-end automation right into the repository. This makes integration of Actions into an organization’s Github repositories pretty straightforward and convenient. Github Actions bring velocity to the Software Development Lifecycle. However, if it is swiftly adopted without… Continue reading Github Actions Security Best Practices
SRE Weekly Issue #292
View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo: https://rootly.io/?utm_source=sreweekly Articles Four lessons every company should learn from the back-to-back Facebook outages… Continue reading SRE Weekly Issue #292
How to ETL at Petabyte-Scale with Trino
Trino (formerly known as PrestoSQL) is widely appreciated as a fast distributed SQL query engine, but there is precious little information online about using it for batch extract, transform, and load (ETL) ingestion (outside of the original Facebook paper), particularly at petabyte+ scale. After deciding to use Trino as a key piece of Salesforce’s Big… Continue reading How to ETL at Petabyte-Scale with Trino