Architectural Principles for High Availability on Hyperforce

Infrastructure and software failures will happen. We idolize four 9s (99.99%) availability. We know we need to optimize and improve Recovery-Time-Objective (RTO, the time it takes to restore service after a service disruption) and Recovery-Point-Objective (RPO, the acceptable data loss measured in time). But how can we actually deliver high availability for our customers? One… Continue reading Architectural Principles for High Availability on Hyperforce

Published
Categorized as Technology

Scaling data ingestion for machine learning training at Meta

Many of Meta’s products, such as search and language translations, utilize AI models to continuously improve user experiences. As the performance of hardware we use to support training infrastructure increases, we need to scale our data ingestion infrastructure accordingly to handle workloads more efficiently. GPUs, which are used for training infrastructure, tend to double in… Continue reading Scaling data ingestion for machine learning training at Meta

Published
Categorized as Technology

SRE Weekly Issue #333

View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https://rootly.com/demo/ Articles Is SRE Just Ops… Continue reading SRE Weekly Issue #333

Published
Categorized as SRE

SRE Weekly Issue #332

View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https://rootly.com/demo/ Articles How Razorpay’s Notification Service… Continue reading SRE Weekly Issue #332

Published
Categorized as SRE

Five security principles for billions of messages across Meta’s apps

At Meta, our messaging apps help billions of people around the world stay connected to those who matter most to them. This scale brings potential threats from criminals and hackers, so we have a responsibility to keep people and their data safe. We’re sharing a set of principles to ensure that security is central to… Continue reading Five security principles for billions of messages across Meta’s apps

Published
Categorized as Technology

Programming languages endorsed for server-side use at Meta

– Supporting a programming language at Meta is a very careful and deliberate decision. – We’re sharing our internal programming language guidance that helps our engineers and developers choose the best language for their projects. – Rust is the latest addition to Meta’s list of supported server-side languages. At Meta, we use many different programming… Continue reading Programming languages endorsed for server-side use at Meta

Published
Categorized as Technology

It’s time to leave the leap second in the past

The leap second concept was first introduced in 1972 by the International Earth Rotation and Reference Systems Service (IERS) in an attempt to periodically update Coordinated Universal Time (UTC) due to imprecise observed solar time (UT1) and the long-term slowdown in the Earth’s rotation. This periodic adjustment mainly benefits scientists and astronomers as it allows… Continue reading It’s time to leave the leap second in the past

Published
Categorized as Technology

SRE Weekly Issue #331

View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https://rootly.com/demo/ Articles DisasterCast – A podcast… Continue reading SRE Weekly Issue #331

Published
Categorized as SRE