Viewing the world as a computer: Global capacity management

Meta currently operates 14 data centers around the world. This rapidly expanding global data center footprint poses new challenges for service owners and for our infrastructure management systems. Systems like Twine, which we use to scale cluster management, and RAS, which handles perpetual region-wide resource allocation, have provided the abstractions and automation necessary for service… Continue reading Viewing the world as a computer: Global capacity management

Published
Categorized as Technology

SRE Weekly Issue #337

View on sreweekly.com Thanks for all the vacation well-wishes! It was really great and relaxing. Take vacations, it’s important for reliability! While I was out, I shipped the past two issues with content prepared in advance, and without the Outages section. This gave me a chance to really think hard about the value of the… Continue reading SRE Weekly Issue #337

Published
Categorized as SRE

Introducing Velox: An open source unified execution engine

Meta is introducing Velox, an open source unified execution engine aimed at accelerating data management systems and streamlining their development. Velox is under active development. Experimental results from our paper published at the International Conference on Very Large Data Bases (VLDB) 2022 show how Velox improves efficiency and consistency in data management systems. Velox helps… Continue reading Introducing Velox: An open source unified execution engine

Published
Categorized as Technology

Hyperpacks: Using Buildpacks to Build Hyperforce

At Salesforce we regularly use our products and services to scale our own business. One example is Buildpacks, which we created nearly a decade ago and is now a part of Hyperforce. Hyperpacks are an innovative new way of using Cloud Native Buildpacks (CNB) to manage our public cloud infrastructure.  Buildpacks were created to help… Continue reading Hyperpacks: Using Buildpacks to Build Hyperforce

Published
Categorized as Technology

Improving Meta’s SLO workflows with data annotations

When we focus on minimizing errors and downtime here at Meta, we place a lot of attention on service-level indicators (SLIs) and service-level objectives (SLOs). Consider Instagram, for example. There, SLIs represent metrics from different product surfaces, like the volume of error response codes to certain endpoints, or the number of successful media uploads. Based… Continue reading Improving Meta’s SLO workflows with data annotations

Published
Categorized as Technology

SRE Weekly Issue #336

View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https://rootly.com/demo/ Articles What it’s like to… Continue reading SRE Weekly Issue #336

Published
Categorized as SRE

SRE Weekly Issue #335

View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https://rootly.com/demo/ Articles How an incident transformed… Continue reading SRE Weekly Issue #335

Published
Categorized as SRE

SRE Weekly Issue #334

View on sreweekly.com I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section. A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging… Continue reading SRE Weekly Issue #334

Published
Categorized as SRE

Architectural Principles for High Availability on Hyperforce

Infrastructure and software failures will happen. We idolize four 9s (99.99%) availability. We know we need to optimize and improve Recovery-Time-Objective (RTO, the time it takes to restore service after a service disruption) and Recovery-Point-Objective (RPO, the acceptable data loss measured in time). But how can we actually deliver high availability for our customers? One… Continue reading Architectural Principles for High Availability on Hyperforce

Published
Categorized as Technology
Generated by Feedzy