SRE Weekly Issue #254

Articles

Coinbase Incident Post Mortem: January 6–7, 2021

This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.

Coinbase

Soar: Simulation for Observability, reliAbility, and secuRity

Cloudflare has what amounts to a sophisticated staging environment for testing new code.

Yan Zhai — Cloudflare

Failing to make progress under excess request load

Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.

Rachel By the Bay

Google Cloud Issue Summary — Google Meet — 2021-01-08

Here’s Google’s follow-up on a Google Meet outage earlier this month.

Google

The Next Gen Database Servers Powering Let’s Encrypt

Those are some seriously big database servers.

Josh Aas and James Renken — Let’s Encrypt

Incident Management in 2021: from Basics to Best Practices

A great general overview of all aspects of incident response, including definitions and best practices.

Better Uptime

Using GPT-3 for plain language incident root cause from logs

Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.

Larry Lancaster — Zebrium

Taming Operational Load with VMware CRE

The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.

Gustavo Franco — VMware

Slack RCA for outage on January 4, 2021

This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.

Slack

Outages

SRE WEEKLY

A message from our sponsor, StackHawk:

Articles

Outages

Related