SRE Weekly Issue #254

A message from our sponsor, StackHawk:

Need to run a standalone Kotlin app as a fat jar in a Gradle project? Check out how we handled that!
http://sthwk.com/kotlin-with-gradle

Articles

This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.

Coinbase

Cloudflare has what amounts to a sophisticated staging environment for testing new code.

Yan Zhai — Cloudflare

Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.

Rachel By the Bay

Here’s Google’s follow-up on a Google Meet outage earlier this month.

Google

Those are some seriously big database servers.

Josh Aas and James Renken — Let’s Encrypt

A great general overview of all aspects of incident response, including definitions and best practices.

Better Uptime

Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.

Larry Lancaster — Zebrium

The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.

Gustavo Franco — VMware

This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.

Slack

Outages

SRE WEEKLY

Published
Categorized as SRE