SRE Weekly Issue #419

How Figma’s Databases Team Lived to Tell the Scale

Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability.

Retrofitting sharding is a huge undertaking.

Sammy Steele — Figma

Moving fast breaks things: the importance of a staging environment

Ride along as this company evolves from constantly shipping directly to production to a robust staging and internal canary deployment system.

Greg Foster — Graphite

Post mortem: we took 124 seconds from you, here’s 378 back

A lighthearted but still detail-filled take on a post-incident analysis for a short production outage.

Greg Foster — Graphite

Building Application Reliability on Top of Infrastructure Unreliability

This one has an interesting discussion of the nature of reliability and the impact of multiple services on overall reliability, including possible mathematical models to use.

Fitz — Temporal

#30 Clearing Delusions in Observability (with David Caudill)

This episode of the SREPath Podcast covers a variety of themes around observability and SLOs. There’s a great text-based summary if that’s your preference.

Ash Patel — SREPath

Linux Crisis Tools

This piece argues that you should install system debugging tools in on your production systems now, because it’s going to be really hard to do it live when you need them.

Brendan Gregg

How much are their 9’s worth?

Following on from a previous article about the squiggliness of availability numbers, this article evaluates SLAs from 4 major companies to try to divine what they actually mean.

Ross Brodbeck

SLO formulas implementation in PromQL step by step

I want to present real-life examples of both availability and latency SLOs, as they are more nuanced than they may initially appear. Also, I find it worthwhile sharing a detailed guide as it showcases uncommon uses of PromQL and demonstrates the language’s versatility.

Michał Kaźmierczak

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related