Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability.
Retrofitting sharding is a huge undertaking.
Sammy Steele — Figma
Ride along as this company evolves from constantly shipping directly to production to a robust staging and internal canary deployment system.
Greg Foster — Graphite
A lighthearted but still detail-filled take on a post-incident analysis for a short production outage.
Greg Foster — Graphite
This one has an interesting discussion of the nature of reliability and the impact of multiple services on overall reliability, including possible mathematical models to use.
Fitz — Temporal
This episode of the SREPath Podcast covers a variety of themes around observability and SLOs. There’s a great text-based summary if that’s your preference.
Ash Patel — SREPath
This piece argues that you should install system debugging tools in on your production systems now, because it’s going to be really hard to do it live when you need them.
Brendan Gregg
Following on from a previous article about the squiggliness of availability numbers, this article evaluates SLAs from 4 major companies to try to divine what they actually mean.
Ross Brodbeck
I want to present real-life examples of both availability and latency SLOs, as they are more nuanced than they may initially appear. Also, I find it worthwhile sharing a detailed guide as it showcases uncommon uses of PromQL and demonstrates the language’s versatility.
Michał Kaźmierczak
SRE WEEKLY