Selena Deckelmann's blog

A Failure and Distributed Systems Youtube Hole

posted on Sep 30, 2015
tags: distributed_systems
categories: work

I was thinking a lot about a conversation I had with a coworker about failure and an email thread about naming a team and ended up diving deep into a series of blog posts and videos. I mostly wanted to remember the series that I read and saw today, but thought maybe you, dear reader, would be interested too.

This started out just being videos, but there were quite a few blog posts too.

Here goes:

This kicked things off for me.. I remember this when it came out and was really happy to have this articulated. My first experience with root cause analysis was not a pretty one. Systems thinking is a skill and not one that is taught enough.

This brought to mind the testing research I read when I was giving talks about failure, and also shared some things I didn't know about concerning QA's impact on organizations. Basically, unintended consequences of siloing QA off from dev and a strong recommendation to not have a separate org (and of course it's an argument for agile). In practice, I find that build and test automation requires specialized skills. Maybe orgs can think of these specialists as coaches for the rest of the people in the org? The problem I see there is that we've all got a need to create, not just coach and teach. How do you find balance?

This is an old argument but relevant to naming operations orgs. I like being reminded of what "the other side" thinks, and being able to read about it before I have to enter into these kinds of discussions.

I was reading this also in trying to think up alternatives to "operations" as a team name.

I loved the chart here. There is a "lean organizations" book out there now that has a section on this. Not sure I want to read the book, but this is a handy reminder about what you sacrifice when you go to command-control models for risk management.

I have not seen these used much in practice, but I do like the idea. I am trying to send out post-mortems about tree-closing incidents involving the team I am managing. I have two more to do this week...

This was heart-warming and reminded me of an old tool I wrote and I tweeted briefly about it. I love the "trust your operations team" advice, and how we should be designing things so that we can trust more and restrict less.

Great PDF summarizing some of the points that were made in the talk. And a reminder: "Post-accident attribution accident to a 'root cause' is fundamentally wrong."

Great, simple model presented for visualizing the processes at work between human overwork, budget constraints and performance. Key learning is about the space between your margin of error and accidents -- that we are constantly pushing against that boundary.

"Build silence into systems" -- yes! Also a bae/Peter Bailis reference. Who btw is looking for grad students!

The big thing here -- relating back to the start of this binge -- is about designing your data. That is how you survive microservices.

A keynote from Astrid Atkinson. She talks about designing things so that engineers can manage their own systems. So great. "Keep in mind what is costing time." yup. "If you talk to a server, timeout." "If you retry, exponentially backoff with jitter." "Systems should perform reasonably under degraded circumstances." "Somewhere between 1 and 5 identical services you're going to have to take a stand." "Strongly endorse boring infrastructure choices."

"It's not just enough to build it, you have to migrate existing workloads."

"Move the biggest customer first."

Super detailed, awesome post-mortem on the DynamoDB outage. Camille mentioned it in her Strange Loop talk.


Have some feedback? Corrections? Ideas for other posts? Contact me @selenamarie.