Why Warning Systems Fail
Part 1: What is Observability
Knowing More
In the Fall of 2023, I read Observability Engineering by some of the folks at Honeycomb as part of the on-again off-again book club at my local DevOps Days chapter. Reading this (free, courtesy of Honeycomb) book and discussing it with peers made me a lot more selective in terms of what I scraped from our systems, and think more deeply about what I was looking for, and my notes from those sessions planted a seed for what would ultimately become this article.
How do we know what we know and what we don’t know? Instead of guessing what might go wrong, an observability practice lets engineers ask arbitrary questions about the state of the system, without necessarily knowing what they will find.
Category Definition Example Known Knowns Things that we are aware of and understand clearly. Predefined dashboards and static alerts triggered by known thresholds (CPU usage, memory saturation). Known Unknowns Things that we are aware of but lack full information. Possible failure outcomes in our list of things that can occur, but but we don’t know when or which components might fail. Unknown Knowns Things that we know exist and could be understood, but are overlooked or ignored. Available log data or metrics that exist but are underutilized due to poor instrumentation or cognitive overload. Unknown Unknowns Things that we don’t know exist, have not foreseen, and are blissfully unaware of. Novel failure modes that have not yet occurred or been imagined.
As the complexity of a system grows, failure modes become combinatorial, requiring flexible, exploratory analysis rather than rigid monitoring. Predefined dashboards and alerts based on static thresholds often fail, and tend to be instrumented after an incident, rather than before. We often don’t know what information would have been relevant until after the incident has occurred for the first time, but now that it’s happened, we can account for it going forward, with the expectation that it might happen again.
The glaring takeaway for me then is that warning systems are only effective when their signals prompt meaningful action, and yet, numerous systems fail precisely because they generate too much noise or too little actionable intelligence.
Duck and Cover
Diving into systems thinking could be a whole other post, but I think one cautionary tale is that of wargaming.
Those among us who remember Bert the Turtle may be familiar with the practice of air raid drills in American public schools. The intent of such programs was not to safeguard the ability of the public to respond to a catastrophic event, but rather inoculate against ensuing chaos should the unthinkable happen. There’s certainly an upside to that sort of social engineering somewhere, but given the choice, I’d rather plan to be on my feet than under a desk. That’s just good disaster recovery planning.
Runbooks are an essential part of the learned experience, and that goes beyond disaster recovery planning. However, not all of the mechanics are out in the open - if your only signal is that a flock of geese is passing over Nunavut, then you need more data to base your decisions off of.
Are the results of your planning visible, verifable, viable? The consequences of extrapolating forward from limited information can be weird and uncomfortable. If you can use games to practice, plan or pre-empt a war, then you should be able to use them to practice, plan, or pre-empt any other problem.
It’s a fortunate thing then that services can be designed with extensive instrumentation, which can provide data structured around events with high levels of context. To take advantage of this, we can prioritize flexibility and exploration - exposing internal state in a way that supports debugging without needing to predict failure modes in advance. Hence, chaos engineering - making plans, then stress-testing those plans on your infrastructure and organization.
Part 2: How do teams accomplish it?
Finding Your Stake
I think that another aspect that plays an oversized role in effective observability is the human operations that support it.
After all, you can try hard at a lot of things - but how you expend your effort matters a lot, especially when working with a team. I (somewhat sweatily) (gave a talk) last fall, where I asked the audience to challenge the belief that they (themselves, as individuals) could solve the problem that was effecting them and their work currently.
I think that’s important to put forward the effort into rigorously defining what your stake is, all the time, and if that changes, thats great - but does consistently being in a problem-solving mindset help you? Probably not. When you treat your relationships like an engineering problem, you’re going to miss out on certain connections that you might have appreciated otherwise.
If you are able to comfortably sit with where your personal responsibility sits, you can right-size your cognitive load in otherwise ambiguous situations.
Networked Teams
Being on an amplified team is like knowing the answer to the question before you ask it … everything you see and feel becomes drenched in data and connection and meaning.” (Hon, 2020, p. 99. A new history of the future in 100 objects: A fiction. MIT Press.)
One of my favorite books to come out of 2020, Adrian Hon’s ‘…History of the Future in 100 Objects’, is a game designer’s take on what the coming decades might hold. Looking back on that time, so much has been written on the rise and fall of widespread remote work, and the emergence of capable generative AI models that dovetailed with it. I think HOn was particularly prophetic in his near term predictions for a non-idealized future of work - increasingly small and coordinated teams, chasing efficiency gains with the help of sophisticated agents.
The quote above implies the present existence of a network team, which the amplified team evolved out of. If the late 90’s to present day were the era of networked teams, then perhaps the transition to amplified teams is already underway.
To paraphrase from the passage, these are some of the qualities that make for a good team member: - Excellent empathy and communication skills - Advanced skills in at least one area (programming, negotiation, writing, languages) - High degree of adaptability in conflicting or ambiguous environments - Playing to the strengths of the individuals that make up the team is intuitive, but surprisingly difficult
In the future, it seems that promoting a culture that accepts vulnerability and shared problem-solving often leads to stronger organizational resilience and better long-term outcomes. This is not shocking, but I appreciate that the focus had less to do with leadership then with the qualities that make up a group that can support one another in a decentralized manner. I think that small, equal teams like this will probably be the most likely to reap the benefits of automation in the near term, and the last to suffer from it in the long term.