On two snowy days in February, a horde of self-professed incident nerds descended upon Denver for the inaugural Learning From Incidents (LFI) conference.
It was a magical gathering of industry folks, resilience engineers, technical leaders, and academics sharing tales of helming incidents, strategies for seeding a culture of continuous learning and discussing the latest resilience engineering research findings.
What is LFI?
This community began in 2019 and is “challenging conventional views and reshaping how the software industry thinks about incidents, software reliability, and the critical role people play in keeping their systems running,” according to the community’s website.
I myself was first introduced to the LFI way of life back in my first SWE role when my company hosted the SNAFU Catchers. The researchers were studying anomaly responses, particularly — how ChatOps controlled the costs of coordination during incidents and facilitated recalibration aka “I didn’t know it worked that way”!
Their presentations shifted my paradigm of “the system” from a purely technical perspective of code, configuration and infrastructure to include people and the human side of the organization, what we call sociotechnical today.
Their research led me on a journey following LFI luminaries Dr. Lorin Hochstein, Nora Jones, J. Paul Reed and many more. Tweets like Amy Nguyen’s below made me laugh, think, and want to learn more about this whole LFI philosophy.
A few years — and many incidents later — I was an SRE at a company that experienced an incident dubbed “Kafkapocalypse.” If there was ever an incident to learn from, it was that one! Instead of following the same post-incident review process, I donned my analyst hat and interviewed the responders, reviewed the available data, and synthesized it into a report that highlighted the uncovered themes and knowledge gaps. My first incident analysis! The context gleaned from the incident analysis was a goldmine of insights and I was hooked on this new approach to incidents.
A recap of LFI tracks
This year’s conference included 5 main tracks that attendees could join to learn about incident theory, research, and real-life scenarios.
CasesConf: The Incident Story Track
You won’t find a recording of my talk (or any Incident Stories) due to the Chatham House Rule : Anyone who attends the meeting is free to use information from the discussion but cannot reveal who made any particular comment. This meant speakers could include specifics about architecture or company dynamics and not worry about litigation.
My presentation, “There’s No Place Like Production,” walked folks through my “big one” — when I was a contributing factor to a major incident for an issue that could only happen in production. I retold how this incident turned into a gift of learning to my former company.
The Intentional Hallway
This track was for opportunistic idea generation, lively discussions, and ad hoc collaborations. One of my favorites was the discussion facilitated by Jeff Martens around Managing Incidents When Your App is Built on Other Apps. The big question was: “How do third party cloud dependencies impact your ability to run reliable services?”
The ancient SRE saying goes: “You can only be as reliable as the most unreliable component in your system.”
This question spawned a juicy conversation that benefitted from having so many differing perspectives in the room from managers. The one point of agreement among the group was that vendor status pages are not the most reliable sources of information. We touched on the sometimes adversarial nature between vendor and customer and what reasonable expectations are.
While you can’t go back in time to be in the room for our discussion — ask folks in your organization this question to spark your own especially if you’re embattled in a build vs. buy debate.
Learning From Incidents Techniques
This track featured practical takeaways for individuals and organizations to learn from incidents and increase resiliency.
A current challenge for the LFI community is how to measure or quantify the impacts of investing in learning from incidents in a way that is sensible to business leaders. For this track, Niall Murphy’s video skit IC Meets VP in a Tale of Incident Management set the stage for a conversation between Mr. VP and Mr. IC, which will be all too familiar for those that manage incidents and operate complex systems.
We invested a lot in Incident Management last quarter, and incidents went up. Is that expected?” asked Mr. VP
“It’s not that we’re finding more incidents as such, it’s really more that we’re trying to handle them better and the investment really helped the team to do that!” said Mr. IC
“Really? Here it says that not only did we have more incidents but they also lasted longer on average,” posited Mr. VP
The video closes with Mr. VP outlining the constraints he is under and what signals are needed from the Incident Management function to continue getting investment. He throws a challenge all of us will contend with investing in resilience work: ”Give us something that can be managed roughly the same way everything else is or lead a revolution in how things are managed”
Let the revolution begin!
Academic and Applied Research from Other Industries
Part of what made this conference unique was the blend of academia and speakers from industries beyond pure technology.
It was exciting to discover that there are actual fields of study such as safety sciences, human factors, organizational theory, and cognitive engineering with brilliant folks that study how we as humans can better manage and maintain complex digital systems that seep into almost every facet of modern life.
Ohio State University Professor Emeritus Dr. David Woods even gave attendees homework! It was to reflect on personal aha moments throughout the conference. He asked us to identify new patterns that became visible and asked, “What facilitates learning? What hampers it?”
Speakers from other industries served as a reminder of how our society relies upon complex and distributed systems and highlighted how critical learning from incidents is as a society. Highlights include tales investigating accidents with the U.S. Forest Service from Dr. Ivan Pupulidy and Nikki Vande Garde’s experiences as the Director of Patient Safety at Oracle Health wrangling electronic health records.
Asking this? Watch That!
If you’re asking yourself any questions about incident reviews, patterns, incident process, and more, be sure to check out the accompanying talks!
1. Why did it make sense for the industry to do it [incident reviews] this way? Nora Jones, Founder/CEO, Jeli
Nora Jones kicked off the conference with a closer look at incident management in 2023. She shares her path to LFI and starting this community and recaps how we as an industry got here. The second part of the talk, “A tale of two incident reviews” compares the learnings and process from the original incident response and a follow-up incident analysis. The follow up moved the time of the incident from ~20 m to 3 weeks prior to the event and uncovered 8 contributing factors compared to the original 2. It was a powerful demonstration on spending the time to learn from incidents.
2. Why do bad incidents happen to good engineers? Dr. Lorin Hochstein, Senior Software Engineer, Netflix
Dr. Hochstein’s provocatively titled presentation, “Your Understanding of Reality is Wrong,” did not disappoint! He schooled us on the linear model of causality and ontology (the nature of reality). He presented a new model for incidents and how he relayed this model to others.
His model encompasses multiple contributing factors, interaction patterns between services, historical context, local rationality of responders, uncertainty, goal conflicts, production pressure, workarounds, and expertise.
3.What are patterns? Have you ever overshot on the recovery of an incident and created a new problem? Dr. David Woods, Professor Emeritus, Ohio State University
In “Finding Patterns in What Makes Incident Response Hard Work,” Dr. Woods walks us through the very beginnings of the cognitive and resilience engineering academic fields. Starting with his partnership with the esteemed late Dr. Richard Cook, Dr. Woods tells the tale of building the first anomaly response models as they studied past and present operating room incidents. This talk gets meta diving into identifying patterns within patterns, but fear not, Dr. Woods is an engaging speaker and makes this content very accessible!
4. Are incidents just paperwork? What’s the value of the process? Clint Byrum, Staff Engineer, Spotify
After spending hours piecing together an incident timeline synthesizing signals from god knows how many disparate systems it’s no wonder folks treat incident reports as toilsome work and go through the motions. In Incident Archaeology: Extracting Value from Paperwork and Narratives, Clint Byrum guides us through the evolution of incident analyses at Spotify. Developers were anxious about going on-call leading to the first hypothesis: “After-hours incidents will have high MTTR and complexity.” When looking at the data, Clint and his team found that it was highly unlikely that incidents would happen after hours!
5. What can we do about the multi-party dilemma? Sarah Butt, Lead Member Technical Staff, SRE, and Alex Elman, Head of Resilience Engineering, Indeed
Anyone working to support, build or maintain a cloud native system or one with critical third party dependencies needs to watch this talk, Embracing the Multi-Party Dilemma: Learning from Incidents Across Company Boundaries.
Alex gives an overview of how Indeed approaches incident analysis, which includes a group learning review. Sarah and Alex share the power and benefits of collaborating across companies to learn from the same incident.
6. What role should IC’s play in incident analysis? Cooper Benson, Platform Engineer, Quizlet
In Designing Incident Analysis Programs Isn’t Just for Managers, Cooper shared how he has been diligently and patiently cultivating an incident analysis practice at Quizlet. Cooper investigated a major incident that caused an infinite restart loop and realized that “There is no root cause!” He embarked on studying Quizlet’s incident response process and asked: What feels like a drag? Is there a learning the current process isn’t capturing?
He noted that IC’s bring unique perspectives and skills such as their ingenuity and ability to automate toil and taking a grassroots approach drives greater ownership and engagement.
Watch on for a practical guide to start shifting your organization away from the 5 Why’s today.
7. What’s the ROI on incident analysis? [….well] What’s the ROI on a brand redesign? John Allspaw Principal/Founder Adaptive Capacity Labs
The conference ended on a joyous note with John Allspaw’s keynote, An Exemplar: What Progress in LFI actually Looks Like. “The goal of incident analysis is to build the richest possible understanding of an event for the broadest possible audience,” he said.
He shares Adaptive Capacity Labs’ engagement with Indeed, who seriously invested in an incident analysis program and learned from their incidents. If you’re wondering what the point of learning from incidents is or how bright the future could be if you invested today, give it a watch.
But don’t let FOMO take over; you can hold your own LFI conference by gathering some folks, watching the lineup on YouTube
Want to know how Chronosphere can help your team learn from incidents? Contact us for a demo today.