If you’re asking yourself any questions about incident reviews, patterns, incident process, and more, be sure to check out the accompanying talks!
1. Why did it make sense for the industry to do it [incident reviews] this way? Nora Jones, Founder/CEO, Jeli
Nora Jones kicked off the conference with a closer look at incident management in 2023. She shares her path to LFI and starting this community and recaps how we as an industry got here. The second part of the talk, “A tale of two incident reviews” compares the learnings and process from the original incident response and a follow-up incident analysis. The follow up moved the time of the incident from ~20 m to 3 weeks prior to the event and uncovered 8 contributing factors compared to the original 2. It was a powerful demonstration on spending the time to learn from incidents.
2. Why do bad incidents happen to good engineers? Dr. Lorin Hochstein, Senior Software Engineer, Netflix
Dr. Hochstein’s provocatively titled presentation, “Your Understanding of Reality is Wrong,” did not disappoint! He schooled us on the linear model of causality and ontology (the nature of reality). He presented a new model for incidents and how he relayed this model to others.
His model encompasses multiple contributing factors, interaction patterns between services, historical context, local rationality of responders, uncertainty, goal conflicts, production pressure, workarounds, and expertise.
3.What are patterns? Have you ever overshot on the recovery of an incident and created a new problem? Dr. David Woods, Professor Emeritus, Ohio State University
In “Finding Patterns in What Makes Incident Response Hard Work,” Dr. Woods walks us through the very beginnings of the cognitive and resilience engineering academic fields. Starting with his partnership with the esteemed late Dr. Richard Cook, Dr. Woods tells the tale of building the first anomaly response models as they studied past and present operating room incidents. This talk gets meta diving into identifying patterns within patterns, but fear not, Dr. Woods is an engaging speaker and makes this content very accessible!
4. Are incidents just paperwork? What’s the value of the process? Clint Byrum, Staff Engineer, Spotify
After spending hours piecing together an incident timeline synthesizing signals from god knows how many disparate systems it’s no wonder folks treat incident reports as toilsome work and go through the motions. In Incident Archaeology: Extracting Value from Paperwork and Narratives, Clint Byrum guides us through the evolution of incident analyses at Spotify. Developers were anxious about going on-call leading to the first hypothesis: “After-hours incidents will have high MTTR and complexity.” When looking at the data, Clint and his team found that it was highly unlikely that incidents would happen after hours!
5. What can we do about the multi-party dilemma? Sarah Butt, Lead Member Technical Staff, SRE, and Alex Elman, Head of Resilience Engineering, Indeed
Anyone working to support, build or maintain a cloud native system or one with critical third party dependencies needs to watch this talk, Embracing the Multi-Party Dilemma: Learning from Incidents Across Company Boundaries.
Alex gives an overview of how Indeed approaches incident analysis, which includes a group learning review. Sarah and Alex share the power and benefits of collaborating across companies to learn from the same incident.
6. What role should IC’s play in incident analysis? Cooper Benson, Platform Engineer, Quizlet
In Designing Incident Analysis Programs Isn’t Just for Managers, Cooper shared how he has been diligently and patiently cultivating an incident analysis practice at Quizlet. Cooper investigated a major incident that caused an infinite restart loop and realized that “There is no root cause!” He embarked on studying Quizlet’s incident response process and asked: What feels like a drag? Is there a learning the current process isn’t capturing?
He noted that IC’s bring unique perspectives and skills such as their ingenuity and ability to automate toil and taking a grassroots approach drives greater ownership and engagement.
Watch on for a practical guide to start shifting your organization away from the 5 Why’s today.
7. What’s the ROI on incident analysis? [….well] What’s the ROI on a brand redesign? John Allspaw Principal/Founder Adaptive Capacity Labs
The conference ended on a joyous note with John Allspaw’s keynote, An Exemplar: What Progress in LFI actually Looks Like. “The goal of incident analysis is to build the richest possible understanding of an event for the broadest possible audience,” he said.
He shares Adaptive Capacity Labs’ engagement with Indeed, who seriously invested in an incident analysis program and learned from their incidents. If you’re wondering what the point of learning from incidents is or how bright the future could be if you invested today, give it a watch.
But don’t let FOMO take over; you can hold your own LFI conference by gathering some folks, watching the lineup on YouTube
Want to know how Chronosphere can help your team learn from incidents? Contact us for a demo today.