Chronosphere was proud to sponsor this year’s SRECon Americas and invest in USENIX’s diverse, equitable, and inclusive community. The conference was a delightful gathering of engineers and technologists who deeply care about reliability, systems engineering, and working with complex distributed systems at scale. With a culture of critical thought, deep technical insights, continuous improvement and innovation, and commitment to vendor-neutral presentations, SRECon is a place for rich discussion and cross-pollination of ideas.
From March 21-23, Santa Clara Convention Center hosted several hundred of us attendees for the week. Key themes were approaches to adapt SRE practices for different sizes of organizations, sprinkles of learning from incidents (LFI) and continuous learning philosophies, and deep dives into systems at scale — from streaming telemetry to service routing and latency measurement optimization.
Conference organizer Sarah Butt kicked off the event in high gear and invited attendees to take things at their own pace, as she noted, “We [SRECon organizers] have an agenda. You do not.”
Butt’s introduction was followed by the incredible opening keynote, “Endgame of SRE” presented by Amy Tobey. I won’t spoil the fun for you — this is definitely one to watch, especially if you’re curious what people mean when they use the term “sociotechnical system.”
Adapting SRE practices
The parade of awesomeness continued with engineers from the U.S. government as they presented “SRE’s Critical to COVID-19 Government Response” and detailed how they played a pivotal role in the pandemic response. Featured initiatives included ordering free at-home tests via USPS website, architecting a behind-the-scenes data pipeline to lessen the avalanche of printed paper results, and powering vaccinefinder.gov. I was truly in awe learning about the behind-the-scenes efforts, which SRE practices helped and how they did it within the government system caught between multiple agencies of oversight!
Service level objectives (SLOs) are an SRE mainstay and if you’ve moved past the “What are SLOs?” 101-level knowledge, then you’ll enjoy “Not All Minutes Are Equal: The Secret behind SLO Adoption Failure.” Michael Goins and Troy Koss shared the nitty gritty details about rolling out SLOs at Capital One from both the management and developer perspective. Importantly, the duo dug into the U-Turns and bumps on the road and their presentation captured the whole story, instead of the highlight reels of success.
Continuous learning
Continuous deployments: ✅ Continuous learning: ❓. Adapting to the relentless stream of changes going on in the economy, pandemic response, and digital transformation requires an organization and its employees regularly revisit and expand their mental models of “the system” and evolve their skills and paradigms. Here is a sampling of talks to update your understanding of distributed tracing, why MTTR is for amateurs, and how to onboard engineers to on-call via production debugging as a group.
Hands down my favorite session of SRECon was “Founder/CTO Perspectives: The Future of Distributed Tracing.” To state my bias upfront: I am pro-tracing and pro-OpenTelemetry and believe it has a bright future. That being said, there are some tough truths about the current state of tracing this panel acknowledged — namely the disconnect between front-end telemetry and back-end telemetry. Traces as they stand today are excellent for backend engineers and SRE types but leave everything to be desired for our friends in working in the front-end. All in all a rousing and entertaining watch to get the latest scoop on where tracing sits in the wide world of observability.
Incident nerds and engineering leadership will find much to delve into after watching Courtney Nash’s “Far From the Shallows: The Value of Deeper Incident Analysis.” She uses data to dispel some of the incident tenets we in the industry hold dear, such as mean time to resolution (MTTR) matters or severity levels are meaningful categories. Be forewarned if you’ve spent a lot of time on the pager or managing engineering groups you may feel nervous giving up the safety blanket of MTT* — that’s why real life data is a great resource to study other incidents. You’ll be excited to swap stories of near misses and proactively plan to take the cape off your go-to incident heroes.
On-call, love it or hate it, is essential for running systems at scale. In “Cognitive Apprenticeship in Practice: Alert Triage Hour of Power” I shared a successful onboarding and collaborative learning practice dubbed the “Alert Triage Hour of Power” that is still going strong three years and counting! This session revisited the time I was faced with a daunting task: Get up to speed on the real-time monitoring distributed system from the edge, to application layer, to infrastructure and internal tools well enough to join the single on-call rotation. Knowing best practices or even how to answer questions with trace data wasn’t enough since I was only getting exposure to the infrastructure and internal tooling in my day-to-day. Alert Triage Hour of Power was born from my anxiety, so if you’re looking to refresh your on-call onboarding or plant the seeds of continuous learning, look no further.
Systems at scale
Scaling systems, the operational fires where SRE’s were first forged, is an evergreen topic. The talks I was able to catch on this theme focused on the limitations of available telemetry and the journey to unearthing insights about system behavior by enhancing or rethinking instrumentation.
From the title alone I knew I had to get a seat for “Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn’t DNS,” who among us hasn’t blamed DNS or been responsible for proving that it wasn’t DNS? The tale begins with a DNS issue manifesting during roll outs puzzling engineers, wanders into reading kernel code to understand dropped packets, and ends on a triumphant note with a fix (via code deletion!) and deeper understanding of the life of a packet and gRPC defaults. This talk is not only a riveting technical deep dive but an expertly laid out investigation story — take notes!
“Measuring Real-Life Latency of the Internet: A Netflix Story,” was a talk I bookmarked from the program by name alone and also because I was curious to know what a day in the life of a senior CDN reliability engineer involved! It was fascinating to hear about the balance of data fidelity and cost consideration Netflix-scale requires and how Thiara and her team over time was able to instrument finer grained metrics leveraging a new data structure t-digest. This talk brought a deeper level of appreciation for all the work that goes on behind-the-scenes Netflix so I can seamlessly stream Selling Sunset.
Closing thoughts
There was an array of evening activities from Birds of a Feather ad-hoc gatherings of folks to link up around a shared interest, plenty of sponsor events, and my personal favorite Lightning Talk Slide Karaoke, where you have to improvise a powerpoint based on mystery slides!
The closing keynote, “Hell Is Other Platforms,” drew delightfully irreverent parallels between the present day SRE/Ops world and the existential play No Exit. Truly another one to bookmark, break out the popcorn, and settle in to watch.
Coming back home was bittersweet since I had such a lovely time. I frequently joke that I am the youngest retired SRE in the field and am not shy about sharing my story of burnout. Walking away from SRECon and reflecting on the deeper conversations that remark sparked, I realized “once an SRE, always an SRE.” It is more than a job title, really it’s a way of understanding systems holistically and an approach to manage risk in concert with others. Overall, SRECon was heartwarming, eye-opening and reminded me that while being an SRE can feel lonely at times at any given company, we are a part of a large community focused on continuous learning.
Want to chat with Chronosphere experts in person? Be sure to catch us at KubeCon EU in Amsterdam from April 18-21!