An eye on observability: for December and a 2021 Recap

on December 21st 2021

To service mesh or not to service mesh

A service mesh is one of those divisive topics that engineers and DevOps professionals either think is essential glue for complex application architectures or completely unnecessary in most use cases. Whatever your opinion, there’s a chance it’s based on content that the creator of Linkerd, William Morgan, says ranges from “low-calorie fluff” to “basically bullshit”. William recently published a blog post (on his own company website, so add yet another grain of salt) that attempts to give a deeper explanation of the technology and when or why it might be useful.

Learn an instrument

This newsletter covers instrumenting applications for observability a lot, and the main reason for this is that it’s an important step often overlooked. After all, an application that doesn’t send any telemetry data remains something of a black box. This month’s newsletter continues our observability coverage, highlighting: a detailed post on how to instrument a Rust application, details on new features in the OpenTelemetry operator that add automatic instrumentation for Java, NodeJS, and Python, and finally a more general overview of OpenTelemetry from JetBrains and Honeycomb.

Observability and SRE journeys

Five years sounds like a long time in Observability, and African payment provider, Paystack, shared their experiences choosing tools and changing internal mindsets. The biggest standout point for me was this one, “Empower teams across the company by creating tools that help them resolve customer issues”. This is fundamental, there is little point building complex observability pipelines if they don’t help staff or customers solve any problems.

And from a company at the larger end of the spectrum was a blog post from the Google SRE team on how they keep Google’s 2 million lines of code running as smoothly as possible. One interesting aspect was how the team members operate as a sort of “support for hire”, with development teams having to budget for their time. There could be a temptation to cut costs, but I guess the potential costs of downtime or failure dissuade product vice presidents from doing this.

If you’re looking for SRE advice for companies a little smaller than Google, then this thread on Reddit is the beginning of an interesting discussion on how to create a “minimally viable” SRE stack.

Metrics forwarding with Prometheus agent mode

Similar to many other tools, Chronosphere and M3 make use of Prometheus’s remote write feature to scrape metrics destined for Prometheus. But this means that Prometheus runs a lot of other functionality that takes up computing resources but aren’t used. Version 2.32.0 introduces “agent mode” that disables querying, alerting, and local storage and replaces it with a customized TSDB write-ahead log to streamline forwarding data to remote Prometheus-compatible stores.

A look back on 2021

Amongst all the chaos and uncertainty that 2021 bought us, tech was a constant, whether we liked it or not. In our own cloud-native related world especially, the pace of change and adoption was astronomical. SlashData recently reported that 2021 saw a 67% increase in Kuberentes use. This equals roughly 6.8 million cloud native developers, considering there an estimated 27 million developers total, that means that now more than a quarter of them use Kubernetes. Some of us made it back to conferences (briefly) and you can read our thoughts on KubeCon NA and Hamburg container days

If you need a reminder of topics covered in these newsletters since August then here are the past issues. Recurring themes were funding rounds, activity around creating standards, and reports from the field.

One of the biggest recurring themes in the second half of 2021 was cloud outages that brought our reliance on a handful of providers to the forefront of public attention. Small outages happen more often than you might think, but AWS suffered from two significant incidents in December, Google Cloud had a major one in November, and who can forget the peace and quiet when all Meta’s services went down in October?

These outages reminded us how many services the products and tools we create and need rely on disparate webs of dependencies we are sometimes not even aware of. Observability tools help us figure out the source of a problem, but if that’s with an external provider, we don’t always have control to do anything about it. This can in turn affect the relationship you have with your own customers and the service level agreements (SLAs) you have with them. Events this year showed us time and again that cloud-native can mean more than just hosting your own code in different places, but you need to ensure you have fail safes at every level of your stack.

Interested in what we are building?