Cloud computing created hopes of a “set it and forget it” approach to IT. Then hopes collided with reality.
While running business on the cloud can make cloud native enterprises more nimble and efficient, it also poses new risks for their IT departments.
These challenges—opaque workflows, disjointed microservices, and the risk of cost overruns—mean that cloud native environments require extra diligence. Enterprises must monitor many moving parts so they know what to fix when something breaks their operations or their budget. They must listen to lots of noise and extract just the right signals. They must uncover issues before they impact customers, which requires fast response times.
This creates the need for cloud observability. This blog, the first in a series, defines cloud observability and why enterprises need it. The second blog will explore use cases for cloud observability, and the third will explore architectural approaches to rolling it out.
Birds eye view
Cloud observability is an emerging discipline that enterprise IT departments use to study the performance, availability, and utilization of cloud applications and infrastructure, as well as the experience of application users. This discipline seeks to optimize cloud-based applications and containers, as well as the storage, compute, and network resources that support them. It monitors and parses the logs, metrics, and traces that describe how various elements operate and interoperate. Cloud observability, enabled by products such as the Chronosophere platform, helps enterprise site reliability engineers (SREs) and CloudOps engineers ensure that cloud agility does not come at the expense of IT stability or customer satisfaction.
Cloud observability helps enterprise SREs and CloudOps engineers ensure that cloud agility does not come at the expense of IT stability or customer satisfaction.
Cloud observability is a subset of operations observability, previously known as application performance monitoring (APM), which arose in the 1990s to manage applications and infrastructure on premises. You can read my article, The Five Shades of Observability: Business, Operations, Pipelines, Models, and Data Quality, to learn more about the overall observability market.
Zooming in
To understand why cloud observability matters, consider three challenges of cloud native environments. First, SaaS and cloud providers have opaque workflows whose automation hides complex processes, making it hard for enterprises to troubleshoot issues that arise. Second, these providers use microservices that can become disjointed and miscommunicate with one another, creating the risk of enterprise outages. Third, these providers offer elastic infrastructure that scales on demand, which can cause enterprises to overrun their budgets.
Challenges of cloud environments
Let’s explore these elements, their challenges, and how cloud observability can help.
Opaque workflows. Software as a service (SaaS) providers such as Salesforce, Workday, and Slack—and the cloud providers AWS, Azure, and Google—use workflows that automate sequences of tasks related to user requests, provisioning, administration, etc. This makes it easier for them to deliver applications and infrastructure to enterprises. Organizations are moving to cloud-based workflows for common repetitive tasks—versus hosting on traditional application stacks—to cut financial and administration costs.
However, automated workflows hide rather than eliminate complexity. Those myriad tasks still need to synchronize with one another under the covers, which makes it harder for enterprises to understand issues that arise and hurt their business. For example, if step four fails in a batch process, an SRE has to look at the logs and determine why it failed.
Cloud observability addresses this challenge. It helps expose and then reduce the complexity with monitoring and analysis of data. For example, with metrics data, SREs have a clear path to remediation when there’s a problem. They know when something happened; they can rely on traces to tell them where it happened; and they can do root cause analysis on what went wrong. I’ll write more about these three phases of observability in my next blog.
Disjointed microservices. Many SaaS and cloud providers execute their tasks with microservices that run on lightweight, containerized, and reusable modules. Microservices make it easier for these providers to deliver agile software releases. Rather than ripping and replacing an entire workflow, they can update one microservice at a time.
However, all these microservices need to communicate with one another to synchronize tasks, which creates risks of outages. If a service request fails, the root cause could lurk within hundreds or thousands of messages that microservices send to one another. The SRE or CloudOps engineer needs to find the errant messages—or wait for the SaaS provider or cloud provider to do so.
As with workflows, cloud observability addresses this challenge by exposing all the tasks and aggregating all the data related to a given issue. The CloudOps engineer can inspect this data, including the messages that various elements—perhaps compute clusters, load balancers, or containers—send to one another as they execute their microservices. Then they can assess the root cause by identifying incorrect or failed messages.
Elastic infrastructure. Cloud providers virtualize their infrastructure and containerize their applications to improve efficiency. For example, AWS offers Amazon Elastic Compute Cloud (EC2) instances as virtual servers that use parts of physical compute clusters. It also offers Amazon Elastic Kubernetes Service (EKS) to manage and orchestrate application containers. These capabilities help cloud providers scale resources elastically.
However, this elasticity creates the risk of cost overruns. The CloudOps engineer needs to maintain careful oversight of cloud resource consumption and therefore costs. Cloud observability addresses this problem by monitoring resource utilization rates to help track costs and support chargeback. For example, the CloudOps engineer can measure compute cycles to predict operating expenditure.
The volume and cardinality of infrastructure data also creates the risk of cost overruns. For example, metrics about resource utilization and response times can accumulate fast, which drives up compute costs when CloudOps engineers need to monitor performance. Cloud observability tools can help by throttling or filtering metrics to ensure they only process the metrics that matter most to understand the performance of cloud environments.
Moving forward in a digital world
These challenges of opaque workflows, disjointed microservices, and potential cost overruns underscore that the mantra for cloud computing has shifted from “set it and forget it” to “be vigilant.” Implemented well, cloud observability can help enterprises operate in this new world.
Stay tuned for my next blog which explores use cases for cloud observability.