Chronosphere has been using Temporal, an open source workflow orchestration engine, since late 2020 for three primary use cases: 

  1. Deployment workflows in a proprietary deployment system.
  2. Automated cluster operations such as rolling restarts and migrating capacity.
  3. Scenario testing for release validation.

I recently spoke at Temporal’s meetup on the first use case – deployments – and the remainder of this blog will focus on how we built our own deployment system using Temporal along with some challenges and lessons learned along the way. 

With the help of Temporal’s glossary, here are a few concepts and terms to help set context for the blog: 

Building our deployment system with Temporal 

Chronosphere uses a hybrid between multi- and single-tenant architectures, in which we deploy a separate fully functioning Chronosphere stack within each of our tenants or customer environments. Because of this, engineers typically need to deploy changes to production for a combination of services to a set of tenants. So when thinking about our deployment system, we wanted to ensure engineers had options to control different aspects of the rollout, such as the order in which tenants and services are deployed, whether they are deployed in parallel or sequentially, how failures should be handled, etc.

This led to the following structure for deployment workflows: 

In this structure, there are three main workflows:

Each of these workflows interact with external dependencies such as Kubernetes, m3db-operator, object storage, and Slack. 

When building out this new, automated deployment system with Temporal, there were a few patterns and challenges we encountered along the way. See how we handled them below:

Challenge #1: Managing conflicting operations 

To help set context, let’s further define workflow IDs. Workflow IDs are user-provided identifiers for a workflow. For example, our Orchestration workflow has the ID pattern “deploy:<cluster>/<deployment_id>”. Temporal allows you to ensure that no two workflows with the same ID are running at once, or that no two workflow IDs are ever reused.

From the early days of our deployment system, we wanted to make sure we didn’t have multiple deployments changing the same set of services in the same customer environment. Initially, we decided that we would allow only one running deployment in the customer environment by generating the ID for DeployTenant workflow, such as “deploy:<cluster>/<tenant>”. 

This worked fine at first, but became difficult to manage as the complexity of the system and our engineering organization started to grow. With more customers, services to manage, and more engineers wanting to do deploys, we ran into challenges with the following use cases: 

In order to resolve these challenges, we introduced a lock workflow to acquire a lock for services at the start of the DeployTenant workflow. By doing this, we are able to better control whether or not to proceed with changes or deployments to a tenant or set of tenants. We based this lock workflow on the mutex sample from Temporal, and then added additional functionality to increase usability for our use cases.  

Challenge #2: Implementing additional safety checks

At Chronosphere, we provide mission-critical observability to our customers, and we wanted to ensure our deployment system met our high bar for safety and correctness. In order to make deploys more robust to failures, we wanted the deployment to be aware of changes in the system that happened outside of a workflow. For example, if an alert is triggered for a customer during a deploy of that customer, we want the system to make it seamless to mitigate the impact.

In order to do so, we introduced different activities to perform specific checks (such as NoCriticalAlerts, IsPodAvailable). In addition, we added helper functions to execute these checks in parallel with other workflow changes. If a check fails, then all other checks are canceled, the workflow is terminated, and the initiating user is notified. These additional safety checks have also been useful for release validation and cluster operations.

Challenge #3: Interacting with running workflows using Slack 

While rolling out new software to production, we occasionally need to involve an engineer to make a decision. This is needed for a few reasons, for example:

Overall, this approach has made our deployment process more interactive and efficient. Learn more about how we built this integration with Slack in the meetup recording

Challenge #4: Passing large, sensitive payloads

The final challenge I covered is around passing large, sensitive payloads to workflows and activities. This is primarily a concern for our deployment manifests as they can contain sensitive data and/or hit unexpected limits due to their large scale. As a solution, we decided to encrypt these manifests and save them to object storage. In addition, instead of passing the payload into the workflow or activity, we pass the path to the object as an activity input.   

Lessons learned 

We learned many lessons throughout the process of building our own deployment system with Temporal. In particular:   

What’s next for Chronosphere and Temporal? 

Temporal has enabled Chronosphere to more safely and reliably automate complex, long-running tasks. It has also enabled us to easily reuse common activities across different types of workflows. For example, the helper that pauses a workflow if an alert fires can be used by our deployment system and capacity tooling with just a few lines of code. This empowers engineers across different teams to ship features quickly while meeting our rigorous standards for safety and customer trust.

Going forward, we plan to expand our usage of Temporal in developer-facing workflows, and compose workflows in more interesting ways. For example, our tooling for provisioning a dev instance of the Chronosphere stack requires creating base resources from a developer’s machine, and then calling the deployment system to deploy the services needed. We plan on moving the provisioning process to a workflow, which will trigger the deployment workflow. We also plan to extend our deployment system to coordinate deploys across multiple production regions, as opposed to requiring a developer to start different deploys for each region.

If you’re passionate about developer experience, or interested in helping us more safely deliver Chronosphere to all sorts of environments around the world, we’re hiring!

Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. To learn more, visit https://chronosphere.io/ or request a demo

Much like any company that makes software its business, Chronosphere builds a large and complicated product, in our case, a large-scale distributed architecture depending on complex open source projects, including M3 (jointly maintained by Chronosphere and Uber). This project requires an equally large and ideally less complicated test suite to ensure that bugs do not make their way into customer production environments.

Our test suite includes common tests you may expect, and others focused on scale, resilience, performance, and other tricky parts of distributed systems. This post explains each suite ordered by how easy it is for engineers to fix any detected issues. 

The suite life of test suites

Unit tests

Tried and true, unit tests separately verify the functionality of each function and object, and form the backbone of testing for nearly every software team. Unit tests are great at asserting pre-conditions and post-conditions for functions to let developers confidently compose them, knowing that conditions will hold for inputs and outputs. Chronosphere has a decent level of test coverage, and CI processes enforce maintaining that coverage using Codecov. But like every software company, there’s always room for more coverage.

Fuzz tests

We use property-based testing to generate valid random data and verify it performs as expected for particularly complex functions, specifically those where edge cases are impossible to generate manually. For example, testing time series compression encoding/decoding logic has proven useful in exposing edge cases. We use Leanovate’s gopter implementation of QuickCheck as a powerful testbed for fuzz tests.

Integration tests

Integration tests are used for testing complex interactions between M3 components, including: M3DB, M3 Coordinator, and M3 Aggregator.

The integration tests programmatically create instances of M3 components to test, seeding them with data and then running validations against them to ensure expected results. A few of these integration tests use fuzzed test data to help sniff out edge cases.

Dockerized integration tests

A different flavor of integration test that sits a little higher in the stack.

Our regular integration tests often seed test data directly into M3. These tests instead try to model a small part of the full data lifecycle against built images of M3 components running in Docker. An example of this type of test is this prometheus write data test (warning: this test has a lot of lengthy bash script, here be dragons). The test writes data using the remote write endpoint in M3 coordinator, then verifies the read by querying for it through the coordinator query endpoints. The tests have the downside of being (partially) written and run in bash, making them more difficult to write, run, and debug, but catching issues in these more real-world situations is worth the pain.

An aside on environments

The test suites above run on development stacks of varying levels of complexity: 

The straightforward requirements for these suites allow them to be easily run as part of our CI pipeline, but we also run other suites against environments more closely approximating production; the tests in the next sections run against two full-sized, production-ready stacks. 

These environments are great at representing how Chronosphere’s system performs under real-world conditions, as they are real-world conditions.

Scenario tests

These tests sit yet further up in the stack, running in Temporal (a tool for orchestrating complex applications; we will be publishing a deep-dive into our usage of Temporal soon,  watch this space) against the mirrored environment. 

These tests run broader “scenarios” rather than the integration tests, and they run them in an environment taking a realistic number of reads and writes. The tests help measure how well changes will perform in an environment running at scale, including custom functionality and extensions built on top of core M3 functionality. For example, one of our scenario tests adds, and then removes, a node to an M3DB cluster to ensure bootstrapping works as expected without read or write degradation. The scenario test suite runs continuously against the head of release candidate branches and the master branch to ensure confidence in the current release, and to catch regressions as soon as they merge.

Dogfooding in meta

Before releasing to production instances, we deploy release candidates to the meta environment for several days. During this period, all engineers using the environment can report any issues they find with the build. If they are significant, we release a hotfix branch and promote it to a release candidate after some basic smoke testing to manually verify that the hotfix has actually fixed the underlying issue, allowing our Buildkite CI pipeline to pick it up. The CI process automatically runs each of the above suites, promoting the release candidate to meta after successful completion.

Gaps in our test coverage

These suites make up our safe release stack, which provides confidence that a release candidate meets quality and reliability expectations for release to customers. However, these suites have shortcomings: they use synthetic data, act on a confined part of the stack, or both. In practice, this has led to missing some tricky edge cases or unexpected interactions that resulted in real-world impact and service degradation to customers.

M3 is a complex system consisting of multiple distributed components, which makes it difficult to test with a high degree of confidence. This complexity makes it difficult to test holistically, especially when introducing scale, timings, failovers, or a million other distributed systems issues. For further information and a deeper dive into how these components can fit together, and what happens to actually persist a write within an M3, the architecture docs are a good place to start.

A simplified diagram to illustrate the complexity of M3 and the interplay between different components to write a single datapoint.

Dogfooding release candidates that have passed all relevant test suites before reaching the meta environment can help engineers catch issues, but it’s not a perfect process. Internal users can ignore flakiness in features, or not notice edge cases in queries that aren’t often used internally or on dashboards that engineers rarely look at. On top of this, some issues only show up when running at high write load or high query load, which is difficult to emulate consistently and catch before confidently releasing to production.

Patching the holes

In summary, the test suite is comprehensive, but not comprehensive enough to catch issues that affect customers’ experience and business. We needed something to fill those shortcomings by emulating and testing environments more like those used by our customers. Keep your eyes open for the next post on the topic, coming in a couple of weeks, where we dig deep into what we ended up creating and how it works to give us the confidence that we need.

Read part two of my blog series to continue the discussion: Comparing queries to validate Chronosphere releases.

At Chronosphere we have a globally-distributed, remote-first engineering organization which lets us hire amazing and diverse talent from everywhere. While this diversity has lots of benefits, it comes with some downsides that don’t typically need to be addressed in small co-located startups. In this blog, I discuss these challenges and share solutions Chronosphere engineers have come up with as a team.

Challenges with code reviews

There are two main challenges we run into when doing code reviews:

Guiding Principles of code reviews

We started off with some guiding principles to help anchor everyone together.

Commit Messages – Best Practices

Commit messages should describe the change in reasonable detail. The more context, the easier for others to understand the change. Chris Beams writes about how to write a Git commit message here. Automated and trivial changes can be short and simple.

Prior to adding others to a pull request, it is recommended to do a self review of the diff and ensure that the CI pipeline is passing. Think of it as proof reading.

The body of the pull request should provide an overview of the change along with a link to any related docs or tickets. Screenshots for UI changes are strongly encouraged as it gives reviewers a better understanding of the change.

Remember that new employees don’t have all the background or tribal knowledge that you might have. Also, “future you” might not remember what was going on, if you have to look back at the review, so having some context helps everyone.

Pull Requests – Best Practices

If a ticket is related to the pull request, make sure to include a reference to it. Pull requests should aim to have small incremental chunks of work. Larger code reviews require more time to do and are harder in general.

You can read more details about optimal pull request size here. 

How to Comment during code review

Comments should be clear about the severity and desired changes. If possible, the suggest change feature should be used. Otherwise, example code – or a link to existing code – that demonstrates the change should be provided. Most of our review communication is done async so it is important to make sure things are clear and addressable without having to wait a full day to make progress. After a couple of back-and-forths, comments alone tend to become unproductive, so try hopping on Zoom to figure things out if you find yourself in this situation.

Severity of comments range from nits to blockers. We use the following prefixes:

Resolving Comments

Comments for changes should follow one of the following paths:

Blocking – proceed with caution

Generally we should favor getting something out over perfecting it. Blocking should be rare, and used to denote dangerous changes that would, or could, cause an outage or other incident. Reviewers with blocking issues should reach out to the author on Slack or Zoom to discuss the issue as soon as possible. This helps avoid long async communication delays and letting the code get stale.

Using our “assume good intentions” goal, we’ll trust each other to merge good code and take the necessary followup actions that may be needed in a timely fashion.

Approving code reviews

Once you are done commenting on the review, if there were no blocking issues it should be approved. This allows changes to be made and merged without another round of back and forth.