An introduction to instrumentation

A lot of what you read around the topic of Observability mentions the benefits and potential of analyzing data, but little about how you collect it. This process is called “instrumentation” and broadly involves collecting events in infrastructure and code that include metrics, logs, and traces. There are of course dozens of methods, frameworks, and tools to help you collect the events that are important to you, and this post begins a series looking at some of those. This post focuses on introductory concepts, setting up the dependencies needed, and generating some basic metrics. Later posts will take these concepts further.

An introduction to metrics data

Different vendors and open source projects created their own ways to represent the event data they collect. While this remains true, there are increased efforts to create portable standards that everyone can use and add their features on top of but retain interoperability. The key project is OpenTelemetry from the Cloud Native Computing Foundation (CNCF). This blog series will use the OpenTelemetry specification and SDKs, but collect and export a variety of the formats it handles.

The application example

The example for this post is an ExpressJS application that serves API endpoints and exports Prometheus-compatible metrics. The tutorial starts by adding basic instrumentation and sending metrics to a Prometheus backend, then adds more, and adds the Chronosphere collector. You can find the full and final code on GitHub.

Install and setup ExpressJS

ExpressJS provides a lot of boilerplate for creating a JavaScript application that serves HTTP endpoints, so is a great start point. Add it to a new project by following the install steps.

Create an app.js file and create the basic skeleton for the application:

const express = require("express");
const PORT = process.env.PORT || "3000";
const app = express();
app.get("/", (req, res) => {
  res.send("Hello World");
});
app.listen(parseInt(PORT, 10), () => {
  console.log(`Listening for requests on http://localhost:${PORT}`);
});

Running this now with node app.js starts a server on port 3000. If you visit localhost:3000 you see the message “Hello World” in the web browser.

Add basic metrics

This step uses the tutorial from the OpenTelemetry site as a basis with some changes, and builds upon it in later steps.

Install the dependencies the project needs, which are the Prometheus exporter, and the base metrics SDK.

npm install --save @opentelemetry/sdk-metrics-base
npm install --save @opentelemetry/exporter-prometheus

Create a new monitoring.js file to handle the metrics functions and add the dependencies:

const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { MeterProvider }  = require('@opentelemetry/sdk-metrics-base');

Create an instance of a MeterProvider that uses the Prometheus exporter. To prevent conflicts with ports, the exporter uses a different port. Typically Prometheus runs on port 9090, but as the Prometheus server runs on the same machine for this example, use port 9091 instead.

const meter = new MeterProvider({
  exporter: new PrometheusExporter({port: 9091}),
}).getMeter('prometheus');

Create the metric to manually track, which in this case is a counter of the number of visits to a page.

const requestCount = meter.createCounter("requests", {
  description: "Count all incoming requests",
  monotonic: true,
  labelKeys: ["metricOrigin"],
});

Create a Map of the values based on the route (which in this case, is only one) and create an exportable function that increments the count each time a route is requested.

In the app.js file, require the countAllRequests function, and add with Express’s .use middleware function, call it on every request.

const { countAllRequests } = require("./monitoring");
…
app.use(countAllRequests());

At this point you can start Express and check that the application is emitting metrics. Run the command below and refresh localhost:3000 a couple of times.

node app.js

Open localhost:9091/metrics and you should see a list of the metrics emitted so far.

Install and configure Prometheus

Install Prometheus and create a configuration file with the following content:

global:
  scrape_interval: 15s
# Scraping Prometheus itself
scrape_configs:
- job_name: 'prometheus'
  scrape_interval: 5s
  static_configs:
  - targets: ['localhost:9090']
  # Not needed when running with Kubernetes
- job_name: 'express'
  scrape_interval: 5s
  static_configs:
  - targets: ['localhost:9091']

Start Prometheus:

prometheus --config.file=prom-conf.yml

Start Express and refresh localhost:3000 a couple of times.

node app.js

Open the Prometheus UI at localhost:9090, enter requests_total into the search bar and you should see results.

Add Kubernetes to the mix

So far, so good, but Prometheus is more useful when also monitoring the underlying infrastructure running an application, so the next step is to run Express and Prometheus on Kubernetes.

Create a Docker image

The express application needs a custom image, create a Dockerfile and add the following:

FROM node
 
WORKDIR /opt/ot-express
# install deps
COPY package.json /opt/ot-express
RUN npm install
# Setup workdir
COPY . /opt/ot-express
# run
EXPOSE 3000
CMD ["npm", "start"]

Build the image with:

docker build -t ot-express .

Download the Kubernetes definition file from the GitHub repo for this post.

A lot of the configuration is necessary to give Prometheus permission to scrape Kubernetes endpoints, the configuration more unique to this example is the following:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ot-express
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ot-express  
  template:
    metadata:
      labels:
        app: ot-express  
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9091"
    spec:
      containers: 
      - name: ot-express 
        image: ot-express
        imagePullPolicy: Never
        ports:
        - name: express-app
          containerPort: 3000
        - name: express-metrics
          containerPort: 9091
---
apiVersion: v1
kind: Service
metadata:
  name: ot-express
  labels:
    app: ot-express
spec:
  ports:
  - name: express-app
    port: 3000
    targetPort: express-app
  - name: express-metrics
    port: 9091
    targetPort: express-metrics
  selector:
    app: ot-express
  type: NodePort

This deployment uses annotations to inform Prometheus to scrape metrics from applications in the deployment, and exposes the express and Prometheus ports it uses.

Update the Prometheus configuration to include scraping metrics from Kubernetes-discovered endpoints. This means you can remove the previous Express job.

global:
  scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
  scrape_interval: 5s
  static_configs:
  - targets: ['localhost:9090']
- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

Create a ConfigMap of the Prometheus configuration:

kubectl create configmap prometheus-config --from-file=prom-conf.yml

Send the Kubernetes declaration to the server with:

kubectl apply -f k8s-local.yml

Find the exposed URL and port for the Express service, open and refresh the page a few times. Find the exposed URL and port for the Prometheus UI, enter requests_total into the search bar and you should see results.

Increasing application complexity

The demo application works and sends metrics when run on the host machine, Docker, or Kubernetes. But it’s not complex, and doesn’t send that many useful metrics. While still not production-level complex, this example application from the ExpressJS website adds multiple routes and HTTP protocols.

Adding in the other code the demo application needs, update app.js to the following:

const express = require("express");
const { countAllRequests } = require("./monitoring");
const PORT = process.env.PORT || "3000";
const app = express();
app.use(countAllRequests());
function error(status, msg) {
  var err = new Error(msg);
  err.status = status;
  return err;
}
app.use('/api', function(req, res, next){
  var key = req.query['api-key'];
  if (!key) return next(error(400, 'api key required'));
  if (apiKeys.indexOf(key) === -1) return next(error(401, 'invalid api key'))
  req.key = key;
  next();
});
var apiKeys = ['foo', 'bar', 'baz'];
var repos = [
  { name: 'express', url: 'https://github.com/expressjs/express' },
  { name: 'stylus', url: 'https://github.com/learnboost/stylus' },
  { name: 'cluster', url: 'https://github.com/learnboost/cluster' }
];
var users = [
  { name: 'tobi' }
  , { name: 'loki' }
  , { name: 'jane' }
];
var userRepos = {
  tobi: [repos[0], repos[1]]
  , loki: [repos[1]]
  , jane: [repos[2]]
};
app.get('/api/users', function(req, res, next){
  res.send(users);
});
app.get('/api/repos', function(req, res, next){
  res.send(repos);
});
app.get('/api/user/:name/repos', function(req, res, next){
  var name = req.params.name;
  var user = userRepos[name];
  if (user) res.send(user);
  else next();
});
app.use(function(err, req, res, next){
  res.status(err.status || 500);
  res.send({ error: err.message });
});
app.use(function(req, res){
  res.status(404);
  res.send({ error: "Sorry, can't find that" })
});
app.listen(parseInt(PORT, 10), () => {
    console.log(`Listening for requests on http://localhost:${PORT}`);
  });

There are a lot of different routes to try (read the comments in the original code), but here are a couple (open them more than once):

Start the application with Docker as above, and everything works the same, but with more metrics scraped by Prometheus.

If you’re interested in scraping more Express-related metrics, you can try the express-prom-bundle package. If you do, you need to change the port in the Prometheus configuration, Docker and Kubernetes declarations to the Express port, i.e. “3000”. You also no longer need the monitoring.js file, or the countAllRequests methods. Read the documentation for the package for more ways to customize it for generating metrics important to you.

Adding Chronosphere as a backend

Chronosphere is a drop-in scalable back-end for Prometheus, book a live demo to see more.

If you’re already a customer, then you can download the collector configuration file that determines how Chronosphere collects your metrics dataa, and add the domain of your instance and API key as SHA-256 encoded values to the Kubernetes Secret declaration:

apiVersion: v1
data:
  address: {SUB_DOMAIN}
  api-token: {API_TOKEN}
kind: Secret
metadata:
  labels:
    app: chronocollector
  name: chronosphere-secret
  namespace: default
type: Opaque

Follow the same steps for starting the application and Prometheus:

kubectl create configmap prometheus-config --from-file=prom-conf.yml
kubectl apply -f k8s-local.yml

Apply the Chronosphere collector definition:

kubectl apply -f chronocollector.yaml

Again, refresh the application page a few times, and take a look in a dashboard or the metrics profiler in Chronosphere and you should see the Express metric.

Next steps

This post showed you how to setup a JavaScript application to collect OpenTelemetry data using the Prometheus collector and send basic metrics data. Future posts will dig into the metrics and how to apply them to an application in more detail.

Chronosphere has been using Temporal, an open source workflow orchestration engine, since late 2020 for three primary use cases: 

  1. Deployment workflows in a proprietary deployment system.
  2. Automated cluster operations such as rolling restarts and migrating capacity.
  3. Scenario testing for release validation.

I recently spoke at Temporal’s meetup on the first use case – deployments – and the remainder of this blog will focus on how we built our own deployment system using Temporal along with some challenges and lessons learned along the way. 

With the help of Temporal’s glossary, here are a few concepts and terms to help set context for the blog: 

Building our deployment system with Temporal 

Chronosphere uses a hybrid between multi- and single-tenant architectures, in which we deploy a separate fully functioning Chronosphere stack within each of our tenants or customer environments. Because of this, engineers typically need to deploy changes to production for a combination of services to a set of tenants. So when thinking about our deployment system, we wanted to ensure engineers had options to control different aspects of the rollout, such as the order in which tenants and services are deployed, whether they are deployed in parallel or sequentially, how failures should be handled, etc.

This led to the following structure for deployment workflows: 

In this structure, there are three main workflows:

Each of these workflows interact with external dependencies such as Kubernetes, m3db-operator, object storage, and Slack. 

When building out this new, automated deployment system with Temporal, there were a few patterns and challenges we encountered along the way. See how we handled them below:

Challenge #1: Managing conflicting operations 

To help set context, let’s further define workflow IDs. Workflow IDs are user-provided identifiers for a workflow. For example, our Orchestration workflow has the ID pattern “deploy:<cluster>/<deployment_id>”. Temporal allows you to ensure that no two workflows with the same ID are running at once, or that no two workflow IDs are ever reused.

From the early days of our deployment system, we wanted to make sure we didn’t have multiple deployments changing the same set of services in the same customer environment. Initially, we decided that we would allow only one running deployment in the customer environment by generating the ID for DeployTenant workflow, such as “deploy:<cluster>/<tenant>”. 

This worked fine at first, but became difficult to manage as the complexity of the system and our engineering organization started to grow. With more customers, services to manage, and more engineers wanting to do deploys, we ran into challenges with the following use cases: 

In order to resolve these challenges, we introduced a lock workflow to acquire a lock for services at the start of the DeployTenant workflow. By doing this, we are able to better control whether or not to proceed with changes or deployments to a tenant or set of tenants. We based this lock workflow on the mutex sample from Temporal, and then added additional functionality to increase usability for our use cases.  

Challenge #2: Implementing additional safety checks

At Chronosphere, we provide mission-critical observability to our customers, and we wanted to ensure our deployment system met our high bar for safety and correctness. In order to make deploys more robust to failures, we wanted the deployment to be aware of changes in the system that happened outside of a workflow. For example, if an alert is triggered for a customer during a deploy of that customer, we want the system to make it seamless to mitigate the impact.

In order to do so, we introduced different activities to perform specific checks (such as NoCriticalAlerts, IsPodAvailable). In addition, we added helper functions to execute these checks in parallel with other workflow changes. If a check fails, then all other checks are canceled, the workflow is terminated, and the initiating user is notified. These additional safety checks have also been useful for release validation and cluster operations.

Challenge #3: Interacting with running workflows using Slack 

While rolling out new software to production, we occasionally need to involve an engineer to make a decision. This is needed for a few reasons, for example:

Overall, this approach has made our deployment process more interactive and efficient. Learn more about how we built this integration with Slack in the meetup recording

Challenge #4: Passing large, sensitive payloads

The final challenge I covered is around passing large, sensitive payloads to workflows and activities. This is primarily a concern for our deployment manifests as they can contain sensitive data and/or hit unexpected limits due to their large scale. As a solution, we decided to encrypt these manifests and save them to object storage. In addition, instead of passing the payload into the workflow or activity, we pass the path to the object as an activity input.   

Lessons learned 

We learned many lessons throughout the process of building our own deployment system with Temporal. In particular:   

What’s next for Chronosphere and Temporal? 

Temporal has enabled Chronosphere to more safely and reliably automate complex, long-running tasks. It has also enabled us to easily reuse common activities across different types of workflows. For example, the helper that pauses a workflow if an alert fires can be used by our deployment system and capacity tooling with just a few lines of code. This empowers engineers across different teams to ship features quickly while meeting our rigorous standards for safety and customer trust.

Going forward, we plan to expand our usage of Temporal in developer-facing workflows, and compose workflows in more interesting ways. For example, our tooling for provisioning a dev instance of the Chronosphere stack requires creating base resources from a developer’s machine, and then calling the deployment system to deploy the services needed. We plan on moving the provisioning process to a workflow, which will trigger the deployment workflow. We also plan to extend our deployment system to coordinate deploys across multiple production regions, as opposed to requiring a developer to start different deploys for each region.

If you’re passionate about developer experience, or interested in helping us more safely deliver Chronosphere to all sorts of environments around the world, we’re hiring!

Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. To learn more, visit https://chronosphere.io/ or request a demo

One of the most important capabilities of an observability platform is alerting. How quickly can you know when something is wrong, so you can rapidly triage and remediate that problem? Chronosphere recently released a new approach to defining alerts called “Monitors,” which gives users more flexibility with alerts and makes them easier to create and manage.

An alert is only useful if it’s seen quickly and by the right on-call team, and that’s where PagerDuty comes in. Many of our customers use Chronosphere and PagerDuty together to:

How the integration works

Chronosphere uses the concept of “Monitors” to watch time series data, and generate an alert when that time series violates a specified condition. When an alert triggers it sends a notification to endpoints you specify (called a “notifier”), including PagerDuty. The notification contains the data from the triggering time series and any meta data added to the monitor.

You can trigger notifications to different PagerDuty services, or group alerts into exciting incidents.

Once the time series data returns to a value within the specified condition you can optionally send a “resolved” notification to PagerDuty.

Requirements

Support

If you need help with this integration, please contact customer-support@chronospherdev.wpengine.com

Integration walkthrough

In PagerDuty

There are two ways to integrate with PagerDuty: via global event routing or directly through an integration on a PagerDuty service. Integrate with global event routing if you want to build different routing rules based on the events coming from Chronosphere. Integrate with a PagerDuty service if you don’t need to route alerts from the Chronosphere to different responders based on the event payload. 

Integrating with Global Event Routing

  1. From the Automation menu, select Event Rules and click your Default Global Ruleset.
  2. On the Event Rules screen, copy your Integration Key.
  3. Continue to the “In Chronosphere” section below.

Integrating with a PagerDuty service

  1. From the Services menu, select Service Directory.
  2. If you are adding the integration to an existing service, click the name of the service you want to add the integration to. Then select the Integrations tab and click Add a new integration.
    1. If you are creating a new service for the integration, read the Configuring Services and Integrations documentation, and follow the steps outlined in the “Create a New Service” section.
  3. Select “Prometheus”from the Integrations list and add it to the service, this redirects you to the Integrations tab.
  1. Copy the Integration Key of the service

In Chronosphere

Chronosphere calls the endpoints you can send notifications to a “notifier”. To setup a PagerDuty notifier, open the Services menu in PagerDuty, and click the + New Service button.

You can create the notifier with the Chronosphere UI from the Settings > Notifiers menu. Click the + Create Notifier button, and select the PagerDuty option. Use the value of the Integration Key for the Service Key field and  “https://events.pagerduty.com/v2/enqueue” for the URL field.

You can define the notifier(s) in a YAML file and use Chronosphere’s CLI tool to create them. Use the value of the Integration Key for the service_key field and  “https://events.pagerduty.com/v2/enqueue” for the url field. If you want to resolve the alert in PagerDuty when it resolves in Chronosphere, set send_resolved to “true” in the base_config.

api_version: v2
kind: Notifier
spec:
  notifier:
    name: team-pagerduty
    slug: team-pagerduty
    pagerduty:
    - base_config:
        # Whether or not to notify about resolved alerts.
        send_resolved: true
      service_key: <service_key>
      url: https://events.pagerduty.com/v2/enqueue
      severity: critical

Create the notifier using the CLI tool with the chronoctl apply -f file.yaml command.

You can define the notifier(s) in a .tf file and use Chronosphere Terraform provider to create them. Use the value of the Integration Key for the service_key field and  “https://events.pagerduty.com/v2/enqueue” for the url field. If you want to resolve the alert in PagerDuty when it resolves in Chronosphere, set send_resolved to true.

resource "chronosphere_pagerduty_alert_notifier" "default" {
  name = "Pagerduty Notifier"

  # Notifier-specific required configuration
  # Detailed definitions can be found at: https://developer.pagerduty.com/docs/events-api-v2/trigger-events/
  severity      = "info"
  url           = "https://events.pagerduty.com/v2/enqueue"

  send_resolved = true
}

Create the notifier using the provider with the terraform apply command.

When you create a notifier with the Chronosphere CLI tool or Terraform, you can send annotations to PagerDuty containing extra information, such as a direct link to the originating source.

For the CLI tool YAML file add something like the following:

api_version: v2
kind: Receiver
spec:
 receiver:
   email: null
   name: team-pagerduty
   opsGenie: null
   pagerduty:
     - base_config:
         send_resolved: true
       images:
         - src: "https://img.stackshare.io/service/12253/default_c4c3e2b994306c3a6028218ea49291a775fc3275.png"
           alt: "Chronosphere"
           href: "https://chronosphere.io"
       links:
         - href: "https://chronosphere.io"
           text: "Chronosphere"
         - href: "https://chronosphere.io"
           text: "Chronosphere"
       service_key: R02EU5OAWHP67AENFIGGBOO580OO2CXS
       severity: critical
       url: https://events.pagerduty.com/v2/enqueue
   slack: null
   webhook: null

And for the Terraform provider:

 image {
    # src is required for every image
    src  = "https://img.stackshare.io/service/12253/default_c4c3e2b994306c3a6028218ea49291a775fc3275.png"
    alt  = "Chronosphere"
    href = ""
  }
  link {
    # href is required for every link
    href = "https://chronosphere.io/"
    text = "some text"
  }

Chronosphere uses “Buckets” as a container for storing monitors, dashboards, and metrics. Every monitor, dashboard, and metric must belong to exactly one bucket. When you create (or edit) a bucket, you define the notifier policies, and there you can select the PageDuty notifier(s) you created.

How to uninstall

To remove the Chronosphere notifier, you first need to change the notification policy of any Buckets that use it. 

To delete with the UI, then open the Settings > Notifiers menu, click the notifier, and the delete icon.

To delete with the terraform provider, remove the notifier definition from the TF file and use the terraform apply command.

To delete with the CLI tools, use the chronoctl delete notifier -n notifier-name command.

Check out Chronosphere at PagerDuty Summit 2022 to learn more

Much like any company that makes software its business, Chronosphere builds a large and complicated product, in our case, a large-scale distributed architecture depending on complex open source projects, including M3 (jointly maintained by Chronosphere and Uber). This project requires an equally large and ideally less complicated test suite to ensure that bugs do not make their way into customer production environments.

Our test suite includes common tests you may expect, and others focused on scale, resilience, performance, and other tricky parts of distributed systems. This post explains each suite ordered by how easy it is for engineers to fix any detected issues. 

The suite life of test suites

Unit tests

Tried and true, unit tests separately verify the functionality of each function and object, and form the backbone of testing for nearly every software team. Unit tests are great at asserting pre-conditions and post-conditions for functions to let developers confidently compose them, knowing that conditions will hold for inputs and outputs. Chronosphere has a decent level of test coverage, and CI processes enforce maintaining that coverage using Codecov. But like every software company, there’s always room for more coverage.

Fuzz tests

We use property-based testing to generate valid random data and verify it performs as expected for particularly complex functions, specifically those where edge cases are impossible to generate manually. For example, testing time series compression encoding/decoding logic has proven useful in exposing edge cases. We use Leanovate’s gopter implementation of QuickCheck as a powerful testbed for fuzz tests.

Integration tests

Integration tests are used for testing complex interactions between M3 components, including: M3DB, M3 Coordinator, and M3 Aggregator.

The integration tests programmatically create instances of M3 components to test, seeding them with data and then running validations against them to ensure expected results. A few of these integration tests use fuzzed test data to help sniff out edge cases.

Dockerized integration tests

A different flavor of integration test that sits a little higher in the stack.

Our regular integration tests often seed test data directly into M3. These tests instead try to model a small part of the full data lifecycle against built images of M3 components running in Docker. An example of this type of test is this prometheus write data test (warning: this test has a lot of lengthy bash script, here be dragons). The test writes data using the remote write endpoint in M3 coordinator, then verifies the read by querying for it through the coordinator query endpoints. The tests have the downside of being (partially) written and run in bash, making them more difficult to write, run, and debug, but catching issues in these more real-world situations is worth the pain.

An aside on environments

The test suites above run on development stacks of varying levels of complexity: 

The straightforward requirements for these suites allow them to be easily run as part of our CI pipeline, but we also run other suites against environments more closely approximating production; the tests in the next sections run against two full-sized, production-ready stacks. 

These environments are great at representing how Chronosphere’s system performs under real-world conditions, as they are real-world conditions.

Scenario tests

These tests sit yet further up in the stack, running in Temporal (a tool for orchestrating complex applications; we will be publishing a deep-dive into our usage of Temporal soon,  watch this space) against the mirrored environment. 

These tests run broader “scenarios” rather than the integration tests, and they run them in an environment taking a realistic number of reads and writes. The tests help measure how well changes will perform in an environment running at scale, including custom functionality and extensions built on top of core M3 functionality. For example, one of our scenario tests adds, and then removes, a node to an M3DB cluster to ensure bootstrapping works as expected without read or write degradation. The scenario test suite runs continuously against the head of release candidate branches and the master branch to ensure confidence in the current release, and to catch regressions as soon as they merge.

Dogfooding in meta

Before releasing to production instances, we deploy release candidates to the meta environment for several days. During this period, all engineers using the environment can report any issues they find with the build. If they are significant, we release a hotfix branch and promote it to a release candidate after some basic smoke testing to manually verify that the hotfix has actually fixed the underlying issue, allowing our Buildkite CI pipeline to pick it up. The CI process automatically runs each of the above suites, promoting the release candidate to meta after successful completion.

Gaps in our test coverage

These suites make up our safe release stack, which provides confidence that a release candidate meets quality and reliability expectations for release to customers. However, these suites have shortcomings: they use synthetic data, act on a confined part of the stack, or both. In practice, this has led to missing some tricky edge cases or unexpected interactions that resulted in real-world impact and service degradation to customers.

M3 is a complex system consisting of multiple distributed components, which makes it difficult to test with a high degree of confidence. This complexity makes it difficult to test holistically, especially when introducing scale, timings, failovers, or a million other distributed systems issues. For further information and a deeper dive into how these components can fit together, and what happens to actually persist a write within an M3, the architecture docs are a good place to start.

A simplified diagram to illustrate the complexity of M3 and the interplay between different components to write a single datapoint.

Dogfooding release candidates that have passed all relevant test suites before reaching the meta environment can help engineers catch issues, but it’s not a perfect process. Internal users can ignore flakiness in features, or not notice edge cases in queries that aren’t often used internally or on dashboards that engineers rarely look at. On top of this, some issues only show up when running at high write load or high query load, which is difficult to emulate consistently and catch before confidently releasing to production.

Patching the holes

In summary, the test suite is comprehensive, but not comprehensive enough to catch issues that affect customers’ experience and business. We needed something to fill those shortcomings by emulating and testing environments more like those used by our customers. Keep your eyes open for the next post on the topic, coming in a couple of weeks, where we dig deep into what we ended up creating and how it works to give us the confidence that we need.

Read part two of my blog series to continue the discussion: Comparing queries to validate Chronosphere releases.

At Chronosphere we have a globally-distributed, remote-first engineering organization which lets us hire amazing and diverse talent from everywhere. While this diversity has lots of benefits, it comes with some downsides that don’t typically need to be addressed in small co-located startups. In this blog, I discuss these challenges and share solutions Chronosphere engineers have come up with as a team.

Challenges with code reviews

There are two main challenges we run into when doing code reviews:

Guiding Principles of code reviews

We started off with some guiding principles to help anchor everyone together.

Commit Messages – Best Practices

Commit messages should describe the change in reasonable detail. The more context, the easier for others to understand the change. Chris Beams writes about how to write a Git commit message here. Automated and trivial changes can be short and simple.

Prior to adding others to a pull request, it is recommended to do a self review of the diff and ensure that the CI pipeline is passing. Think of it as proof reading.

The body of the pull request should provide an overview of the change along with a link to any related docs or tickets. Screenshots for UI changes are strongly encouraged as it gives reviewers a better understanding of the change.

Remember that new employees don’t have all the background or tribal knowledge that you might have. Also, “future you” might not remember what was going on, if you have to look back at the review, so having some context helps everyone.

Pull Requests – Best Practices

If a ticket is related to the pull request, make sure to include a reference to it. Pull requests should aim to have small incremental chunks of work. Larger code reviews require more time to do and are harder in general.

You can read more details about optimal pull request size here. 

How to Comment during code review

Comments should be clear about the severity and desired changes. If possible, the suggest change feature should be used. Otherwise, example code – or a link to existing code – that demonstrates the change should be provided. Most of our review communication is done async so it is important to make sure things are clear and addressable without having to wait a full day to make progress. After a couple of back-and-forths, comments alone tend to become unproductive, so try hopping on Zoom to figure things out if you find yourself in this situation.

Severity of comments range from nits to blockers. We use the following prefixes:

Resolving Comments

Comments for changes should follow one of the following paths:

Blocking – proceed with caution

Generally we should favor getting something out over perfecting it. Blocking should be rare, and used to denote dangerous changes that would, or could, cause an outage or other incident. Reviewers with blocking issues should reach out to the author on Slack or Zoom to discuss the issue as soon as possible. This helps avoid long async communication delays and letting the code get stale.

Using our “assume good intentions” goal, we’ll trust each other to merge good code and take the necessary followup actions that may be needed in a timely fashion.

Approving code reviews

Once you are done commenting on the review, if there were no blocking issues it should be approved. This allows changes to be made and merged without another round of back and forth.