How Chronosphere uses LaunchDarkly for safer testing and releases

on August 17th 2021

As a quickly growing start-up, we are constantly releasing new features and improvements for our users (including ourselves!). As a part of each new feature release comes meticulous testing and QA to ensure each tenant remains highly secure and performant throughout the process. 

With our customers having varying requests and requirements for features, the testing and release process for these features sometimes needs to occur tenant by tenant. In order words, having certain features turned on for some customers and not for others. In order to manage this, we use a concept called feature flags (which we will define below).    

At large scale, feature flags can be difficult to manage so Chronosphere turned to LaunchDarkly, a managed feature flag platform, to help control these releases across customers in a more efficient way. Before diving into how Chronosphere uses LaunchDarkly to help manage and test features, let’s take a step back by defining what a feature flag is, how they work, and why they are so powerful for testing and release purposes.  

What is a feature flag? 

(Section written in collaboration with Dawn Parzych, Manager of Developer Marketing at LaunchDarkly) 

As defined by LaunchDarkly “a feature flag is a software development process used to enable or disable functionality remotely without deploying code. New features can be deployed without making them visible to users. Feature flags help decouple deployment from release letting you manage the full lifecycle of a feature.”

They are used to control how and where your code is working, and are most commonly used for release management, experimentation, and testing purposes. Releasing a feature is a business decision, deploying software is an engineering decision. Deploying software requires testing and verification. This needs to be done prior to releasing the software to all users. You often want to deploy software multiple times a day to test and verify. Releasing software every day is expensive. When you use feature flags you can make your deployments smaller, more granular, correcting a small change is less expensive and time consuming than correcting a massive change with numerous dependencies. 

As an example, if wanting to test out a new feature before deploying it to production, you can use a feature flag to activate the feature for a test environment or user while keeping it deactivated for all other environments or users. This not only ensures safer testing (e.g. gradually rolling out a feature across environments or users), but it also means that you can keep everything on a single branch, simply switching a feature flag “on” or “off” without having to restart or redeploy your code. 

With LaunchDarkly, you are able to manage all of your feature flags in one place. It offers SDKs for all major programming languages, such as Go, Java, and Ruby, making it easy to integrate into existing applications. Through LaunchDarkly’s central UI and dashboard, you can organize flags at various hierarchies (projects, environments, and users), segment and target users for testing, and manage or maintain flags at any stage in their lifecycle.   

Chronosphere and LaunchDarkly in action  

Chronosphere uses LaunchDarkly for three primary use cases: toggling features, managing tenant specific configuration, and as a control plane for external components such as our metrics agent process. In order to accomplish these use cases, we integrated LaunchDarkly into the Chronosphere backend by writing a thin wrapper around the LaunchDarkly Go SDK. This approach enforces flag value persistence and specific user targeting, both of which we took a customized approach to given our requirements and use cases.

Enforced user targeting 

Our feature flags are organized in LaunchDarkly’s flag hierarchy (described above) with one project (Chronosphere) that has two environments (RC, Production) and multiple users. In LaunchDarkly, users do not necessarily mean an individual end user, rather users are a context for a feature flag. A user can be a person, API, or in our case a tenant. With this organization, we are able to set flags for an entire environment (ie all of Production) or target a flag for an individual tenant. Before we look at our client we should understand how users work in LaunchDarkly. There are two behaviors to note:

  1. You cannot create a user in the LaunchDarkly dashboard. Users are populated in LaunchDarkly by using their SDK to request flags for a specific user. We’ll see an example of this in a moment.
  2. Users are an optional field when evaluating a feature flag. If the user is not specified, the flag value of the environment will be used.

LaunchDarkly has built-in user targeting functionality, however, we took a slightly different approach to enforce our user-based targeting. To receive feature flag values for specific users, developers must provide a user struct to the LaunchDarkly SDK methods. This struct acts as a context during flag evaluation which allows the LaunchDarkly SDK to know which users flag values to return to the caller. Since we want to serve flag values at the granularity of the a tenant, we designed our client framework to ensure that the Chronosphere tenant is always reported to LaunchDarkly as a user.

Consider the following code:

In this example the user is specified by name, however, it’s just as easy for a developer to pass in an empty user like in this snippet:

This code will compile and the mistake is easy to miss in code review. With an empty user struct LaunchDarkly will return the default flag value for the environment (i.e. Production), thus ignoring any tenant specific overrides for this feature flag. This pattern also requires each instance of calling code to construct a user which is inconvenient. There are many issues that arise if the user isn’t handled properly. For example, changing flags in the LaunchDarkly UI for a specific tenant will not propagate to our services. This can result in a confusing scenario where our LaunchDarkly UI doesn’t match the reality of the software configuration. To avoid this issue, we designed the following interface:

You’ll notice that we removed the lduser.User variable from our internal client interface. The interface now receives an authenticated Go context to inform the of the user. Through this approach, flag requests will always be tied to the correct user.

Persistence to Google Cloud Storage (GCS) 

In the event that LaunchDarkly is down, the client will revert to default values for enabled flags. We store a wide variety of configurations in LaunchDarkly and therefore we do not ever want to rely on default flag values, even in the face of an outage. It is common to persist flag values to an external datastore for this purpose. The LaunchDarkly SDK allows developers to implement the PersistentDataStore interface to send flag values to an external store. LaunchDarkly has official support for Redis, Consul, and DynamoDB out of the box. At Chronosphere, we aren’t currently operating any of the supported services in production and therefore we decided to implement the PersistentDataStore interface ourselves to store our data in GCS. In the event of an outage, our clients will fallback to a GCS blob to retrieve the most up to date flag values. This behavior is also enforced by our internal client thus ensuring that we never return default flag values for a flag evaluation.

So what’s next? 

Since embedding LaunchDarkly into our internal clients, the engineering teams at Chronosphere have seen noticeable efficiency gains. Before LaunchDarkly, engineers would need to edit static configuration files and redeploy services to change feature flag values. Now they can simply access the LaunchDarkly UI to manage all flag values in a single place, toggling specific values “on” or “off” as needed for their use cases. 

Thus far, we’ve integrated LaunchDarkly’s client in two environments across multiple users, and are actively using 20+ flags in production. As we keep growing and scaling out our processes, LaunchDarkly will continue to be an integral part in ensuring safe and efficient tests and feature releases for ourselves and for our customers.    

At Chronosphere, we are building a Prometheus-native metrics monitoring solution built for massive scale. If you’re interested in learning more, please reach out to contact@chronosphere.io or request a demo

The first monitoring solution purpose-built for cloud native deployments.