Observability Platform Migrations: Navigating the Dos and Don’ts

Copy of Green Technology Image preview card (1)

Blog

From gathering requirements to pricing models, here are vital prep steps to take for a successful migration.

On: Sep 20, 2024

10 MINS READ

Migration Dos and DON’Ts

Change is in the air, with many organizations committing to changing monitoring and observability tooling in the coming years. Whether your org is looking to consolidate into a single platform, escape adversarial pricing/billing schemes, fully adopt open source instrumentation, or anything in between — how you migrate matters. After reflecting on my own experiences managing major monitoring migrations as an SRE, and in talking with fellow Chronauts involved in customer migrations, I have come up with a list of common pitfalls that arise before and during a migration. Read on for observability migration “don’ts” … and “dos”.

Don’t skip requirements-gathering

Most organizations have a sense of why they need to ditch their current platform, but don’t always take the time to spell out what they actually need in the next one. Without a crystal clear understanding of the requirements specific to your system, you could end up wasting time evaluating solutions or, worst-case, selecting a platform that’s a bad fit and end up needing to migrate again.

Do: Consider running a Start/Stop/Continue survey for those that use or rely on observability data — think beyond only developers and SRE, like technical support, product managers, and engineering leaders. It can be as simple as asking these three questions:

What is working well with $Current_Platform?

What is not working well with $Current_Platform?
What is missing?

Note that not every piece of feedback you get from colleagues will be relevant and may focus on separating the needs from the “nice-to-haves.” The resulting requirements will serve as a consistent framework to evaluate potential solutions.

Don’t ignore the WHY

There are plenty of motivations to migrate that range from the practical — like fleeing predatory pricing plans — to the silly — like “our VP of Eng is best friends with their VP of Eng”. No matter what, there is some reason driving this work and it is very important that you write that reason down in a simple 1-pager in as plain terms as possible. I call this “the Why doc” and you can lean on it throughout the entire migration process.

Do: Advocate for your “why” doc. When a grumpy developer who loved the previous platform loudly complains about the new one, and starts seeding FUD (fear, uncertainty, and doubt) among their peers, refer to the reasons the previous tool had to go. These should be listed in the “Why” doc.

When you are planning the order of teams and services to migrate, and have difficulty getting managers to prioritize the work, refer to the “Why” doc.
As you wrap up the migration and prepare to celebrate and sunset the old tool, leverage the “Why” doc to explain the value achieved.
In two years when a new developer joins and asks “Why don’t we use $OLD_TOOL? I loved it at my last company.” refer them to the Why doc and evaluate whether or not those reasons are still applicable.

Whether or not someone is swayed by the Why doc, at the very least it can enable folks to “disagree and commit” and move your conversations and the migration effort forward.

Don’t depend on “Lift and Shift” strategy

Lift and Shift describes a common migration approach where you “lift” exactly, or as close as possible, the existing configuration, instrumentation, monitors, and dashboards from the old system and “shift” them into the new one. Sounds straightforward enough, right?

It can be – if your organization has a clear observability strategy. When the lines of responsibility are clear across developers and operators, teams can actively manage, tune, and prune their monitors and dashboards.

The main issue with this approach is it will preserve all of the cruft and bad habits that have built up over the years. Think about some of the dashboards you scroll past with titles like “Bob’s MySQL Overview (test)” or “Alpaca Service (deprecated).” Or consider the extremes of dashboards with a single chart, or those with every possible metric under the sun charted. Not to mention monitors that are permanently muted, redundant, or in constant warning mode.

These are not providing value now and they will not ever provide value in the future.

Do: Instead, try what I call “Lift, Sift, and Shift”

Lift – Take an inventory of existing services, infrastructure, and associated dashboards and monitors along with the owning team. Highlight, underline, and bold anything that is unowned or has shared ownership and identify who will be responsible for migrating these.
Sift – Flag any dashboards and monitors that are not providing value and not worth migrating, starting with criteria above. Identify any assets that for whatever reason cannot be directly translated across platforms and provide alternatives. Ask teams to sign off on the final list of assets to migrate.
Shift – Route your telemetry to both the old and new platforms, and port over the dashboards and monitors that made the cut. To catch any last minute issues and increase confidence for the final cutover – run a brownout to temporarily cut off access to the old platform and triple-check telemetry is still streaming in and test out your alerts.

Don’t start from scratch

When reading the last section, if you thought to yourself “hmm, wouldn’t it be better to just start entirely fresh in the new platform?” Think again. Don’t let the monitoring cruft cloud your senses – amid the abandoned dashboards and flapping monitors there are expertly curated system views and trustworthy alerts.

Do: Take time to do due diligence and audit what you have. Informal knowledge can be encoded in dashboards, things that may not even be documented in a wiki. For example, there may be custom application metrics that have been added to fill a visibility gap identified after an incident. Or a lovingly curated dashboard with detailed description panels that help viewers interpret the data. Wiping these dashboards out could lose this information unless it has been captured in an external runbook or wiki — this would result in a decrease in your system’s observability.

This matters the most for services that are unfamiliar, and which were authored by team members that have long since left the company. It also matters for older parts of your infrastructure that don’t follow familiar modern patterns and standards.

Institutional knowledge aside, it’s highly unlikely your organization is willing to invest the time and resources needed to rebuild every single monitor, dashboard, and view of the system from the ground up.

Migration Timeline Estimator

Get an estimate of how quickly your migration can happen with our Migration Timeline Estimator tool.

Calculate

Don’t let your migration onus rest solely on SREs

Security is everyone’s responsibility, and so is observability! Often the SRE/Ops/Infra team is tapped to lead a migration, which is fine but wades into troubling waters when the lines of responsibility between software developers and SREs are undefined. Migrations shepherded and implemented by SRE teams alone — and with minimal engagement from developers — negatively impact both parties.

Why negative? When SREs do the migration on behalf of application teams they can be seen as the “owners” of observability instead of trusted partners to consult. This can manifest as a high support burden from acting as first line support answering questions in Slack, or tasks for instrumentation or tuning monitors “thrown over the wall” to SRE. Application developers lose out on valuable time and experience learning the new platform. Even with the same telemetry, same monitors, and same dashboards, every observability platform is different and has its quirks – it just takes time to learn how to navigate, query, and become proficient.

Do: Make your migration an cross-team effort. There are many decisions and considerations when it comes to migrating telemetry, monitors, and dashboards for applications. The best person to make decisions about application telemetry are the developers writing that code and on-call for that service. Full stop.

Don’t ignore the fine print on vendor professional services

Enlisting the help of Professional Services can seem like a great way to bypass migration toil and ensure your maximizing all the benefits of a given platform. But your mileage will vary across vendors.

I learned this the hard way when shepherding my first migration, which was on a very tight timeline. The company I worked for leveraged monitoring-as-code for dashboards and monitors. Since our chosen vendor had a published Terraform provider, I assumed that translating from one vendor’s platform to another would be fairly uncomplicated. For whatever reason, our Professional Services rep was unwilling to learn Terraform and insisted the only option was for him to point-and-click to manually create the 100s of monitors and dashboards we needed. This left us in a worse state — it created more work post-migration and managing him added undue stress to my plate while putting the migration deadline at risk.

Do: Lean in on internal expertise. On the flipside, I’ve worked with incredible Solution Architects when migrating CI/CD platforms that brought great value in the form of personal and team learnings, helpful config tweaks, and advice tailored to our needs.

When engaging Professional Services invest time in creating a detailed Statement of Work and understand how you will be billed before forking over any funds. Don’t leave any room for bad assumptions and be clear about what you and your organization need from the engagement.

Don’t gloss over the pricing model

Observability overages can be outrageously high and it is worth analyzing what has contributed to your increased costs and data volume over the past year or so, especially if reducing cost is the primary driver for migrating.

There are positive reasons for increasing data volumes – like business growth in the form of more customers; launching new popular products or features; or increasing system visibility by introducing telemetry like tracing or events. On the flip side there are also suboptimal reasons arising from accidental misconfigurations, unmanaged cardinality, or limits on configuring data retention to name a few.

Do: Evaluate competing vendor costs in order to learn about expensive alternatives due to a lack of due diligence. Whatever you find from researching your organization’s observability spend, ask yourself how the costs would be different if you were using a new vendor instead. The worst outcome would be to expend all this effort to move to a new tool, get developers enabled and trained, and then wind up in the same exact position after new customer discounts and sweetheart deals expire.

Migrating observability can be easier than you think

Migrations can be daunting, but after reading this you are now aware of common challenges and how to navigate them. When choosing Chronosphere as your partner you get to tap into our deep well of expertise when it comes to all things observability from crafting an observability strategy all the way down to debugging complex composite PromQL queries. CTA

If you want to go deeper on the migration topic, I recommend reading, Observability Platform Migration: A Practical Guide. 5 Steps to upgrading your O11y

Observability Platform Migration: 5-Step Practical Guide to Upgrading Your Observability

Read the whitepaper

Recent News

Featured Resources