Is your incident response better than an energy company’s?

A person working at an energy company is pressing a button on a keyboard that says incident response.
ACF Image Blog

Principal Developer Advocate Paige Cruz outlines her experience with Portland General Electric and how it sets a precedent for incident response communication.

Paige Cruz
Paige Cruz Principal Developer Advocate | Chronosphere

Paige Cruz is a Principal Developer Advocate at Chronosphere passionate about cultivating sustainable on-call practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to Site Reliability Engineering holding the pager for InVision, Lightstep, and Weedmaps. Off-the-clock you can find her spinning yarn, swooning over alpacas, or watching trash TV on Bravo.

6 MINS READ

Portland, Oregon, has weathered many storms over the past few years that have brought down century-old trees and severed power lines, leaving thousands of people and pets in the dark.

As a resident that means I have become quite familiar with Portland General Electric’s (PGE) incident response communications and honestly, I’m impressed. Their customer communications far exceed what I see from many tech companies. How is that possible? Fewer components to their system? Highly trained responders? I suspect it is due to the information and signals they’re able to get.

It’s no secret that software engineers are faced with an information overload when diagnosing system issues. There’s no solace in the fact that at least these engineers are alerted to real issues because 59% say half of alerts they receive from their current observability solution aren’t actually helpful or usable. Not only is there a deluge of signals but many are low quality with 40% frequently getting alerts from their observability solution without enough context to triage, according to Chronosphere’s 2023 Cloud native observability report.

Follow a power outage through the eyes of a PGE customer and answer a few questions to see how your incident communications stack up.

Ready…set…outage! 

The very first observable event from a customer’s viewpoint is that all the lights suddenly go out, fans/air systems shut off, and an eerie silence settles in.

Not too long after (within minutes) a text arrives:

This is five-star customer outreach because it is proactively notifying me that PGE acknowledges that power that I care about — specifically my address is affected. Being proactive likely saves PGE support from being overwhelmed with texts and calls while freeing up customer time to find some flashlights and candles.

Another highlight is that PGE’s next steps are clearly outlined. When there is news to report, they’ll be back with an estimated restoration time. Lovely! As the customer, that’s really the only other information.

The cherry on top is their method for reporting data inaccuracies. Using the same communication context, not making the customer (who is stressed because the power just went out) have to log into a separate website or app to report.

Say a customer hasn’t signed up for text alerts. They’d likely navigate to the PGE website on a mobile browser and hoping their battery outlasts the outage. The Outages & Safety section is easy to spot as a permanent part of the main menu bar and gets you to information in one click.

Notice anything about the order of reporting options? I noticed they’re listed in order of ease for customer/efficiency for PGE. Reporting through the app is trivial followed by filing on their website, both of these allow PGE to quickly get reports at scale. The final option is to report by phone which could tie up humans for the response, hence showing it last.

Hark, a map!

Say the text notification didn’t quench your thirst for outage information. PGE’s outage map is here for you, on demand and shows the company’s current understanding of customer impact.

Why do I love this? Knowing if an outage is widespread across thousands of customers and regions or is isolated to my local neighborhood helps calibrate customer expectations for power restoration.

Obviously if the entire city of Portland is experiencing outages, it’d be absurd to expect a quick resolution.

Commendations

  1. Location-aware visualization: This is a view that shows the overall system impact. A table view of numbers just wouldn’t be able to communicate that as effectively
  2. Key information about data freshness: When it was last updated and the refresh rate.
  3. Actual number of customers affected: Less useful for me, but increases my confidence that PGE has insight into their system since they’ve got a real live number there.
  4. Ability to search and pinpoint your area based on phone number: Not something arcane like account ID that no one has committed to memory.

Having this map lets people keep up to date on the resolution progress, or lack thereof, without having to bug and tie up PGE employees; Nervous Nellies can check and refresh to their heart’s content.

The app worth installing

I am loath to install an app on my phone these days, but the PGE app has earned a spot in my must-haves.

The design is simple and conveys information so clearly. Let’s look at their app’s outage status page:

Look at that incident resolution timeline! No need to be a lineman to grok it. It outlines:

  1. Simple stages of an outage with key moments highlighted for customers, especially “crew dispatched,” which means help is on the way!
  2. Updating with the cause of the outage sheds more light on roughly how long to expect mitigation to take. It’s not perfect, but it’s something.

Let there be light

Yay! And lights hum back to life, all the machines start making noises, heat flows again, and PGE sends a little wrap-up text.

Imagine if you’d gotten that text and were still sitting in the dark? You’d probably be unhappy until you read the final sentence and could easily inform PGE of their grave error by replying STILLOUT.

What a master class in customer communication!

If you’re wondering if your incident response is better than PGE’s, here are some questions you can ask yourself:

  1. Do you proactively notify customers about impact vs. customers notifying you?
  2. Is it both easy for customers to report impact and efficient for customer support to triage?
  3. Do you regularly provide status updates and info in a customer-friendly way? (Unless your customers are devs, no need to go into detail about a database lockup.)
  4. While an incident is occurring, are you able to determine how many customers are affected?
  5. Can customers subscribe to a feed of outage notifications? Are they tailored to the features/products they use or is it all outages?

If your answer is “no” to more than one, it’s time to reassess so your organization can follow in the footsteps of master communication.

Curious how Chronosphere can help you improve incident response? Contact us for a demo today.

Share This: