“We were experiencing constant packet loss,” explained the observability lead. “Developers could break the whole system by making some benign change that has some weird bug and it crashes the whole system.” There was an extreme noisy neighbor problem — any change made by developers could easily impact StatsD’s ability to monitor some other, seemingly unrelated application in an entirely unpredictable way.
It was so bad, he said, that when he first joined the company no one was actually paying attention to the metrics generated by StatsD, instead relying on work-arounds like counting log lines. When the team upgraded StatsD, they discovered that they had been losing metrics because of a bug in the system.
“We did the upgrade, and there was an instant change in the pattern,” the observability lead said. “That’s the bigger problem: The pattern changed. That means we had been flying blind for a long time. We had been seeing a zigzag, but then it became a flat line, which was more or less what we expected.” The team had been suspecting that something was off, but hadn’t found a way to verify it.
In general, DoorDash thinks of itself as being data driven. Everyone is crazy about numbers, from the CEO down to the newest engineer. Data is used to make better decisions about technology and about the business. If everyone loses observability, it means the entire company loses that competitive edge. Because software is a core part of DoorDash’s product, losing visibility into the application suite was simply not acceptable.