Genius Sports: Improving Development Speed While Ensuring High Availability

Genius Sports is a sports data company that collects, aggregates and then repackages sporting data that is then sold as a data pipeline. It works with sports organizations ranging from the NFL and Premier League English Football to second division Azerbaijani badminton. The end customers for the data pipelines are varied as well — they include selling data back to the sports’ teams and leagues themselves, to sports media companies, gaming companies and betting and trading companies. 

The amount of data Genius Sports handles is extremely high. On an average Saturday afternoon, there could be a thousand or more soccer games on concurrently. For each game, there will be hundreds or thousands of betting markets, and Genius Sports is generating the odds that people bet on, which are based on a number of factors and updated many times per second. “There are millions of data points, and they’re changing every second or so when something happens in the game,” explained Luke Fieldsend, DevOps team lead at Genius Sports. “Generally, our customers need to get these changes within two seconds.”

High stakes

Avoiding service outages is extremely important for Genius Sports. If the system is down, it could mean end customers are unable to cash out their bets after a game is over, or it could mean a media company is unable to update advertising. Because uninterrupted service is so critical to clients, high availability is a major priority for Genius Sports. “The cost of a service outage is very hard to estimate in concrete terms, but they are definitely quite high in terms of dollar amounts, even for an outage that lasts a couple minutes,” Fieldsend said. 

“If we don’t have visibility, we can’t flag problems early,” he said. “That can have a number of knock-on effects and result in a service outage.” 

Moving to the cloud

Genius Sports was using Graphite with a Grafana UI to monitor the services running in the data center. As the team started moving services to the cloud and moving to a cloud-native approach, the services running in AWS were sending StatsD metrics to the Graphite service. This set-up was struggling to keep up with speed and scale Genius Sports was operating at.

There would always be a mix of Prometheus and Graphite metrics for each service. Prometheus was running in the EKS clusters, and would collect pod-level, resource-level metrics. The application metrics, on the other hand, were being sent to Graphite in the data center. This set-up was hard to maintain and caused friction for the developers. 

“We were looking for a solution that would not only let us consolidate our metrics now, but also give us a path to potentially only using Prometheus in the future,” Fieldsend said. 

Looking for a solution

As Genius Sports started looking for a better monitoring solution for the cloud native workloads, there were a couple factors Fieldsend was considering. 

A hosted solution. “We didn’t want to have to roll our own, or spend an inordinate amount of time trying to self-host,” Fieldsend said. 

Grafana-based UI. Any opportunity to reduce the skills gap is ultimately good for productivity, and the team at Genius Sports had years of experience using Grafana dashboards, so a Grafana frontend was preferred. 

Easy migration. Even with just a couple hundred individual services running in AWS, it still translated into several thousand instances, and any migration would need to be highly automated. 

Support for StatsD and Prometheus. The team was essentially producing hybrid metrics, and needed a way to give developers a single dashboard to access all metrics for all of the services in their catalog. 

Using Chronosphere

Genius Sports started out doing a pilot in one team with M3DB, that went fairly well but still involved some management overhead. After discovering Chronosphere as a way to get a fully hosted version, the team ran a successful one-month pilot. Now Chronoshpere has been expanded into additional teams and is used daily by an increasing number of engineers at Genius Sports. 

Since starting with Chronosphere, the teams have been able to completely eliminate their reliance on the data center appliances for monitoring. They also don’t have to worry about onboarding their services to the metrics stack — it’s all been abstracted away. “That wasn’t a problem in every sprint, but there were definitely sprints where one to two days of a developers’ time went to interacting with the metrics stack,” Fieldsend said. 

Service outages are always rare at Genius Sports, since there is such a focus on preventing them in the first place. But before using Chronosphere, “We definitely had issues where we couldn’t get visibility on a problem service in production, and that held up clearing a service outage,” Fieldsend said. 

Developers are able to use Chronosphere throughout the application lifecycle, reducing friction in the testing phase and allowing developers to deliver faster. In addition, the team now has access to application-level alerting, something that was elusive without consistent monitoring across environments. “Day-to-day, it’s all about supporting development and letting them go faster,” Fieldsend said. 

The first monitoring solution purpose-built for cloud native deployments.