Cudo: Proactively Preventing Disruptions in User Experience

From crypto mining to fog computing

Cudo started as a cryptocurrency mining platform, providing a desktop application that made it easier for users to mine the most profitable coins according to their hardware, the network and other factors like exchange rates. The original desktop app has evolved into a dedicated operating system used by professional crypto miners rather than hobbyists using their home machine. Now the company is expanding into fog computing, a cloud computing platform that allows hardware owners to host workloads when they have extra capacity, allowing workload owners to run on any ISP that is participating by running on Cudo Cloud. 

Cudo’s clients tend to be companies with idle hardware sitting around, like gaming cafes, large data centers and dedicated mining farms. As they move more towards fog computing they expect an increasing number of customers to be data center owners. 

Homegrown limitations

“We knew we were going to hit the limits of it pretty quickly,” explained Richard Poole, lead architect at Cudo Ventures about their homegrown solution for users to monitor their infrastructure. “We just needed to release something sooner rather than later.” The monitoring system was MySQL-based, using sharded tables and date partitions to let them drop data that was over 30 days old. “We knew we were going to hit the limits of what our MySQL server could offer pretty quickly.” 

One of the biggest problems with the solution was that there was no ability to aggregate any of the time series data, as well as very limited data retention — one minute resolution retained for an hour, 10 minute resolution for seven days and one hour resolution for 30 days. “The really strictly defined resolution windows made the time series data not that useful for actually debugging issues,” Poole said. The total inability to aggregate any of the time series data also made it impossible to get an overview of health and cost metrics on the entire fleet, something customers were interested in doing but not able to.

Then there was the internal monitoring. “We were running pretty blind on a lot of the services,” Poole said. “We had the baseline Kubernetes metrics being exposed and we checked those only when there was an operational issue.” The internal monitoring was managed with Prometheus. This setup made it practically impossible to act proactively. One issue specific to cryptotrading, however, is that sometimes the cryptocurrency networks have to fork. Until someone noticed the fork, the node would stop synchronizing with the network and users would be unable to withdraw from the platform. With Cudo’s Prometheus-based monitoring system, the team would often first discover the fork when users started complaining. 

Moving to Chronosphere

Originally, the team at Cudo had experimented with M3DB as it was building the homegrown solution for client-side monitoring, but wasn’t able to get the etcd cluster to be stable, causing continual reliability issues. “M3DB ticks all the boxes for us,” Poole said. “As soon as we found out that Chronosphere was offering a hosted solution based on M3DB, we pretty much decided we were going to use that eventually.” 

Responding proactively

In terms of monitoring internal systems, now Cudo’s team is able to work to proactively prevent any potential bugs from impacting end users instead of first hearing about problems when users complain.

“It gives us a good indication where we’ve got scalability issues or something isn’t engineered quite right,” Poole said. “It helps us uncover architectural issues.” The team no longer feels like it is flying blind, and it’s easy to add new metrics if needed — something that was extremely onerous previously. 

They are also able to debug much faster, reducing the customer impact of any issues. For example, when a cryptocurrency network forks, the team is alerted to the issue and is able to update the node before the issue is noticed by users, whereas before they would first learn about the problem from users. 

Cudo’s large data center and mining farm clients also have access to much better metrics now. Cudo’s software reports metrics to Chronosphere, which are then presented back to clients as part of the Cudo console. Not only do they now have enough granularity to effectively use metrics for debugging individual devices, the biggest value for these large operations is the ability to aggregate metrics and get an organizational overview. This allows them to get information about things like power consumption broken down by regions in the data center. Most Cudo clients end up using the metric dashboard to improve cost optimization and monitor hardware health. “They all seem to love it,” Poole said. “It’s certainly a lot more powerful.” 

The first monitoring solution purpose-built for cloud native deployments.