GPU providers: 5 keys to AI O11y, Part 3

A computer chip icon is centered on a green circle overlay, with a blurred background showing a computer screen displaying code and a keyboard—perfect for illustrating AI O11Y or GPU providers.
ACF Image Blog

In part three of a five-part AI blog series, learn how to shape noisy telemetry, spot idle capacity, and survive demand spikes. This guide shows GPU providers the metrics and SLOs that lower spend, raise utilization, and scale smoothly.

Middle-aged man with short hair and glasses, wearing a light purple dress shirt, smiling at the camera against a plain background, reminiscent of Dan Juengst.
Dan Juengst | Enterprise Solutions Marketing | Chronosphere

Dan Juengst serves as the lead for Enterprise Solutions Marketing at Chronosphere. Dan has 20+ years of high tech experience in areas such as streaming data, observability, data analytics, DevOps, cloud computing, grid computing, and high performance computing. Dan has held senior technical and marketing positions at Confluent, Red Hat, CloudBees, CA Technologies, Sun Microsystems, SGI, and Wily Technology. Dan’s roots in technology originated in the Aerospace industry where he leveraged high performance compute grids to design rockets.

11 MINS READ

TL;DR

Learn what GPU providers do for AI teams and why observability matters. In plain terms, providers run the GPU farms that Model Builders rent to train, fine-tune, and serve models, so their business hinges on three levers:

  • Keeping costs in check
  • Keeping GPUs busy
  • And scaling smoothly when demand spikes

Good observability turns those levers into measurable, fixable outcomes by shining a light on queues and idle capacity, shaping noisy telemetry to cut waste, and staying reliable at peak load. Bottom line: the right observability helps GPU providers lower spend, raise utilization, and stay elastic, which directly improves customer experience and margins.

Introduction

This blog is part three of our five-part series, explaining how different segments of the AI market approach observability. In part three here, we’ll dive into GPU Providers and explain how observability drives success for these companies. Here’s the rest of our series:

Introducing the GPU providers

Let’s continue our discussion about AI Observability (applying modern observability to AI workloads) and the 4 key segments of the AI market that require observability.

In our Part 2 post, we introduced you to the Model Builders who build, train, host, and run large language models (LLMs). One of the things that we noted in our discussion of the Model Builders is that a key ingredient to their success are the farms of GPUs that serve as the powerful infrastructure for the massively parallel computations required to make LLMs work. In many cases today, the Model Builders utilize GPUs that they rent from GPU Providers. So, this Part 3 of our 5-part series on the 5 Keys to AI Observability will focus on these GPU Providers.

SIDEBAR:

GPUs, or Graphics Processing Units are similar to CPUs except they are designed for massively parallel mathematical computations. GPUs were designed originally for the calculations required to render 3D computer graphics such as is used in video gaming. 3D graphics rendering requires millions to billions of calculations to be performed at the same time. The chip architecture of a GPU has 1000s of central compute cores whereas a CPU may only have 1s to 10s.

LLM training, evaluation, and inference all require massively parallel computations in the same way that 3D graphics rendering does. Around 2012 AI developers began using farms of GPUs to perform these computations. This development allowed the rapid advancement of LLMs that we have seen over the past few years.

CPU GPU
Central Processing Unit Graphics Processing Unit
4–8 Cores 100s or 1,000s of Cores
Low Latency High Throughput
Good for Serial Processing Good for Parallel Processing
Quickly processes tasks that require interactivity Breaks jobs into separate tasks to process simultaneously

Why GPU providers matter to Model Builders

GPU providers keep the lights on for AI, and they matter to every Model Builder we talk to.

We spent last week with a model team getting ready to launch a big fine-tuning process on their LLM. Their questions were simple. Will our jobs start fast, or sit in a queue? Will throughput hold steady when evaluation loops hammer the cluster?

These are the moments when the relationship between Model Builders and GPU providers becomes real. GPU providers build, host, rent, and operate the GPU infrastructure that powers training, fine-tuning, and inference at scale, and most AI companies rent instead of buy.

What GPU providers actually do

GPU providers power the entire LLM lifecycle, from training to real-time inference, using extensive GPU farms in their data centers. GPUs are essential for the massively parallel, high-speed computations required at every stage.

During training, GPUs handle billions of matrix multiplications, making the process feasible.

Fine-tuning also requires significant GPU resources to adapt models for specific tasks.

Retrieval-Augmented Generation (RAG) heavily relies on GPUs for embedding generation, similarity searches, and integrating retrieved information. Finally, inference uses GPUs to process millions of calculations concurrently and at high speed, enabling scalable and responsive LLM deployment for millions of users.

The success drivers are straightforward and stubborn: GPU utilization, cost efficiency, and infrastructure elasticity.

The success drivers that shape outcomes

Let’s take a look at the business impact of each of these GPU Provider’s success drivers and talk about how observability helps with each.

GPU utilization

Business impact

Imagine your fine-tune window opens at 9 a.m. If the cluster is fragmented or the scheduler is backed up, your jobs slip. That slip shows up as missed milestones, unhappy Product Managers, and wasted human time. GPU Providers feel the same pain, just in revenue terms. Utilization is the difference between selling every hour or leaving money on the table, and every idle GPU is lost revenue. So providers obsess over keeping busy time high and queue time low because those two numbers predict both customer sentiment and margin.

How observability can help

With robust observability, providers gain deep insights into cluster fragmentation and scheduler backlogs. Real-time metrics on GPU activity, queue times, and job execution allow them to identify bottlenecks immediately and optimize resource allocation. This proactive approach minimizes job slips, improves model builder satisfaction, and directly translates to increased revenue by ensuring GPUs are consistently active and productive.

Cost efficiency

Business impact

Like other types of infrastructure providers, margins are slim for GPU Providers. One thing they don’t want is for their observability bills to consume budget that could be spent on the hardware.

How observability can help

Observability, when managed strategically, becomes a cost-saving tool itself. By treating telemetry data like any other workload, providers can leverage observability platforms to control, shape, aggregate, and reduce the massive influx of data. This intelligent data management significantly lowers observability bills, freeing up budget that can then be reinvested into critical areas like hardware upgrades, enhanced training programs, and expanding staffing.

Elasticity

Business impact

AI demand is spiky. Eval bursts, traffic spikes, surprise launches, and batch pipelines all collide at the worst possible time. So the GPU Provider’s capacity must be elastic. But what about the solution is monitoring that capacity? If observability cannot scale alongside the surge, the provider flies blind when it matters most.

How observability can help

In the face of unpredictable AI demand, observability ensures that providers aren’t flying blind. A scalable observability solution can mirror the elasticity of the GPU infrastructure, providing continuous, high-fidelity monitoring during peak loads and sudden surges. By tracking data points per second and active time series as hard numbers, providers can confidently manage eval bursts, traffic spikes, and surprise launches, maintaining high uptime and ensuring their capacity solutions are effectively monitored at OpenAI-class scale.

NVIDIA DCGM - A key source of GPU telemetry

So, where do you get the observability metrics to understand what is happening within your GPU infrastructure? Well NVIDIA’s Datacenter GPU Manager (DCGM) is a great source for telemetry around the health of NVIDIA GPUs and infrastructure (and NVIDIA is the dominant GPU supplier today).

DCGM provides key GPU telemetry and it offers it in an open source format via its Prometheus exporter. This allows this telemetry to be ingested into any open source compatible observability platform.

Key metrics exposed by NVIDIA DCGM include:

  • Utilization metrics: These measure how busy the GPU is, including:
    • GPU utilization: The percentage of time the GPU’s core compute engine is processing tasks.
    • Memory utilization: The amount of GPU VRAM in use compared to its total capacity.
    • Compute utilization: The activity level of the GPU’s CUDA cores.
    • Tensor core utilization: The usage of Tensor Cores for AI and deep learning workloads.
    • Decoder/encoder utilization: The load on the hardware video decoder (NVDEC) and encoder (NVENC).
  • Performance metrics: These provide insights into GPU performance and efficiency:
    • Temperature: Core GPU die temperature and memory temperature.
    • Clock speeds: The frequency of the GPU’s memory clock and SM (Streaming Multiprocessor) clocks.
    • PCIe throughput: The rate of data transfer between the GPU and CPU over the PCIe interface.
  • Power metrics: These track energy usage and consumption:
    • Power usage: The power consumed by the GPU in watts.
    • Total energy consumption: The total energy consumed in millijoules since the driver was last reloaded.
  • Memory metrics: Detailed information about GPU memory:
    • Framebuffer used/free: The amount of GPU VRAM currently being used and the available free memory.
    • Memory bandwidth: The data transfer rate of the GPU memory.
  • Health metrics: Indicators of the GPU’s hardware status and errors:
    • Correctable remapped rows: The count of GPU memory rows remapped due to correctable ECC errors.
    • XID errors: Errors that indicate hardware issues or failures within the GPU.

A story from the field

A mixed-SKU fleet looked “full” on paper yet showed unpredictable job starts.

  • The team stopped arguing about theories and instrumented the path Model Builders care about most.
  • They tracked queue latency to first start, GPU busy ratio by tenant, and the error budget for those two signals.
  • They then tied alerts to SLO error budget burn, not raw thresholds, so only revenue-relevant incidents paged the on-call.

The outcome was visible in a week. Queue times dropped, idle pockets disappeared, and the customer experience improved without buying a single extra card.

The business math

If you are a GPU provider, the business math is just as clear. Utilization is the revenue dial. High observability costs steal budget from GPUs, training, and staff. AI workloads are spiky, so your telemetry, dashboards, and alerting must keep up when everything scales at once. Those are not opinions. They are operating constraints.

At the end of the day, GPU providers win when they keep costs in line, utilization high, and elasticity real, and the right observability choice makes each of those goals visible and repeatable for GPU providers.

Frequently Asked Questions

What do GPU providers actually do for Model Builders?

They build, distribute, host, rent, and provision the GPU infrastructure that powers training, fine-tuning, and inference, and most customers rent instead of buy.

Which success drivers should providers optimize first?

Start with cost efficiency, then GPU utilization, then elasticity, since these three govern budget, revenue, and customer experience.

What makes observability challenging for GPU providers?

Cardinality and cardinality drift from tenants, jobs, SKUs, and regions can explode signals. Spiky AI load breaks naive alerting. You need strong telemetry shaping and SLOs tied to utilization and queue health.

Do I need vendor-specific metrics?

Yes, but mapped to portable concepts. Track HBM, NVLink, MIG, and MPS on NVIDIA, or XGMI and HBM on AMD, then normalize to fleet-level SLOs like queue time, busy ratio, and job success.

Are mega-scale GPU builds real or hype?

Real. Public projects like xAI’s Memphis expansion target the one-million GPU tier, which pushes providers to nail power, networking, and observability from day one.

O’Reilly eBook: Cloud Native Observability

Master cloud native observability. Download O’Reilly’s Cloud Native Observability eBook now!

Share This: