Product
Solutions
Resources
Company
Download Trial Book a Demo

Code Coverage for CUDA NVIDIA: Host and GPU Device Testing Made Simple

A CUDA build compiles into two worlds at once — CPU host code and GPU device code — and most coverage tools only ever see one of them. Here's how to instrument both sides, capture coverage from inside your kernels, and merge it all into a single source-level report.

For decades, code coverage has been a CPU story. You compile, you run your tests, a tool tells you which lines and branches ran, and you fill the gaps. But more and more of the work that actually matters now happens on the GPU — image pipelines, physics solvers, signal processing, and of course the matrix kernels behind every neural network. And for most teams, that GPU code is a coverage blind spot. The kernels run millions of times a day in production, yet no one can say which lines inside them have ever been exercised by a test.

The reason is structural, not accidental. CUDA code coverage is hard because a CUDA program is really two programs fused into one binary, and the tools you already trust were built to measure only the half that runs on the CPU. This guide explains the host/device split that causes the blind spot, what "coverage" even means when thousands of threads run the same line in lockstep, and how to instrument both sides of the build so that a GPU kernel shows up in your report exactly like any host function.

In one sentence

A host-only coverage tool reports green while every line of your __global__ kernels stays completely unmeasured — real coverage means instrumenting the device side too.

Why GPU coverage is hard

When you build a normal C++ program, one compiler turns your source into one stream of machine code for one CPU. A CUDA build is not like that. The nvcc compiler driver looks at your .cu files and splits them along a line drawn by function qualifiers:

  • __host__ code is ordinary CPU code — it is handed off to your system's host compiler (gcc, clang or MSVC) and ends up as normal machine instructions.
  • __global__ functions are kernels: entry points you launch from the host with the <<<grid, block>>> syntax, which then run on the GPU.
  • __device__ functions run on the GPU too, but are called only from other device code.

The device side does not become CPU instructions at all. nvcc compiles it to PTX — a virtual instruction set — which is then assembled into SASS, the actual machine code for your specific GPU architecture. So a single .cu file fans out into two completely separate compilation paths: one CPU object, and one embedded GPU image. A coverage tool that hooks the host compiler never even sees the device path.

On top of that, the GPU executes differently. It runs threads in groups of 32 called warps under a model NVIDIA calls SIMT — single instruction, multiple thread. Every thread in a warp wants to execute the same instruction at the same time. When a branch sends some threads one way and the rest another, the warp diverges: the hardware runs each path in turn with the other threads idled. Divergence is both a performance concern and, as we'll see, a place where coverage has something important to say.

What "coverage" even means on a GPU

The good news is that the core ideas transfer cleanly. Statement, decision and condition coverage are defined in terms of source structure, and your kernel source has statements, ifs and loops just like any other code. A line of device code is "covered" when at least one thread executed it; a branch is covered when at least one thread took it true and at least one took it false. So the metrics you already know still apply — the question is how to observe them.

The twist is aggregation. On the host, a line either ran or it didn't. On the GPU, the same line might be executed by thousands of threads across many warps, and you usually want to fold all of that into a single yes/no per source line. That folding is straightforward, but it's worth being deliberate about what you're collapsing.

Divergence is the part that rewards attention. A branch that only ever diverges one way — where the same lane always takes the true side and never the false side — is an under-tested branch, and it is exactly the kind of thing that hides bugs: an edge-case lane, a boundary thread, a tail block that your test inputs never push down the other path. Treating per-thread branch outcomes as ordinary decision coverage surfaces those gaps instead of letting a busy-looking kernel mask them.

A kernel that runs a billion times a day can still have a branch that has never once been taken false. Coverage is how you find it before production does.

The host side and the device side

It helps to picture a CUDA program as two halves that talk across a boundary. The host side is your main(), your setup, your memory allocations, your error handling, and the launch statements that fire kernels. The device side is the kernels themselves plus the device functions they call. Both halves live in the same source files; both are part of the same build; both can contain bugs.

The launch configuration — the grid and block dimensions you pass in the triple-chevron — is host code, and it decides how many threads run your kernel and in what shape. That matters for coverage because the launch parameters determine which threads even exist to exercise device branches. A test that launches a single block may never reach the code paths that only the last partial block triggers.

This is why a host-only tool gives such a misleading picture. It happily instruments the host half: it sees main(), sees the allocations, sees the line where you write kernel<<<...>>>(), and counts that launch line as covered. But it has no visibility past the launch. Every statement and branch inside the kernel is invisible to it. You can have a report that reads 90% while the half of your codebase doing the actual computation sits at zero — uncounted, not failing, just absent.

Instrumenting both sides of a CUDA build

Real GPU coverage means instrumenting both compilation targets. On the host side this is familiar territory: the host code is instrumented during its normal compile, the same way any C++ coverage works. The interesting work is on the device side, and it comes with two constraints that ordinary instrumentation never has to think about.

First, you have to inject coverage probes into the device code itself, so that as threads execute kernel statements and branches they record what they hit. Because thousands of threads touch the same probes at once, those updates have to be thread-safe — typically atomic increments or per-thread/per-warp counters that are reduced afterward — so that concurrent lanes don't lose each other's data.

Second, and more fundamentally, the GPU has no file system. Host coverage tools traditionally just write a .gcda file when the program exits. Device code cannot do that. It has no notion of opening a file or a path. So coverage data has to be accumulated into a buffer in GPU device memory, and then, after the kernel finishes, that buffer is copied back across the PCIe bus to the host — the same boundary your input and output data cross — where it can finally be written out and merged. The flow looks like this:

  1. Allocate a coverage buffer in device memory before launch.
  2. As kernels run, threads update their hit counters in that buffer.
  3. After the kernels complete, copy the buffer back to host memory.
  4. On the host, fold the per-thread data into per-line and per-branch results.
  5. Write the device results out alongside the host coverage data.
Mind the boundary

Device coverage data only becomes visible once it crosses back to the host. If a kernel crashes or you skip the copy-back, the data sits stranded in GPU memory — so the instrumentation has to manage that round trip reliably, not the test author.

One unified report

Collecting host and device data is only half the win. The other half is bringing them together so that you read one report, not two. The goal is a single source-level view where a kernel that sits at 60% coverage is just as visible, and just as actionable, as a host function at 60%. The same red/amber/green color-coding, the same per-file rollup, the same gap list — whether the line in question runs on a CPU or on ten thousand GPU threads.

That unification is what turns GPU coverage from a curiosity into something a team will actually act on. When the device code lives in the same report as everything else, it stops being a separate, scary, "we'll get to it later" category. An uncovered kernel branch shows up in the same list as an uncovered host branch, gets triaged the same way, and gets a test written for it the same way. The blind spot closes because the GPU stops being a special case in your tooling.

Coverage for CUDA with RKTracer

This is exactly what RKTracer is built to do, and it does it without you touching your source. There are no special macros to add to your kernels, no annotations, no separate device build. You prefix your existing build command — whether that's a direct nvcc invocation or a full make — and RKTracer auto-detects the CUDA toolchain, instruments both the host and the device code during compilation, and wires up the device-memory buffer and copy-back for you. When your tests run, both halves are measured; when they finish, rkresults emits one HTML (or XML) report covering the whole program.

terminal — host + device coverage in one pass
# Prefix your normal CUDA build — no source edits, no separate device build
$ rktracer make image_pipeline

  toolchain: nvcc 12.4 — host + device instrumented
   host: 24 files     device: 11 kernels — source unmodified

$ ctest # run your existing tests; kernels launch as usual
$ rkresults --report html

   Host statement   96.4%
   Device statement 88.1%
   Device decision  72.0%  (3 kernel branches never taken false)

RKTracer reports host and CUDA device coverage side by side and flags the exact kernel branches your tests never exercised.

Because GPU work rarely lives alone, it's worth seeing how this compares to the host-only tools most teams reach for first:

CapabilityHost-only coverage toolRKTracer for CUDA
Host (CPU) codeMeasuredMeasured
Device kernel codeInvisibleMeasured
Per-kernel branch / decisionNot collectedCollected
Source changes requiredSometimesNone
Unified host + device reportNoOne report

And when a gap is real rather than unreachable, RKTracer's AI test generation proposes the tests that close it — including the launch configurations and inputs needed to drive an under-tested kernel branch down its missing path. If you want the broader picture of what RKTracer measures on accelerators, the GPU & CUDA coverage page goes deeper.

In CI and across targets

None of this needs a parallel pipeline. Because RKTracer just prefixes the build, GPU coverage drops into the same CI job as your host coverage: build with the prefix, run the tests, emit the report, and fail the gate if coverage slips below your threshold. The XML output feeds the same dashboards you already use for CPU code, so the GPU numbers show up next to everything else instead of in a separate silo.

The deeper benefit is one model across every place your code runs. The same tool, the same metrics — statement, decision, condition, MC/DC and multi-condition — and the same report format cover host builds, embedded targets with or without a file system, GPUs and CUDA, and simulators or emulators. You don't learn a new coverage story per platform; the GPU simply joins the one you already have.

The mental model to keep

  • A CUDA build is two programs in one — instrument the device side or you measure nothing inside your kernels.
  • The same statement, decision and condition metrics apply per-thread; fold them into one yes/no per source line.
  • The GPU has no file system, so coverage data lives in device memory and is copied back across the bus.
  • Merge host and device into one report so a 60% kernel is as visible as a 60% host function.

Bringing the GPU into your coverage story

GPU code has been a blind spot for so long that many teams have stopped noticing the hole. But the kernels are where the heavy lifting happens, and "we never measured it" is not a comfortable answer when one of those kernels has a branch that has never been taken. The barrier was never the metrics — statement, decision and condition coverage mean exactly what they always meant. The barrier was reaching inside a build that compiles to two targets at once and getting data back from a device with no file system.

Once that's solved, the GPU stops being special. You prefix your build, run your tests, and read one report where host and device sit side by side. The blind spot closes, the kernel that was quietly running untested code gets a test, and your coverage story finally covers all the code you actually ship — not just the half that runs on the CPU.

AR
Arjun Rao
Developer Relations, RKValidate

Arjun helps teams instrument GPU/CUDA and embedded builds, getting real coverage out of code that runs everywhere from bare-metal targets to the accelerators in their data center.

Keep reading
Embedded

Cross-Compiler Coverage: Why Host Numbers Lie

Coverage measured on a host build can diverge sharply from the target — here's why, and how to measure on the hardware that ships.

Read more
Code Coverage

Everything You Need to Know About Code Coverage

Statement, branch, decision, condition, MC/DC and multi-condition — what each metric really proves.

Read more
Code Coverage

Tools for Code Coverage: A Practical Guide

A tour of the code-coverage tooling landscape and how to choose one for your language, platform and CI.

Read more

Bring your GPU code into coverage

See RKTracer measure host and CUDA device coverage in one report — no source changes, just prefix your build. Book a 30-minute demo or run the free trial today.