How NVIDIA Tests CUDA and GPU Skills in Its Interviews

Updated · techinterview.org

The NVIDIA loop that trips people up isn’t the LeetCode round. It’s the moment an engineer drops a working CUDA kernel in front of you, says it reaches 30% of an H100’s memory bandwidth, and asks why. There’s no trick answer. They want to watch you reason about how the hardware actually moves data, and whether you reach for a profiler or start guessing.

That one question captures what the company screens for. C++ and Python fluency are table stakes. What separates an offer from a polite rejection is whether you think in warps, memory hierarchies, and interconnect bandwidth the way most engineers think in for-loops.

The shape of the loop

Most candidates in 2026 report four to six rounds over four to eight weeks. A recruiter screen comes first, then a technical phone screen with an engineer who spends ten minutes on your background before handing you an open-ended problem. The onsite is three to five back-to-back sessions, each 45 to 60 minutes, usually a mix of CUDA or systems coding, GPU performance reasoning, a design discussion sized to the team, and one behavioral round.

Team matters more here than at most companies. A CUDA libraries team will push on kernel-level detail. An inference team wants to know how you’d serve a model under a latency budget. A driver or compiler team cares about C++ internals and undefined behavior. The recruiter usually tells you which org you’re interviewing with, so prep to that org, not to a generic “NVIDIA” blob.

One thing to brace for: the open-ended questions rarely have a single right answer, and the follow-ups keep coming. You optimize the kernel, they ask what breaks on the next architecture. You give a number, they ask how you measured it. The whole format is built to find the edge of what you actually understand.

The memory-bandwidth question, in detail

Memory coalescing is the idea they probe most, because it’s where most real kernels leave performance on the floor. When the 32 threads of a warp issue a load, the hardware checks whether they touch consecutive addresses. If they do, those accesses collapse into a few wide memory transactions. If thread 0 reads address 0, thread 1 reads address 128, and so on, every thread triggers its own transaction and effective bandwidth craters.

The classic setup is a strided access pattern hiding inside innocent-looking indexing:

// Coalesced: consecutive threads hit consecutive floats
int i = blockIdx.x * blockDim.x + threadIdx.x;
out[i] = in[i] * 2.0f;

// Strided: each thread jumps by `stride`, killing coalescing
int i = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
out[i] = in[i] * 2.0f;

The transpose is the canonical interview version. A naive transpose reads its input coalesced but writes its output strided (or the reverse), so half of every access pattern is wasteful. The expected answer is to stage the tile in shared memory: read a 32×32 tile coalesced from global memory, transpose it inside shared memory where strided access is cheap, then write it back coalesced. If you’ve never written that kernel, write it once before the interview. It comes up by name.

When they hand you the slow kernel, the move they want is not a guess. It’s profile first. Run Nsight Compute (ncu), look at the memory chart, check whether the bottleneck is DRAM throughput or L2, and read the sectors-per-request number that tells you directly whether your loads are coalesced. Saying “I’d check the profiler before changing code” already puts you ahead of most candidates.

Warps, divergence, and occupancy

A warp is 32 threads executing in lockstep. When threads in the same warp take different branches of an if, the hardware runs both paths and masks off the inactive lanes, so a divergent branch can cost you up to 2x. The fix interviewers listen for is restructuring so divergence happens at warp boundaries: branch on threadIdx.x / 32, not on something that splits a single warp down the middle.

Occupancy is the other lever, and it’s where people overcorrect. Occupancy is the ratio of active warps to the maximum a streaming multiprocessor can hold, bounded by whichever resource runs out first: registers per thread, shared memory per block, or block count. The common mistake is treating 100% occupancy as the goal. It isn’t. Plenty of fast kernels run at 50% occupancy because each thread uses more registers to keep data in fast storage. The honest answer to “how much occupancy do you need” is “enough to hide memory latency,” and you find that by measuring, not by maximizing a single number.

Expect questions phrased close to these:

  • “A kernel reaches 30% of peak memory bandwidth on an H100. How do you find out why?”
  • “Your reduction is slower at 64 threads per block than at 256. What’s going on?”
  • “This branch sits in the inner loop. How much is it costing you, and how would you know?”

Shared memory and the bank-conflict trap

Shared memory sits on the SM and runs roughly 100x faster than uncached global memory, which is why staging data there is the first optimization most kernels need. The catch is that it’s split into 32 banks. If two threads in a warp hit different addresses in the same bank, those accesses serialize. If they hit the same address, the hardware broadcasts and there’s no conflict.

The interview version is a shared-memory array of width 32 indexed by column. Every thread in a warp lands on the same bank, so a 32-way conflict serializes the whole access. The one-line fix is padding the array to width 33, which shifts each row by one bank and spreads the accesses out. If you can explain why __shared__ float tile[32][33] beats tile[32][32], you’ve shown you understand the memory system at a level most candidates can’t.

The parallelism design question

For infrastructure and inference teams, the design round is rarely about load balancers. It’s about splitting a model that no longer fits on one GPU. A representative prompt: “You have a 70-billion-parameter model and nodes with 8 H100s connected by NVLink. How do you train it across 1,024 GPUs?” There’s no single answer, and the interviewer is testing whether you know the tradeoffs between the parallelism strategies and when each one stops working.

Strategy What it splits Main cost When it’s the answer
Data parallel The batch; every GPU holds the full model Gradient all-reduce each step; model must fit on one GPU Model fits in memory and you want raw throughput
Tensor parallel Individual matmuls across GPUs All-reduce inside every layer; needs NVLink-class interconnect A layer is too big for one GPU and you want low latency
Pipeline parallel The layers, into sequential stages Pipeline bubbles; needs microbatching to stay busy A very deep model spread across many nodes
Sequence parallel The sequence dimension of attention Extra communication around attention Context length is the thing that won’t fit

The strong answer combines them: tensor parallel within a node where NVLink is fast, pipeline parallel across nodes where the interconnect is slower, and data parallel on top for throughput. That’s roughly how the large training runs are actually wired, and naming the interconnect boundary as the reason tensor parallelism stays inside the node is the detail that lands.

For an inference team, the same muscle shows up as a serving question: continuous batching, KV-cache memory math, and where tail latency hides when you pack requests of different lengths into one batch. If you’ve read how vLLM or TensorRT-LLM handle batching, say so and explain the mechanism, not the brand name.

The C++ they expect you to know

Driver, compiler, and library teams treat C++ as a first-class skill, not a formality. Move semantics and when a move actually happens, RAII for managing CUDA streams and device memory, what makes code undefined behavior, and how virtual dispatch costs you in a hot loop are all fair game. A common pattern is a small class that owns a device allocation; they want to see you free it in the destructor and either delete or correctly implement the copy operations so you don’t double-free. Rusty C++ is the quiet reason a lot of strong GPU candidates stall in these loops.

Python shows up too, mostly for ML-adjacent roles, but it’s used to check that you can wire up a test, drive a kernel from PyTorch, or reason about where the GIL and host-device transfers cost you time. Nobody’s asking you to implement a red-black tree in Python here.

How to actually prepare

Write three or four kernels by hand and profile them with Nsight Compute until you can read the memory and occupancy charts without thinking: a vector add, a transpose, a reduction, and a simple tiled matmul. Those four cover coalescing, shared memory, bank conflicts, divergence, and occupancy, which is most of what the GPU rounds test. Then read enough about tensor and pipeline parallelism that you can sketch how a 70B model maps onto a node of 8 GPUs on a whiteboard.

The candidates who clear this loop aren’t the ones who memorized definitions. They’re the ones who, handed a slow kernel, reach for a profiler, read the numbers, and can name which line of the memory hierarchy is the bottleneck and why. Build that reflex and the questions stop feeling like trivia.

Scroll to Top