How Together AI Interviews Backend and Infra Engineers

Updated · techinterview.org

Most product companies want to know whether you can ship a feature. Together AI wants to know whether you understand where a GPU’s time actually goes. They run one of the larger open-model inference clouds, hosting a couple hundred open-source models on shared accelerators, and the interview is shaped by that economics problem. A bad batching decision there isn’t a slow page load, it’s a six-figure monthly GPU bill that buys you nothing.

That framing matters because candidates walk in expecting a generic FAANG loop and get something with more hardware texture. The coding rounds are normal enough. The design and specialist rounds are where the company is really probing, and they reward people who have thought hard about memory bandwidth, attention internals, and multi-tenant scheduling rather than people who have only read about them.

The shape of the loop

The process usually opens with a recruiter screen, about half an hour, mostly logistics and a sanity check on why you want to work on inference infrastructure specifically. Saying you want to work in AI lands flat. Saying you have opinions about why throughput-optimized serving and latency-optimized serving pull in opposite directions lands much better.

After that comes a technical phone screen or an online assessment, then an onsite loop of roughly four to five rounds. End to end it tends to run two to six weeks, faster if a team has an urgent opening. The exact mix shifts by org and by the year you interview, so treat any fixed script as approximate. What stays constant is the split between general engineering ability and a deep technical block tied to the role you applied for.

The coding rounds

Expect two coding rounds, sixty minutes each, at a difficulty that sits around LeetCode medium with a couple of hard edges. People who have interviewed at Nvidia core teams or frontier-lab inference groups describe a similar bar. The problems themselves are often standard: a graph traversal, a sliding-window string problem, something with a heap. The differentiator is how you talk about complexity and memory while you solve them.

Interviewers on infra teams care about the constant factors, not only the big-O. If you write an O(n) solution that thrashes cache or allocates inside a hot loop, you’ll get asked about it. Narrate the allocation behavior. Mention when you’d switch from a hash map to a sorted array because the working set fits in L2. That kind of commentary signals you actually think about the machine, which is the whole job here.

A reasonable warm-up before the loop:

  • “Merge k sorted streams of token logprobs and keep the running top-p set.”
  • “Given request arrival times and durations, compute the maximum number of concurrent requests on one GPU.”
  • “Parse a batch of variable-length sequences into fixed buckets to minimize padding waste.”

None of those require exotic algorithms. They’re chosen because the framing maps onto serving problems, and a candidate who recognizes that connection without being told tends to do well in the later rounds too.

The system design round, where it gets specific

This is the round that separates Together AI from a generic backend interview. The prompts are AI-infrastructure problems with real constraints, not “design Twitter.” A few that match what people report being asked:

  • “Design the inference service that hosts 200 open models on shared GPU capacity. How do you decide what stays resident and what gets evicted?”
  • “A customer wants a dedicated endpoint with tight tail-latency guarantees. What changes versus the shared pool?”
  • “Design multi-tenant LoRA serving so one customer’s adapter can’t starve another’s.”
  • “Requests have wildly different prompt and output lengths. How does your batching stay efficient without head-of-line blocking?”

The model loading and eviction question is the one people underprepare. Cold-loading a 70B model into GPU memory takes real seconds, so you can’t just page models in on demand the way you’d treat a web cache. You end up reasoning about keeping popular models warm, predicting demand, sharing base weights across LoRA adapters, and what happens to in-flight requests when you evict. Strong answers reach for continuous batching and talk about the KV cache as the actual scarce resource, because at long context lengths the KV cache, not the weights, is what eats your VRAM.

You don’t need to have built this before. You do need to reason about it from first principles and arrive at the tradeoffs honestly. If you claim exactly-once delivery for a streaming token API, expect a follow-up that pokes at where your guarantee breaks under a mid-stream GPU failure.

What each track actually tests

The specialist block depends heavily on which role you’re in. The technical rounds for an inference-engine hire look almost nothing like the ones for an API backend hire. Roughly:

Track What the technical rounds lean on
Inference engine CUDA and Triton kernels, memory bandwidth, attention internals, profiling with Nsight
Serving / platform request batching, scheduling, multi-tenant GPU isolation, autoscaling under bursty load
Backend / API distributed systems, queues, rate limiting, idempotency, usage metering and billing
Training / fine-tuning infra data pipelines, LoRA adapter management, checkpointing, throughput per dollar
Applied research a take-home implementing or optimizing a method, plus discussion of recent papers

If you applied to the inference-engine track and can’t talk about why a fused attention kernel beats the naive version, or why memory bandwidth rather than FLOPs is the ceiling for decoding, that’s a problem. The company employs the people who wrote FlashAttention, so the bar on this material is set by folks who think about it at the warp level.

The CUDA and take-home tracks

For the lowest-level roles, a technical round may involve writing or optimizing an actual kernel. A common shape is a take-home that hands you a slow kernel and asks you to make it faster, then a follow-up call where you defend your changes. They’re watching whether you profile before you optimize, whether you understand occupancy and coalesced memory access, and whether you can explain why your change helped instead of just reporting that it did.

A concrete version of the warm-up many people do on their own:

// Naive: each thread reads global memory repeatedly.
// The interview question is usually "why is this bandwidth-bound,
// and what's the first change you'd make?"
__global__ void softmax_rows(const float* x, float* y, int n, int d) {
    int row = blockIdx.x;
    float m = -INFINITY;
    for (int j = 0; j < d; ++j) m = fmaxf(m, x[row*d + j]);
    float s = 0.f;
    for (int j = 0; j < d; ++j) s += expf(x[row*d + j] - m);
    for (int j = 0; j < d; ++j) y[row*d + j] = expf(x[row*d + j] - m) / s;
}

The expected discussion is three passes over global memory when you could do it in shared memory with a single load, then a block-level reduction for the max and the sum. If you can sketch that and estimate the bandwidth you’d save, you’re speaking the language the interviewer wants to hear.

For research-leaning roles, the take-home runs four to eight hours on a realistic problem, often reproducing or extending something from a recent paper on speculative decoding or quantization. They care more about your judgment and your writeup than about a leaderboard score. A clean explanation of what you tried, what failed, and why beats a marginally better number with no reasoning attached.

The behavioral round and culture read

There’s a behavioral component, usually STAR-format, and it’s not a formality. Together AI is a startup that scaled fast, so they screen for people who can operate without a thick layer of process. Expect questions about a time you owned an ambiguous problem end to end, a time you disagreed with a technical direction, and how you handled an incident in production. Specifics win. “We had a latency regression after a kernel change, I bisected it to a register-spill issue and rolled back within the hour” tells them more than any amount of values talk.

They also read for genuine interest in open models. The company’s whole thesis is that open-source models plus good infrastructure can compete with closed frontier APIs, and people who clearly care about that mission tend to get the nod over equally skilled candidates who treat it as just another inference job.

How to prep without wasting time

If you have two weeks, split them. Keep your data-structures fundamentals sharp so the coding rounds are a non-event, but spend the larger share on the serving stack. Read how continuous batching works in vLLM, understand what PagedAttention solved and why KV-cache fragmentation was the problem it attacked, and be able to explain speculative decoding to someone who’s never heard of it. Know roughly what a single H100 can and can’t do for a 7B versus a 70B model, because order-of-magnitude estimates come up constantly in the design round.

The candidates who struggle are usually strong generalist engineers who never built the mental model for where GPU time and memory go. The ones who do well can stand at a whiteboard, get handed “design serving for 200 models on shared hardware,” and reason their way to KV-cache pressure and model-eviction policy without being led there. That instinct is what the entire loop is built to find.

Scroll to Top