NLP Interview Questions: Tokenization, Embeddings, and BERT

NLP interview questions appear across ML engineer, data scientist, and applied researcher roles at companies like Google, Meta, OpenAI, and Cohere. They test whether you understand language modeling beyond the API level — tokenization trade-offs, how embeddings encode meaning, and the architectural choices that made BERT and its descendants the dominant paradigm.

What Every Interviewer Is Testing

Do you know why subword tokenization exists, not just that it does?
Can you explain the difference between word2vec, GloVe, and contextual embeddings without buzzwords?
Do you understand BERT’s pre-training objectives and why they enable transfer learning?
Can you walk through fine-tuning BERT for a downstream task end-to-end?

Tokenization

Q: Why do modern NLP models use subword tokenization instead of word-level or character-level?

Word-level tokenization has an unbounded vocabulary problem: rare words, typos, and new terms become [UNK] tokens and lose all semantic signal. A production model trained on Wikipedia has no representation for “COVID-19” when it emerges.

Character-level tokenization solves OOV but creates excessively long sequences. “transformer” becomes 11 tokens; attention cost scales quadratically with sequence length.

Subword tokenization hits the middle ground. Common words are single tokens; rare words are split into recognizable pieces.

The three dominant algorithms:

Algorithm	Used by	Key idea
BPE (Byte-Pair Encoding)	GPT-2/3/4, RoBERTa	Iteratively merge the most frequent byte/char pair
WordPiece	BERT, DistilBERT	Merge pairs that maximize likelihood of training corpus
SentencePiece / Unigram	T5, LLaMA, mBERT	Language-model-based; handles any script, no whitespace dependency

Interview follow-up: “How does tokenization affect multilingual models?” — languages like Chinese have no whitespace; SentencePiece handles this cleanly; WordPiece requires language-specific preprocessing.

Q: What is tokenization fragmentation and why does it matter for LLMs?

“tokenization” → [“token”, “ization”] — two tokens, not one. Numbers like “12345” often fragment into [“12”, “34”, “5”]. This means:

Arithmetic is hard for LLMs because digit tokens carry no ordinal meaning
Token count != word count; 1,000 words ≈ 750 tokens for English but ~1,200 for code
Prompt engineering must account for boundary effects near context limits

Embeddings

Q: What is the difference between word2vec, GloVe, and contextual embeddings?

word2vec (2013, Google): Predicts surrounding words (Skip-gram) or predicts center word from context (CBOW). Produces a single static vector per word. “bank” has one vector regardless of context.

GloVe (2014, Stanford): Factorizes the global word co-occurrence matrix. Produces static vectors that capture global corpus statistics. Slightly better on analogy tasks (“king – man + woman = queen”).

Contextual embeddings (ELMo, BERT, GPT): Each token gets a different vector depending on its sentence context. “bank” in “river bank” vs “savings bank” gets different embeddings. This is the key breakthrough.

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# "bank" in two contexts
sentences = [
    "I deposited money at the bank.",
    "We fished from the river bank."
]

for sent in sentences:
    inputs = tokenizer(sent, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # Shape: [1, seq_len, 768]
    hidden_states = outputs.last_hidden_state
    # Find "bank" token position
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    bank_idx = tokens.index('bank')
    bank_embedding = hidden_states[0, bank_idx, :]  # Different for each sentence!
    print(f"Context: {sent}")
    print(f"Bank embedding norm: {bank_embedding.norm():.4f}")
    print(f"First 5 dims: {bank_embedding[:5].numpy()}n")

BERT Architecture and Pre-training

Q: Explain BERT’s pre-training objectives. Why do they enable transfer learning?

BERT uses two pre-training tasks on unlabeled text:

1. Masked Language Modeling (MLM):
Randomly mask 15% of tokens. Predict the masked tokens from surrounding context (both left and right — bidirectional). This forces the model to learn rich contextual representations.

The 15% masking is split:

80% replaced with [MASK]
10% replaced with a random token
10% kept unchanged (prevents model from only learning [MASK]-specific representations)

2. Next Sentence Prediction (NSP):
Given two sentences A and B, predict whether B actually follows A in the corpus. Enables tasks requiring sentence-pair reasoning (QA, NLI, paraphrase detection).

Note: Later models (RoBERTa, ALBERT) showed NSP is not always helpful and dropped it. RoBERTa’s dynamic masking + larger batches + more data improved BERT’s performance significantly without NSP.

Why this enables transfer learning: Pre-training on massive text (BookCorpus + Wikipedia) develops general linguistic knowledge — syntax, semantics, coreference, world facts. Fine-tuning on a small labeled dataset adapts just the top layers for the specific task.

Q: Walk me through fine-tuning BERT for text classification.

from transformers import BertForSequenceClassification, BertTokenizer
from torch.utils.data import DataLoader, Dataset
import torch
from torch.optim import AdamW

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.encodings = tokenizer(texts, truncation=True, padding=True,
                                   max_length=max_len, return_tensors='pt')
        self.labels = torch.tensor(labels)

    def __len__(self): return len(self.labels)

    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.encodings.items()}, self.labels[idx]

# Load pre-trained model with classification head
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Use differential learning rates:
# - Lower LR for pre-trained layers (preserve learned representations)
# - Higher LR for classification head (learn task-specific features)
optimizer = AdamW([
    {'params': model.bert.parameters(), 'lr': 2e-5},
    {'params': model.classifier.parameters(), 'lr': 1e-4}
])

# Training loop
model.train()
for batch_inputs, batch_labels in train_loader:
    optimizer.zero_grad()
    outputs = model(**batch_inputs, labels=batch_labels)
    outputs.loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Prevent exploding gradients
    optimizer.step()

Q: How does BERT differ from GPT? When would you choose one over the other?

Property	BERT	GPT
Architecture	Encoder only	Decoder only
Attention	Bidirectional (sees full context)	Causal (left-to-right only)
Pre-training	MLM + NSP	Next-token prediction (autoregressive)
Best for	Classification, NER, QA (understanding)	Text generation, summarization, code
Fine-tuning	Add task head, fine-tune encoder	Prompt engineering or SFT

Use BERT when you need representations for classification or retrieval. Use GPT when you need generation. Use encoder-decoder (T5, BART) when you need both (translation, summarization).

Common NLP Interview Questions at Big Tech

Q: How would you handle out-of-vocabulary tokens in production?

Use subword tokenization (BPE/WordPiece) — true OOV is rare
For domain-specific corpora (legal, medical, code), retrain or extend the tokenizer vocabulary
For named entities, consider character-level fallback or BIO tagging

Q: What is the attention mask in transformer inputs and why is it needed?

Padding makes batches uniform-length, but padded positions should not influence computations. The attention mask is a binary tensor (1 = real token, 0 = padding) that masks padded positions to -∞ before the softmax, making their attention weights effectively zero.

Q: How do you handle very long documents that exceed BERT’s 512-token limit?

Sliding window: split document into overlapping 512-token chunks; aggregate predictions (max, mean, or trained aggregator)
Hierarchical approach: encode sentences with BERT, then apply another model over sentence embeddings
Long-context models: Longformer (sparse attention, 4096 tokens), BigBird (random + sliding + global attention, 4096 tokens)
Retrieval-first: extract the relevant passage before encoding (same as RAG retrieval)

Q: What is Named Entity Recognition (NER) and how is it typically formulated?

NER labels each token with its entity type. Standard formulation: token classification with BIO tagging scheme.

B-PER: beginning of a person name
I-PER: inside a person name
O: outside any named entity

Fine-tune BERT with a linear classification head over each token’s hidden state. Use CRF (Conditional Random Field) on top for valid sequence constraint enforcement (B-PER must precede I-PER, not I-LOC).

Depth Levels

Junior: Explain the difference between word2vec and BERT. What is tokenization?

Senior: Walk through BERT fine-tuning. Discuss gradient flow through the transformer. Handle the 512-token limit problem.

Staff: Design an NLP system at scale (distributed inference, latency vs. accuracy trade-offs). Compare BERT, RoBERTa, DeBERTa. Discuss catastrophic forgetting in continual learning.