What are the main NLP tasks and which approaches work best?

Key tasks and recommended approaches: Text classification (spam, sentiment, topic): TF-IDF + logistic regression for simple keyword-based tasks. BERT fine-tuning for nuanced understanding (sarcasm, legal). LLM zero-shot when no labeled data available. NER (named entities): BERT + token classification (B-I-O tagging). LLM extraction for zero-shot. Question answering: extractive (BERT on SQuAD -- predict answer span in passage) or generative (T5/LLM -- generate answer). Open-domain: retriever + reader (RAG). Summarization: extractive (select important sentences -- simple, faithful) or abstractive (generate new text -- T5, Pegasus, LLM -- more natural but may hallucinate). The trend: LLMs are replacing task-specific fine-tuned models when labeled data is scarce, tasks change frequently, or one model must handle multiple tasks. But fine-tuned BERT/RoBERTa still wins on latency-sensitive tasks (

How do you evaluate NLP models correctly?

Metrics by task: Classification: F1, precision, recall (not accuracy with imbalanced classes). NER: exact-match F1 (span + type must match). QA: Exact Match and token-level F1. Summarization: ROUGE (n-gram overlap), BERTScore (semantic similarity), faithfulness (claims match source). Common pitfalls: (1) Data leakage -- test examples in training set. Deduplicate. (2) Label imbalance -- 95% non-spam makes accuracy misleading. Use F1/AUC. (3) Domain shift -- model trained on product reviews fails on movie reviews. Evaluate on in-domain data. (4) Annotation quality -- low inter-annotator agreement (Cohen kappa) means noisy labels and unreliable models. For production: always evaluate on a held-out set matching production data distribution. Monitor performance over time for data drift.

AI/ML Interview: Natural Language Processing — Text Classification, NER, Sentiment Analysis, QA, Summarization

⏱ 6 min read

NLP powers search engines, chatbots, content moderation, machine translation, and document understanding. With the rise of LLMs, NLP has been transformed — but understanding the traditional NLP pipeline, task taxonomy, and evaluation metrics remains essential for ML interviews. This guide covers NLP tasks from classification to generation with both traditional and modern approaches.

Text Classification

Text classification assigns a label to a text input. Examples: spam detection (spam/not spam), sentiment analysis (positive/negative/neutral), topic classification (sports/politics/technology), and content moderation (safe/toxic/violent). Traditional approach: (1) Feature extraction — TF-IDF (Term Frequency-Inverse Document Frequency) converts text to a numerical vector. TF measures word frequency in the document. IDF downweights common words (“the,” “is”). The TF-IDF vector is sparse and high-dimensional (vocabulary size). (2) Classifier — logistic regression, SVM, or random forest on the TF-IDF features. Fast to train and interpret. Good baseline. Modern approach: (1) Fine-tune a pre-trained language model (BERT, RoBERTa, or a smaller model like DistilBERT). Add a classification head (linear layer) on top of the [CLS] token embedding. Fine-tune on labeled data with cross-entropy loss. (2) Zero-shot classification with LLMs — no training data needed. Prompt: “Classify the following text as positive, negative, or neutral: {text}.” The LLM generates the label. Accuracy depends on the model and prompt quality. Good for: rapid prototyping, low-resource languages, and evolving label sets. When to use which: TF-IDF + logistic regression for: simple tasks with clear keywords (spam with “buy now,” “free”). BERT fine-tuning for: nuanced classification requiring understanding (sarcasm detection, legal document classification). LLM zero-shot for: no labeled data available, or the label set changes frequently.

Named Entity Recognition (NER)

NER identifies and classifies named entities in text: persons, organizations, locations, dates, monetary amounts, and domain-specific entities (drug names, gene names). Example: “Apple announced a $3 billion acquisition of Beats in 2014.” -> [Apple: ORG], [$3 billion: MONEY], [Beats: ORG], [2014: DATE]. Traditional approach: CRF (Conditional Random Field) with hand-crafted features: capitalization, POS tags, gazetteers (lists of known entities). Modern approach: (1) BERT + token classification — each token gets a label (B-PER, I-PER, O for beginning, inside, outside of an entity). Fine-tune BERT with a linear classification layer per token. (2) LLM extraction — prompt: “Extract all company names from the following text: {text}.” The LLM outputs structured entities. Good for zero-shot but may miss edge cases or hallucinate entities. Evaluation: (1) Exact match F1 — the entity span and type must match exactly. Strict but standard. (2) Partial match — credit for overlapping spans (useful during development). Challenges: nested entities (“New York University” — is “New York” a location?), ambiguity (“Apple” — company or fruit?), and domain adaptation (medical NER requires different entity types than financial NER).

Question Answering

QA systems answer questions based on a given context or knowledge base. Types: (1) Extractive QA — the answer is a span extracted from a given passage. Input: question + context paragraph. Output: start and end positions of the answer span in the context. Architecture: BERT fine-tuned on SQuAD. Two output heads: one predicts the start position, another the end position. The span between them is the answer. (2) Abstractive QA (generative) — the answer is generated, not extracted. May paraphrase or synthesize information from multiple passages. Architecture: T5, BART, or an LLM with the context in the prompt. More flexible but may hallucinate. (3) Open-domain QA — the system must find the relevant information (not given a specific passage). Architecture: retriever + reader. The retriever (BM25 or dense retrieval with embeddings) finds relevant passages from a knowledge base. The reader (extractive or generative) answers from the retrieved passages. This is the RAG architecture applied to QA. Evaluation: (1) Exact Match (EM) — the predicted answer exactly matches the ground truth. (2) F1 — token-level overlap between prediction and ground truth. More lenient than EM (partial credit for partially correct answers). SQuAD 2.0 includes unanswerable questions — the model must also learn to say “I do not know.”

Text Summarization

Summarization condenses long text into a shorter version preserving key information. (1) Extractive summarization — select the most important sentences from the original text. No new text is generated. Algorithm: score each sentence by importance (position, keyword overlap, TF-IDF), select top-K sentences. Simple and faithful (no hallucination — all text comes from the original). (2) Abstractive summarization — generate new text that captures the key points. May paraphrase, merge information, and produce sentences not in the original. Architecture: T5, BART, Pegasus (pre-trained specifically for summarization), or LLMs with a summarize prompt. More natural but may hallucinate or misrepresent. Evaluation: (1) ROUGE — measures n-gram overlap between the generated summary and reference summaries. ROUGE-1 (unigram), ROUGE-2 (bigram), ROUGE-L (longest common subsequence). Higher = more overlap with reference. (2) BERTScore — semantic similarity between generated and reference using BERT embeddings. Better than ROUGE at capturing paraphrase quality. (3) Faithfulness metrics — does the summary contain claims not in the source? Critical for news and medical summarization. Modern approach: LLMs produce high-quality abstractive summaries. The main challenge is faithfulness — ensuring the summary does not add information not in the source. Techniques: constrained generation (only allow words/phrases from the source), and entailment-based verification (check each summary claim against the source).

NLP Evaluation and Practical Considerations

Common pitfalls in NLP evaluation: (1) Data leakage — test examples appearing in the training set (especially for web-scraped data where the same text may appear on multiple sites). Always deduplicate train/test splits. (2) Label imbalance — in spam detection, 95% of emails may be non-spam. Accuracy is misleading (95% by always predicting non-spam). Use F1, precision, recall, or AUC-ROC instead. (3) Domain shift — a model trained on product reviews may perform poorly on movie reviews. Always evaluate on in-domain test data. (4) Annotation quality — NLP labels often require subjective judgment (is this comment toxic?). Inter-annotator agreement (Cohen kappa) measures label quality. Low agreement = noisy labels = unreliable model. Practical model selection for production: (1) Latency-sensitive (< 50ms) — DistilBERT or smaller fine-tuned models. Quantized (INT8) for faster inference. (2) Quality-sensitive (accuracy is everything) — large fine-tuned model (RoBERTa-large, DeBERTa-v3) or LLM. (3) Zero-shot (no training data) — LLM with a well-crafted prompt. (4) Multilingual — mBERT, XLM-RoBERTa, or multilingual LLMs. The trend: LLMs are replacing task-specific fine-tuned models for many NLP tasks, especially when: labeled data is scarce, tasks change frequently, or a single model must handle multiple tasks.