Computer Vision Interview Questions: CNNs, Object Detection, and Transfer Learning

Computer vision is one of the most interview-tested areas of ML, especially at companies with physical products, autonomous systems, or large image/video platforms. This guide covers the core concepts — CNNs, object detection, transfer learning — and the specific questions you’ll encounter at FAANG and AI-focused companies.

What the Interviewer Is Testing

At the junior level: can you explain convolutions and why pooling works? At the senior level: can you design an object detection pipeline, explain anchor boxes vs anchor-free approaches, and diagnose why a model trained on one domain fails on another?

Convolutional Neural Networks (CNNs): The Foundation

How Convolution Works

A convolutional layer slides a small filter (kernel) across the input image, computing a dot product at each position. A 3×3 filter detects a specific local pattern (edge, color gradient) regardless of where it appears in the image. This translation invariance is the key property — the same filter detects a horizontal edge at position (10, 20) and at position (100, 200).

import torch
import torch.nn as nn

# One convolutional layer: 3 input channels (RGB), 64 output feature maps, 3×3 kernel
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)

# Input: (batch, 3, 224, 224) — 224×224 RGB image
# Output: (batch, 64, 224, 224) — 64 feature maps, same spatial size (padding=1)

# Parameters: 3 × 64 × 3 × 3 + 64 (bias) = 1,792 — tiny!
# Compare to a fully connected layer: 3×224×224 × 64 = 9.6M parameters
print(sum(p.numel() for p in conv.parameters()))  # 1,792

Parameter sharing is the other key property: all spatial positions use the same filter weights. This is why CNNs are so parameter-efficient relative to fully connected networks on image data.

Pooling

Max pooling takes the maximum value in each local region (typically 2×2), reducing spatial dimensions by 2×. This achieves two things: reduces computation in subsequent layers, and provides minor translation invariance (a feature detected 1 pixel to the left still passes through max pooling).

CNN Architecture Evolution

Architecture Year Key Innovation Top-1 ImageNet
LeNet-5 1998 First successful deep CNN for digit recognition
AlexNet 2012 Deep CNN on GPU, ReLU, dropout — ignited the deep learning era 63.3%
VGG-16 2014 Deeper with 3×3 convolutions only; simple and widely used for transfer learning 71.5%
ResNet-50 2015 Residual connections — enables training 100+ layer networks without vanishing gradients 76.1%
EfficientNet-B7 2019 Neural architecture search; compound scaling of depth/width/resolution 84.3%
Vision Transformer (ViT) 2020 Transformer applied to image patches; surpasses CNNs at large scale 88.5%+

ResNet is the most important architecture to know in interviews. The residual connection — adding the input directly to the output of a block — is why deep networks became trainable: gradients flow directly through the shortcut, bypassing the potentially vanishing nonlinear path.

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)
        self.relu = nn.ReLU()

    def forward(self, x):
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = self.relu(out + residual)  # skip connection
        return out

Object Detection: Localizing and Classifying Multiple Objects

Classification: is there a cat in this image? Detection: where exactly is the cat (bounding box), and what else is in the image?

Anchor-Based Detection (YOLO, R-CNN family)

Anchor boxes: Pre-defined bounding boxes of different aspect ratios and scales placed at each position in the feature map. The detector predicts offsets (Δx, Δy, Δw, Δh) from the anchor to the true box, plus a class probability. Anchors are manually designed for common object shapes in the dataset.

R-CNN family:

  • R-CNN (2014): Selective search → 2,000 region proposals → CNN per region → SVM classification. Slow (47 seconds per image).
  • Fast R-CNN (2015): One forward pass over entire image → RoI pooling extracts features for each proposal. 0.2s per image.
  • Faster R-CNN (2015): Region Proposal Network (RPN) replaces selective search — proposals generated by the neural network itself. 0.2s per image, end-to-end trainable.

YOLO (You Only Look Once, v1 2015 → YOLOv8/v11 2023): Single forward pass predicts all bounding boxes and classes simultaneously. Divides image into an S×S grid; each cell predicts B boxes. Extremely fast (<1ms on modern GPU), somewhat less accurate than Faster R-CNN on small objects. YOLO is the go-to for real-time detection (autonomous vehicles, video streams).

Anchor-Free Detection (FCOS, CenterNet)

Modern detectors have moved away from anchors. FCOS predicts (l, r, t, b) — distance from each point to the four sides of the object’s bounding box. No anchor design required, simpler training. DETR (2020) uses a transformer encoder-decoder to predict a fixed set of boxes directly — no anchors, no NMS.

Key Detection Concepts for Interviews

IoU (Intersection over Union): Measures overlap between predicted and ground-truth boxes. IoU = area of intersection / area of union. Range 0–1. Threshold for “correct detection” typically 0.5.

Non-Maximum Suppression (NMS): When multiple predicted boxes overlap the same object, NMS keeps only the highest-confidence box and suppresses others with IoU above a threshold. Essential post-processing step.

mAP (mean Average Precision): The standard object detection metric. Compute precision-recall curve for each class, take area under curve (AP), average across all classes. COCO mAP uses IoU thresholds from 0.5 to 0.95.

Transfer Learning for Computer Vision

Never train a vision model from scratch unless you have millions of images. Use a pretrained backbone (ResNet, EfficientNet, ViT trained on ImageNet-21K or CLIP) and fine-tune:

import torchvision.models as models

# Load pretrained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Replace final classification layer for your task (e.g., 10 classes)
backbone.fc = nn.Linear(2048, 10)

# Fine-tuning strategy:
# 1. Freeze backbone initially, train only the new head (fast, few epochs)
# 2. Unfreeze backbone with low learning rate (1e-5) for full fine-tuning

for param in backbone.parameters():
    param.requires_grad = False           # freeze backbone

for param in backbone.fc.parameters():
    param.requires_grad = True            # only train the head

optimizer = torch.optim.Adam(backbone.fc.parameters(), lr=1e-3)

Data Augmentation

Essential for CV — prevents overfitting, improves generalization to new viewpoints and lighting:

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],   # ImageNet statistics
                         std=[0.229, 0.224, 0.225]),
])

# Advanced: MixUp (blend two images), CutMix (replace a patch with another image's patch)
# AugMix, RandAugment for state-of-the-art augmentation pipelines

Common Interview Questions and Strong Answers

Q: Why does pooling provide translation invariance?
A: Max pooling returns the same value whether a feature fires at position (i, j) or (i+1, j) within the pooling window. Multiple pooling layers accumulate this invariance so small position shifts don’t change the final feature representation.

Q: What problem does batch normalization solve?
A: Internal covariate shift — as weights update, the distribution of each layer’s inputs changes, making training unstable. BatchNorm normalizes activations to zero mean and unit variance within each batch, stabilizing training and allowing much higher learning rates. Also has a mild regularization effect.

Q: When would you use a Vision Transformer vs a CNN?
A: ViTs outperform CNNs when training data is large (ImageNet-21K or JFT-300M scale) because attention captures global relationships that convolutions miss. CNNs have better inductive bias (local connectivity, translation equivariance) and outperform ViTs with limited data or when fine-tuning on small downstream datasets.

Q: How do you handle class imbalance in object detection?
A: Focal Loss (Lin et al., 2017 — RetinaNet) down-weights the loss on easy background examples, focusing training on hard foreground objects. Standard for single-stage detectors. Alternatively, oversample rare classes or use weighted sampling during training.

Interview Follow-ups

  • How would you build a real-time pedestrian detection system for a dashcam with a 10W power budget?
  • Your model achieves 95% accuracy in the lab but 70% in production. What do you investigate first?
  • How do you detect objects that are much smaller than the anchor box sizes you defined?
  • Explain how CLIP (Contrastive Language-Image Pretraining) works and what makes it useful for zero-shot classification.

Related ML Topics

  • How Transformer Models Work — Vision Transformers (ViT) apply the same self-attention mechanism to image patches; CLIP uses a dual-encoder transformer for image-text alignment
  • How Does Backpropagation Work? — gradient flow through convolutional layers follows the same chain rule; vanishing gradients in deep CNNs are solved by residual connections (ResNet)
  • Overfitting and Regularization — transfer learning from ImageNet acts as a regularizer; fine-tuning only the top layers prevents catastrophic forgetting on small datasets
  • Feature Selection and Dimensionality Reduction — conv feature maps as learned feature extractors; Global Average Pooling reduces spatial dimensions before the classifier head
  • Handling Imbalanced Datasets — object detection datasets are heavily imbalanced (background >> foreground); focal loss (RetinaNet) addresses this without hard negative mining
Scroll to Top