What is the difference between YOLO and Faster R-CNN for object detection?

Two approaches with a speed-accuracy tradeoff: Faster R-CNN (two-stage): Stage 1 proposes regions likely containing objects (Region Proposal Network). Stage 2 classifies each region and refines bounding boxes. 5-15 FPS. Higher accuracy. Best when accuracy matters more than speed (medical imaging, satellite). YOLO (one-stage): processes the entire image in one forward pass. Divides image into a grid, each cell predicts bounding boxes, objectness scores, and class probabilities. YOLOv8: 30-100+ FPS. Slightly lower accuracy. Best for real-time (autonomous driving, video surveillance, AR). Both use Non-Maximum Suppression (NMS) to remove duplicate detections. Key metric: IoU (Intersection over Union) measures overlap between predicted and ground-truth boxes. mAP (mean Average Precision) at different IoU thresholds is the standard evaluation metric.

How does transfer learning work for computer vision?

Use a model pre-trained on ImageNet (14M images, 1000 classes) and adapt to your task. Two strategies: (1) Feature extraction: freeze pre-trained layers, replace final classification layer with your classes, train only the new layer. Works when: small dataset (

AI/ML Interview: Computer Vision — CNN, Object Detection, Image Classification, YOLO, ResNet, Segmentation, Transfer Learning

⏱ 6 min read

Computer vision is one of the most mature and widely deployed areas of machine learning — powering autonomous vehicles, medical imaging, content moderation, and augmented reality. Understanding CNN architectures, object detection, and transfer learning is essential for ML interviews at companies like Tesla, Google, Meta, and any startup working with visual data. This guide covers the key concepts with interview-level depth.

Convolutional Neural Networks (CNN)

CNNs exploit spatial structure in images through three key operations: (1) Convolution — a filter (kernel, typically 3×3 or 5×5) slides across the image, computing dot products at each position. Each filter detects a specific feature: edges, textures, colors. Early layers detect simple features (edges, corners); deeper layers combine these into complex features (eyes, wheels, text). A layer with 64 filters produces 64 feature maps. (2) Pooling — reduces spatial dimensions. Max pooling (2×2, stride 2): takes the maximum value in each 2×2 region, halving width and height. This provides: translation invariance (the feature is detected regardless of exact position), reduced parameters (smaller feature maps), and increased receptive field (deeper layers “see” more of the input). (3) Fully connected layers — after convolutions reduce the image to a compact feature representation, fully connected layers map features to output classes. For ImageNet classification: 2048-dimensional feature vector -> 1000 class probabilities via softmax. Modern CNNs replace fully connected layers with global average pooling (average each feature map to a single value) — fewer parameters, less overfitting.

Key Architectures: ResNet, EfficientNet, Vision Transformer

ResNet (2015): introduced residual connections (skip connections). Each block computes F(x) + x instead of just F(x). This allows gradients to flow directly through the skip connection, enabling training of very deep networks (50, 101, 152 layers). Without residuals, networks deeper than ~20 layers suffer from vanishing gradients and degradation. ResNet-50 achieves 76% top-1 accuracy on ImageNet with 25M parameters. EfficientNet (2019): uses compound scaling — simultaneously scales network width (more filters), depth (more layers), and resolution (larger input images) with a fixed ratio. EfficientNet-B7 achieves 84% on ImageNet while being 8.4x smaller than the best previous model. Efficient for deployment on mobile/edge devices. Vision Transformer (ViT, 2020): applies the Transformer architecture to images. Split the image into 16×16 patches. Each patch is linearly embedded (like a token). Self-attention across all patches captures global relationships (unlike CNNs which have limited receptive fields in early layers). ViT-L achieves 88%+ on ImageNet when pre-trained on large datasets. Requires more data than CNNs (lacks the inductive bias of spatial locality). For interviews: know ResNet (the workhorse), understand why residual connections matter, and know that ViT is the modern alternative for large-scale pre-training.

Object Detection: YOLO and Two-Stage Detectors

Object detection: locate AND classify multiple objects in an image. Output: bounding boxes (x, y, width, height) with class labels and confidence scores. Two approaches: (1) Two-stage detectors (R-CNN family) — Stage 1: propose regions likely containing objects (Region Proposal Network). Stage 2: classify each proposed region and refine the bounding box. Faster R-CNN: ~5-15 FPS. High accuracy. Better for applications where accuracy matters more than speed (medical imaging, satellite analysis). (2) One-stage detectors (YOLO, SSD) — process the entire image in a single forward pass. Divide the image into a grid. Each grid cell predicts bounding boxes and class probabilities. YOLOv8: 30-100+ FPS. Slightly lower accuracy than two-stage but much faster. Best for real-time applications (autonomous driving, video surveillance, AR). YOLO architecture: the image passes through a CNN backbone (CSPDarknet). Feature maps at multiple scales (detect small, medium, and large objects). Each cell in the feature map predicts: bounding box coordinates (x, y, w, h), objectness score (is there an object?), and class probabilities. Non-Maximum Suppression (NMS) removes duplicate detections (overlapping boxes for the same object — keep the highest confidence, remove boxes with IoU > 0.5). For interviews: understand the speed/accuracy tradeoff between one-stage and two-stage. Know what IoU (Intersection over Union) and NMS are.

Image Segmentation

Segmentation classifies every pixel in the image. Three types: (1) Semantic segmentation — label each pixel with a class (road, car, pedestrian, sky). All cars are the same class — individual instances are not distinguished. Architecture: U-Net (encoder-decoder with skip connections). The encoder compresses the image to a feature representation. The decoder upsamples back to the original resolution. Skip connections preserve fine-grained spatial details. Used in: medical imaging (segment tumors in MRI), autonomous driving (segment road from obstacles). (2) Instance segmentation — distinguish individual objects of the same class. Each car gets a unique mask. Architecture: Mask R-CNN (extends Faster R-CNN with a segmentation head). For each detected object, predict a pixel-level mask in addition to the bounding box and class. Used in: robotics (grasp individual objects), photo editing (select and modify individual people). (3) Panoptic segmentation — combines semantic and instance segmentation. Every pixel is labeled with a class AND an instance ID. Both “stuff” (road, sky — no instances) and “things” (cars, people — individual instances) are handled. For interviews: know the difference between semantic, instance, and panoptic. Know U-Net for semantic and Mask R-CNN for instance segmentation.

Transfer Learning and Fine-Tuning

Training a CNN from scratch requires millions of labeled images and days of GPU time. Transfer learning: use a model pre-trained on a large dataset (ImageNet: 14M images, 1000 classes) and adapt it to your specific task. Two strategies: (1) Feature extraction — freeze the pre-trained CNN layers (do not update weights). Replace the final classification layer with a new one for your classes. Train only the new layer. The pre-trained layers serve as a fixed feature extractor. Works well when: your dataset is small (< 10K images) and your domain is similar to ImageNet (natural images). (2) Fine-tuning — unfreeze some or all pre-trained layers. Train the entire network on your data with a small learning rate (1e-5 to 1e-4 — avoid destroying pre-trained features). Unfreeze gradually: start by training only the last few layers, then unfreeze more. Works well when: your dataset is moderate (10K-100K images) or your domain is different from ImageNet (medical images, satellite imagery). Pre-trained models: torchvision provides ResNet, EfficientNet, ViT pre-trained on ImageNet. Hugging Face provides CLIP, DINOv2, and SAM (Segment Anything Model) pre-trained on massive datasets. For most computer vision tasks in production: start with a pre-trained model and fine-tune. Training from scratch is rarely justified unless you have a unique modality (radar, infrared) not covered by existing pre-trained models.