AI/ML Interview: Data Preprocessing — Feature Engineering, Missing Data, Encoding, Scaling, Imbalanced Classes

Data preprocessing and feature engineering are responsible for 80% of ML project success — not the model architecture. Understanding how to clean, transform, and prepare data is essential for ML interviews. Interviewers test this because it reveals practical experience: anyone can call model.fit(); knowing how to handle missing values, encode categoricals, and deal with class imbalance separates experienced engineers from tutorial followers.

Handling Missing Data

Missing data is ubiquitous in real datasets. Types: (1) MCAR (Missing Completely at Random) — the probability of missing is independent of any variable. Safe to drop or impute. (2) MAR (Missing at Random) — missingness depends on observed variables but not the missing value itself. Example: income is missing more often for younger respondents (age is observed, income is missing). Impute using correlated variables. (3) MNAR (Missing Not at Random) — missingness depends on the missing value itself. Example: people with very high income are less likely to report it. Difficult to handle — imputation may introduce bias. Strategies: (1) Drop rows — if 50% missing values. The column may not provide enough signal. (3) Mean/median imputation — replace missing values with the column mean (continuous) or mode (categorical). Simple, fast. Downside: reduces variance, may bias the model. (4) Model-based imputation — use a model (KNN, regression, or iterative imputation like MICE) to predict missing values from other features. More accurate but complex and computationally expensive. (5) Missing indicator — add a binary column is_missing_{feature}. The model can learn patterns associated with missingness. Combine with imputation (impute the value AND add the indicator). Best practice: understand why the data is missing before choosing a strategy. Add missing indicators for features where missingness is informative.

Feature Encoding

ML models require numeric inputs. Categorical features must be encoded: (1) Label encoding — assign integers: red=0, blue=1, green=2. Problem: implies an ordering (green > blue > red) that does not exist. Use only for: ordinal categories (low=0, medium=1, high=2) and tree-based models (which split on individual values, not ordering). (2) One-hot encoding — create a binary column per category. red -> [1,0,0], blue -> [0,1,0]. No implied ordering. Problem: high-cardinality features (city with 10,000 values) create 10,000 columns (curse of dimensionality). Use for: low-cardinality features ( 0.85 (85% conversion rate in NYC). Handles high cardinality without dimensionality explosion. Problem: data leakage if not done carefully (the target variable is used to create a feature). Use cross-validation-based target encoding or add regularization. (4) Embedding — for very high cardinality (user IDs, product IDs): learn a dense embedding vector per category during model training. Standard for deep learning (word embeddings, entity embeddings). (5) Frequency encoding — replace each category with its frequency in the dataset. Simple, handles high cardinality, no leakage.

Feature Scaling

Many models (linear regression, SVM, KNN, neural networks) are sensitive to feature scale. If feature A ranges [0, 1] and feature B ranges [0, 1000000], the model will disproportionately weight feature B. Scaling methods: (1) StandardScaler (z-score normalization): x_scaled = (x – mean) / std. Results in mean=0, std=1. Best for: features that are approximately normally distributed. Standard choice for most situations. (2) MinMaxScaler: x_scaled = (x – min) / (max – min). Results in range [0, 1]. Best for: features with a known bounded range, neural networks (sigmoid/tanh activations expect inputs in [0, 1]). Sensitive to outliers (a single extreme value compresses all other values). (3) RobustScaler: x_scaled = (x – median) / IQR. Uses median and interquartile range instead of mean and std. Robust to outliers. Best when outliers are present and should not dominate the scaling. (4) Log transform: x_scaled = log(1 + x). For skewed distributions (income, page views). Compresses the long tail, making the distribution more symmetric. Tree-based models (Random Forest, XGBoost, LightGBM) are NOT sensitive to feature scale — they split on individual feature values. Scaling does not help and may slightly hurt (by adding unnecessary preprocessing). Always scale for: linear models, SVM, KNN, neural networks. Never scale for: tree-based models (optional but unnecessary).

Handling Imbalanced Classes

In fraud detection (0.1% fraud), disease diagnosis (1% positive), and ad click prediction (2% CTR), the positive class is rare. A model predicting “negative” for everything achieves 99.9% accuracy but is useless. Strategies: (1) Resampling — oversampling: duplicate minority class examples (or generate synthetic examples with SMOTE — Synthetic Minority Over-sampling Technique). Undersampling: remove majority class examples. Combined: SMOTE + Tomek links (generate synthetic positives, remove borderline negatives). Apply ONLY to training data — never resample the validation/test set. (2) Class weights — assign higher weight to the minority class in the loss function. class_weight=”balanced” in sklearn. The model is penalized more for misclassifying the minority class. Equivalent to oversampling without actually duplicating data. (3) Threshold adjustment — train with the default threshold (0.5), then adjust the classification threshold to optimize for the desired metric (F1, precision at a specific recall). Lowering the threshold increases recall (catches more positives) at the cost of precision. (4) Anomaly detection approach — for extreme imbalance (< 0.01% positive), train a one-class model on the majority class. Positives are detected as anomalies. Evaluation: always use AUC-PR (not AUC-ROC) and F1 (not accuracy) for imbalanced problems.

Feature Engineering Best Practices

Feature engineering creates informative features from raw data: (1) Aggregation features — for user behavior: total purchases in last 30 days, average order value, days since last login, number of sessions per week. For time-series: rolling averages, rolling standard deviations, lag features (value at t-1, t-7). (2) Interaction features — product of two features: income * credit_score, or ratio: price / area (price per square foot). Domain knowledge guides which interactions are meaningful. (3) Date/time features — extract: day of week, hour of day, month, is_weekend, is_holiday, days_until_event. Cyclical encoding: encode hour as sin(2*pi*hour/24) and cos(2*pi*hour/24) so hour 23 is close to hour 0. (4) Text features — TF-IDF, word count, character count, sentiment score, presence of specific keywords. For deep learning: token embeddings. (5) Geospatial features — distance to nearest store, population density, neighborhood average income. Feature selection: after engineering many features, select the most informative: correlation analysis (remove highly correlated features), mutual information (feature relevance to target), and feature importance from a trained tree model. Remove features that add noise without signal. The best feature engineers combine domain knowledge (understanding what drives the business outcome) with data exploration (what patterns exist in the data).

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How should you handle imbalanced classes in machine learning?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”For rare positive classes (0.1% fraud, 1% disease): (1) Resampling (training data only): SMOTE generates synthetic minority examples. Undersampling removes majority examples. Combined: SMOTE + Tomek links. Never resample validation/test sets. (2) Class weights: assign higher loss penalty to minority class. class_weight=balanced in sklearn. Equivalent to oversampling without duplication. (3) Threshold adjustment: train with default 0.5 threshold, then optimize the classification threshold for F1 or desired precision/recall balance. Lower threshold = more recall, less precision. (4) Anomaly detection: for extreme imbalance (<0.01%), train one-class model on majority. Detect positives as anomalies. Evaluation: always use AUC-PR and F1, never accuracy (99.9% by predicting all negative is useless). The precision-recall tradeoff is the key: the business cost of false positives vs false negatives determines the optimal operating point."}},{"@type":"Question","name":"When do you need feature scaling and which method should you use?","acceptedAnswer":{"@type":"Answer","text":"Need scaling: linear models, SVM, KNN, neural networks (all sensitive to feature magnitude). Do NOT need scaling: tree-based models (Random Forest, XGBoost, LightGBM — they split on values, not distances). Methods: StandardScaler (z-score: mean=0, std=1) — default for most cases, assumes approximately normal distribution. MinMaxScaler ([0,1] range) — for bounded features and neural networks with sigmoid/tanh. Sensitive to outliers. RobustScaler (median and IQR) — use when outliers are present. Log transform — for right-skewed distributions (income, page views). Compresses long tail. Critical: fit the scaler on training data ONLY, then transform both train and test. Fitting on the full dataset leaks test information into the preprocessing."}}]}
Scroll to Top