Feature Selection and Dimensionality Reduction: PCA, Lasso, and Tree Importance

Feature selection and dimensionality reduction are how you fight the curse of dimensionality — the phenomenon where models trained on high-dimensional data overfit, train slowly, and become impossible to interpret. This post covers the full toolkit: filter methods, wrapper methods, embedded methods, and PCA.

What the Interviewer Is Testing

At the junior level: do you know PCA? At the senior level: can you choose between filter/wrapper/embedded methods given a constraint (interpretability, compute budget, model type)? Can you explain why PCA breaks in certain scenarios and what to use instead?

Why Reduce Features?

Overfitting: More features → more parameters → more chance to memorize noise. 1,000 features and 500 samples is a disaster.
Training time: Many algorithms scale O(p²) or O(p³) with feature count. Halving features can 4× or 8× training speed.
Interpretability: A model with 5 features is explainable. A model with 500 is not.
Collinearity: Highly correlated features don’t add information but add noise and instability to linear models.

Three Families of Feature Selection

1. Filter Methods (Model-Free)

Evaluate each feature’s relevance to the target using a statistical measure, independently of any model. Fast, but miss feature interactions.

Correlation (Pearson): For regression targets. Removes features with |r| < threshold. Remove one of any pair with |r| > 0.9 with each other (collinearity).

import pandas as pd
import numpy as np

corr_matrix = df.corr().abs()
# Find features highly correlated with target
target_corr = corr_matrix['target'].sort_values(ascending=False)
print(target_corr)

# Remove collinear features (upper triangle)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
df_reduced = df.drop(columns=to_drop)

Mutual Information: Captures non-linear relationships between feature and target. Works for both regression and classification. Doesn’t assume linear relationship.

from sklearn.feature_selection import SelectKBest, mutual_info_classif

selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)
selected_features = X.columns[selector.get_support()]

Chi-squared (χ²): For classification with non-negative features (count or frequency data). Tests independence between feature and target.

from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2, k=20)
X_selected = selector.fit_transform(X_train_nonneg, y_train)

Variance threshold: Remove features with near-zero variance — they carry no information. A feature that is 0.99 for 99% of samples is useless.

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.01)
X_reduced = sel.fit_transform(X)

2. Wrapper Methods (Model-Dependent)

Train a model on different feature subsets, select the subset that gives the best validation score. Accounts for feature interactions. Expensive: O(2^p) subsets in the brute-force case.

Recursive Feature Elimination (RFE): Train model, rank features by importance (coefficient magnitude for linear models, feature_importances_ for trees). Remove the worst K features. Repeat until target number of features reached.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression(max_iter=1000)
rfe = RFE(estimator, n_features_to_select=20, step=10)
rfe.fit(X_train, y_train)

selected_features = X.columns[rfe.support_]
print(f"Selected: {list(selected_features)}")

RFECV: RFE with cross-validation to automatically find the optimal number of features.

from sklearn.feature_selection import RFECV
rfecv = RFECV(estimator, cv=5, scoring='roc_auc', min_features_to_select=5)
rfecv.fit(X_train, y_train)
print(f"Optimal features: {rfecv.n_features_}")

3. Embedded Methods (Built into Model Training)

Feature selection happens as part of model training. Best of both worlds: accounts for interactions, efficient.

L1 (Lasso) regularization: L1 penalty drives many feature weights to exactly zero. The non-zero features are the selected set. Adjust the regularization strength (C in sklearn) to control how many features survive.

from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel

# LassoCV finds optimal alpha via cross-validation
lasso = LassoCV(cv=5).fit(X_train, y_train)
selector = SelectFromModel(lasso, prefit=True)
X_selected = selector.transform(X_train)
print(f"Features remaining: {X_selected.shape[1]}")

Tree-based feature importance: Random forests and gradient boosting report feature importance (mean decrease in impurity, or permutation importance for more reliable estimates).

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.nlargest(20).plot(kind='barh')
plt.title('Top 20 Features by Random Forest Importance')

# Prefer permutation importance — more reliable, accounts for collinearity
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_val, y_val, n_repeats=10, random_state=42)
perm_importances = pd.Series(result.importances_mean, index=X.columns)

Dimensionality Reduction: PCA

Feature selection removes features. Dimensionality reduction creates new features (components) that are linear combinations of the originals. PCA is the workhorse.

How PCA works: Find the directions (principal components) of maximum variance in the data. Project all samples onto these components. The first component explains the most variance, the second the next most (orthogonal to first), and so on.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# CRITICAL: scale before PCA — PCA is variance-based, dominated by large-scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

pca = PCA()
pca.fit(X_scaled)

# Cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Components for 95% variance: {n_components_95}")

# Fit with chosen n_components
pca = PCA(n_components=n_components_95)
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced from {X_train.shape[1]} to {X_pca.shape[1]} dimensions")

When to use PCA:

High-dimensional data where many features are correlated (images, embeddings, sensor data)
When interpretability of individual features is not required
As a preprocessing step before k-means clustering or linear models

When NOT to use PCA:

When you need to explain which original features matter (stakeholders asking “why did the model flag this user?”)
For tree-based models — they don’t benefit from PCA; their built-in feature importance is better
When features are on very different scales AND you cannot standardize (e.g., binary + continuous + count features mixed — PCA will distort)
For sparse data (text, one-hot encoded categoricals) — use TruncatedSVD (LSA) instead, which doesn’t center the data and preserves sparsity

t-SNE and UMAP: For Visualization Only

t-SNE and UMAP are non-linear dimensionality reduction methods that preserve local structure — great for visualizing high-dimensional clusters in 2D or 3D. Do not use them as input features for downstream models. They are non-parametric (cannot transform new data without refitting), non-deterministic (different runs give different results), and their distances are not meaningful across clusters.

from sklearn.manifold import TSNE
X_2d = TSNE(n_components=2, random_state=42, perplexity=30).fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10')
plt.title('t-SNE visualization — for EDA only, not modeling')

Choosing the Right Method

Scenario	Method
Quick baseline, any model	Filter (mutual info or variance threshold)
Linear model, interpretability needed	Lasso (L1) embedded selection
Tree model	Permutation importance + threshold
Small feature set, compute available	RFECV (wrapper)
Correlated continuous features, no interpretability needed	PCA
Sparse high-dim (text, one-hot)	TruncatedSVD or L1
Visualization only	t-SNE or UMAP

Common Interview Mistakes

Applying PCA before train/test split (leakage — test set statistics influence the PCA)
Forgetting to scale before PCA (unscaled PCA is dominated by high-variance features)
Using t-SNE components as model features
Applying feature selection to the entire dataset before cross-validation (leakage — use Pipeline)
Choosing PCA when stakeholders need feature explanations