Responsible AI is increasingly tested in ML interviews as companies face regulatory pressure, public scrutiny, and genuine ethical concerns about AI systems. Understanding fairness metrics, bias detection, explainability techniques, and privacy-preserving ML is essential for senior ML roles. This guide covers the technical and practical aspects of building AI systems that are fair, interpretable, and safe.
Bias in ML Systems
Bias in ML: the model systematically produces unfair outcomes for certain groups. Sources: (1) Training data bias — the data reflects historical discrimination. A resume screening model trained on past hiring decisions learns that male candidates are preferred (because historical hiring was biased toward men). The model perpetuates the bias. (2) Representation bias — some groups are underrepresented in the training data. A facial recognition model trained mostly on light-skinned faces performs poorly on darker-skinned faces. (3) Label bias — the labels themselves are biased. Criminal recidivism prediction trained on arrest data (arrest rates are higher in over-policed communities, not necessarily because crime rates are higher). (4) Feature bias — using features correlated with protected attributes. ZIP code correlates with race. Using ZIP code for credit scoring can discriminate by race even without explicitly using race. (5) Measurement bias — the metric itself is biased. Using “time to resolution” as a customer service quality metric disadvantages customers with complex issues (who happen to be in certain demographics). Detection: analyze model performance across demographic subgroups. If accuracy for Group A is 95% but Group B is 75%, the model is biased. Compare false positive and false negative rates across groups. A model that flags 20% of Group B applications as fraudulent but only 5% of Group A has a disparate impact.
Fairness Metrics
Multiple fairness definitions exist — and they are mathematically incompatible (you cannot satisfy all simultaneously). Key metrics: (1) Demographic parity — the probability of a positive outcome is equal across groups. P(Y=1 | Group=A) = P(Y=1 | Group=B). “The same percentage of each group is approved.” (2) Equal opportunity — the true positive rate is equal across groups. P(Y_pred=1 | Y_true=1, Group=A) = P(Y_pred=1 | Y_true=1, Group=B). “Among qualified applicants, the same percentage is approved from each group.” (3) Equalized odds — both true positive rate AND false positive rate are equal across groups. Stricter than equal opportunity. (4) Predictive parity — the positive predictive value is equal across groups. P(Y_true=1 | Y_pred=1, Group=A) = P(Y_true=1 | Y_pred=1, Group=B). “Among approved applicants, the same percentage from each group actually succeeds.” Impossibility theorem (Chouldechova, 2017): if base rates differ between groups (e.g., different default rates for different demographics), you cannot simultaneously achieve equal false positive rates AND equal false negative rates AND equal positive predictive values. You must choose which fairness criterion matters most for the application. For hiring: equal opportunity (qualified candidates should have equal chances). For criminal justice: equalized odds (minimize both wrongful conviction AND missed offenders equally across groups). In interviews: know that fairness metrics conflict and that the choice depends on the application and stakeholder values.
Model Explainability and Interpretability
Explainability: understanding WHY a model made a specific prediction. Critical for: regulatory compliance (GDPR “right to explanation”), debugging (why did the model reject this application?), and trust (users trust models they can understand). Techniques: (1) SHAP (SHapley Additive exPlanations) — for each prediction, compute the contribution of each feature. Based on game theory (Shapley values). Feature X contributed +$5000 to the predicted house price. Works for any model (model-agnostic). The standard for feature importance explanation. (2) LIME (Local Interpretable Model-agnostic Explanations) — approximate the complex model locally (around a specific prediction) with a simple, interpretable model (linear regression). The simple model explanation applies to that specific prediction. (3) Attention visualization — for Transformer models: visualize which input tokens the model attends to when making a prediction. Shows what the model “looks at.” Caveat: attention weights do not always reflect true feature importance. (4) Integrated Gradients — for neural networks: compute the gradient of the output with respect to each input feature, integrated along a path from a baseline to the input. Attributes the prediction to input features. (5) Inherently interpretable models — linear regression (coefficients are directly interpretable), decision trees (follow the decision path), and rule-based systems. Trade accuracy for interpretability. For interviews: know SHAP (the standard), understand the accuracy-interpretability tradeoff, and know that attention visualization has limitations.
Privacy-Preserving ML
ML models can leak training data: membership inference (determine if a specific person was in the training data), model inversion (reconstruct training inputs from model outputs), and training data extraction (GPT-2 memorized and regurgitated training text). Privacy techniques: (1) Differential privacy (DP) — add calibrated noise during training so the model provably cannot reveal information about any single training example. Formally: for any two datasets differing by one example, the model outputs are nearly indistinguishable (within an epsilon privacy budget). DP-SGD (Differentially Private Stochastic Gradient Descent): clip per-example gradients and add Gaussian noise. The privacy budget epsilon controls the noise-accuracy tradeoff. Lower epsilon = more privacy = more noise = lower model accuracy. Apple, Google, and the US Census use DP. (2) Federated learning — train the model across multiple devices without centralizing the data. Each device trains a local model on its data. Only the model updates (gradients) are sent to the server, not the raw data. The server aggregates updates and distributes the improved model. Google Gboard (keyboard prediction) uses federated learning: your typing data never leaves your phone. (3) Secure multi-party computation — multiple parties jointly train a model without revealing their individual data to each other. Computationally expensive but provably secure. Used in healthcare (hospitals collaborate on a model without sharing patient data). For interviews: know differential privacy conceptually (noise for privacy, epsilon controls the tradeoff) and federated learning (training without centralizing data).
AI Safety and Alignment
AI safety ensures AI systems behave as intended and do not cause harm. Key concerns: (1) Alignment — the AI system goals match human intentions. RLHF (covered in our RL guide) is the primary alignment technique for LLMs. The challenge: specifying human values precisely enough for the AI to optimize. Reward hacking: the AI finds loopholes in the reward function that maximize reward without achieving the intended goal. (2) Robustness — the AI performs well on inputs it was not specifically trained on. Adversarial examples: small, imperceptible input perturbations cause dramatic output changes (a stop sign with a small sticker is classified as a speed limit sign). Adversarial training: include adversarial examples in training to improve robustness. (3) Hallucination mitigation — LLMs confidently generate false information. Techniques: RAG (ground responses in retrieved facts), confidence calibration (the model knows when it is uncertain), and factual consistency checking (verify generated claims against a knowledge base). (4) Red teaming — systematically probe the AI for failure modes: can it be manipulated into generating harmful content? Does it follow safety guidelines under adversarial prompting? Companies hire red teams to attack their AI systems before public release. For interviews: know RLHF for alignment, understand adversarial robustness, and recognize that AI safety is an active research area with no complete solutions.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why are fairness metrics mathematically incompatible?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The impossibility theorem (Chouldechova, 2017): if base rates differ between groups (different default rates for different demographics), you CANNOT simultaneously achieve equal false positive rates AND equal false negative rates AND equal positive predictive values. You must choose which fairness criterion matters most. Key metrics: Demographic parity (equal approval rates across groups), Equal opportunity (equal true positive rates — qualified candidates have equal chances), Equalized odds (equal TPR AND FPR), Predictive parity (equal PPV — among approved, equal success rates). Choice depends on application: hiring -> equal opportunity (qualified candidates should have equal chances). Criminal justice -> equalized odds (minimize both wrongful conviction and missed offenders equally). Credit -> predictive parity (approved applicants from each group should default at equal rates). In interviews: demonstrate you understand the tradeoff, not just the definitions.”}},{“@type”:”Question”,”name”:”How does differential privacy protect ML training data?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Differential privacy (DP) adds calibrated noise during training so the model provably cannot reveal information about any single training example. Formally: for any two datasets differing by one example, model outputs are nearly indistinguishable (within epsilon privacy budget). DP-SGD: clip per-example gradients (bound each example influence) and add Gaussian noise. Epsilon controls the privacy-accuracy tradeoff: lower epsilon = more privacy = more noise = lower accuracy. Used by Apple (keyboard predictions), Google (Chrome usage statistics), and the US Census. Federated learning complements DP: train across devices without centralizing data. Each device trains locally; only model updates (gradients) are sent to the server. Google Gboard uses this: typing data never leaves your phone. For interviews: know DP conceptually (noise for privacy, epsilon tradeoff) and federated learning (training without centralizing data). These are increasingly asked at companies handling sensitive user data.”}}]}