Machine Learning in Data Science Interviews

Overview
Machine learning questions are often the toughest parts of data science interviews, and for good reason. This post will highlight several example problems, general comments on machine learning, and what topics to study on the theory and application side.
5 Challenging Problems
- Say we are using a Gaussian Mixture Model (GMM) for anomaly detection on fraudulent transactions to classify incoming transactions into K classes. Describe the model setup formulaically and how to evaluate the posterior probabilities and log likelihood. How can we determine if a new transaction should be deemed fraudulent?
- Suppose you are running a linear discriminant analysis (LDA) model on some data with K classes. Describe mathematically how you would project centroids to some L < K-1 dimensional subspace.
- Describe the idea and mathematical formulation of kernel smoothing. How do you compute the kernel regression estimator?
- Say we are running a probabilistic linear regression which does a good job modeling the underlying relationship between some y and x. Now assume all inputs have some noise ε added, which is independent of the training data. What is the new objective function? How do you compute it?
- What is the loss function used in k-means clustering for k clusters and n sample points? Compute the update formula using 1) batch gradient descent, 2) stochastic gradient descent for the cluster mean for cluster k using a learning rate ε.
General comments
Machine learning is not applicable to every data science role since data science is a broad field, but for relevant roles, it is an important area of study that has both depth and breadth. However, regardless of role, I think it is useful for any data scientist or aspiring data scientist to study machine learning for three main reasons: It is the intersection of mathematics, computer science, and statistics - each of which is a rich field of study and are the building blocks for a lifetime of solid knowledge. Machine learning offers a perfect mix and match of theory and application - there are an endless number of things to be curious about and both how things operate and are used in practice. Just as software has disrupted the business landscape worldwide, machine learning has already disrupted, and will likely continue to help existing businesses improve tremendously and also lead to an amazing array of future businesses.
Studying machine learning
There are two main areas of focus relevant to machine learning: theory and application. Theory entails all of the mathematical underpinnings behind models and why and how they work the way they do, whereas application entails all of the real-world use cases whereby technology at scale can leverage such models. Both are equally important to study and become well-versed in. For theory, there are a plethora of textbooks and other online resources available at your disposal. It is essential to be able to understand the math behind various algorithms and frameworks in order to see patterns behind how models are tuned and operate. Doing so will lead to better understanding of applications, and also familiarity with how to think about different adaptions to various models. Additionally, being comfortable with all the technical details definitely helps for the theoretical side of machine learning interview questions. For application, there are an endless number of practical projects to look into assuming you can get sufficient data of interest (Kaggle and Open AWS provide many datasets!) - and if not feasible, it is also worth just learning about how many companies utilize machine learning in a substantial capacity (for example, read about Netflix using recommender systems). This will reinforce your theoretical understanding and provide insight into how some of the world's premier businesses are using machine learning to generate large amounts of value at scale.
Topics to study
The list below is by no means exhaustive but entails thoughts on what would be good to look into topic-wise from a bird's eye view: General understanding of models: bias-variance tradeoff, assessing model fit, sampling, cross-validation, etc. Linear models: Regression (simple and multiple), model selection, shrinkage methods Classification: linear methods (LDA, Logistic Regression, etc.) and nonlinear (SVMs, decision trees, etc.) Neural networks: back-propagation, CNNs, RNNs, LSTMs, etc. Unsupervised learning: clustering (K-means), PCA, Factor analysis, etc.
When learning about each, be sure to get immersed in the application side of things - there are a lot of really interesting companies using various methods mentioned above for core parts of their businesses. For example: many businesses utilize fraud detection - which models can be run and used there? What about recommendation engines? Serving ads? Customer lifetime value analysis? Note: there are many other areas (reinforcement learning, computer vision, game theory topics, etc.) that are left out from the above list for the sake of brevity.
Conclusion
Hopefully, this article gave you a couple of useful pointers for interviews. If you are interested in more questions (and answers), make sure to subscribe!
Are you interviewing for data science jobs or are you trying to hone your data science skills? Check out our newsletter, Data Science Prep, to get data science interview questions straight to your inbox.