[Machine Learning] Interview Questions - Part 5

Authored by Tony Feng

Created on Oct 10th, 2022

Last Modified on Oct 13th, 2022

Intro

The post contains a collection of questions for machine learning interview.


Questions

1) Feature Engineering Techniques for Machine Learning

  • Feature selection
    • Correlation matrix with heatmap
    • Statistical tests
  • Exploratory Data Analysis
  • Handling imbalanced data
    • Under-sampling majority class
    • Over-Sampling Minority class
    • Data augmentaion
  • Imputation (handling missing values)
    • Apply mean, median to numerical values
    • Apply mode to categorical values
    • Drop NA values entire rows
    • Drop NA values entire features
  • Handling Outliers
    • Removal
    • Replacing values
  • Feature scaling
    • Standardization
    • Normalization
  • Encoding
    • Label Encoding (text -> number)
    • One-Hot Encoding (categorical -> numerical)

2) Is logistic regression a concave or convex function? Does it have a global optimum?

The error function minimized in logistic regression is a convex function, so it has a global optimum.

The prediction of logistic regression is non-linear (due to the sigmoid transform). Squaring this prediction like MSE results in a non-convex function with many local minimums, which makes it difficult to find the global minimum. Instead, using MLE in logistic regression as the cost function is convex.

The proof could be found here.

3) KNN vs. K-means

KNN represents a supervised classification algorithm that will give new data points accordingly to the k number or the closest data points. (It creates classes.)

k-means clustering is an unsupervised clustering algorithm that gathers and groups data into k number of clusters. (The classes are already created.)

4) How to detect overfitting?

  • Emperical way: Performance on training set » Performance on testing set
  • Intutive way: We should prefer simple models with fewer coefficients over complex models(Occam’s Razor principle)

5) What will happen if we increase the batch of SGD (e.g. from 16 to full data size)?

  • Slow and computationally expensive
  • Good for converge
  • Slow to converge
  • Might be trapped into local minima

6) Activation Functions

  • Linear or Identity Activation Function
  • Non-linear Activation Function
    • Sigmoid: [0, 1], differentiable, monotonic
    • Tanh: [-1, 1], differentiable, monotonic
    • ReLU: [0, inf], monotonic
    • Leaky ReLU: [-inf, inf], monotonic

A neural network without an non-linear activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

Activation Functions in Neural Networks

7) Optimization Algorithms

Two common tools to improve gradient descent are the sum of gradient (first moment) and the sum of the gradient squared (second moment).

Momentum method uses the first moment with a decay rate to gain speed.

AdaGrad uses the second moment with no decay to deal with sparse features.

RMSProp uses the second moment with a decay rate to speed up from AdaGrad.

Adam uses both first and second moments, and is generally the best choice.

8) Define LSTM

LSTM is short for Long Short Term Memory. It is derived from recurrent neural network and is designed to address the long term dependency problem, by maintaining a state what to remember and what to forget.

Generally, it has three key components

  • Gates (Forget, Memory, Update & Read)
  • Tanh(x) (values between -1 to 1)
  • Sigmoid(x) (values between 0 to 1)

9) What is Confusion Matrix?

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data.

The rows represent the actual classes the outcomes should have been. While the columns represent the predictions we have made. Using this table it is easy to see which predictions are wrong.

10) Explain SVM

Support Vector Machine, abbreviated as SVM can be used for both regression and classification tasks. But, it is widely used in classification objectives. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points.

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes.

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the boundaries.

It can handle highly non-linear problems using a kernel trick which implicitly maps the input vectors to higher-dimensional feature spaces. The “trick” is that kernel methods represent the data only through a set of pairwise similarity comparisons between the original data observations x (with the original coordinates in the lower dimensional space), instead of computing the coordinates of the data in a higher dimensional space.


Reference


MIT License
Last updated on Oct 13, 2022 20:26 EDT
Built with Hugo
Theme Stack designed by Jimmy