[Machine Learning] Interview Questions - Part 2

Authored by Tony Feng

Created on Oct 7th, 2022

Last Modified on Oct 7th, 2022

Intro

The post contains a collection of questions for machine learning interview.


Questions

1) Why do ensembles typically have higher scores than individual models?

An ensemble is the combination of multiple models to create a single prediction. The key idea is that the errors of one model will be compensated by the right predictions of the other models and thus the score of the ensemble will be higher.

We need diverse models for creating an ensemble.

  • Using different ML algorithms.
  • Using different subsets of the data for training. (Bagging)
  • Giving a different weight to each of the samples of the training set. (Boosting)

Engineers need to find a balance between execution time and accuracy.

2) What is an imbalanced dataset? Can you list some ways to deal with it?

An imbalanced dataset has different proportions of categories. There are different options to deal with imbalanced datasets:

  • Resampling the dataset: undersampling majority classes and oversampling minority classes
  • Data Augmentation
  • Cross Validation
  • Choose a right algorithm, e.g., random forest
  • Using appropriate metrics, e.g., F-score, Confusion Matrix, which describe the performance of the model better on an imbalanced dataset.

3) What is data augmentation? Can you give some examples?

Data augmentation is a technique for generating new data by modifying existing data in such a way that the target is not changed, or it is changed in a known way.

4) Explain how the AUC - ROC curve works

ROC or Receiver Operating Characteristic curve represents a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds.

AUC is known for Area Under the ROC curve. It calculates the two-dimensional area under the entire ROC curve ranging from 0 to 1, which means an excellent model will have AUC near 1, and hence it will show a good measure of separability. AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.

AUC is not preferable when we need to calibrate probability output. Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives vs false positives, and it is difficult to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives even if that results in a significant increase of false negatives.

5) What is Precision/Recall/F1-score?

Precision (positive predictive value) is the fraction of relevant instances among the retrieved instances.

$$ Precision = \frac{TP}{TP + FP} $$

Recall (sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.

$$ Recall = \frac{TP}{TP + FN} $$

F1-score is the weighted average of precision and recall. It considers both false positive and false negative into account.

$$ F1 = \frac{2PR}{P + R} $$

6) Define Learning Rate

The learning rate is a hyper-parameter used to govern the pace at which an algorithm updates or learns the values of a parameter estimate. In other words, it controls how much we are adjusting the weights of our network with respect the loss gradient.

When training loss fluctuates, we may reduce the learning rate. Otherwise, SGD jumps too far and misses the area near local minima.

7) The differences between Batch Gradient Descent and Stochastic Gradient Descent

Batch Gradient Descent involves calculations over the whole training set at each step as a result of which it is very slow on very large training data. Thus, it becomes very computationally expensive. However, this is great for convex or relatively smooth error manifolds. Also, Batch GD scales well with the number of features.

Stochastic gradient descent (SGD) picks up a “random” instance of training data at each step and then computes the gradient, making it much faster to reach the convergence. Also, it can escape shallow local minima more easily. There tends to be more noise, which allows for improving generalization error. However, it might lead to result in larger variance.

Batch GD: Batch size = Size of training set
Stochastic GD: Batch size = 1
Mini-Batch GD: 1 < Batch size < Size of training set

8) Epoch vs. Batch vs. Iteration

Epoch: It’s the number of times the whole training dataset is passed through the model.
Batch: It’s the number of examples processed together in one pass.
Iteration: It’s the number of batches required to complete one epoch.

9) What is gradient vanishing? What is gradient explosion?

As we add more and more hidden layers, backpropagation becomes less and less useful in passing information to the lower layers. The error may be so small by the time it reaches layers close to the input of the model. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks. On the contrary, if the gradients get larger or even NaN as our backpropagation progresses, the error gradients accumulate and end up with exploding gradients having big weight updates.

For sigmoid, it saturates at zero for large negative values and at one for large positive values. The same applies to the Tanh function that saturates at -1 and 1. ReLU returns the input if the input value is positive, and it returns 0 if the input is negative.

ReLu can solve gradient vanishing but may cause gradient explosion. The output of ReLU is unbounded in the positive domain by design. This means that the output can, in some situations, continue to grow in size.

10) What’s the difference between boosting and bagging?

Boosting and bagging are similar, in that they are both ensembling techniques, where multiple weak learners (classifiers/regressors that are barely better than guessing) combine (through averaging or max vote) to form a single strong learner that can make accurate predictions.

Bagging means that you take bootstrap samples (with replacement) of your dataset and each sample trains a (potentially) weak learner. Boosting, on the other hand, uses all data to train each learner, but instances that were misclassified by the previous learners are given more weight so that subsequent learners give more focus to them during training.


Reference


MIT License
Last updated on Oct 13, 2022 20:52 EDT
Built with Hugo
Theme Stack designed by Jimmy