Authored by Tony Feng
Created on Oct 9th, 2022
Last Modified on Oct 9th, 2022
Intro
The post contains a collection of questions for machine learning interview.
Questions
1) Why are small convolution kernels such as 3x3 better than larger ones?
First, you can use several smaller kernels to get the same receptive field and capture more spatial context with less parameters and computations.
Secondly, because with smaller kernels, you will be using more filters. You’ll be able to use more activation functions and thus have a more discriminative mapping function being learnt by your CNN.
Also, small kernels would lead to slow reduction of image dimensions making the network deep, whereas large kernels would decrease the size of the image really fast.
2) The difference for dropout between training and testing
Dropout is a simple way to prevent a neural network from overfitting by making neurons output ‘wrong’ values on purpose. It is a random process of disabling neurons in a layer with probability p.
At test time, however, we do not apply dropout with the test data. But that means your neurons will receive more connections and therefore more activations during inference than what they received during training. For example, if you use a dropout rate of 50% dropping two out of four neurons in a layer during training, the neurons in the next layer will receive twice the activations during inference and thus become overexcited. Accordingly, the values produced by these neurons will be too large by 50%. To correct this over-activation at inference time, you multiply the weights of the overexcited neurons by the retention probability (1 – p) to scale them down.
3) How does inference time vary with the depth of the decision tree?
Test time complexity would be O(depth) since we have to move from root to a leaf node of the decision tree.
4) With the increase of depth, what will be influnced in decison tree prediction?
- More inference time
- To some degree, it reduces bias and prevent underfitting, but it may lead to overfitting (more leaf nodes).
5) Pros and Cons of decision tree
Advantages:
- Once the tree is constructed, the training data does not need to be stored. Instead, we can simply store splitting conditions.
- Inference is very fast, as test inputs simply need to traverse down the tree to a leaf.
- No metrics are needed, because the splits are based on feature thresholds.
Disavantages:
- Trees fail to deal with linear relationships.
- Trees are quite unstable. A few changes in the training dataset can create a completely different tree.
- The number of terminal nodes increases quickly with depth.
6) What is the purpose of random restarts?
Random restart means restarting at a new random state after a pre-defined number of steps, which can turn a local search algorithm into an algorithm with global search capability. It helps in non-convex optimization to alleviate the problems of trapping in many local minima or flat regions.
7) What is bias & variance decomposition?
The bias–variance decomposition is a way of analyzing a learning algorithm’s expected generalization error with respect to a particular problem, which Bias-variance trade-off is quite significant to understand the errors and accuracies of supervised machine learning algorithms.
$$Generalization Error = Bias^2 + Variance + IrreducibleError$$
, where the irreducible error is caused by elements outside our control, such as statistical noise in the observations. No matter how good our model is, the data will have some noise or irreducible error that can not be removed.
The formal explaination could be found here.
8) Differences between gradient-boosting decision tree and random forest?
Differences cover the order of training process, the way to make predictions,
A Random forest is a bagging or bootstrap aggregating method that builds decision trees independently and parallelly using different sub-samples for training. Subsequently, it uses majority vote of weak learners to provide a final prediction. A Random forest can be used for both regression and classification problems. The rationale is that although a single tree may be inaccurate, the collective decisions of a bunch of trees are likely to be right most of the time. It’s the bagging, random feature selection, averaging in random forests that reduce variance. However, it’s hard to interpret since each classification decision or regression output has multiple decision paths.
Gradient boosting decision tree uses the idea of boosting is to build weak predictors sequentially and use information of the previous built predictors to enhance performance of the model. Generally, gradient boosting can result in better performance than random forests with proper parameters. However, gradient boosting may not be a good choice if you have a lot of noise, as it can result in overfitting. They also tend to be harder to tune than random forests.
9) What’s the range of cosine distance?
Cosine similarity finds the similarity between two points or vectors by calculating the angle between them. So, it will range from 0 to 1. When two points P1 & P2 are very near and lies on same axis to each other, cos_sim = 1; When two points P1 & P2 are far from each other and angle between points is 90 Degree, cos_sim = 0.
10) Differences between online training and offline training?
Online ML | Offline/Batch ML | |
---|---|---|
Complexity | More complex because the model keeps evolving over time. | Less complex because the model is fed with more consistent data sets periodically. |
Computational power | More computational power because the continuous feed of data leads to continuous refinement. | Fewer computational power because data is delivered in batches. |
Use in production | Harder to implement and control because the production model changes in real-time. | Easier to implement because offline learning provides engineers with more time to perfect the model before deployment. |
Applications | Weather, Stock Prices | Big Data Software |