[Machine Learning] Topic 1: Linear Regression

Authored by Tony Feng

Created on May 2nd, 2022

Last Modified on May 5th, 2022

Intro

Linear Regression is studied as a model for understanding the relationship between input and output numerical variables. It is classified as a supervised learning problem, whose goal is to learn a function/hypothesis $ h: X \rarr Y $ so that $ h(x) $ is a “good” predictor for the corresponding value of $ y $.

Notations

Dataset

$ x^{(i)} $ denotes the input variables or features, and $ y^{(i)} $ denotes the output variable that we’re trying to predict, where the superscript $ i $ denotes the $ i^{th} $ training sample.
A pair $ (x^{(i)}, y^{(i)}) $ denotes a single training example.
$ x^{(i)}_j $ represents the $ j^{th} $ feature of the $ i^{th} $ training sample.
$ m $ denotes the number of training examples, while $ n $ denotes the number of features.

Hypothesis Function

$ h_{\theta}(x) $ denotes the hypothesis function, which is a linear function of the features $ x $ and $ x_0=1 $.

Cost Function

$ J_{\theta} $ denotes the cost function that we want to minimize by finding the optimal set of parameters $ \theta $.

Hypothesis Function

Initial Representation

Given a set of data, let’s say we want to approximate $ y $ as a linear function of $ x $:

$$ h_{\theta}(x) = {\theta}_0 + {\theta}_1x_1 + {\theta}_2x_2 + … + {\theta}_nx_n $$

, where the $ {\theta}_j $’s are the parameters (also called weights) parameterizing the space of linear functions mapping from $ X $ to $ Y $.

Representation Improvement

To make it simplied, we introduce an intercept term $ x_0=1 $ such that:

$$ h(x) = \sum_{j=0}^{n} {\theta}_jx_j = {\theta}^Tx $$

, where on the right-hand side above we are viewing $ {\theta} $ and $ x $ both as vectors.

Cost Function

Error Interpretation

The hypothesis may not capture unmodeled effects or random noise, so we can say:

$$ y^{(i)} = {\theta}^Tx^{(i)} + {\epsilon}^{(i)} $$

Let’s further assume that $ {\epsilon}^{(i)} $ follows a Gaussian distribution (Normal distribution) with mean $ {\mu}=0 $ and variance $ {\sigma}^2 $. This could be expressed as $ {\epsilon}^{(i)} {\sim} N(0, {\sigma}^2)$. This implies that the probability density of $ {\epsilon}^{(i)} $ follows a Gaussian density function as follows:

$$ P({\epsilon}^{(i)}) = \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{({\epsilon}^{(i)})^2}{2 \sigma^{2}}) $$

Another assumption we’re going to make is that the $ {\epsilon}^{(i)} $ error terms are IID, which in statistics stands for independently and identically distributed. Under these set of assumptions, the probability density of $ y^{(i)} $ given $ x^{(i)} $ and parameterized by $ {\theta} $:

$$ P(y^{(i)} \mid x^{(i)} ; \theta)=\frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$

In other words, given $ x^{(i)} $ and $ {\theta} $, the random variable $ y^{(i)} $ is sampled from a Gaussian Distribution $ N({\theta}^Tx^{(i)}, {\sigma}^{2}) $.

Likelihood Function

Given the input features matrix (design matrix) $ X $ (All $ x^{(i)} $ and $ {\theta} $), the probability distribution of $ \vec{y} $ (All $ y^{(i)} $’s) is given by $ P(\vec{y}∣X;{\theta}) $. The likelikhood of the parameters $ L({\theta}) $ is defined as:

$$ L(\theta) =P(\vec{y} \mid X ; \theta) =\prod_{i=1}^{m} P(y^{(i)} \mid x^{(i)} ; \theta) $$

Now, we add details into the above equation, $$ L(\theta)=\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$

The principal of Maximum Likelihood Estimation (MLE) states that we should choose $ {\theta} $ so as to make the data as high probability as possible. The derivations will be a bit simpler if we instead maximize the log likelihood $ {\ell(\theta)} $:

$$ {\ell(\theta)}=ln(\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}})) $$

According to the product rule of logarithms,

$$ \ell(\theta)=\sum_{i=1}^{m} ln \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$

Again, the log product rule shows us:

$$ \ell(\theta)=\sum_{i=1}^{m} \left[ln \frac{1}{\sqrt{2 \pi} \sigma} + ln(exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}})\right] $$

This could be expanded as:

$$ \ell(\theta)= m {\cdot} ln \frac{1}{\sqrt{2 \pi} \sigma} - \frac{1}{\sigma^{2}} {\cdot} \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$

Cost Function Formation

From the probability interpretation above, we could regard $ \ell(\theta) $ as:

$$ \ell(\theta)= C_1 - C_2 {\cdot} \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$

, where $ C_1 $ and $ C_2 $ are constant. Hence, maximizing $ \ell(\theta) $ is equivalent to minimizing $ J(\theta) $:

$$ J(\theta)= \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$

, where $ 1/2 $ is a constant that helps cancel 2 in derivative of the function.

The formation above shows that choosing the value of $ \theta $ to minimize the least squares error cost function, is equivalent to finding the MLE for the parameters $ \theta $ under the set of assumptions that the error terms $ {\epsilon}^{(i)} $ are Gaussian and IID.

Cost Function Minimization

Least Mean Square

To find a set of parameters, one way is to find the optimum that minimizes the cost function. Therefore, we need $ \nabla_{\theta} J(\theta)= 0 $. More specifically, $ \frac{\partial J}{\partial \theta_{1}} = 0, \frac{\partial J}{\partial \theta_{2}} = 0, \frac{\partial J}{\partial \theta_{3}} = 0, …, \frac{\partial J}{\partial \theta_{n}} = 0 $.

However, this may generate $ n $ equations and is computationally expensive if there are many features in the dataset. We need to find other ways.

Gradient Descent

We can start with some “initial guess” for $ {\theta} $, and that repeatedly changes $ {\theta} $ to make $ J(\theta) $ smaller, until hopefully we converge to a value of $ {\theta} $ that minimizes $ J(\theta) $. For $ j = 0, 1, 2, …, n $, we simultaneously update the $ {\theta} $ and this process will be kept until convergence:

$$ \theta _{j} \gets \theta _{j} - \alpha \cdot \frac{\partial J}{\partial \theta _{j}} $$

Here, $ \alpha$ is the learning rate, a hyperparameter to control the searching speed. Now, we put all the equations together and we coud see:

$$ \theta_{j}\gets \theta_{j}-\alpha \cdot \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)}) x_{j}^{(i)} $$

This is the Stochastic Gradient Descent strategy, because we update the parameters according to the gradient of the error with respect to a single training example only. Beside, there are other strategies, such as Batch GD, Mini-batch GD, etc, which will be discussed in my future posts.

Normal Equation

In this method, we will minimize $ J(\theta) $ by explicitly taking its derivatives with respect to the $ {\theta}_j $’s, and setting them to zero.

Given a training set, define the design matrix $ X $ to be the $ m×n $, and let $ y $ to be the $ m×1 $ vector containing all the target values from the training set. We can obtain:

$$ J(\theta)= \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} = \frac{1}{2} || X\theta-y || ^2 = \frac{1}{2} (X \theta-y)^{T}(X \theta-y) $$

Now, we expand the equation:

$$ J(\theta)= \frac{1}{2} (\theta^{T} X^{T} X \theta-\theta^{T} X^{T} y-y^{T} X \theta + y^{T} y) $$

Since $ \theta^{T} X^{T} y $ and $ y^{T} X \theta $ are scalars, so they are equivalent. We can now simplify the equation as:

$$ J(\theta)= \frac{1}{2} (\theta^{T} X^{T} X \theta-2\theta^{T} X^{T} y + y^{T} y) $$

Accoding to the rules of derivative of matrix, $ \frac{d B^{T} A B}{d B}=(A+A^{\mathrm{T}}) B $ and $ \frac{d B^{T} A}{d B}=A $, we can find its derivatives with respect to $ \theta $:

$$ \nabla_{\theta} J(\theta)= \frac{1}{2} ((X^TX + (X^TX)^T)\theta - 2X^Ty) = \frac{1}{2} (2X^TX\theta - 2X^Ty) $$

Now, we set $ \nabla_{\theta} J(\theta) $ to zero:

$$ X^TX\theta = X^Ty $$

Thus, the value of $ \theta $ that minimizes $ J(\theta) $ is given in closed form by the equation:

$$ \theta=(X^{T} X)^{-1} X^{T} y $$

This approach requires the matrix to be inversible. Usually, if some features have linear relations or the size of training samples is smaller than the number of features, $ X^TX $ is not inversible.

Intro

Notations

Dataset

Hypothesis Function

Cost Function

Hypothesis Function

Initial Representation

Representation Improvement

Cost Function

Error Interpretation

Likelihood Function

Cost Function Formation

Cost Function Minimization

Least Mean Square

Gradient Descent

Normal Equation

Reference