[Machine Learning] Topic 1: Linear Regression

Authored by Tony Feng

Created on May 2nd, 2022

Last Modified on May 5th, 2022

Intro

Linear Regression is studied as a model for understanding the relationship between input and output numerical variables. It is classified as a supervised learning problem, whose goal is to learn a function/hypothesis $ h: X \rarr Y $ so that $ h(x) $ is a “good” predictor for the corresponding value of $ y $.


Notations

Dataset

  • $ x^{(i)} $ denotes the input variables or features, and $ y^{(i)} $ denotes the output variable that we’re trying to predict, where the superscript $ i $ denotes the $ i^{th} $ training sample.
  • A pair $ (x^{(i)}, y^{(i)}) $ denotes a single training example.
  • $ x^{(i)}_j $ represents the $ j^{th} $ feature of the $ i^{th} $ training sample.
  • $ m $ denotes the number of training examples, while $ n $ denotes the number of features.

Hypothesis Function

  • $ h_{\theta}(x) $ denotes the hypothesis function, which is a linear function of the features $ x $ and $ x_0=1 $.

Cost Function

  • $ J_{\theta} $ denotes the cost function that we want to minimize by finding the optimal set of parameters $ \theta $.

Hypothesis Function

Initial Representation

Given a set of data, let’s say we want to approximate $ y $ as a linear function of $ x $:

$$ h_{\theta}(x) = {\theta}_0 + {\theta}_1x_1 + {\theta}_2x_2 + … + {\theta}_nx_n $$

, where the $ {\theta}_j $’s are the parameters (also called weights) parameterizing the space of linear functions mapping from $ X $ to $ Y $.

Representation Improvement

To make it simplied, we introduce an intercept term $ x_0=1 $ such that:

$$ h(x) = \sum_{j=0}^{n} {\theta}_jx_j = {\theta}^Tx $$

, where on the right-hand side above we are viewing $ {\theta} $ and $ x $ both as vectors.


Cost Function

Error Interpretation

The hypothesis may not capture unmodeled effects or random noise, so we can say:

$$ y^{(i)} = {\theta}^Tx^{(i)} + {\epsilon}^{(i)} $$

Let’s further assume that $ {\epsilon}^{(i)} $ follows a Gaussian distribution (Normal distribution) with mean $ {\mu}=0 $ and variance $ {\sigma}^2 $. This could be expressed as $ {\epsilon}^{(i)} {\sim} N(0, {\sigma}^2)$. This implies that the probability density of $ {\epsilon}^{(i)} $ follows a Gaussian density function as follows:

$$ P({\epsilon}^{(i)}) = \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{({\epsilon}^{(i)})^2}{2 \sigma^{2}}) $$

Another assumption we’re going to make is that the $ {\epsilon}^{(i)} $ error terms are IID, which in statistics stands for independently and identically distributed. Under these set of assumptions, the probability density of $ y^{(i)} $ given $ x^{(i)} $ and parameterized by $ {\theta} $:

$$ P(y^{(i)} \mid x^{(i)} ; \theta)=\frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$

In other words, given $ x^{(i)} $ and $ {\theta} $, the random variable $ y^{(i)} $ is sampled from a Gaussian Distribution $ N({\theta}^Tx^{(i)}, {\sigma}^{2}) $.

Likelihood Function

Given the input features matrix (design matrix) $ X $ (All $ x^{(i)} $ and $ {\theta} $), the probability distribution of $ \vec{y} $ (All $ y^{(i)} $’s) is given by $ P(\vec{y}∣X;{\theta}) $. The likelikhood of the parameters $ L({\theta}) $ is defined as:

$$ L(\theta) =P(\vec{y} \mid X ; \theta) =\prod_{i=1}^{m} P(y^{(i)} \mid x^{(i)} ; \theta) $$

Now, we add details into the above equation, $$ L(\theta)=\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$

The principal of Maximum Likelihood Estimation (MLE) states that we should choose $ {\theta} $ so as to make the data as high probability as possible. The derivations will be a bit simpler if we instead maximize the log likelihood $ {\ell(\theta)} $:

$$ {\ell(\theta)}=ln(\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}})) $$

According to the product rule of logarithms,

$$ \ell(\theta)=\sum_{i=1}^{m} ln \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$

Again, the log product rule shows us:

$$ \ell(\theta)=\sum_{i=1}^{m} \left[ln \frac{1}{\sqrt{2 \pi} \sigma} + ln(exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}})\right] $$

This could be expanded as:

$$ \ell(\theta)= m {\cdot} ln \frac{1}{\sqrt{2 \pi} \sigma} - \frac{1}{\sigma^{2}} {\cdot} \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$

Cost Function Formation

From the probability interpretation above, we could regard $ \ell(\theta) $ as:

$$ \ell(\theta)= C_1 - C_2 {\cdot} \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$

, where $ C_1 $ and $ C_2 $ are constant. Hence, maximizing $ \ell(\theta) $ is equivalent to minimizing $ J(\theta) $:

$$ J(\theta)= \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$

, where $ 1/2 $ is a constant that helps cancel 2 in derivative of the function.

The formation above shows that choosing the value of $ \theta $ to minimize the least squares error cost function, is equivalent to finding the MLE for the parameters $ \theta $ under the set of assumptions that the error terms $ {\epsilon}^{(i)} $ are Gaussian and IID.


Cost Function Minimization

Least Mean Square

To find a set of parameters, one way is to find the optimum that minimizes the cost function. Therefore, we need $ \nabla_{\theta} J(\theta)= 0 $. More specifically, $ \frac{\partial J}{\partial \theta_{1}} = 0, \frac{\partial J}{\partial \theta_{2}} = 0, \frac{\partial J}{\partial \theta_{3}} = 0, …, \frac{\partial J}{\partial \theta_{n}} = 0 $.

However, this may generate $ n $ equations and is computationally expensive if there are many features in the dataset. We need to find other ways.

Gradient Descent

We can start with some “initial guess” for $ {\theta} $, and that repeatedly changes $ {\theta} $ to make $ J(\theta) $ smaller, until hopefully we converge to a value of $ {\theta} $ that minimizes $ J(\theta) $. For $ j = 0, 1, 2, …, n $, we simultaneously update the $ {\theta} $ and this process will be kept until convergence:

$$ \theta _{j} \gets \theta _{j} - \alpha \cdot \frac{\partial J}{\partial \theta _{j}} $$

Here, $ \alpha$ is the learning rate, a hyperparameter to control the searching speed. Now, we put all the equations together and we coud see:

$$ \theta_{j}\gets \theta_{j}-\alpha \cdot \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)}) x_{j}^{(i)} $$

This is the Stochastic Gradient Descent strategy, because we update the parameters according to the gradient of the error with respect to a single training example only. Beside, there are other strategies, such as Batch GD, Mini-batch GD, etc, which will be discussed in my future posts.


Normal Equation

In this method, we will minimize $ J(\theta) $ by explicitly taking its derivatives with respect to the $ {\theta}_j $’s, and setting them to zero.

Given a training set, define the design matrix $ X $ to be the $ m×n $, and let $ y $ to be the $ m×1 $ vector containing all the target values from the training set. We can obtain:

$$ J(\theta)= \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} = \frac{1}{2} || X\theta-y || ^2 = \frac{1}{2} (X \theta-y)^{T}(X \theta-y) $$

Now, we expand the equation:

$$ J(\theta)= \frac{1}{2} (\theta^{T} X^{T} X \theta-\theta^{T} X^{T} y-y^{T} X \theta + y^{T} y) $$

Since $ \theta^{T} X^{T} y $ and $ y^{T} X \theta $ are scalars, so they are equivalent. We can now simplify the equation as:

$$ J(\theta)= \frac{1}{2} (\theta^{T} X^{T} X \theta-2\theta^{T} X^{T} y + y^{T} y) $$

Accoding to the rules of derivative of matrix, $ \frac{d B^{T} A B}{d B}=(A+A^{\mathrm{T}}) B $ and $ \frac{d B^{T} A}{d B}=A $, we can find its derivatives with respect to $ \theta $:

$$ \nabla_{\theta} J(\theta)= \frac{1}{2} ((X^TX + (X^TX)^T)\theta - 2X^Ty) = \frac{1}{2} (2X^TX\theta - 2X^Ty) $$

Now, we set $ \nabla_{\theta} J(\theta) $ to zero:

$$ X^TX\theta = X^Ty $$

Thus, the value of $ \theta $ that minimizes $ J(\theta) $ is given in closed form by the equation:

$$ \theta=(X^{T} X)^{-1} X^{T} y $$

This approach requires the matrix to be inversible. Usually, if some features have linear relations or the size of training samples is smaller than the number of features, $ X^TX $ is not inversible.


Reference


MIT License
Last updated on May 06, 2022 11:55 EDT
Built with Hugo
Theme Stack designed by Jimmy