Authored by Tony Feng
Created on May 2nd, 2022
Last Modified on May 5th, 2022
Intro
Linear Regression is studied as a model for understanding the relationship between input and output numerical variables. It is classified as a supervised learning problem, whose goal is to learn a function/hypothesis $ h: X \rarr Y $ so that $ h(x) $ is a “good” predictor for the corresponding value of $ y $.
Notations
Dataset
- $ x^{(i)} $ denotes the input variables or features, and $ y^{(i)} $ denotes the output variable that we’re trying to predict, where the superscript $ i $ denotes the $ i^{th} $ training sample.
- A pair $ (x^{(i)}, y^{(i)}) $ denotes a single training example.
- $ x^{(i)}_j $ represents the $ j^{th} $ feature of the $ i^{th} $ training sample.
- $ m $ denotes the number of training examples, while $ n $ denotes the number of features.
Hypothesis Function
- $ h_{\theta}(x) $ denotes the hypothesis function, which is a linear function of the features $ x $ and $ x_0=1 $.
Cost Function
- $ J_{\theta} $ denotes the cost function that we want to minimize by finding the optimal set of parameters $ \theta $.
Hypothesis Function
Initial Representation
Given a set of data, let’s say we want to approximate $ y $ as a linear function of $ x $:
$$ h_{\theta}(x) = {\theta}_0 + {\theta}_1x_1 + {\theta}_2x_2 + … + {\theta}_nx_n $$
, where the $ {\theta}_j $’s are the parameters (also called weights) parameterizing the space of linear functions mapping from $ X $ to $ Y $.
Representation Improvement
To make it simplied, we introduce an intercept term $ x_0=1 $ such that:
$$ h(x) = \sum_{j=0}^{n} {\theta}_jx_j = {\theta}^Tx $$
, where on the right-hand side above we are viewing $ {\theta} $ and $ x $ both as vectors.
Cost Function
Error Interpretation
The hypothesis may not capture unmodeled effects or random noise, so we can say:
$$ y^{(i)} = {\theta}^Tx^{(i)} + {\epsilon}^{(i)} $$
Let’s further assume that $ {\epsilon}^{(i)} $ follows a Gaussian distribution (Normal distribution) with mean $ {\mu}=0 $ and variance $ {\sigma}^2 $. This could be expressed as $ {\epsilon}^{(i)} {\sim} N(0, {\sigma}^2)$. This implies that the probability density of $ {\epsilon}^{(i)} $ follows a Gaussian density function as follows:
$$ P({\epsilon}^{(i)}) = \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{({\epsilon}^{(i)})^2}{2 \sigma^{2}}) $$
Another assumption we’re going to make is that the $ {\epsilon}^{(i)} $ error terms are IID, which in statistics stands for independently and identically distributed. Under these set of assumptions, the probability density of $ y^{(i)} $ given $ x^{(i)} $ and parameterized by $ {\theta} $:
$$ P(y^{(i)} \mid x^{(i)} ; \theta)=\frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$
In other words, given $ x^{(i)} $ and $ {\theta} $, the random variable $ y^{(i)} $ is sampled from a Gaussian Distribution $ N({\theta}^Tx^{(i)}, {\sigma}^{2}) $.
Likelihood Function
Given the input features matrix (design matrix) $ X $ (All $ x^{(i)} $ and $ {\theta} $), the probability distribution of $ \vec{y} $ (All $ y^{(i)} $’s) is given by $ P(\vec{y}∣X;{\theta}) $. The likelikhood of the parameters $ L({\theta}) $ is defined as:
$$ L(\theta) =P(\vec{y} \mid X ; \theta) =\prod_{i=1}^{m} P(y^{(i)} \mid x^{(i)} ; \theta) $$
Now, we add details into the above equation, $$ L(\theta)=\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$
The principal of Maximum Likelihood Estimation (MLE) states that we should choose $ {\theta} $ so as to make the data as high probability as possible. The derivations will be a bit simpler if we instead maximize the log likelihood $ {\ell(\theta)} $:
$$ {\ell(\theta)}=ln(\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}})) $$
According to the product rule of logarithms,
$$ \ell(\theta)=\sum_{i=1}^{m} ln \frac{1}{\sqrt{2 \pi} \sigma} exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}}) $$
Again, the log product rule shows us:
$$ \ell(\theta)=\sum_{i=1}^{m} \left[ln \frac{1}{\sqrt{2 \pi} \sigma} + ln(exp (-\frac{(y^{(i)}-\theta^{T} x^{(i)})^{2}}{2 \sigma^{2}})\right] $$
This could be expanded as:
$$ \ell(\theta)= m {\cdot} ln \frac{1}{\sqrt{2 \pi} \sigma} - \frac{1}{\sigma^{2}} {\cdot} \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$
Cost Function Formation
From the probability interpretation above, we could regard $ \ell(\theta) $ as:
$$ \ell(\theta)= C_1 - C_2 {\cdot} \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$
, where $ C_1 $ and $ C_2 $ are constant. Hence, maximizing $ \ell(\theta) $ is equivalent to minimizing $ J(\theta) $:
$$ J(\theta)= \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} $$
, where $ 1/2 $ is a constant that helps cancel 2 in derivative of the function.
The formation above shows that choosing the value of $ \theta $ to minimize the least squares error cost function, is equivalent to finding the MLE for the parameters $ \theta $ under the set of assumptions that the error terms $ {\epsilon}^{(i)} $ are Gaussian and IID.
Cost Function Minimization
Least Mean Square
To find a set of parameters, one way is to find the optimum that minimizes the cost function. Therefore, we need $ \nabla_{\theta} J(\theta)= 0 $. More specifically, $ \frac{\partial J}{\partial \theta_{1}} = 0, \frac{\partial J}{\partial \theta_{2}} = 0, \frac{\partial J}{\partial \theta_{3}} = 0, …, \frac{\partial J}{\partial \theta_{n}} = 0 $.
However, this may generate $ n $ equations and is computationally expensive if there are many features in the dataset. We need to find other ways.
Gradient Descent
We can start with some “initial guess” for $ {\theta} $, and that repeatedly changes $ {\theta} $ to make $ J(\theta) $ smaller, until hopefully we converge to a value of $ {\theta} $ that minimizes $ J(\theta) $. For $ j = 0, 1, 2, …, n $, we simultaneously update the $ {\theta} $ and this process will be kept until convergence:
$$ \theta _{j} \gets \theta _{j} - \alpha \cdot \frac{\partial J}{\partial \theta _{j}} $$
Here, $ \alpha$ is the learning rate, a hyperparameter to control the searching speed. Now, we put all the equations together and we coud see:
$$ \theta_{j}\gets \theta_{j}-\alpha \cdot \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)}) x_{j}^{(i)} $$
This is the Stochastic Gradient Descent strategy, because we update the parameters according to the gradient of the error with respect to a single training example only. Beside, there are other strategies, such as Batch GD, Mini-batch GD, etc, which will be discussed in my future posts.
Normal Equation
In this method, we will minimize $ J(\theta) $ by explicitly taking its derivatives with respect to the $ {\theta}_j $’s, and setting them to zero.
Given a training set, define the design matrix $ X $ to be the $ m×n $, and let $ y $ to be the $ m×1 $ vector containing all the target values from the training set. We can obtain:
$$ J(\theta)= \frac{1}{2} {\cdot} \sum_{i=1}^{m} (y^{(i)}-\theta^{T} x^{(i)})^{2} = \frac{1}{2} || X\theta-y || ^2 = \frac{1}{2} (X \theta-y)^{T}(X \theta-y) $$
Now, we expand the equation:
$$ J(\theta)= \frac{1}{2} (\theta^{T} X^{T} X \theta-\theta^{T} X^{T} y-y^{T} X \theta + y^{T} y) $$
Since $ \theta^{T} X^{T} y $ and $ y^{T} X \theta $ are scalars, so they are equivalent. We can now simplify the equation as:
$$ J(\theta)= \frac{1}{2} (\theta^{T} X^{T} X \theta-2\theta^{T} X^{T} y + y^{T} y) $$
Accoding to the rules of derivative of matrix, $ \frac{d B^{T} A B}{d B}=(A+A^{\mathrm{T}}) B $ and $ \frac{d B^{T} A}{d B}=A $, we can find its derivatives with respect to $ \theta $:
$$ \nabla_{\theta} J(\theta)= \frac{1}{2} ((X^TX + (X^TX)^T)\theta - 2X^Ty) = \frac{1}{2} (2X^TX\theta - 2X^Ty) $$
Now, we set $ \nabla_{\theta} J(\theta) $ to zero:
$$ X^TX\theta = X^Ty $$
Thus, the value of $ \theta $ that minimizes $ J(\theta) $ is given in closed form by the equation:
$$ \theta=(X^{T} X)^{-1} X^{T} y $$
This approach requires the matrix to be inversible. Usually, if some features have linear relations or the size of training samples is smaller than the number of features, $ X^TX $ is not inversible.
Reference
- Stanford CS229 ML Notes
- Machine Learning Regression Explained
- Linear Regression for Machine Learning
- Logistic Regression — Detailed Overview
- ML Fundamentals: What Is Cost Function?
- Coursera ML Andrew Ng Notes