[Machine Learning] Topic 2: Logistic Regression

Authored by Tony Feng

Created on May 6th, 2022

Last Modified on May 7th, 2022

Intro

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome. It’s an extension of the linear regression model for classification problems.


Hypothesis Representation

Logistic Function

Since $\theta^{T}x$ could be $ \ll 0 $ and $ \gg 1 $, which is not very natural for a binary classification problem with labels 0 or 1. Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1.

$$ g(z) = \frac{1}{1+e^{-z}} $$

The deravative of the sigmoid functin is:

$$ \frac{\partial}{\partial z} g(z) = \frac{1}{(1+e^{-z})} \cdot(1-\frac{1}{(1+e^{-z})}) = g(z)\cdot(1-g(z)) $$

From the plot of this function, we could see that $ g(z) $ starts off really close to 0, then rises and asymptotes towards 1. Since we want our hypothesis to output values between 0 and 1, we could combine $ g(z) $ and $ h_{\theta}(x) $:

$$ h_{\theta}(x) = g(\theta^{T}x) = \frac{1}{1+e^{-\theta^{T}x}} $$

If $ h_{\theta}(x) >= 0.5 $, we could know $ y = 1 $, and vice versa.

Estimated Probability

In the logistic regression model, the ouput of $ h_{\theta}(x) $ represents the estimated probability of $ y = 1 $ or $ y = 0 $. Thus, for a single training example $ (x^{(i)},y^{(i)}) $, we have:

$$ P(y^{(i)}=1 \mid x^{(i)} ; \theta)=h_{\theta}(x^{(i)}) $$ $$ P(y^{(i)}=0 \mid x^{(i)} ; \theta)=1-h_{\theta}(x^{(i)}) $$

Those can be written more compactly as a single equation as follows:

$$ P(y^{(i)} \mid x^{(i)} ; \theta)=(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}} $$


Cost Function

Maximum Likelihood Estimation

Assuming that the $m$ training examples were generated independently, we can express the likelihood of the parameters $L(\theta)$ as:

$$ L(\theta) =\prod_{i=1}^{m} P(y^{(i)} \mid x^{(i)} ; \theta) =\prod_{i=1}^{m}(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}} $$

We can make the calculation simpler by applying logarithm:

$$ \ell(\theta) =\log L(\theta) =\sum_{i=1}^{m} (y^{(i)} \log (h(x^{(i)})) + (1-y^{(i)}) \log (1-h(x^{(i)}))) $$

Cost Function Formation

Usually, in machine learning, we want to convert the “acsent” problem into a “descent” problem. Therefore,

$$ J(\theta) = - \ell(\theta) = - \sum_{i=1}^{m} (y^{(i)} \log (h(x^{(i)})) + (1-y^{(i)}) \log (1-h(x^{(i)}))) $$

Update Rules

Now, let’s take derivatives to derive the gradient descent rule:

$$ \frac{\delta}{\delta \theta_{j}} J(\theta)=- \sum_{i=1}^{m}(y^{(i)} \frac{1}{h_{\theta}(x^{(i)})} \frac{\delta}{\delta \theta_{j}} h_{\theta}(x^{(i)})-(1-y^{(i)}) \frac{1}{1-h_{\theta}(x^{(i)})} \frac{\delta}{\delta \theta_{j}} h_{\theta}(x^{(i)})) $$

Becasue $ h_{\theta}(x) = g(\theta^{T}x) $, we have:

$$ \frac{\delta}{\delta \theta_{j}} J(\theta)=- \sum_{i=1}^{m}(y^{(i)} \frac{1}{g(\theta^{\mathrm{T}} x^{(i)})}-(1-y^{(i)}) \frac{1}{1-g(\theta^{\mathrm{T}} x^{(i)})}) \frac{\delta}{\delta \theta_{j}} g(\theta^{\mathrm{T}} x^{(i)}) $$

Previously, we have $ \frac{\partial}{\partial z} g(z) = g(z)\cdot(1-g(z)) $ and $ \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)} = x_{j}^{(i)} $. So,

$$ \frac{\delta}{\delta \theta_{j}} J(\theta)=-\sum_{i=1}^{m}(y^{(i)} \frac{1}{g(z)}-(1-y^{(i)}) \frac{1}{1-g(z)}) g(z)(1-g(z)) x_{j}^{(i)} $$

Finally, we have

$$ \frac{\delta}{\delta \theta_{j}} J(\theta)=(h_{\theta}(x^{(i)})-y) x_{j}^{(i)} $$

Thus, the gradient descent rule is

$$ \theta_{j}\gets \theta_{j}-\alpha \cdot \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)}) x_{j}^{(i)} $$

If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because $ h_{\theta}\left(x^{(i)}\right) $ is now defined as a non-linear function of $ \theta^{T} x^{(i)} $.

It might be a little surprising that we end up with the same update rule for a rather different algorithm and learning problem. This is actually not a coincidence, and is in fact a general property of a much bigger class of algorithms called generalized linear models.


Reference


MIT License
Last updated on May 11, 2022 08:52 EDT
Built with Hugo
Theme Stack designed by Jimmy