In the previous post, we can predict the value using simple linear regression model. Let’s think about different situations.
- SPAM(Y/N)
- Fraudulent(Y/N)
- Tumor(Y/N)
In the case the output is discrete like example, Linear regression cannot predict correct answer. Then how can we deal with the discrete case?
What is Logistic regression?
In statistics, logistic regression(logit regression, or logit model) is a regression model where the dependent variable (DV) is categorical. - [Wikipedia][1]
It is used to classify categorial variables, Pass/Fail, Win/Lose and Alive/Dead.
\[y\in {0,1} \begin{cases} 0, & \text{"Negative class"} \\ 1, & \text{"Positive class"} \end{cases}\] \[\text{Want } 0≤h_{\theta}(x) ≤ 1\\ h_\theta(x) = g(\theta^Tx)\\ g(z) = \frac{1}{1+e^{-z}}\]$g(z)= \frac{1}{1+e^{-z}}(\text{ Sigmoid function})$ $h_\theta(x) = \text{estimated probability that y=1 on input }x$ $= P(y=1\vert x;\theta)$
$P(y=0\vert x; \theta) = 1 - P(y=1\vert x;\theta)$
Decision boundary
Decision boundary
is a settle point to separate the region where the hypothesis the hypothesis predicts that y is equal to one and zero.
\(\begin{cases} \text{if }h_\theta(x) \ge 0.5 & y=1\\ \text{if }h_\theta(x) \lt 0.5 & y=0\\ \end{cases}\) $g(z)\ge 0.5$ when, $z\ge0$.
$h_\theta(x) = g(\theta^T x) \ge 0.5$ whenever $\theta^Tx \ge 0$
Linear boundary
\[h\theta(x) = g(\theta_0 + \theta_1x_1 +\theta_2x_2)\\ g(z) = \frac{1}{1+e^{-z}}\]Predict “$y=1$” if $\theta_0 + \theta_1x_1 +\theta_2x_2 = \theta^Tx \ge 0$
$\theta^Tx=0$ is Decision boundary
Non-linear boundary
\[h\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta1x2^2)\\ g(z) = \frac{1}{1+e^{-z}}\] \[h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_1x_2^2)\] \[\theta = \begin{bmatrix} -1\\0\\0\\1\\1\\ \end{bmatrix}\]Predict “$y=1$” if $-1 + x_1^2+x_2^2= \theta^Tx \ge 0$
$\theta^Tx=0$ is Decision boundary
Cost function
Training set
: {$(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \cdots, (x^{(n)},y^{(n)})$}
m examples
: $x \in \begin{bmatrix}x_0 \ x_1\ \vdots \x_n \end{bmatrix}$ $x_0=1, y = \in { 0,1 }$
$h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}$
\[Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x)) &\text{if } y = 1\\ -log(1-h_\theta(x)) &\text{if } y = 0\\ \end{cases}\]It turns out to below equation.
$Cost(h_\theta(x),y) = -ylog(h_\theta(x)) - (1-y)log(1-h_\theta(x))$
The cost is the value when the answers are wrong.
\[J(\theta) = cost(h_\theta(x^{(i)},y) = \frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\\\]if Cost($h_\theta(x),y$) = $\frac{1}{2}(h_\theta(x) - y)^2$
But, for logistic regression is non-linear function it can be non-convex function
having many local optimal.
To remove the local optimal, log function can be used. \(Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta (x)) & if\ y=1\\ -log(1-h_\theta(x)) & if\ y=0\\ \end{cases}\)
y = $0$ or $1$ always
Gradient Descent Algorithm for Logistic
\(J(\theta) = -\frac{1}{m}[\sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta (x^{(i)}))]\) To fit parameters $\theta$:
\[h_\theta(x) = \frac{1}{1+e^{-\theta^TX}}\\ \frac{dJ(\theta)}{d\theta} = (h_\theta(x^i)-y^i)x_j^i\]To make a prediction given new $x$: output $h_\theta(x)$
\[\min_\theta J(\theta):\\ \begin{align} Repeat\ \{ \\ &\theta_j := \theta_j - \alpha\sum_{i=1}^m(h_\theta(x^i)-y^i)\cdot x_j^i\\ \} \end{align}\]Advanced optimization
Cost function$J(\theta)$. Want $\min_\theta J(\theta)$.
Given $\theta$, we have code that can compute.
- $J(\theta)$
- $\frac{dJ(\theta)}{d\theta_j}$
There are some algorithms for optimization.
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS
Advantages
- No need to manually pick $\alpha$
- Often faster than gradient descent
Disadvantage
- More complex
Multiclass classification
In the case the problem is not binary. How can we deal with logistic regression?
For example
- Email (Work/Friends/Family)
- Weather (Sunny/Cloudy/Rain/Snow)
Just think like there are three kinds of classification like
- Email( Work or not/Friend or not /Family or not)
Softmax
In the two class case, we use the below equation to predict $\hat{y}$.
\(H_L(x)=Wx\\ z=H_L(x),g(z)\\ g(z)=\frac{1}{1+e^{-2}}\\ H_R(x)=g(H_L(x))\) We can denote this equation to matrix expression. \(\begin{bmatrix} w_1&w_2&w_3 \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} = \begin{bmatrix} w_1x_1+w_2x_2+w_3x_3 \end{bmatrix}\)
Then make many $H_R(x)$ to predict multinomial values.
\[\begin{bmatrix} w_{A1}&w_{A2}&w_{A3}\\ w_{B1}&w_{B2}&w_{B3}\\ w_{C1}&w_{C2}&w_{C3} \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} = \begin{bmatrix} w_{A1}x_1+w_{A2}x_2+w_{A3}x_3\\ w_{B1}x_1+w_{B2}x_2+w_{B3}x_3\\ w_{C1}x_1+w_{C2}x_2+w_{C3}x_3 \end{bmatrix} \\ = \begin{bmatrix} \hat{Y}_A\\ \hat{Y}_B\\ \hat{Y}_C \end{bmatrix}\]The prediction value $\hat{Y}$ can be changed by softmax function into the range 0~1.
\[S(y_i) = \frac{e^{y_i}}{\sum_je^{y_j}}\]Cost function - Cross entropy
When the prediction is right, there is small or no penalty. In other hand, there is big penalty by the cost function.
\(Y=L(label)\\ D(S,L)=-\sum_iL_i log(S_i)\) It is same to the equation using in the logistic regression. \(D(S,L)=-\sum_iL_i log(S_i) \\ = ylog(H(x))-(1-y)log(1-H(x))\)