Machine learning(2) - Logistic regression

read

In the previous post, we can predict the value using simple linear regression model. Let’s think about different situations.

SPAM(Y/N)
Fraudulent(Y/N)
Tumor(Y/N)

In the case the output is discrete like example, Linear regression cannot predict correct answer. Then how can we deal with the discrete case?

What is Logistic regression?

In statistics, logistic regression(logit regression, or logit model) is a regression model where the dependent variable (DV) is categorical. - [Wikipedia][1]

It is used to classify categorial variables, Pass/Fail, Win/Lose and Alive/Dead.

\[y\in {0,1} \begin{cases} 0, & \text{"Negative class"} \\ 1, & \text{"Positive class"} \end{cases}\] \[\text{Want } 0≤h_{\theta}(x) ≤ 1\\ h_\theta(x) = g(\theta^Tx)\\ g(z) = \frac{1}{1+e^{-z}}\]

$g(z)= \frac{1}{1+e^{-z}}(\text{ Sigmoid function})$ $h_\theta(x) = \text{estimated probability that y=1 on input }x$ $= P(y=1\vert x;\theta)$

$P(y=0\vert x; \theta) = 1 - P(y=1\vert x;\theta)$

Decision boundary

Decision boundary is a settle point to separate the region where the hypothesis the hypothesis predicts that y is equal to one and zero.

$\begin{cases} \text{if }h_\theta(x) \ge 0.5 & y=1\\ \text{if }h_\theta(x) \lt 0.5 & y=0\\ \end{cases}$ $g(z)\ge 0.5$ when, $z\ge0$.

$h_\theta(x) = g(\theta^T x) \ge 0.5$ whenever $\theta^Tx \ge 0$

Linear boundary

\[h\theta(x) = g(\theta_0 + \theta_1x_1 +\theta_2x_2)\\ g(z) = \frac{1}{1+e^{-z}}\]

Predict “$y=1$” if $\theta_0 + \theta_1x_1 +\theta_2x_2 = \theta^Tx \ge 0$

$\theta^Tx=0$ is Decision boundary

Non-linear boundary

\[h\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta1x2^2)\\ g(z) = \frac{1}{1+e^{-z}}\] \[h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_1x_2^2)\] \[\theta = \begin{bmatrix} -1\\0\\0\\1\\1\\ \end{bmatrix}\]

Predict “$y=1$” if $-1 + x_1^2+x_2^2= \theta^Tx \ge 0$

$\theta^Tx=0$ is Decision boundary

Cost function

Training set: {$(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \cdots, (x^{(n)},y^{(n)})$}

m examples: $x \in \begin{bmatrix}x_0 \ x_1\ \vdots \x_n \end{bmatrix}$ $x_0=1, y = \in { 0,1 }$

$h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}$

\[Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x)) &\text{if } y = 1\\ -log(1-h_\theta(x)) &\text{if } y = 0\\ \end{cases}\]

It turns out to below equation.

$Cost(h_\theta(x),y) = -ylog(h_\theta(x)) - (1-y)log(1-h_\theta(x))$

The cost is the value when the answers are wrong.

\[J(\theta) = cost(h_\theta(x^{(i)},y) = \frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\\\]

if Cost($h_\theta(x),y$) = $\frac{1}{2}(h_\theta(x) - y)^2$

But, for logistic regression is non-linear function it can be non-convex function having many local optimal.

To remove the local optimal, log function can be used. $Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta (x)) & if\ y=1\\ -log(1-h_\theta(x)) & if\ y=0\\ \end{cases}$

y = $0$ or $1$ always

Gradient Descent Algorithm for Logistic

$J(\theta) = -\frac{1}{m}[\sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta (x^{(i)}))]$ To fit parameters $\theta$:

\[h_\theta(x) = \frac{1}{1+e^{-\theta^TX}}\\ \frac{dJ(\theta)}{d\theta} = (h_\theta(x^i)-y^i)x_j^i\]

To make a prediction given new $x$: output $h_\theta(x)$

\[\min_\theta J(\theta):\\ \begin{align} Repeat\ \{ \\ &\theta_j := \theta_j - \alpha\sum_{i=1}^m(h_\theta(x^i)-y^i)\cdot x_j^i\\ \} \end{align}\]

Advanced optimization

Cost function$J(\theta)$. Want $\min_\theta J(\theta)$.

Given $\theta$, we have code that can compute.

$J(\theta)$
$\frac{dJ(\theta)}{d\theta_j}$

There are some algorithms for optimization.

Gradient descent
Conjugate gradient
BFGS
L-BFGS

Advantages

No need to manually pick $\alpha$
Often faster than gradient descent

Disadvantage

More complex

Multiclass classification

In the case the problem is not binary. How can we deal with logistic regression?

For example

Email (Work/Friends/Family)
Weather (Sunny/Cloudy/Rain/Snow)

Just think like there are three kinds of classification like

Email( Work or not/Friend or not /Family or not)

\[y \in \{0,1,\dots,n\}\\ h_\theta^{(0)}=P(y=0\vert x;\theta)\\ h_\theta^{(1)}=P(y=1\vert x;\theta)\\ \vdots\\ h_\theta^{(n)}=P(y=n\vert x;\theta)\\ \hat{y} = \max_i(h_\theta^{(i)}(x))\]

Softmax

In the two class case, we use the below equation to predict $\hat{y}$.

$H_L(x)=Wx\\ z=H_L(x),g(z)\\ g(z)=\frac{1}{1+e^{-2}}\\ H_R(x)=g(H_L(x))$ We can denote this equation to matrix expression. $\begin{bmatrix} w_1&w_2&w_3 \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} = \begin{bmatrix} w_1x_1+w_2x_2+w_3x_3 \end{bmatrix}$

Then make many $H_R(x)$ to predict multinomial values.

\[\begin{bmatrix} w_{A1}&w_{A2}&w_{A3}\\ w_{B1}&w_{B2}&w_{B3}\\ w_{C1}&w_{C2}&w_{C3} \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} = \begin{bmatrix} w_{A1}x_1+w_{A2}x_2+w_{A3}x_3\\ w_{B1}x_1+w_{B2}x_2+w_{B3}x_3\\ w_{C1}x_1+w_{C2}x_2+w_{C3}x_3 \end{bmatrix} \\ = \begin{bmatrix} \hat{Y}_A\\ \hat{Y}_B\\ \hat{Y}_C \end{bmatrix}\]

The prediction value $\hat{Y}$ can be changed by softmax function into the range 0~1.

\[S(y_i) = \frac{e^{y_i}}{\sum_je^{y_j}}\]

Cost function - Cross entropy

When the prediction is right, there is small or no penalty. In other hand, there is big penalty by the cost function.

$Y=L(label)\\ D(S,L)=-\sum_iL_i log(S_i)$ It is same to the equation using in the logistic regression. $D(S,L)=-\sum_iL_i log(S_i) \\ = ylog(H(x))-(1-y)log(1-H(x))$

Machine learning(2) - Logistic regression

What is Logistic regression?

Decision boundary

Linear boundary

Non-linear boundary

Cost function

Gradient Descent Algorithm for Logistic

Advanced optimization

Advantages

Disadvantage

Multiclass classification

Softmax

Cost function - Cross entropy

Python code

python code with numpy

Implementation with Tensorflow

Written by

Taekyung Han

Supported by

SHEPHEXD

CAN MACHINES THINK LIKE HUMANS?