derivative of cost function for Logistic Regression

The reason is the following. We use the notation:

$$\theta x^i:=\theta_0+\theta_1 x^i_1+\dots+\theta_p x^i_p.$$

Then

$$\log h_\theta(x^i)=\log\frac{1}{1+e^{-\theta x^i} }=-\log ( 1+e^{-\theta x^i} ),$$ $$\log(1- h_\theta(x^i))=\log(1-\frac{1}{1+e^{-\theta x^i} })=\log (e^{-\theta x^i} )-\log ( 1+e^{-\theta x^i} )=-\theta x^i-\log ( 1+e^{-\theta x^i} ),$$ [ this used: $ 1 = \frac{(1+e^{-\theta x^i})}{(1+e^{-\theta x^i})},$ the 1's in numerator cancel, then we used: $\log(x/y) = \log(x) - \log(y)$]

Since our original cost function is the form of:

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

Plugging in the two simplified expressions above, we obtain $$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[-y^i(\log ( 1+e^{-\theta x^i})) + (1-y^i)(-\theta x^i-\log ( 1+e^{-\theta x^i} ))\right]$$, which can be simplified to: $$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\theta x^i-\log(1+e^{-\theta x^i})\right]=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right],~~(*)$$

where the second equality follows from

$$-\theta x^i-\log(1+e^{-\theta x^i})= -\left[ \log e^{\theta x^i}+ \log(1+e^{-\theta x^i} ) \right]=-\log(1+e^{\theta x^i}). $$ [ we used $ \log(x) + \log(y) = log(x y) $ ]

All you need now is to compute the partial derivatives of $(*)$ w.r.t. $\theta_j$. As $$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$ $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$

the thesis follows.


@pedro-lopes, it is called as: chain rule. $$(u(v))' = u(v)' * v'$$ For example: $$y = \sin(3x - 5)$$ $$u(v) = \sin(3x - 5)$$ $$v = (3x - 5)$$ $$y' = \sin(3x - 5)' = \cos(3x - 5) * (3 - 0) = 3\cos(3x-5)$$

Regarding: $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}$$ $$u(v) = \log(1+e^{\theta x^i})$$ $$v = 1+e^{\theta x^i}$$ $$\frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) = \frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) * \frac{\partial}{\partial \theta}(1+e^{\theta x^i}) = \frac{1}{1+e^{\theta x^i}} * (0 + xe^{\theta x^i}) = \frac{xe^{\theta x^i}}{1+e^{\theta x^i}} $$ Note that $$\log(x)' = \frac{1}{x}$$ Hope that I answered on your question!


We have, \begin{align*} L(\theta) &= -\frac{1}{m}\sum\limits_{i=1}^{m}{y_i. log P(y_i|x_i,\theta) + (1-y_i). \log{(1 - P(y_i|x_i,\theta))}} \\ h_\theta(x_i) &= P(y_i|x_i,\theta) = P(y_i=1|x_i,\theta) = \frac{1}{1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)}} \end{align*}

Then, \begin{align*} \log{(P(y_i|x_i,\theta))}=\log{(P(y_i=1|x_i,\theta))} &=-\log{\left(1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)} \right)} \\ \Rightarrow \frac{\partial }{\partial \theta_j} log P(y_i|x_i,\theta) =\frac{x_i^j.\exp{\left(-\sum\limits_k \theta_k x_i^k\right)}}{1+\exp{\left(-\sum\limits_k \theta_k x_i^k\right)}} &= x_i^j.\left(1-P(y_i|x_i,\theta)\right) \end{align*} and \begin{align*} \log{(1-P(y_i|x_i,\theta))}=\log{(1-P(y_i=1|x_i,\theta))} &=-\sum\limits_k \theta_k x_i^k -\log{\left(1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)} \right)} \\ \Rightarrow \frac{\partial }{\partial \theta_j} \log{(1 - P(y_i|x_i,\theta))} &= -x_i^j + x_i^j.\left(1-P(y_i|x_i,\theta)\right) = -x_i^j.P(y_i|x_i,\theta) \\ \end{align*}

Hence,

\begin{align*} \frac{\partial }{\partial \theta_j} L(\theta) &= -\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.\frac{\partial }{\partial \theta_j} log P(y_i|x_i,\theta) + (1-y_i).\frac{\partial }{\partial \theta_j} \log{(1 - P(y_i|x_i,\theta))}} \\ &=-\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.x_i^j.\left(1-P(y_i|x_i,\theta)\right) - (1-y_i).x_i^j.P(y_i|x_i,\theta)} \\ &=-\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.x_i^j - x_i^j.P(y_i|x_i,\theta)} \\ &=\frac{1}{m}\sum\limits_{i=1}^{m}{(P(y_i|x_i,\theta)-y_i).x_i^j} \end{align*} (Proved)