Partial derivative in gradient descent for two variables

The answer above is a good one, but I thought I'd add in some more "layman's" terms that helped me better understand concepts of partial derivatives. The answers I've seen here and in the Coursera forums leave out talking about the chain rule, which is important to know if you're going to get what this is doing...


It's helpful for me to think of partial derivatives this way: the variable you're focusing on is treated as a variable, the other terms just numbers. Other key concepts that are helpful:

  • For "regular derivatives" of a simple form like $F(x) = cx^n$ , the derivative is simply $F'(x) = cn \times x^{n-1}$
  • The derivative of a constant (a number) is 0.
  • Summations are just passed on in derivatives; they don't affect the derivative. Just copy them down in place as you derive.

Also, it should be mentioned that the chain rule is being used. The chain rule says that (in clunky laymans terms), for $g(f(x))$, you take the derivative of $g(f(x))$, treating $f(x)$ as the variable, and then multiply by the derivative of $f(x)$. For our cost function, think of it this way:

$$ g(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 \tag{1}$$

$$ f(\theta_0, \theta_1)^{(i)} = \theta_0 + \theta_{1}x^{(i)} - y^{(i)} \tag{2}$$

To show I'm not pulling funny business, sub in the definition of $f(\theta_0, \theta_1)^{(i)}$ into the definition of $g(\theta_0, \theta_1)$ and you get:

$$ g(f(\theta_0, \theta_1)^{(i)}) = \frac{1}{2m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)^2 \tag{3}$$

This is, indeed, our entire cost function.

Thus, the partial derivatives work like this:

$$ \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 = 2 \times \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^{2-1} = \tag{4}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)}$$

In other words, just treat $f(\theta_0, \theta_1)^{(i)}$ like a variable and you have a simple derivative of $\frac{1}{2m} x^2 = \frac{1}{m}x$

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{5}$$

And $\theta_1, x$, and $y$ are just "a number" since we're taking the derivative with respect to $\theta_0$, so the partial of $g(\theta_0, \theta_1)$ becomes:

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} (\theta_0 + [a \ number][a \ number]^{(i)} - [a \ number]^{(i)}) = \frac{\partial}{\partial \theta_0} \theta_0 = 1 \tag{6}$$

So, using the chain rule, we have:

$$ \frac{\partial}{\partial \theta_0} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} \tag{7}$$

And subbing in the partials of $g(\theta_0, \theta_1)$ and $f(\theta_0, \theta_1)^{(i)}$ from above, we have:

$$ \frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) \times 1 = \tag{8}$$

$$ \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)$$


What about the derivative with respect to $\theta_1$?

Our term $g(\theta_0, \theta_1)$ is identical, so we just need to take the derivative of $f(\theta_0, \theta_1)^{(i)}$, this time treating $\theta_1$ as the variable and the other terms as "just a number." That goes like this:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{9}$$

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} ([a \ number] + \theta_{1}[a \ number, x^{(i)}] - [a \ number]) \tag{10}$$

Note that the "just a number", $x^{(i)}$, is important in this case because the derivative of $c \times x$ (where $c$ is some number) is $\frac{d}{dx}(c \times x^1) = c \times 1 \times x^{(1-1=0)} = c \times 1 \times 1 = c$, so the number will carry through. In this case that number is $x^{(i)}$ so we need to keep it. Thus, our derivative is:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = 0 + (\theta_{1})^1 x^{(i)} - 0 = 1 \times \theta_1^{(1-1=0)} x^{(i)} = 1 \times 1 \times x^{(i)} = x^{(i)} \tag{11}$$

Thus, the entire answer becomes:

$$ \frac{\partial}{\partial \theta_1} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_1} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \tag{12}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_1}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) x^{(i)}$$


A quick addition per @Hugo's comment below. Let's ignore the fact that we're dealing with vectors at all, which drops the summation and $fu^{(i)}$ bits. We can also more easily use real numbers this way.

$\require{cancel}$

Let's say $x = 2$ and $y = 4$.

So, for part 1 you have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x - y)$$

Filling in the values for $x$ and $y$, we have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + 2\theta_{1} - 4)$$

We only care about $\theta_0$, so $\theta_1$ is treated like a constant (any number, so let's just say it's 6).

$$\frac{\partial}{\partial \theta_0} (\theta_0 + (2 \times 6) - 4) = \frac{\partial}{\partial \theta_0} (\theta_0 + \cancel8) = 1$$

Using the same values, let's look at the $\theta_1$ case (same starting point with $x$ and $y$ values input):

$$\frac{\partial}{\partial \theta_1} (\theta_0 + 2\theta_{1} - 4)$$

In this case we do care about $\theta_1$, but $\theta_0$ is treated as a constant; we'll do the same as above and use 6 for it's value:

$$\frac{\partial}{\partial \theta_1} (6 + 2\theta_{1} - 4) = \frac{\partial}{\partial \theta_1} (2\theta_{1} + \cancel2) = 2 = x$$

The answer is 2 because we ended up with $2\theta_1$ and we had that because $x = 2$.

Hopefully the clarifies a bit on why in the first instance (wrt $\theta_0$) I wrote "just a number," and in the second case (wrt $\theta_1$) I wrote "just a number, $x^{(i)}$. While it's true that $x^{(i)}$ is still "just a number", since it's attached to the variable of interest in the second case it's value will carry through which is why we end up at $x^{(i)}$ for the result.


Despite the popularity of the top answer, it has some major errors. The most fundamental problem is that $g(f^{(i)}(\theta_0, \theta_1))$ isn't even defined, much less equal to the original function. The focus on the chain rule as a crucial component is correct, but the actual derivation is not right at all.

So I'll give a correct derivation, followed by my own attempt to get across some intuition about what's going on with partial derivatives, and ending with a brief mention of a cleaner derivation using more sophisticated methods. That said, if you don't know some basic differential calculus already (at least through the chain rule), you realistically aren't going to be able to truly follow any derivation; go learn that first, from literally any calculus resource you can find, if you really want to know.

For completeness, the properties of the derivative that we need are that for any constant $c$ and functions $f(x)$ and $g(x)$, $$\frac{d}{dx} c = 0, \ \frac{d}{dx} x = 1,$$ $$\frac{d}{dx} [c\cdot f(x)] = c\cdot\frac{df}{dx} \ \ \ \text{(linearity)},$$ $$\frac{d}{dx}[f(x)+g(x)] = \frac{df}{dx} + \frac{dg}{dx} \ \ \ \text{(linearity)},$$ $$\frac{d}{dx}[f(x)]^2 = 2f(x)\cdot\frac{df}{dx} \ \ \ \text{(chain rule)}.$$

Taking partial derivatives works essentially the same way, except that the notation $\frac{\partial}{\partial x}f(x,y)$ means we we take the derivative by treating $x$ as a variable and $y$ as a constant using the same rules listed above (and vice versa for $\frac{\partial}{\partial y}f(x,y)$).


Derivation

We have

$$h_\theta(x_i) = \theta_0 + \theta_1 x_i$$

and

$$\begin{equation} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x_i)-y_i)^2\end{equation}.$$

We first compute

$$\frac{\partial}{\partial\theta_0}h_\theta(x_i)=\frac{\partial}{\partial\theta_0}(\theta_0 + \theta_1 x_i)=\frac{\partial}{\partial\theta_0}\theta_0 + \frac{\partial}{\partial\theta_0}\theta_1 x_i =1+0=1,$$

$$\frac{\partial}{\partial\theta_1}h_\theta(x_i) =\frac{\partial}{\partial\theta_1}(\theta_0 + \theta_1 x_i)=\frac{\partial}{\partial\theta_1}\theta_0 + \frac{\partial}{\partial\theta_1}\theta_1 x_i =0+x_i=x_i,$$

which we will use later. Now we want to compute the partial derivatives of $J(\theta_0, \theta_1)$. We can actually do both at once since, for $j = 0, 1,$

$$\frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j}\left[\frac{1}{2m} \sum_{i=1}^m (h_\theta(x_i)-y_i)^2\right]$$

$$= \frac{1}{2m} \sum_{i=1}^m \frac{\partial}{\partial\theta_j}(h_\theta(x_i)-y_i)^2 \ \text{(by linearity of the derivative)}$$

$$= \frac{1}{2m} \sum_{i=1}^m 2(h_\theta(x_i)-y_i)\frac{\partial}{\partial\theta_j}(h_\theta(x_i)-y_i) \ \text{(by the chain rule)}$$

$$= \frac{1}{2m}\cdot 2\sum_{i=1}^m (h_\theta(x_i)-y_i)\left[\frac{\partial}{\partial\theta_j}h_\theta(x_i)-\frac{\partial}{\partial\theta_j}y_i\right]$$

$$= \frac{1}{m}\sum_{i=1}^m (h_\theta(x_i)-y_i)\left[\frac{\partial}{\partial\theta_j}h_\theta(x_i)-0\right]$$

$$=\frac{1}{m} \sum_{i=1}^m (h_\theta(x_i)-y_i)\frac{\partial}{\partial\theta_j}h_\theta(x_i).$$

Finally substituting for $\frac{\partial}{\partial\theta_j}h_\theta(x_i)$ gives us

$$\frac{\partial}{\partial\theta_0} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x_i)-y_i),$$ $$\frac{\partial}{\partial\theta_1} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x_i)-y_i)x_i.$$


Intuition for partial derivatives

So what are partial derivatives anyway? In one variable, we can assign a single number to a function $f(x)$ to best describe the rate at which that function is changing at a given value of $x$; this is precisely the derivative $\frac{df}{dx}$of $f$ at that point. We would like to do something similar with functions of several variables, say $g(x,y)$, but we immediately run into a problem. In one variable, we can only change the independent variable in two directions, forward and backwards, and the change in $f$ is equal and opposite in these two cases. (For example, if $f$ is increasing at a rate of 2 per unit increase in $x$, then it's decreasing at a rate of 2 per unit decrease in $x$.)

With more variables we suddenly have infinitely many different directions in which we can move from a given point and we may have different rates of change depending on which direction we choose. So a single number will no longer capture how a multi-variable function is changing at a given point. However, there are certain specific directions that are easy (well, easier) and natural to work with: the ones that run parallel to the coordinate axes of our independent variables. These resulting rates of change are called partial derivatives. (For example, $g(x,y)$ has partial derivatives $\frac{\partial g}{\partial x}$ and $\frac{\partial g}{\partial y}$ from moving parallel to the x and y axes, respectively.) Even though there are infinitely many different directions one can go in, it turns out that these partial derivatives give us enough information to compute the rate of change for any other direction. (Strictly speaking, this is a slight white lie. There are functions where the all the partial derivatives exist at a point, but the function is not considered differentiable at that point. This happens when the graph is not sufficiently "smooth" there.)

In particular, the gradient $\nabla g = (\frac{\partial g}{\partial x}, \frac{\partial g}{\partial y})$ specifies the direction in which g increases most rapidly at a given point and $-\nabla g = (-\frac{\partial g}{\partial x}, -\frac{\partial g}{\partial y})$ gives the direction in which g decreases most rapidly; this latter direction is the one we want for gradient descent. This makes sense for this context, because we want to decrease the cost and ideally as quickly as possible.


A higher level approach

For the interested, there is a way to view $J$ as a simple composition, namely

$$J(\mathbf{\theta}) = \frac{1}{2m} \|\mathbf{h_\theta}(\mathbf{x})-\mathbf{y}\|^2 = \frac{1}{2m} \|X\mathbf{\theta}-\mathbf{y}\|^2.$$

Note that $\mathbf{\theta}$, $\mathbf{h_\theta}(\mathbf{x})$, $\mathbf{x}$, and $\mathbf{y}$, are now vectors. Using more advanced notions of the derivative (i.e. the total derivative or Jacobian), the multivariable chain rule, and a tiny bit of linear algebra, one can actually differentiate this directly to get

$$\frac{\partial J}{\partial\mathbf{\theta}} = \frac{1}{m}(X\mathbf{\theta}-\mathbf{y})^\top X.$$

The transpose of this is the gradient $\nabla_\theta J = \frac{1}{m}X^\top (X\mathbf{\theta}-\mathbf{y})$. Setting this gradient equal to $\mathbf{0}$ and solving for $\mathbf{\theta}$ is in fact exactly how one derives the explicit formula for linear regression.


conceptually I understand what a derivative represents.

So let us start from that. Consider a function $\theta\mapsto F(\theta)$ of a parameter $\theta$, defined at least on an interval $(\theta_*-\varepsilon,\theta_*+\varepsilon)$ around the point $\theta_*$. Then the derivative of $F$ at $\theta_*$, when it exists, is the number $$ F'(\theta_*)=\lim\limits_{\theta\to\theta_*}\frac{F(\theta)-F(\theta_*)}{\theta-\theta_*}. $$ Less formally, you want $F(\theta)-F(\theta_*)-F'(\theta_*)(\theta-\theta_*)$ to be small with respect to $\theta-\theta_*$ when $\theta$ is close to $\theta_*$.

One can also do this with a function of several parameters, fixing every parameter except one. The result is called a partial derivative. In your setting, $J$ depends on two parameters, hence one can fix the second one to $\theta_1$ and consider the function $F:\theta\mapsto J(\theta,\theta_1)$. If $F$ has a derivative $F'(\theta_0)$ at a point $\theta_0$, its value is denoted by $\dfrac{\partial}{\partial \theta_0}J(\theta_0,\theta_1)$.

Or, one can fix the first parameter to $\theta_0$ and consider the function $G:\theta\mapsto J(\theta_0,\theta)$. If $G$ has a derivative $G'(\theta_1)$ at a point $\theta_1$, its value is denoted by $\dfrac{\partial}{\partial \theta_1}J(\theta_0,\theta_1)$.

You consider a function $J$ linear combination of functions $K:(\theta_0,\theta_1)\mapsto(\theta_0+a\theta_1-b)^2$. Derivatives and partial derivatives being linear functionals of the function, one can consider each function $K$ separately. But, the derivative of $t\mapsto t^2$ being $t\mapsto2t$, one sees that $\dfrac{\partial}{\partial \theta_0}K(\theta_0,\theta_1)=2(\theta_0+a\theta_1-b)$ and $\dfrac{\partial}{\partial \theta_1}K(\theta_0,\theta_1)=2a(\theta_0+a\theta_1-b)$.