Derivative as a linear transformation

I think it may be easier for you to just think to the one-dimensional case at the beginning; so take a differentiable function $f \colon \mathbb{R} \to \mathbb{R}$ and assume that you want to differentiate it at point $0$. Let us also assume that $f(0) = 0$. If this is not true, you can translate your axes do that this becomes true.

You probably already know that $f'(0)$ is the slope of the straight line tangent to $f$ and passing through $(0, 0)$:

Derivative of a function at the origin

(the continuous line is the function $f$, the dashed line is the tangent line at the point $(0, 0)$)

This is a perfect legitimate way to think about a derivative. But there is another one, which is helpful to understand differentiation in more complex cases.

Instead of thinking as the derivative of a function in a point as a number, we want to think it as another function (note that when I say "function" I don't mean the derivative function $f'$ as a whole; I'm speaking of a derivative at a single point as a function by itself). In particular, I have the function $f$ near the point $0$ and I want to find another function which is linear and is the "best possible approximation" of $f$ near $0$. Which one is this function? It is the one that maps $$ x \mapsto f'(a) \cdot x . $$

This whole function is what your book calls $[f'(a)]$. Let me stress again that this is not just a number, it is a whole function. It is the dashed line in the picture above. So, while $f'(0)$ is just the slope of the dashed line, $[f'(0)]$ is the dashed line itself (or, more precisely, the function whose plot is the dashed line).

The whole point about differentiation is this one: taking functions that are not necessarily linear and "make them linear", with respect to one point of the domain. Why do we take derivatives? Because linear functions are much more beautiful and their behavior is much easier to understand (of course if the initial function $f$ is already linear, then we do not have to do much: its derivative, taken at any point, will be the function $f$ itself).

Now, this was the easy case: a function from $\mathbb{R}$ to $\mathbb{R}$. The point is that the same reasoning works in more complex situations. For example, if you have a non linear function $f \colon \mathbb{R}^n \to \mathbb{R}^m$, you can perfectly define its derivative at a certain point $a$ as the function $\mathbb{R}^n \to \mathbb{R}^m$ that is linear and best approximates $f$ near $a$. Of course you still have to understand whether it exists and it is unique and, even then, you may like to know how you are supposed to make computations with it, but at least for the definition this is the simplest way to think about a derivative.

About your question on correspondence between vectors and linear functions: if you are still not confident enough with linear algebra, do as before and think with just one dimension. This statement in one dimension reads:

There is a correspondence between real numbers and linear functions $\mathbb{R} \to \mathbb{R}$.

Which is it? If you have a real number $m$ you can make a function out of it just by taking the linear function $x \mapsto mx$. If you have a linear function, you just take its slope and you found a number. These two functions are inverse each of the other.

This is the same thing that we did above with differentiation: if you have a function and you know its derivative as a number, you can easily construct its derivative as a function in the way I suggested above. If you have the function, then the number is nothing else than its slope.

When you will know more about vector spaces (which I suggest to you to make it happen soon), you will see that everything that I just wrote works just the same in that case.

Why do we think at derivatives this way?

I would mention a couple of reasons.

The first one is practical: numbers are easy to work with as long as you have just one dimension. They become a bit more complicated in more dimensions, but still feasible. In more complicated contexts, though, they would become a real nightmare: on Riemannian manifolds you wouldn't know how to choose them; in infinite dimensions you would have infinitely many of them (and this would be the least of the technical complications). Instead, linear maps are something that is always pretty easy to define and usually behave the right way. In those contexts they it is much better to work with them.

The other reason is more theoretic and I already sketched it above. The point is: "Why do we take derivatives? What are they good for?". My view is that derivatives are mostly a way to make difficult things easier. Arbitrary smooth functions can be very complicated. Linear functions are instead very easy to understand. You can compose them, you can describe them easily, you know their behavior. So having a "magic wand" that takes a smooth function and turns it in a linear function that locally shares some of the features of the original function is really desirable (with reference to the picture: the wand takes the solid line and turns it into the dashed line). Differentiation is this magic wand.


To be more precise, the function $g(r) := [f'(a)]r$ is a linear transformation from $\mathbb{R}$ to Y. It maps a small perturbation $r$ around $a$ to approximately $f(a+r) - f(a)$. Since $g(r)$ is characterized by $[f'(a)]$, as a shorthand we can say $[f'(a)]$ is a linear transformation. When $A \in \mathbb{R}^n$ for n>1, there will be an analogous linear transformation, and $f'(a)$ will be the matrix that characterizes that linear transformation.


As soon as you go beyond elementary one-variable calculus you should always consider the derivative of a map $f:\>N\to M$ at a point $p\in N$ as a linear map $$df(p):\quad T_p\to T_{f(p)}$$ from the tangent space at $p$ to the tangent space at $f(p)$. When $N$ (or $M$) are (subsets of) real vector spaces we have the natural embedding $$\exp:\quad T_p\to N,\qquad X\mapsto p+X$$ which is not always made explicit. In this spirit one simply writes $$f(p+X)-f(p)=df(p).X+o(|X|)\qquad(x\to0)\ .$$ Depending on the exact circumstances the derivative $df(p)$ may be encoded in various ways.

When $N\subset{\mathbb R}^n$ and $M={\mathbb R}^m$ we have canonical bases in the tangent spaces of both $N$ and $M$, and the natural encodification of $df(p)$ is then the Jacobian matrix $\bigl[f_{i.k}(p)\bigr]$, where $$f_{i.k}(p):={\partial f_i\over\partial x_k}\Biggr|_p\qquad(1\leq i\leq m, \ 1\leq k\leq n)\ .$$

When the range $M={\mathbb R}$ and a scalar product is defined in $T_p$ then $df(p)$ can be encoded as gradient $\nabla f(p)\in T_p$. I won't go into this here.

In the case of your example we have $N={\mathbb R}$. This implies that we have at all points of $p\in N$ a natural basis of $T_p$, namely the vector $e_1=(1)$ (a $1$-tuple), and $df(p)$ as a map is completely determined by its value on this vector. It is customary to write $$df(p).e_1=:f'(p)\in T_{f(p)}=Y\ .$$ Therefore we have in this case $$\eqalign{f(p+X)-f(p)&=f(p+Xe_1)-f(p)\cr &=df(p).(Xe_1)+o(|X|)\cr &=X\>f'(p)+o(|X|)\qquad\qquad(X\to0)\ .\cr}$$