How can we think and/or write rigorously about integration by substitution?

From the perspective of an elementary calculus student (by which I mean that it can supposedly be made rigorous later on, but isn't in introductory classes), the $du$, $dx$ stuff is absolute nonsense and I will never understand why it continues to be used by so many professors. It really is the one glaring hole in most otherwise rigorous calculus courses. Really bizarre.

Anyway, the real story is told using composition of functions. Where $f$ is an integrable function on $[a, b]$, I'll denote $\int_a^b f$ the integral of $f$ over $[a, b]$, since it really is something determined by the function itself, there are no "variables" (whatever that could mean) anywhere.

Then we have:

$$\int^b_a(f\circ\phi)\phi'=\int^{\phi(b)}_{\phi(a)}f$$

There are appropriate assumptions that need to be made about $f$ and $\phi$, which are better explained on Wikipedia.

Thus for example let's say we want to integrate

$$\int^2_1\frac {2x} {1+x^2}$$

Well, defining $f(x)=\frac 1 x$ and $\phi=1+x^2$, the integrand is precisely $(f\circ\phi)\phi'$, therefore the integral is equal to

$$\int^{\phi(2)}_{\phi(1)}\frac 1 x=\int^{5}_{2}\frac 1 x$$

Note: I can say from personal experience that thinking with this approach is much slower than the "multiply both sides by $dx$" approach that most of your classmates will be using. I recommend practicing thinking with function composition alot until you can do it quickly and fluently.


This baffled me too as I first came along, Jack M gave a good answer what really is behind it. And for the practicing mathematician, the „symbolic approach“ via variables, differentials and so on is just like a mental shorthand, that works by clever choosen notation. Maybe bear in mind that the chain rule could be read in two directions, one if you see at once the functions $\varphi$ and $f$, like maybe $\int_a^b x\cos(x^2+2) dx$, or in the other direction, where you need in some way "compute" your $\varphi$, like in $\int_a^b \cos(x^2+2) d x$ (here you cannot apply the chain rule directly by reading it according to the formula, you need to rearrange a little bit).

So what „mathematically“ are you doing here? You just compute your subsitution function $\varphi$! This could be done with all this „magical“ differential quotient stuff, suppose you have an integral of the form $$ \int_a^b f(\varphi(x)) dx $$ which as Jack M pointed out has nothing to do with the variable $x$, but is a function of functions (sometimes called functional), the variables are in this sense just „notational conventions" to have these mental shorthands for the chain rule. Okay, suppose $\varphi$ is invertible, and let $\psi := \varphi^{-1}$, then $$ \int_a^b f(\varphi(x)) d x = \int_{\psi(a)}^{\psi(b)} \psi'(x) f(\varphi(\psi(x)) d x = \int_{\psi(a)}^{\psi(b)} \psi'(x) f(x) d x. $$ this is just the chain rule, you can easily state this in the language Jack M does, and how I called the integration variables ($x$ or $t$ or whatever, doesn't matter!)

But how to compute $\psi$? Yes, how you would compute the inverse of $\varphi(x)$, write $y = \varphi(x)$ and try to solve for $y$, then rename, you will see that these are exactly the steps that are „hidden“ in the „symbolic-differential application“ of the chain rule. For my example $$ t = x^2 + 2 \mbox{ which has inverse } x = \sqrt{t - 2} $$ (this just works for example if $0 < a < b$ where the function indeed is invertible!). Now compute its derivate and plug in and you get $$ \int_{\psi(a)}^{\psi(b)} \psi'(t)\cdot \cos(t) dt = \int_{\sqrt{a-2}}^{\sqrt{b-2}} \left( \frac{1}{2\sqrt{t-2}} \right) \cdot \cos(t) d t = \int_{\sqrt{a-2}}^{\sqrt{b-2}} \frac{\cos(x)}{2\sqrt{x-2}} d x. $$ But what are you doing if you apply the symbolic method, you put $t = x^2 + 2$, then compute $dt / dx = 2x$ to get $dx = dt / 2x$, if you plug this in you still have $x$ in it, so solve for $x$ to get $x = \sqrt{t - 2}$, do you see how there is just the computation of the inverse function, and plugging its derivative in, is contained in these steps? Of course, the precise assumptions are hidden, to make this more rigorous first your substition function must the inverstible, and then what is at work here is the rule of inverse function differentiation $(\varphi^{-1})' = 1/(\varphi \circ \varphi^{-1})$ (do you see how this formula is hidden in the above steps?), which fits nicely with this "differential shorthand notation" too, making this "mental shorthand" working.


ADDED: To be clear, (1) I can see why such (perhaps sloppy) notation can be confusing and it's quite natural that many would find it confusing at first and I used to too and (2) nonetheless I believe that there is a value in this type of abuse of notation. But yes it demands explanation. I will first explain in generality (which you probably know, but for others who might ask the same questions) and then get to your specific questions.

$dx, dt, dy$. There are two crucial points to be made about this kind of notation.

First, this style of notation is referring to the intuition of variables and infinitesimals.

Let's say you come across something like the following argument:

Consider the unit circle. This is defined by $$x^2 + y^2 = 1$$ By differentiating, we can see that the following relation holds on the unit circle: $$2 x dx + 2 y dy = 0$$ Therefore, the circle has the property that blah blah.

Note how the argument uses $dx, dy$ liberally as if they are well defined actual quantities. When a mathematician is reading the middle part of that argument, he might intuitively interpret it as follows (let's take aside its possible formal meanings for a moment):

If $(x,y)$ is an arbitrary point on the unit circle and if $(x+dx, y+dy)$ is another point on the circle extremely close to $(x,y)$, then the four numbers $x, y, dx, dy$ have the property that $2 x dx + 2 y dy$ is extremely close to $0$ (compared to $dx,dy$).

Actually, he could be visualizing $(x,y)$ as continuously changing for the duration of one second, and maybe divide the duration of one second into million steps (the duration of each step is one millisecond). For example, he could imagine the value of $(x,y)$ as changing from $(0,1)$ (the north pole) to $(0, -1)$ (the south pole) while going down the right side of the circle. For each step, if $(x,y)$ is the position of the moving point at that moment and if $(x+dx, y+dy)$ is the position at the next moment (the next step), then the four numbers $x, y, dx, dy$ have the property that $2 x dx + 2 y dy$ is extremely close to $0$.

As for its possible formal meanings, depending on his preference or context, the mathematician could for example take its meaning to be a statement about differential forms, or a statement about parametrized curves (representing the circle), or maybe just a geometric statement about tangents to the circle.

For the substitution in your post, the starting relation is $u = \sin t$. You are supposed to imagine the value of the pair $(t, u)$ continuously changing from $(a, \sin a)$ to $(b, \sin b)$ along the sin curve, say for the duration of one second. It does not matter if $b$ is smaller than $a$ or not. It does not matter whether $u$ happens to take some specific value, say 0, more than once between the initial moment $t =a$and the last moment $t = b$. It does not matter if the movement is by constant speed or not. We could even change direction at some moment then change direction again at another moment. The only thing that matters is that you can imagine some movement that starts from $(a, \sin a)$ and finishes at $(b, \sin b)$ and along the curve. Now divide into million steps. For each step, if $(t,u)$ is the position of the moving point at that moment and if $(t+dt, u+du)$ is the position at the next moment, then the four quantities $t,u,dt,du$ have the property that $du$ is extremely close to $\cos t dt$.

Second, it is a good thing to use an argument that uses $dx, dt, dy$ liberally as if they are well defined things, as long as the user (of the argument) and the reader can easily come up with a way to convert the argument into a rigorous argument free of such liberal use. This could be done by using the notion of parametrized (differentiable) curves and/or Riemann–Stieltjes integral in most cases. For the case in your question, conversion is much simpler, as demonstrated in some other answer, or as you noted in your post. Of course, it's also worth noting that one can come to nonsensical conclusions by reasoning about infinitesimals in an unregulated way, as demonstrated with the famous "proof" of $1 = \sqrt{2}$. Hence the "as long as" clause.

To answer your question about the notation $\int_{t=a}^{t=b} u du$. Again, imagine the value of $(t,u)$ changing from $t=a$ to $t=b$ along the curve for the duration of one second. Divide the duration into million steps.

Pretend that the expression $$\int_{t=a}^{t=b} u du$$ just means the result you get when you sum $u du$ during this one second. Think of this as a sum of million terms: one term for each moment. Yes this expression only makes sense when a specific substitution like $u = \sin t$ is agreed upon beforehand. The formal meaning of the expression should be obvious from the intuition.

The expression $$[\frac12 u^2]_{t=a}^{t=b}$$ means the value of $\frac12 u^2$ at the last moment minus that at the first moment. This expression too only makes sense when a substitution is agreed upon.

We can also pretend that the expression $$\int_a^b \sin t \cos t dt$$ means a sum of million terms: one term for each moment. For each moment, $\sin t \cos t dt$ is added. $\sin t \cos t dt$ is (almost) equal to $u du$ at each moment, and that is the intuition behind: $$\int_a^b \sin t \cos t dt = \int_{t=a}^{t=b} u du$$

At each moment, $u\, du$ is (almost) equal to the value of $\frac12 u^2$ at that moment minus the value of $\frac12 u^2$ at the next moment. This is usually expressed as $$d(\frac12 u^2) = u\, du$$ and you can take its formal meaning to be that $\frac12 u^2$ is an anti-derivative of $u\, du$ and so we can write (by the formal meaning, or by the intuition): $$\int_{t=a}^{t=b} u du = [\frac12 u^2]_{t=a}^{t=b}$$