What is the intuitive interpretation of the transpose compared to the inverse?

We usually see matrices as linear transformations. The inverse of $A$, when it exists, means simply "reversing" what $A$ does as a function. The transpose originates in a different point of view.

So we have vector spaces $X,Y$, and $A:X\to Y$ is linear. For many reasons, we often look at the linear functionals on the space; that way we get the dual $$ X^*=\{f:X\to\mathbb R:\ f\ \text{ is linear}\}, $$ and correspondingly $Y^*$. Now the map $A$ induces a natural map $A^*:Y^*\to X^*$, by $$ (A^*g)(x)=g(Ax). $$ In the particular case where $X=\mathbb R^n$, $Y=\mathbb R^m$, one can check that $X^*=X$ and $Y^*=Y$, in the sense that all linear functionals $f:\mathbb R^n\to\mathbb R$ are of the form $f(x)=y^Tx$ for some fixed $y\in\mathbb R^n$. In this situation $A$ is an $m\times n$ matrix, and the matrix of $A^*$ is the transpose of $A$.


Something weird is going on here. I'm assuming $g: \mathbb R^m \to \mathbb R$ and say $A$ is an $m\times n$ matrix. Let $\mathcal a(x): \mathbb R^n \to \mathbb R^m, x \mapsto Ax + b$ be the corresponding affine transformation, so that $f = g \circ a$. The chain rule says $Df(x) = Dg(a(x)) Da(x)$.

The Jacobian realization of $Dg$ is $\nabla g$ and is an $1\times m$ matrix (row vector), while the Jacobian for $a$ is $A$, an $m \times n $ matrix. The dimensions all agree, since this would make $\nabla f$ a $1\times n$ matrix, which agrees with the notion that the derivative of $f$ is a linear map $\mathbb R^n \to \mathbb R$.

So what I suspect is happening is some identification of $\mathbb R^n$ with its dual space under the Euclidean inner product; that is, you're realizing the gradient as a column vector instead of a row vector. The transpose is precisely the way this is done. If $T: V \to W$ is a linear transformation, then its adjoint is $T^\dagger: W^* \to V^*$. But under the Euclidean inner product, you can identify $\mathbb R^n \cong (\mathbb R^n)^*$, so $$ (\nabla g(a(x)) A)^T = A^T [\nabla g(a(x))]^T = A^T \nabla g(a(x))$$ where we're abusing notation by identifying the row vector $\nabla g$ with the column vector $\nabla g$. This hidden identification is likely what is confusing you.


Notice using the chain rule that $$D_p g(Av+b)=\langle\nabla g(Ap+b),Av\rangle=\langle A^T\nabla g(Ap+b),v\rangle.$$ Now compare to $D_pf(v)=\langle\nabla f(p),v\rangle$.