How were modular forms discovered?

You don't need the language of group theory to talk about some aspects of groups. For example, number theorists going back to Fermat were studying the group of units mod $m$ (including things like the order of a unit mod $m$) and geometers were studying groups of motions in space long before anyone defined a "group".

The first modular forms (of level $4$, not level $1$) were found by Gauss in his work on the arithmetic-geometric mean around 1800, and it would take until the end of the 19th century for the term "modular form" to be introduced, in 1890. I provided links for this near the end of my answer here, but I don't want to discuss this further.

Instead I want to show how one reason for interest in modular forms in the 19th century was the construction of nonconstant meromorphic functions on Riemann surfaces of higher genus. This will explain why someone might care about functions satisfying a transformation rule like $f((a\tau+b)/(c\tau+d)) = (c\tau+d)^kf(\tau)$.

Compact Riemann surfaces of genus 1 arise in the form $\mathbf C/L$ where $L$ is a discrete subgroup of $\mathbf C$. Meromorphic functions on $\mathbf C/L$ are the elliptic functions, which in various forms were studied throughout much of the 19th century (by Jacobi, Weierstrass, et al.). We'd like to find a similar story for compact Riemann surfaces of genus greater than $1$. By letting a discrete group $Γ \subset {\rm SL}_2(\mathbf R)$ act on the upper half-plane $\mathfrak h$ by linear fractional transformations, the coset space $\Gamma\setminus\mathfrak h$ will be a compact Riemann surface (with finitely many points missing) and the higher-genus Riemann surfaces arise in this way. A natural collection of discrete subgroups of ${\rm SL}_2(\mathbf R)$ is ${\rm SL}_2(\mathbf Z)$ and its finite-index subgroups, such as the congruence subgroups. When I write $\Gamma$ below, you could consider it to be one of these groups of integral matrices.

How can we create nonconstant meromorphic functions on $\Gamma\setminus\mathfrak h$? That is the same thing as creating nonconstant meromorphic functions $F : \mathfrak h \rightarrow \mathbf C$ that are $\Gamma$-invariant: $F(\gamma\tau) = F(\tau)$ for all $\gamma \in \Gamma$ and $\tau \in \mathfrak h$. If we aren't clever enough to be able to write down $\Gamma$-invariant functions directly, we can still make progress by finding some non-invariant functions as long as they are non-invariant in the same way: if $$ f\left(\frac{aτ + b}{cτ + d}\right) = (cτ + d)^kf(τ) $$ and $$ g\left(\frac{aτ + b}{cτ + d}\right) = (cτ + d)^kg(τ) $$ for some "weight" $k \in \mathbf Z$ and all $(\begin{smallmatrix} a& b\\c &d\end{smallmatrix}) ∈ \Gamma$ and $\tau ∈ \mathfrak h$, then the ratio $f(τ)/g(τ)$ is $\Gamma$-invariant: $$ \frac{f((a\tau+b)/(c\tau+d))}{g((a\tau+b)/(c\tau+d))} = \frac{(cτ + d)^kf(τ)}{(cτ + d)^kg(τ)} = \frac{f(\tau)}{g(\tau)}. $$ As long as $f$ and $g$ are not constant multiples of each other (essentially, the space of modular forms $f$ and $g$ live in is more than one-dimensional), the ratio $f/g$ will be a nonconstant $\Gamma$-invariant function.

But why should we use fudge factors of the form $(cτ + d)^k$? Suppose for a function $f : \mathfrak h \rightarrow \mathbf C$ that $f(γτ)$ and $f(τ)$ are always related by a nonvanishing factor determined by $γ ∈ Γ$ and $τ ∈ \mathfrak h$ alone, not by $f$: $$ f(γτ) = j(γ, τ)f(τ) $$ for some function $j : \Gamma × \mathfrak h \rightarrow \mathbf C^\times$. If also $g(γτ) = j(γ, τ)g(τ)$ then $f(γτ)/g(γτ) = f(\tau)/g(\tau)$, so $f/g$ is $\Gamma$-invariant. We want to figure out how anyone might imagine using the fudge factor $j(\gamma,\tau) = (c\tau+d)^k$, where $γ = (\begin{smallmatrix} a& b\\ c& d\end{smallmatrix})$. Since $(γ_1γ_2)τ = γ_1(γ_2τ)$, we have $f((γ_1γ_2)τ) = f(γ_1(γ_2τ))$. This turns the above displayed equation into $$ j(γ_1γ_2,τ)f(τ) = j(γ_1,γ_2τ)f(γ_2τ). $$ Since $f(γ_2τ) = j(γ_2,τ)f(τ)$, the above equation holds if and only if the "cocycle condition" $$j(γ_1γ_2, τ) = j(γ_1,γ_2τ)j(γ_2,τ)$$ holds, which looks a lot like the chain rule $(f_1 ◦ f_2)'(x) = f_1'(f_2(x))f_2'(x)$. This suggests a natural example of the function $j(\gamma,\tau)$ using differentiation: when $γ = (\begin{smallmatrix} a& b\\ c& d\end{smallmatrix})$ and $\tau \in \mathfrak h$, set $$ j(γ, τ ) := \left(\frac{aτ + b}{cτ + d}\right)' = \frac{a(cτ + d) − c(aτ + b)}{(cτ + d)^2} =\frac{ad − bc}{(cτ + d)^2}, $$ and for $γ ∈ {\rm SL}_2(\mathbf R)$ this says $j(γ, τ) = 1/(cτ + d)^2$. When $j(γ,τ)$ satisfies the cocycle condition, so does $j(γ,τ)^m$ for each $m \in \mathbf Z$, which motivates the consideration of the transformation rule for modular forms with factors $(cτ + d)^k$, at least for even $k$ (use $k = -2m$).

The top answer at https://math.stackexchange.com/questions/312515/what-is-the-intuition-between-1-cocycles-group-cohomology shows how the same cocycle condition arises when trying to write down line bundles on an elliptic curve over $\mathbf C$.


An introduction that emphasises the historical development is Elliptic and Modular Functions from Gauss to Dedekind to Hecke by Ranjan Roy. The first chapter, The basic modular forms of the nineteenth century, describes much of the early history.

The modular form of weight one in the OP was indeed published by Klein in 1878, but it appeared already in a 1858 paper by Hermite and was implicit in work by Gauss. (Roy writes extensively on Gauss' early contributions to modular forms in chapter 2.)