VC dimension, fat-shattering dimension, and other complexity measures, of a class BV functions

I think the family $\mathcal{H}_\alpha$ is ill-suited for your purposes, because it is too rich.

For example in the case $\mathcal{X} = [0,1]$ (or anything atomless probability measure space) one can find arbitrarily many $h_\tau\in\mathcal{H}_\alpha$ such that any two of them are at distance $\alpha/2$. This means that $\mathcal{N}(\varepsilon,\mathcal{H}_\alpha,L^1([0,1])) = \infty$ as soon as $\varepsilon\le \alpha/2$.

To prove this explicitely, one can use Hadamard matrices of size $N$ (this is certainly a conceptual overkill and could be replaced by a random choice, or a classical non-compactness argument in functional analysis, but it seems the simplest way to proceed). Divide $[0,1]$ into $N$ equal intervals, and for each sequence $\tau=(\tau_i)_{1\le i\le N} \in \{-1,1\}^N$ define $h_\tau$ to be $\frac{1+\tau_i\alpha}2$ on the $i$-th interval. If $\tau,\tau'$ are two lines of a Hadamard matrix, they differ on exactly half their entries, so that $\lVert h_\tau-h_{\tau'}\rVert_{L^1([0,1])}=\frac\alpha2$. There exist arbitrarily large Hadamard matrices and we are done.

Note that if you stay at the level of precision $\varepsilon\simeq \alpha$, then your family is essentially reduced to a point (any two $h\in\mathcal{H}_\alpha$ are indistinguishable at this scale).

In order not to stay on a negative claim, let me suggest that you use instead another definition of ``essentially constant'', through a more refined measures of variations of a function. The problem with your condition is that it does not see any of the geometry of $\mathcal{X}$ (the class $\mathcal{H}_\alpha$ essentially only depend on the cardinal of $\mathcal{X}$), and you cannot expect anything (unless possibly you capture some geometry with the chosen metric on the considered space of functions - $L^1$ would not do, as the $[0,1]^{n}$ are all isomorphic when endowed with their Lebesgue measures).

There are many natural choices:

  • H"older functions, when $\mathcal{X}$ is a metric space (the particular case of Lipschitz function is the most common),

  • As mentioned by Aryeh Kontorovich, BV functions (this notion is quite simple in dimension $1$, but significantly more intricate in higher dimension), and for a rougher notion $p$-BV (in dimension $1$, one replaces the $\lvert f(x_{i+1})-f(x_i)\rvert$ by $\lvert f(x_{i+1})-f(x_i)\rvert^p$, where $1/p<1$ plays the same role as the H"older exponent,

  • smooth ($\mathcal{C}^k$) functions when $\mathcal{X}$ is a domain in $\mathbb{R}^n$ or a manifold, and the many available variations: Sobolev, Besov, etc.


Since the OP mentions functions of bounded variation in the title, let's take that literal definition. A function $f:[0,1]\to\mathbb{R}$ is said to have variation $V(f)$ defined by $$ V(f) = \sup_{0=x_0<x_1<\ldots<x_n=1}\sum_{i=1}^n|f(x_i)-f(x_{i-1})|. $$ For functions with integrable derivative, $V(f)=\int_0^1|f'(x)|dx$. Let $F_v$ be the collection of all $f:[0,1]\to\mathbb{R}$ with $V(f)\le v$. It is known (Anthony and Bartlett, Neural Network Learning (1999), Theorem 11.12) that the fat-shattering dimension of $F_v$ at scale $\gamma$ is $$ 1+\left\lfloor \frac{v}{2\gamma} \right\rfloor.$$

(The notion of fat-shattering -- the one most appropriate for learnability of continuous function classes, is given ibid. in Definition 11.11.)

Of course, covering numbers and fat-shattering are intimately related. For example, Theorem 12.7 ibid. shows how to bound the $L_\infty$ covering numbers in terms of the fat-shattering dimension (read the whole chapter!).

Finally, I take issue with "a function whose output doesn't vary much cannot be a good classifier". Linear classifiers/regressors are as smooth as can be and yet have an excellent track record. Conversely, functions that vary too rapidly are prone to overfitting.