Is a data set really a set?

Arthur is right; the term "data set" usually means multiset. For example, a bivariate data set just means a multiset of elements of $\mathbb{R}^2$. Furthermore, if $X$ is a set, I like to write $\mathbb{N}\langle X \rangle$ for the set of all multisets in $X$. Therefore, $\mathbb{N}\langle \mathbb{R}^2\rangle$ is notation for the collection of all bivariate data sets. The remainder of my answer will address the question:

What is a multiset?

Informally, a multiset in $X$ is like a finite subset of $X$, except that repetitions are allowed. (Order still doesn't matter.) For example, the following are multisets in $\mathbb{N}$: $$\{1,2\} \quad \{2,1\} \quad \{2,1,1\}$$ The first two are equal, but the last one is distinct from the other two. Its sometimes clearer to write multisets using the notation of linear combinations: $$\{a,b,b\} = \{a\}+\{b\}+\{b\} = \{a\}+2\{b\}$$

Here's a few different formalizations, starting from the most concrete and ending with the most abstract:

Let $X$ denote a set. Then:

Definition 0. A multiset in $X$ is a finitely-supported function $X \rightarrow \mathbb{N}$.

(The support of $f : X \rightarrow \mathbb{N}$ is defined to be $\{x \in X \mid f(x) \neq 0\}$, and $f$ is said to be finitely-supported iff this set is finite.)

More abstractly:

Definition 1. A multiset in $X$ is an element of the $\mathbb{N}$-module freely generated by $X$.

(This explains why the notation of linear-combinations works. It also explains why $\mathbb{N}\langle X\rangle$ is good notation.)

To see how and why Definition 0 works, just interpret $\{a\}$ as a function $X \rightarrow \mathbb{N}$ for each $a \in X$ as follows: $$\{a\}(b) = [a=b]$$

(See also, Iverson bracket. I usually avoid Kronecker delta notation because its a less a versatile formalization and you can just do a lot less with it; therefore, I think the mathematical community should phase out its usage.)

Now observe that the set of finitely-supported functions $X \rightarrow \mathbb{N}$ form an $\mathbb{N}$-module under the pointwise operations. This allows us to add up elements of the form $\{a\}$ however we please, essentially building complicated multisets from simpler "atoms." We can say more: the set $$\{\{a\} : a \in X\}$$ is a basis for the set of all finitely-supported functions $X \rightarrow \mathbb{N}$, which explains the equivalence with Definition 1. In fact, it is the only basis.

For the more advanced reader:

First, some comments of a general nature. If you've only considered modules only over rings, the uniqueness of a basis might come as a bit of a shock. This is all possible because $\mathbb{N}$ is not a ring, of course. Another case of this is $\mathbb{B}$-modules, where $\mathbb{B} = \{0,1\}$ has multiplication given by Logical AND, addition given by Logical OR. So in particular, $1+1 = 1$, in contrast to the ring $\mathbb{Z}/2\mathbb{Z}$, which has $1+1 = 0$. Anyway, a $\mathbb{B}$-module turns out to be the same thing as a unital semilattice, and the $\mathbb{B}$-module freely by $X$ is just $\mathcal{P}_{\mathrm{fin}}(X)$, the collection of all finite subsets of $X$. The singletons provide the unique basis.

Moving beyond semirings, another place where basis-uniqueness occurs is in the context of barycentric algebras. In this case, the free algebras are simplices, which explains how we're able to speak of the vertexes of a simplex.

On another topic altogether, we can also try to categorify:

Definition 2. A categorified multiset in $X$ is a finitely-supported function $X \rightarrow \mathbf{FinSet}$.

(In other words, a categorified multiset is an $X$-indexed family of finite sets such that all but finitely many of those sets are empty.)

More abstractly:

Definition 3. A categorified multiset in $X$ is an object of the finite-coproduct category freely generated by $X$.

I'd also add that there's definition that seems not to fit the above scheme:

Definition. 4 A categorified multiset in $X$ is an object $M$ of the slice category $\mathbf{Set}/X$ such that the underlying set of $M$ is finite.


A "data set" in statistics does indeed allow repetitions and in that sense is different from a "set" in set theory.

It wouldn't make much sense otherwise: for instance, if you take the average daily temperature of each day for a year, there are only going to be a couple of dozen values (or a few dozen, in Fahrenheit), and the concept of average or mean, standard deviation, and so on, wouldn't make any sense.

Depending on the context, a "data set" is either an ordered series of values (thus, an $n$-tuple in disguise, as you say) or a collection of values, some of which may be repeated, with no ordering implied - so that $\{1,2,2\} = \{2,1,2\} = \{2,2,1\}$. I suppose that if you were desperate then you could consider the latter as a map from the space of possible values to the set of natural numbers.


In most cases the data set will be a true set if you view it as a set of observations. In the example of temperatures, there are only a few different temperatures, but each one corresponds to a different day. Your data set consists of ordered pairs (day, temperature on that day) and no pair is repeated. The only way you get repetition is if you observe the same data more than once. If you have a data set of the number of legs on horses, your observations are (horse name, number of legs). If you have a repeat, you have observed the same horse more than once, so you might want to delete one observation. Alternately, you might worry that a horse had lost a leg, in which case your data is (horse name, date observed, number of legs) and again you won't have any duplicates.