Does `COUNT` discard duplicates?

Either your professor made a mistake or you misunderstood what she said. In the context of relational DBMSes, as implemented by various vendors, the aggregate function COUNT(<expression>) returns the number of non-NULL values of <expression> in the result set (or a group).

There is a special case of COUNT(*), which returns the number of rows in the result set or group, not the number of values of anything. This is equivalent to COUNT(<constant expression>), such as COUNT(1).

Many databases support COUNT(DISTINCT <expression>), which will return the number of unique values of <expression>.

COUNT does count duplicates in all DBMS I'm aware of, but.

Is there any reason for a professor to teach this behaviour

Yes, there is a reason. In the original relational theory (that underlies all modern relational DBMSes) relation is a set in mathematical sense of this word. That means that no relation can contain duplicates at all, including all transitional relations, not just your “tables”.

Following this principle you may say that SELECT length FROM product already contains only two rows, hence corresponding COUNT returns 2, not 3.

For example, in Rel DBMS, using the relation given in the question and Tutorial D syntax:

SUMMARIZE product {length} BY {}: {c := COUNT()}

gives:

Rel result

If your professor is talking about SQL, the statement is wrong. COUNT(x) will return the number of rows where x IS NOT NULL including duplicates. COUNT(*) or COUNT([constant]) is a special case that will count the rows, even those where every column is NULL. However, duplicates are always counted, unless you specify COUNT(distinct x). Example:

with t(x,y) as ( values (null,null),(null,1),(1,null),(1,1) )

select count(*) from t
4

select count(1) from t
4

select count(distinct 1) from t
1

select count(x) from t
2

select count(distinct x) from t
1

COUNT(distinct *) is invalid AFAIK.

As a side note, NULL introduces some unintuitive behaviour. As an example:

SELECT SUM(x) + SUM(y),  SUM(x + y) FROM T
4, 2

i.e:

SUM(x)+SUM(y) <> SUM(x+y)

If he/she is talking about a relational system as described by, for example, the book Databases, Types, and the Relational Model: The Third Manifesto by C. J. Date and Hugh Darwen - it would be a correct statement.

Say that we have the relation:

STUDENTS = Relation(["StudentId", "Name"]
                    , [{"StudentId":'S1', "Name":'Anne'},
                       {"StudentId":'S2', "Name":'Anne'},
                       {"StudentId":'S3', "Name":'Cindy'},
                     ])

SELECT COUNT(NAME) FROM STUDENTS

corresponds to:

COUNT(STUDENTS.project(['Name']))

i.e.

COUNT( Relation(["Name"]
               , [{"Name":'Anne'},
                  {"Name":'Cindy'},
                ]) )

which would return 2.

Does `COUNT` discard duplicates?

Tags:

Dbms

Related

Recent Posts