Why can't you hear music well over a telephone line?

The hint given by the interviewer is a red herring. The limitation you're hearing has been part of the phone network since long before digital sampling had any part in the telephone system. And it applies even in a local phone call where the signal is never digitized.

It is related to the fact that the connection from a land-line phone in your house or office back to the "central office" of the phone company is essentially a continuous connection through a pair of wires. There's typically no active circuits such as amplifiers, repeaters, digitizers, or other electonics involved.

Given the technology of 100 years ago when the phone network was first designed, a connection of this length could really only carry a very limited bandwidth. The engineers who designed the network did numerous experiments to determine just what frequencies needed to be conveyed for people to understand each other's regular speech, and designed the network only to be sure those frequencies were transmitted. They didn't add any costly components to the system if they weren't needed to achieve this goal.

For example they might have used passive filters to "emphasize" high frequencies in circuits that were a bit longer (and so naturally tend to cut out the high frequencies) than average, or to cut off high frequencies in circuits that were shorter than average, to ensure all users get as much as possible the same quality of connections.

Later, when they started using multiplexing to connect multiple voice circuits through a single wire (for inter-city connections, for example), the limitted bandwidth allowed them to carry more connections on a single wire, and at that point the bandwidth limitation would have been deliberately enforced by filtering to ensure that conversations didn't cross-talk between each other.

Finally, when digital sampling and digital transmission was introduced into the network, the sampling theorem limitations discussed in the other answers came into play. Fortuitously, the bandwidth limitations introduced in the early days of analog telephone networks allowed digitization to be done at really low bitrates without degrading the signal quality below what it had been all along, and again this allows more conversations to be carried on a given wire in the network.

Edit

I want to summarize with a key point that I previously posted in a comment on another answer:

The digital sampling rate (and later, compression methods) used in digital telephony was chosen to match the characteristics of the analog phone network, not the other way around.


According to Wikipedia the frequency range of the plain old telephone service is 300Hz to 3.4kHz. So any music you listen to will be missing the low frequencies and missing the high frequencies. If you remember back to the last time you heard hold music on the phone you'll probably remember that it sounded a bit muffled, but I have to say that it's still recognisable i.e. you can identify what music is being played. I'd be annoyed if my Hi-Fi sounded like that, but the music isn't totally mangled.

In my youth I used to be a Hi-Fi enthusiast, and the manufacturers' technical specs would boast that their equipment had a flat frequency spectrum from around 20Hz to 20kHz. The problem with reproducing this in a phone system is that as DisplayName mentions in their answer, to carry a frequency $f$ over a digital network requires a sampling frequency of at least $2f$ otherwise you get aliasing. Providing bandwidth costs money and reduces call capacity (i.e. fewer calls per optic fibre) so phone backbones use a sampling frequency of only 8kHz, and hence the highest permissible frequency is 4kHz. The upper limit is a bit lower than this because it's hard to engineer audio filters with very sharp cutoffs. The 3.4kHz limit I mentioned above is presumably to ensure that no frequency near 4kHz gets through.

Whether such a large frequency range is required for music playback is debatable. At a recent hearing checkup I was told I cannot hear anything above 12kHz (too many Black Sabbath gigs in my youth) but music on my Hi-Fi still sounds fine to me.


Have a look into the Nyquist theorem. The sampling frequency needs to be at least double the rate of the sampled frequency. I.e. that's why the human ear can hear up to ca. 20kHz and the CD samples at 44.1kHz.

Wikipedia Nyquist-Shannon Theorem

What do we hear instead if we do listen to (originally) 5 Hz to 20 kHz music through the phone? Is everything above 8 kHz simply gone or is there another effect? E.g., will 14 kHz be audible somehow (but differently) at 7 kHz?

Or in other words: "What is happening to the frequencies that are above the Nyquist threshold?"

The frequencies are missing. As simple as that. Not present. What our ear does instead is remember what should be there, based on experience. So when you talk to somebody, you know over the phone your brain adds what must be there. Still I noticed that the first time I did this my brain gave me the real info (lacking frequencies) and only later learned that it can just fake the rest, based on the knowledge of the voice of the opponent. See Wikipdedia:CELP which uses a similar approach for audio compression.

If you want to know more about the reasons of the 8kHz sampling rate you can again use wikipedia: Wikipedia:PSTN the standard used is G.711. Also Sampling Frequency and Human Speech, which I have not read yet, goes into what you need as a minimum for human speech including graphs and explanations. Finally you can look into Wikipedia:MP3 in order to understand psychoacoustics. Hint a beat masks things that come after it for example. So that stuff can be dropped, since you don't hear it and other nice things. :D