How is a human voice unique?

A "pure tone" is a sound that has a single sine function as its pressure profile. The human voice is not a pure tone; it is a superposition of many different sine waves with different frequencies and different amplitudes. Here is an image illustrating how many sine waves of different frequencies can combine to make a more complicated waveform like the human voice:

Wave superposition image example

(image credit)

Thus a human voice has many more parameters than just a single amplitude and frequency. It has many amplitudes, one for each of many different frequencies (along with a phase for each as well). Furthermore, these amplitudes change over time as the human voice makes different sounds.

This picture, for example, is a "spectrogram" of a human voice.

Human voice spectrogram image

(image credit: By Dvortygirl, Mysid - FFT'd in baudline; original sound by DvortygirlThis file was derived from:En-us-it's all Greek to me.ogg, CC BY-SA 3.0 )

The x-axis is time, the y-axis is frequency, and the intensity indicates the amplitude of each frequency component at each point in time. A pure tone would show up as a single solid horizontal line. You can see that the human voice is made of many many frequency components of various amplitudes.

This is the same reason a violin, oboe, and piano sound different even when they play "the same note". The musical terminology for the specific balance of different frequency components is known as "timbre".

See the Wikipedia article for further reading.


Here is an image of the waveforms of three people saying the word "ramen." The first two are actually the same person on different occasions, having therefore the same pitch to his voice. The third is a woman saying the same word "ramen". I have altered the durations of the clips so that they all take up the same amount of time overall.

enter image description here

(click to expand).

If you look very closely, there is an initial segment of less-turbulence (R) morphing into a segment with a lot of turbulence (A) morphing into what's essentially a pure frequency (M) with, in the man's case, an overtone; followed by another rougher patch (E), followed by another "more pure" note (N), which seems to be very similar if a bit softer, more drawn-out, and possibly with a higher overtone in each case.

One thing that's very noticeable is that the woman's voice goes up-and-down a lot more, which manifests as her voice's higher pitch.

Another thing is this "turbulence" stuff: this stuff, and any sort of "noise," is a lot of different frequencies happening at once. Your ear actually has a part called the "cochlea" which appears to have little hairs that each are at a slightly different resonance frequency due to their different locations in the organ -- so different frequencies vibrate different hairs in your ears! It's the whole pattern of how these hairs vibrate together which makes the difference between the "a" sounds in Dad and Father, which are very different vowel-sounds (at least in American English!).

In general then there are not two pure numbers which distinguish a pure sound (its frequency and amplitude) but there are instead two functions of frequency which distinguish a pure sound. The first function is the amplitude as a function of frequency -- any pure sound is going to have a bunch of different components at different frequencies! -- and the second parameter is called the phase of the different frequencies. The two numbers are only going to distinguish two sine waves that start out in-phase, but very few of the sounds you hear are sine waves and very few of the sounds you hear are perfectly in phase.

Since a phase is best represented as an angle with such periodic and quasi-periodic waveforms, the natural description of a sound is actually in terms of a function which assigns every frequency a 2D scaled rotation matrix where the rotation angle is the phase and the scale-factor is the scale; it's in 2D because you only need one angle. Such scaled rotation matrices are also known as complex numbers and this function is called the sound's Fourier transform, defined as: $$y[f] = \mathcal F_{t \to f}~y(t) = \int_{-\infty}^\infty dt~ e^{-2\pi i f t} y(t).$$ It turns out that this has a very cool property which is its "inverse transform"$$y(t) = \mathcal F^{-1}_{f\to t}~ y[f] = \int_{-\infty}^\infty dt~ e^{+2\pi i f t} y[f].$$First off, the fact that this exists at all means that both pictures are 100% equivalent for any time-signal: we can always analyze what it looks like in the frequency domain. Second, the fact that its inverse takes almost exactly the same form allows us to reuse our fast Fourier transform tricks to build a fast inverse-Fourier transform, so these are used all the time in signal processing.

Each human voice contains a different baseline pitch, a different accent (mapping of words to actual sounds!), a different phase profile, some different choices of harmonics. It's a testament to how powerful our brain is, and how long it takes us to learn a language, that we can even recognize that two different people from different places are saying the same word! But there are obviously some patterns, like the simpler "more pure" natures of the M and N sounds above, which our brain can "latch on" to in order to group together common sounds. So it's not impossible, it's just very difficult.


As stated by the others, a sound is made up by sinus-waves of different frequencies. The tuning you hear, is determined by the lowest frequency (fundamental). The other frequencies are multiples of that ground frequency and are called overtones.

Summarising what is shown below: the amount in which the different overtones are present, determine the colour of the sound and makes the difference between your voice and mine, between a piano and a saxophone.

As an example, I examined two a's (440 Hz). One produced by a tuning-fork, the other played on an oboe (human speech is a bit more complex, but qualitatively, it's the same).

Below, the two recorded sound are displayed simultaniously: Yellow: sound of a tuning fork / White: sound of an oboe

Performing a fourier transform (looking at which frequencies are present in the sound) on the tuning-fork sound, the result is below: one frequency is very dominant: 440 Hz, the other frequencies hardly have any influence (notice the dB-scale and thus logarithmic scale on the y-axis).

Fourier Transform Tuning fork

The same analysis on the oboe sound reveals much more: Several peaks at 440 Hz, 880 Hz, 1320 Hz, ... (2x, 3x, 4x, ... 440 Hz) As you can see, the tuning you hear (440 Hz), is not the frequency that is most present in the sound (often, the first peak is the highest, but the pattern you see below is what gives the oboe its particular sound). Your hearing is trained to perceive the series of peaks as a whole and recognize the ground frequency as the pitch.

Fourier Transform Oboe