Why do we need perspective division?

I mean why do we need that?

In layman terms: To make perspective distortion work. In a perspective projection matrix, the Z coordinate gets "mixed" into the W output component. So the smaller the value of the Z coordinate, i.e. the closer to the origin, the more things get scaled up, i.e. bigger on screen.


Some details that complement the general answers:

The idea is to project a point (x,y,z) on screen to have (xs,ys,d). The next figure shows this for the y coordinate.

enter image description here

We know from school that

tan(alpha) = ys / d = y / z

This means that the projection is computed as

ys = d*y/z = y /w

w = z / d

This is enough to apply a projection. However in OpenGL, you want (xs,ys,zs) to be normalized device coordinates in [-1,1] and yes this has something to do with clipping.

The extrema values for (xs,ys,zs) represent the unit cube and everything outside it will be clipped. So a projection matrix usually takes into consideration the clipping limits (Frustum) to make a single transformation that, with the perspective division, simultaneously apply a projection and transform the projected coordinates along with the z to normalized device coordinates.