To calculate world coordinates from screen coordinates with OpenCV

First to understand how you calculate it, it would help you if you read some things about the pinhole camera model and simple perspective projection. For a quick glimpse, check this. I'll try to update with more.

So, let's start by the opposite which describes how a camera works: project a 3d point in the world coordinate system to a 2d point in our image. According to the camera model:

P_screen = I * P_world

or (using homogeneous coordinates)

| x_screen | = I * | x_world |
| y_screen |       | y_world |
|    1     |       | z_world |
                   |    1    |

where

I = | f_x    0    c_x    0 | 
    |  0    f_y   c_y    0 |
    |  0     0     1     0 |

is the 3x4 intrinsics matrix, f being the focal point and c the center of projection.

If you solve the system above, you get:

x_screen = (x_world/z_world)*f_x + c_x
y_screen = (y_world/z_world)*f_y + c_y

But, you want to do the reverse, so your answer is:

x_world = (x_screen - c_x) * z_world / f_x
y_world = (y_screen - c_y) * z_world / f_y

z_world is the depth the Kinect returns to you and you know f and c from your intrinsics calibration, so for every pixel, you apply the above to get the actual world coordinates.

Edit 1 (why the above correspond to world coordinates and what are the extrinsics we get during calibration):

First, check this one, it explains the various coordinates systems very well.

Your 3d coordinate systems are: Object ---> World ---> Camera. There is a transformation that takes you from object coordinate system to world and another one that takes you from world to camera (the extrinsics you refer to). Usually you assume that:

  • Either the Object system corresponds with the World system,
  • or, the Camera system corresponds with the World system

1. While capturing an object with the Kinect

When you use the Kinect to capture an object, what is returned to you from the sensor is the distance from the camera. That means that the z coordinate is already in camera coordinates. By converting x and y using the equations above, you get the point in camera coordinates.

Now, the world coordinate system is defined by you. One common approach is to assume that the camera is located at (0,0,0) of the world coordinate system. So, in that case, the extrinsics matrix actually corresponds to the identity matrix and the camera coordinates you found, correspond to world coordinates.

Sidenote: Because the Kinect returns the z in camera coordinates, there is also no need from transformation from the object coordinate system to the world coordinate system. Let's say for example that you had a different camera that captured faces and for each point it returned the distance from the nose (which you considered to be the center of the object coordinate system). In that case, since the values returned would be in the object coordinate system, we would indeed need a rotation and translation matrix to bring them to the camera coordinate system.

2. While calibrating the camera

I suppose you are calibrating the camera using OpenCV using a calibration board with various poses. The usual way is to assume that the board is actually stable and the camera is moving instead of the opposite (the transformation is the same in both cases). That means that now the world coordinate system corresponds to the object coordinate system. This way, for every frame, we find the checkerboard corners and assign them 3d coordinates, doing something like:

std::vector<cv::Point3f> objectCorners;

for (int i=0; i<noOfCornersInHeight; i++) 
{
    for (int j=0; j<noOfCornersInWidth; j++) 
    {
        objectCorners.push_back(cv::Point3f(float(i*squareSize),float(j*squareSize), 0.0f));
    }
} 

where noOfCornersInWidth, noOfCornersInHeight and squareSize depend on your calibration board. If for example noOfCornersInWidth = 4, noOfCornersInHeight = 3 and squareSize = 100, we get the 3d points

(0  ,0,0)  (0  ,100,0)  (0  ,200,0)    (0  ,300,0)
(100,0,0)  (100,100,0)  (100,200,0)    (100,300,0)
(200,0,0)  (200,100,0)  (200,200,0)    (200,300,0)

So, here our coordinates are actually in the object coordinate system. (We have assumed arbitrarily that the upper left corner of the board is (0,0,0) and the rest corners' coordinates are according to that one). So here we indeed need the rotation and transformation matrix to take us from the object(world) to the camera system. These are the extrinsics that OpenCV returns for each frame.

To sum up in the Kinect case:

  • Camera and World coodinate systems are considered the same, so no need for extrinsics there.
  • No need for Object to World(Camera) transformation, since Kinect return value is already in Camera system.

Edit 2 (On the coordinate system used):

This is a convention and I think it depends also on which drivers you use and the kind of data you get back. Check for example that, that and that one.

Sidenote: It would help you a lot if you visualized a point cloud and played a little bit with it. You can save your points in a 3d object format (e.g. ply or obj) and then just import it into a program like Meshlab (very easy to use).