How does screenshotting work?

but I'm thinking that the screenshot itself happens way before the data has been turned into pixels?

It happens before the data has been turned into physical pixels (if there are any), but it happens after the data has been turned into pixel values, i.e. a bitmap image.

For example, if a program is displaying text or vector graphics or 3D visuals, the screenshot process doesn't care about that at all, it only cares about the resulting image after those graphics have been rendered into a bitmap.

However, the screenshot is taken directly from the OS memory, or at worst, read back from the GPU memory – it is not captured from the actual VGA or HDMI signals.

In other words, how does screenshotting work? Does it "freeze" pixels? Is it the graphics card or its driver that does the work? Some other hardware or software component?

Depends on the OS that you're asking about. Generally, the core graphics system (the same one which lets apps put windows on screen, such as GDI on Windows or X11 on Linux) will keep an in-memory copy of all pixels on screen (i.e. the framebuffer), so that they could be sent to the GPU again whenever needed. So it simply provides functions for programs to retrieve that copy.

For example, on Windows there are the GetDC() and GetWindowDC() functions. On Linux, the X11 system has somewhat similar methods such as XGetImage(). These just give the program a bitmap image that's already held somewhere in the system RAM, without any special hardware involvement.

(Although in some cases, e.g. with GNOME on Linux, the window manager actually uses the GPU to compose the screen's contents – so in order to make a screenshot it actually has to request the data back to the CPU first.)

As a side note, there can be some differences between what's in the framebuffer and what's actually being displayed. For example, many video games will produce very dark screenshots because they use the GPU's gamma correction feature to adjust the image brightness, and this correction is only applied as a last step when producing the video signal – so screenshots will only capture the uncorrected, dark-looking image. (Unless the game actually overrides the whole OS screenshot function with its own.)

One way of looking at the difference is to consider the results of the two.

A screenshot is the equivalent of the computer taking a full screen image in digital form and saving it as a file. In this manner, the digital information is as precise as it can be, based on the monitor and display adapter capability. If you have a 4K capable card and display, your screen capture will be 4K at perfect detail.

A camera snapshot of a screen, on the other hand, is a digital to analog to digital conversion. The first digital is the aforementioned information coming from the display adapter. The analog portion is the transmission of light from the display to your eyes and/or camera, while the final digital is the conversion of that light to digital via the camera digital sensor.

There is going to be a substantial difference in the quality of the image provided by the camera compared to the screen capture. The camera adds even more reduction of quality by passing the "signal" in the form of light through lenses with aberrations and losses.

A camera reads data from a light sensor and stores that data in RAM or other storage. In the case of a video camera as opposed to a still one, it's doing this continuously. The "raw" data from the sensor may not be compatible with the format needed by a display device, such as a PC graphics card or the LCD on a camera, so if the device with a camera needs to display what the camera is seeing, a conversion from the camera format to display device format is needed.

A screenshot is an export of data that already exists in RAM being used by a video card or eventually destined for a display device. Typically this data is in the format a PC graphics card or other display device expects. When it's captured, it has to be converted from this format to a well-known image format.

So the main differences are one of data flow:

Camera -> RAW data -> capture (copy) to storage or RAM -> display device binary format -> display device video RAM -> display device (if what camera is seeing should be directly displayed)

Camera -> RAW data -> capture (copy) to temp storage or RAM -> convert from there to JPEG, etc. (if what camera is seeing should be saved to file)

Display device -> display device video RAM -> display device binary format -> capture (copy) to other system RAM -> convert from there to BMP, JPEG, etc. (saving what display device is using to generate picture to file)