How can MS-DOS and other text mode programs display double-width CJK characters?

The normal "80x25 characters" mode is actually 720x350 pixels (meaning that each character cell is 9 pixels wide by 14 pixels high). Double-width character modes ("40x25") can either simply interpolate this to the larger width by doubling each column to save on video content memory (cutting the required amount of video content memory in half), or use additional glyph memory and an identical amount of video content memory to increase the character cells to 18*14 pixels.

Fairly early on (I think it was done when EGA was introduced), support for user-defined character glyphs was added to the IBM PC's text display mode.

The normal text mode of the IBM PC is simply a sequential 4000 bytes of video content RAM at a particular address. These are read as one byte of character attributes (originally blinking, bold, underline etc.; later re-used for foreground and background colors and blinking/highlight, hence the limitation to 16 colors in text mode) and one byte describing the character to be displayed. The actual glyph to be displayed for each character byte value is stored elsewhere.

This means that as long as you can make do with 256 distinct glyphs on the screen at any one time, and each glyph can be represented as a 9x14 one-bit bitmap, you can simply replace the glyphs in memory to make the characters appear differently. In part, this was one portion of what mode con codepage select did on DOS. This is relatively trivial.

If you need more than 256 distinct glyphs but can live with the reduced number of glyphs on screen, you can go with a 40x25 scheme with double-width (18 pixels wide) glyphs. Assuming that the total amount of video content RAM is fixed and assuming that you can increase the glyph bitmap memory, you can move to using two bytes out of every four bytes to represent one on-screen glyph, giving you access to 2^16 = 65,536 different glyphs (including the blank glyph). If you feel daring, you could even skip the second attribute byte which gives you access to 2^24 ~ 16.7M different glyphs. Both of these approaches rely on special software support, but the hardware and firmware portion should be pretty easy to do. 65,536 glyphs at 18x14 one-bit pixels works out to about 2 MiB, a sizeable but not insurmountable amount of memory at the time. 256 glyphs at 18x14 one-bit pixels is about 8 KiB, which was absolutely reasonable even in the first half of the 1980s when EGA was developed and introduced.

Basic US English needs at least 62 dedicated glyphs (numbers 0-9, letters A-Z in upper and lower case) so you have something like 180-190 glyphs to play with if you also want to be able to display US English text at the same time and go with 8 bits per glyph. If you can live without simultaneous US English support, which you might choose to do in a resource-constrained environment such as the early IBM PC architecture, you have access to the full number of glyphs.

With some trickery you could probably mix and match the two schemes, too.

I don't know how it was actually done but both of these are viable schemes for how to get particularly limited-character-count "fancy" alphabets onto a plain IBM PC screen in text mode that I can come up with just sitting in front of Stack Exchange for a moment. It's perfectly possible that there are additional graphics modes that make this easier in practice.

Also, keep in mind the distinction between text mode and graphical mode displaying text. If you are in graphical mode, perhaps through VESA which is pretty universally supported, you're on your own as far as drawing character glyphs go but you also have a lot more freedom in how to draw them. For example, I'm pretty sure the text-based parts of Windows NT (which is the product family Windows XP belongs to) use a graphical mode to display text, including the Windows NT 4.0 boot screen and BSODs.