Efficient raster sampling of billions of polygons (bounding boxes)

Your timeit includes the numpy import, which would add some overhead. So why don't you write the code for a subset of the bounding boxes and time that loop, then multiply it up to estimate the total running time?

Solving it on a single computer is by its nature serial, and with a relatively simple operation, you might not get any significant optimization from an already simple algorithm. You could try dividing it up in a sort of manual map-reduce operation (I know you have a "no map-reduce" caveat), and running as many instances as you have cores. Mosaicking/merging n rasters (the reduce step) is a trivially fast operation. This will probably be less painful to code than a multi-threaded solution.

Alternatively (or additionally), you could write a program to combine certain bounding boxes such as overlapping or nested ones - this would require a spatial index. If you don't have one, you may find creating one beneficial, especially if you end up locally parallelizing the main algorithm.

Also, don't dismiss multi-computer parallelization out of hand. If your best estimate is over a year, then you need to add up how much money your time will cost in running the single-computer version, and weigh it against hiring some cloud-compute time. As @whuber says, 1024 GPUs will chomp through the data so quickly, it'll cost you next to nothing, even if you spend a week getting your head round CUDA. If it's your boss prohibiting you from trying it on more than one computer, do the cost analysis and hand him some hard numbers - he'll then weigh up the value of the data against the value of your time.