What database technologies do big search engines use?

Pigeons.

The heart of Google's search technology is PigeonRank™, a system for ranking web pages developed by Google founders Larry Page and Sergey Brin at Stanford University:

enter image description here

Building upon the breakthrough work of B. F. Skinner, Page and Brin reasoned that low cost pigeon clusters (PCs) could be used to compute the relative value of web pages faster than human editors or machine-based algorithms. And while Google has dozens of engineers working to improve every aspect of our service on a daily basis, PigeonRank continues to provide the basis for all of our web search tools.

Why Google's patented PigeonRank™ works so well

PigeonRank's success relies primarily on the superior trainability of the domestic pigeon (Columba livia) and its unique capacity to recognize objects regardless of spatial orientation. The common gray pigeon can easily distinguish among items displaying only the minutest differences, an ability that enables it to select relevant web sites from among thousands of similar pages.

By collecting flocks of pigeons in dense clusters, Google is able to process search queries at speeds superior to traditional search engines, which typically rely on birds of prey, brooding hens or slow-moving waterfowl to do their relevance rankings.

When a search query is submitted to Google, it is routed to a data coop where monitors flash result pages at blazing speeds. When a relevant result is observed by one of the pigeons in the cluster, it strikes a rubber-coated steel bar with its beak, which assigns the page a PigeonRank value of one. For each peck, the PigeonRank increases. Those pages receiving the most pecks, are returned at the top of the user's results page with the other results displayed in pecking order.


I am sure there is a combination of things:

  • serious hardware
  • lots of it - data is distributed and replicated across many nodes and different data centers

    • (actually in the Google case at least I believe they have thousands and thousands of really low-end servers)
  • a lot of common queries' results are cached, notice how they pre-populate potential searches for things you know you've never searched for before; they're predicting what you might search for and hoping they've already got your result pre-calculated and cached somewhere. In a lot of cases they do - there aren't many searches you could come up with on Google today that haven't been asked by someone before you. When they do get a new search phrase then they probably use something like free-text search - and I'd expect keywords are extracted semantically when a page is first crawled rather than trying to find keywords in the document after you've searched for them. Of course they do have to periodically invalidate those caches, re-calculating page rank, and distributing the new cached results across their cache - and I'm sure there's a lot of serious engineering behind that.

It's important to bear in mind a couple of things about google:

  • Their DB is the proprietary BigTable - it was custom designed BY GOOGLE to exactly fit their needs

  • Their proprietary DB is built on top of their proprietary file system - Google File System - this was designed, again BY GOOGLE, to be easily expandable using common commodity hardware. As Aaron mentioned in his answer, they have a large number of average servers instead of a small number of very powerful servers.

They store individual tables across multiple machines as a way of making access quicker - their software knows which data is on which machine and instead of thrashing through a disk to locate it can go straight to the server with the relevant info.