How to get tens of millions of pages indexed by Google bot?

Some potential strategies:

  • Google Webmaster Tools allows you to request an increased crawl rate. Try doing that if you haven't already.
  • Take another look at your navigation architecture to see if you can't improve access to more of your content. Look at it from a user's perspective: If it's hard for a user to find a specific piece of information, it may be hard for search engines as well.
  • Make sure you don't have duplicate content because of inconsistent URL parameters or improper use of slashes. By eliminating duplicate content, you cut down on the time Googlebot spends crawling something it has already indexed.
  • Use related content links and in-site linking within your content whenever possible.
  • Randomize some of your links. A sidebar with random internal content is a great pattern to use.
  • Use dates and other microformats.
  • Use RSS feeds wherever possible. RSS feeds will function much the same as a sitemap (in fact, Webmaster Tools allows you to submit a feed as a sitemap).
  • Regarding sitemaps, see this question.
  • Find ways to get external links to your content. This may accelerate the process of it getting indexed. If it's appropriate to the type of content, making it easy to share socially or through email will help with this.
  • Provide an API to incentivize use of your data and external links to your data. You can have an attribution link as a requirement to the data use.
  • Embrace the community. If you reach out to the right people in the right way, you'll get external links via blogs and Twitter.
  • Look for ways to create a community around your data. Find a way to make it social. API's, mashups, social widgets all help, but so do a blog, community showcases, forums, and gaming mechanics (also, see this video).
  • Prioritize which content you have indexed. With that much data, not all of it is going to be absolutely vital. Make a strategic decision as to what content is most important, e.g., it will be most popular, it has the best chance at ROI, it will be the most useful, etc. and make sure that that content is indexed first.
  • Do a detailed analysis of what your competitor is doing to get their content indexed. Look at their site architecture, their navigation, their external links, etc.

Finally, I should say this. SEO and indexing are only small parts to running a business site. Don't lose focus on ROI for the sake of SEO. Even if you have a lot of traffic from Google, it doesn't matter if you can't convert it. SEO is important, but it needs to be kept in perspective.

Edit:

As an addendum to your use case: you might consider offering reviews or testimonials for each person or business. Also, giving out user badges like StackOverflow does could entice at least some people to link to their own profile on your site. That would encourage some outside linking to your deep pages, which could mean getting indexed quicker.


How to get tens of millions of pages indexed by Google bot?

It won't happen overnight, however, I guarantee that you would see more of your pages spidered sooner if inbound links to deep content (particularly sitemap pages or directory indexes which point to yet deeper content) were being added from similarly-large sites which have been around for a while.

Will an older domain be sufficient to get 100,000 pages indexed per day?

Doubtful, unless you're talking about an older domain that has had a significant amount of activity on it (i.e. accumulated content and inbound links) over the years.

Are there any SEO consultants who specialize in aiding the indexing process itself.

When you pose the question that way, I'm sure you'll find plenty of SEO's who loudly proclaim "yes!" but, at the end of the day, Virtuosi Media's suggestions are as good advice as you'll get from any of them (to say nothing of the potentially-bad advice).

From the sound of it, you should consider utilizing business development and public relations channels to build your site's ranking at this point - get more links to your content (preferably by partnering with an existing site which offers regionally-targeted content to link in to your regionally-divided content, for example), get more people browsing to your site (some will have the Google toolbar installed so their traffic may work toward page discovery), and, if possible, get your business talked about on the news or in communities of people who have a need for it (if you plan to charge for certain services, consider advertising a free trial period to draw interest).


There are two Possible options I know of thay be of aome assistance.

One: A little trick I tried with a website that had three million pages which worked surprisingly well was what my colleague coined a crawl loop. You may have to manipulate the idea a bit to make it fit with your site.

Basically we set a day where we didnt think we would be getting much traffic (christmas) and we literally copied a list of every single link on our site and pasted every single one into a php file that was called on every single webpage. (The sidebar php file)

We then perceded to go to google search console (formerly google webmaster tools) and told google to fetch a url and crawl every single link on that urls page.

Since you have so many links, and the pages those link to also have an abundant amount of links, google goes into a bit of a loop and crawls the site in a much quicker fashion. I was skeptical at first but it worked like a charm.

Before you do this you must make sure you have an extremely efficient database setup and a very powerful server otherwise it could either overload the server or hurt your SEO due to the slow page load times.

If that isnt an option for you you can always look into google's cloud console apis. They have a search console api so you could write a script to either add each webpage as its own website instance in search console or to have google fetch every single one of your urls.

The apis can get complicated extremely quickly but are an amazing tool when used right.

Good luck!