Open source command line tools for indexing a large number of text files

The best thing you could do is feed the text files into a MySQL database and use its FullText matching system. This will give very rapid searches with rankings on how well the results match with the search.

Interfacing a MySQL database with other systems, such as a website for document searching, etc, would be a simple enough task.

Useful resources:

  • MySQL basics: http://news.softpedia.com/news/MySQL-Basic-Usage-Guide-37081.shtml
  • How to use full text searching: http://devzone.zend.com/article/1304
  • MySQL Full Text Searching manual: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

If you want to search for files by file name:

The standard Unix tool for this is locate. It builds a database of files in a cron job, then locate searches through the matches.

It's part of most Linux distributions (usually package "locate" or "mlocate").

If you want to search for files by content:

There are a variety of search engines available that will index documents for you (some even support other formats besides plain text, e.g. word processor document). Examples would be Beagle and Google desktop search. There's a fairly exhaustive list on Wikipedia:

http://en.wikipedia.org/wiki/List_of_search_engines#Desktop_search_engines

Edit:

If you don't want a search engine that runs in the background or automatically indexes all your files, you can probably still use a desktop search engine. Most of them let you control the indexing process, so you can start the indexing manually and specify which directories to index and where to put the index file.


I found what I was looking for. Swish++ can index of a directory of files (not just text), and is basically a set of command line tools. It appears to be a rewrite of Swish-e.