Python file indexing and searching

Lupy has been retired and the developers recommend PyLucene instead. As for PyLucene, its mailing list activity may be low, but it is definitely supported. In fact, it just recently became an official apache subproject.

You may also want to look at a new contender: Whoosh. It's similar to lucene, but implemented in pure python.


I haven't done indexing before, however the following may be helpful :-

  1. pyIndex - http://rgaucher.info/beta/pyIndex/ -- File indexing library for Python
  2. http://www.xml.com/pub/a/ws/2003/05/13/email.html -- Thats a script for searching Outlook email using Python and Lucene
  3. http://gadfly.sourceforge.net/ - Aaron water's gadfly database (I think you can use this one for indexing. Haven't used it myself.)

As far as using HDF files goes, I have heard of a module called h5py.

I hope this helps.


I'd suggest Sphinx. It's very active, has much more features and seems faster than Lucene.