Is there an automatic process to create index creation?

I suggest you look at the script make-index.py (and related files) in the scripts folder of the download page at the Stacks Project (http://www.math.columbia.edu/algebraic_geometry/stacks-git/). The index it generates isn't really ideal, but at least their strategy will give you some idea how to get started. They seem to take the approach that (in a gigantic math textbook) the things which most deserve to be in the index are the italicized word(s) or phrase(s) in each definition environment. In my experience using math books, the most common reason I look something up in the index is to learn its definition, so this seems appropriate, although maybe not for books in other subjects. However you might be able to use the Stacks Project script as a guide to automate the creation of an index which suits your own needs, even if they are very different.


As others have mentioned, trying to automate this task would be close to impossible. But if you want to get some very rough hints of words for yourself, this is something I would try (note, requires some scripting):

Use detex or something to strip the TeX markup and then write a small script that counts the number of time each word has been used in the document. The top words in the list will probably be useless words like a, the, is, etc. But, after those, you might be able to find a few promising words.


In addition to what Juan A. Navarro suggested, I'd say that words which occur in chapter and section titles are likely candidates for indexing. E.g., if section 2.3 is entitled "The Virasoro Algebra", then that's probably a sufficiently important topic that other occurrences of "Virasoro algebra" should be indexed. You could write a script (in your favourite scripting language) to pull out the arguments to \section commands and the like, throw out the prepositions and articles and sort the remainder by frequency. How your script will know that the words Virasoro and algebra go together . . . well, either you call import skynet and live with the consequences, or you do some manual work with its output.

Other things to check could include words which are capitalized when not at the beginning of a sentence and words set in emphatic type.