How to generate an automatic index (concordance) in a large file?

The output file should look like this: http://www.gurt-der-wahrheit.org/files/konkordanz_schlachter_1951_A4.pdf

Sure, it is possible. How about this?

example page 2

The complete concordance, of which the above is page 2, was generated by the following file (compile with lualatex rather than pdflatex):

\documentclass[a4paper]{article}
\usepackage{luatex85} % a4paper doesn't seem to take effect otherwise
\usepackage[margin=1cm, top=0.5cm, footskip=0.5cm]{geometry}
\usepackage{fontspec}
\setmainfont{Arial}

\begin{document}
\parindent=0pt \twocolumn \scriptsize
\pretolerance=-1 \sloppy  % Skip the first pass, and avoid overfull boxes
\spaceskip=\fontdimen2\font plus 2\fontdimen3\font minus \fontdimen4\font % Fewer underfull warnings, by allowing more stretch
\directlua{dofile('concordance.lua')}
\directlua{words, locations = concordance('bibschl.txt', 'latin1')}
\directlua{printConcordance(words, locations, {minLength=2, otherExclusions={'Aaron','RECHT','zorn'}})}
\end{document}

where concordance.lua is what generates the concordance (for each word, find all the places where it occurs) and typesets individual entries (bold keys, semicolons separating the locations, etc.):

function concordance(filename, encoding)
   -- Given a file that has the following structure:
   --         <blank line>
   --         BookName<space>ChapterNumber
   --         VerseNumber<space>Verse
   --         VerseNumber<space>Verse
   --         ...
   --         <blank line>
   --         BookName<space>ChapterNumber
   --         ...
   -- (Each verse itself is a sequence of space-separated words, ignoring case and trailing punctuation.)
   -- Returns two tables: (1) the words, in sorted order, and (2) mapping words to locations (book, chapter, verse)
   local readBookNext = true  -- Whether the *next* line contains Book & Chapter
   local currentBook = ''
   local currentChapter = 0
   local concordanceTable = {}
   for line in io.lines(filename) do
      line = makeUTF8(line, encoding)  -- Just in case encoding='latin1'
      if line == '' then readBookNext = true
      elseif readBookNext then
         currentBook, currentChapter = string.match(line, '^(.*) ([0-9]*)$')
         readBookNext = false
      else
         verseNumber, verse = string.match(line, '^([0-9]*) (.*)$')
         for word in string.gmatch(verse, '%S+') do
            addWordToConcordance(word, currentBook, currentChapter, verseNumber, concordanceTable)
         end
      end
   end
   local keys = {}
   for word, _ in pairs(concordanceTable) do table.insert(keys, word) end
   table.sort(keys)
   return keys, concordanceTable
end

local badFirsts = {['¶'] = true, ['«'] = true, ['-'] = true, ['<']=true, ['(']=true, [',']=true}
local badLasts = {['.']=true, [',']=true, [':']=true, ['!']=true, [';']=true, ['?']=true, ['»']=true, [')']=true, ["'"]=true, ['>']=true, ['`']=true}
function addWordToConcordance(origWord, book, chapter, verse, concordanceTable)
   -- In `concordanceTable`, adds (book, chapter, verse) to the entry for word
   local word = unicode.utf8.upper(origWord)
   while badFirsts[unicode.utf8.sub(word, 1, 1)] do word = unicode.utf8.sub(word, 2) end  -- Strip leading punctuation
   while badLasts[unicode.utf8.sub(word, -1)] do word = unicode.utf8.sub(word, 1, -2) end -- Strip trailing punctuation
   if string.match(word, '^[0-9-]*B?$') then return end  -- Ignore empty words and words like "42-3" or "29-39B"
   local list = concordanceTable[word] or {}
   table.insert(list, {book=book, chapter=chapter, verse=verse})
   concordanceTable[word] = list
end

function makeUTF8(line, encoding)
   -- Converts text `line` from latin1 (ISO-8859-1) to UTF-8, if necessary.
   if encoding == 'utf8' or encoding == nil then
      return line
   elseif encoding == 'latin1' then
      local utf8Line = ''
      for c in string.gmatch(line, '.') do utf8Line = utf8Line .. unicode.utf8.char(string.byte(c)) end
      return utf8Line
   else
      error(string.format('Unknown encoding "%s"', encoding))
   end
end

--------------------------------------------------------------------------------
-- Above are functions that generate the concordance; below are functions for injecting that into TeX
--------------------------------------------------------------------------------
function printConcordance(words, locations, options)
   options = options or {}
   local includeThreshold = options['includeThreshold'] or 300 -- Words that occur too often are dropped
   local breakThreshold = options['breakThreshold'] or 1000    -- A "paragraph" break is added after enough entries
   local minLength = options['minLength'] or 1                 -- Words shorter than this length are dropped
   local otherExclusions = {}                                  -- Words in this table are dropped
   for _, ex in ipairs(options['otherExclusions'] or {}) do
      otherExclusions[unicode.utf8.upper(ex)] = true
   end
   local numPrinted = breakThreshold + 100  -- more than breakThreshold: we want a “break” before the first word
   for _, word in ipairs(words) do
      tex.print([[\hskip 1.5\fontdimen2\font plus 5\fontdimen3\font minus \fontdimen4\font]])
      local n = #locations[word]
      if n > includeThreshold then
         print(string.format('Dropping word %s (occurs %d times)', word, n))
      elseif unicode.utf8.len(word) < minLength then
         print(string.format('Dropping word %s (its length %d is less than %d)', word, unicode.utf8.len(word), minLength))
      elseif otherExclusions[word] ~= nil then
         print(string.format('Dropping word %s (it was specified as an exclusion)', word))
      else
         if numPrinted > breakThreshold then
            tex.print(string.format([[\par\underline{\textbf{%s}}\par]], word))
            numPrinted = 0
         else
            tex.print(string.format([[\textbf{%s}]], word))
         end
         numPrinted = numPrinted + n
         for i, v in ipairs(locations[word]) do
            if i > 1 then tex.sprint('; ') end
            tex.sprint(string.format('%s%s:%s', abbrev(v.book), v.chapter, v.verse))
         end
      end
   end
end

-- Abbreviations that aren't just first 3 letters
local knownBooks = {Richter='Ri', Ruth='Rt', Hiob='Hi', Psalmen='Ps', Hohelied='Hld', Klagelieder='Klg',
                    Amos='Am', Zephania='Zef', Matthäus='Mt', Lukas='Lk', Apostelgeschichte='Apg', Philemon='Phm'}
function abbrev(book)
   if knownBooks[book] ~= nil then return knownBooks[book] end
   -- First 3 letters, but if 2nd letter is a space then ignore it.
   if string.sub(book, 2, 2) == ' ' then
      knownBooks[book] = unicode.utf8.sub(book, 1, 1) .. unicode.utf8.sub(book, 3, 4)
   else
      knownBooks[book] = unicode.utf8.sub(book, 1, 3)
   end
   return knownBooks[book]
end

To change what TeX sees, you can change the printConcordance function. It has an options parameter through which you can control various things:

includeThreshold for dropping words that occur too frequently. (For example, “UND” occurs over 42000 times and you surely don't want to index it; what about “DAVID” which occurs 862 times?) The default threshold is set at 300 occurrences.
minLength for “restrictions from minimal word length” as mentioned in the question
otherExclusions for “restrictions to specific words to exclude” as mentioned in the question.

And of course you can edit the function yourself to change the behaviour further. When you compile the above file, the terminal output will tell you which words are being dropped, and why. On my laptop it takes about 5 seconds to generate the concordance, and about 25 seconds to typeset it, so the whole run takes about 30 seconds total.

Because you asked for links to other programs, there is a solution, albeit a commercial one. First, create your PDF file using LaTeX. Second, run that file through PDF Index Generator: https://www.pdfindexgenerator.com/. You will be able to modify the list of words so as to remove common words. You will then get a file which you can append to your LaTeX-generated PDF. One caveat is the the generated list will not be in the same format, that is in terms of font and styles, as your LaTeX-generated text. To achieve this you would have to extract the concordance text from the PDF and then append it to your LaTeX-generated text. I have no connection with PDF Generator other than being a user.

A freeware list of common or "stop words" for German and other languages can be found at: https://github.com/Alir3z4/stop-words

How to generate an automatic index (concordance) in a large file?

Tags:

Indexing

Makeindex

Related

Recent Posts