Space-efficient in-memory structure for sorted text supporting prefix searches

You can find a scientific paper connected to your problem here (citation of the authors: "Experiments show that our index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip, bzip or ppmdi." - but unfortunately the paper is payment only). I'm not sure how difficult these ideas are to implement. The authors of this paper have a website where you can find also implementations (under "Index Collection") of various compressed index algorithms.

If you want to go on with your approach, make sure to check out the websites about Crit-bit trees and Radix tree.

Since there are only 1.1 million chunks, you can index a chunk using 24 bits instead of 32 bits and save space there.

You could also compress the chunks. Perhaps Huffman coding is a good choice. I would also try the following strategy: instead of using a character as a symbol to encode, you should encode character transitions. So instead of looking at the probability of a character appearing, look at the probability of the transition in a Markov chain where the state is the current character.

Space-efficient in-memory structure for sorted text supporting prefix searches

Tags:

C#

.Net

Algorithm

Prefix

Trie

Related

Recent Posts