How to maintain dictionary in a heap in python?

Using heap is a best solution with time complexity: O(nlogk). where n is length of the heap and k is 10 here.

Now the trick with mapping of keys is that we can create another class for comparison of key and define magic methods __lt__() __gt__(). which overrides < , > operators

import heapq
class CompareWord:
  def __init__(self , word , value):
    self.word = word
    self.value = value

  def __lt__(self, other):   #To override > operator
    return self.value < other.value

  def __gt__(self , other):  #To override < operator
    return self.value > other.value

  def getWord(self):
    return self.word

def findKGreaterValues(compare_dict , k):
  min_heap = []
  for word in compare_dict:
      heapq.heappush(min_heap , CompareWord(word ,compare_dict[word] ))
      if(len(min_heap) > k):
          heapq.heappop(min_heap)   
  answer = []
  for compare_word_obj in min_heap:
      answer.append(compare_word_obj.getWord())

  return answer

Using heapq you probably want to do something like this:

heap = [(-value, key) for key,value in the_dict.items()]
largest = heapq.nsmallest(10, heap)
largest = [(key, -value) for value, key in largest]

Note that since heapq implements only a min heap it's better to invert the values, so that bigger values become smaller.

This solution will be slower for small sizes of the heap, for example:

>>> import random
>>> import itertools as it
>>> def key_generator():
...     characters = [chr(random.randint(65, 90)) for x in range(100)]
...     for i in it.count():
...             yield ''.join(random.sample(characters, 3))
... 
>>> the_dict = dict((key, random.randint(-500, 500)) for key, _ in zip(key_generator(), range(3000)))
>>> def with_heapq(the_dict):
...     items = [(-value, key) for key, value in the_dict.items()]
...     smallest = heapq.nsmallest(10, items)
...     return [-value for value, key in smallest]
... 
>>> def with_sorted(the_dict):
...     return sorted(the_dict.items(), key=(lambda x: x[1]), reverse=True)[:10]
... 
>>> import timeit
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
0.9220538139343262
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
1.2792410850524902

With 3000 values it's just slightly faster than the sorted version, which is O(nlogn) instead of O(n + mlogn). If we increase the size of the dict to 10000 the heapq version becomes even faster:

>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
2.436316967010498
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
3.585728168487549

The timings probably depends also on the machine on which you are running. You should probably profile which solution works best in your case. If the efficiency is not critical I'd suggest to use the sorted version because it's simpler.


For getting the top 10 elements, assuming that the number is in the second place:

from operator import itemgetter

topten = sorted(mydict.items(), key=itemgetter(1), reverse = True)[0:10]

if you want to sort by value then key just change it to key=itemgetter(1,0).

As for a data structure, a heap sounds like what you would want. Just keep them as tuples, and compare the number term.