Data structure: insert, remove, contains, get random element, all at O(1)

Consider a data structure composed of a hashtable H and an array A. The hashtable keys are the elements in the data structure, and the values are their positions in the array.

  1. insert(value): append the value to array and let i be its index in A. Set H[value]=i.
  2. remove(value): We are going to replace the cell that contains value in A with the last element in A. let d be the last element in the array A at index m. let i be H[value], the index in the array of the value to be removed. Set A[i]=d, H[d]=i, decrease the size of the array by one, and remove value from H.
  3. contains(value): return H.contains(value)
  4. getRandomElement(): let r=random(current size of A). return A[r].

since the array needs to auto-increase in size, it's going to be amortize O(1) to add an element, but I guess that's OK.


O(1) lookup implies a hashed data structure.

By comparison:

  • O(1) insert/delete with O(N) lookup implies a linked list.
  • O(1) insert, O(N) delete, and O(N) lookup implies an array-backed list
  • O(logN) insert/delete/lookup implies a tree or heap.

You might not like this, because they're probably looking for a clever solution, but sometimes it pays to stick to your guns... A hash table already satisfies the requirements - probably better overall than anything else will (albeit obviously in amortised constant time, and with different compromises to other solutions).

The requirement that's tricky is the "random element" selection: in a hash table, you would need to scan or probe for such an element.

For closed hashing / open addressing, the chance of any given bucket being occupied is size() / capacity(), but crucially this is typically kept in a constant multiplicative range by a hash-table implementation (e.g. the table may be kept larger than its current contents by say 1.2x to ~10x depending on performance/memory tuning). This means on average we can expect to search 1.2 to 10 buckets - totally independent of the total size of the container; amortised O(1).

I can imagine two simple approaches (and a great many more fiddly ones):

  • search linearly from a random bucket

    • consider empty/value-holding buckets ala "--AC-----B--D": you can say that the first "random" selection is fair even though it favours B, because B had no more probability of being favoured than the other elements, but if you're doing repeated "random" selections using the same values then clearly having B repeatedly favoured may be undesirable (nothing in the question demands even probabilities though)
  • try random buckets repeatedly until you find a populated one

    • "only" capacity() / size() average buckets visited (as above) - but in practical terms more expensive because random number generation is relatively expensive, and infinitely bad if infinitely improbable worst-case behaviour...
      • a faster compromise would be to use a list of pre-generated random offsets from the initial randomly selected bucket, %-ing them into the bucket count

Not a great solution, but may still be a better overall compromise than the memory and performance overheads of maintaining a second index array at all times.