unordered_map excess calls to hash function
Firstly, a couple of observations:
The unordered map is both a hash table, and a singlylinked list.
See here that
begin
returns aniterator
which models LegacyForwardIterator.Inserting an entry into the map requires updating both the hash table and the linked list.
Secondly, a couple of notes on these containers' implementation decisions:
For singlylinked lists, it's common to have a sentinel node which doesn't contain any data (for something like
Node<T>
, it'll still have aT
, just defaultinitialized). We only want it for itsnext
pointer, because it helps keep list operations regular (ie, we don't have to write insertatthehead and insertafternode as different special cases).For hash tables (assuming linkedlist buckets, since it's required by the standard) we can either use
Node table[N]
(so each bucket has its own sentinel preallocated) orNode* table[N]
.In this case, since we're actually using
Node<T>
and don't know the size ofT
, it seems reasonable to store a pointer for each bucket.For a hash table which is also a singlylinked list, it makes sense to use the perbucket list as (part of) the allelements list. Otherwise we'd need to store two pointers per node,
next_in_bucket
andnext_in_list
.This means that the "sentinel" (onebeforethebeginning) node pointed to by a bucket is actually the last node of the previous bucket ... except for the bucket at the front of the list, when it really is the overall list sentinel.
The comments in the code say
/* ... * The nonempty buckets contain the node before the first node in the * bucket. This design makes it possible to implement something like a * std::forward_list::insert_after on container insertion and * std::forward_list::erase_after on container erase * calls. _M_before_begin is equivalent to * std::forward_list::before_begin. Empty buckets contain * nullptr. Note that one of the nonempty buckets contains * &_M_before_begin which is not a dereferenceable node so the * node pointer in a bucket shall never be dereferenced, only its * next node can be.
(the sentinel is
_M_before_begin
in this code)
So, when we add an element to an alreadypopulated bucket, the steps are roughly
void insert_to_non_empty_bucket(Node *n, Key k) {
Node *sentinel = table[k];
n>next = sentinel>next;
sentinel>next = n;
}
Note again that we don't know or care whether the sentinel here is the last element of the previous bucket, or the overall list sentinel. The code is the same either way (which was one of the reasons for using a sentinel in the first place).
However, when we add the first element to an empty bucket (and it is not the only nonempty bucket), we have one extra step: we need to update the sentinel pointer for the next bucket, to point to our new node. Otherwise we'd have two buckets both pointing to the list sentinel.
void insert_to_empty_bucket(Node *n, Key k) {
Node *sentinel = &list_sentinel; // ie, &_M_before_begin
n>next = sentinel>next;
sentinel>next = n;
// update the *next* bucket in the table
table[n>next>key] = n;
}
Finally: in this implementation, Node
doesn't cache the key, so there is no n>next>key
. There is actually a trait controlling this, but it's clearly false in this case, which means that final line has to recompute the hash in order to update the next bucket.
NB. just to clarify, when I say previous bucket or next bucket, I'm just talking about position in the list, where buckets appear in reverse order of when they became nonempty. It doesn't have anything to do with position in the table, or imply any intrinsic ordering.
As others pointed out, an unordered map, which is just a form of a hash table, is in libstdc++ implemented basically just as a single ("global") linked list. Plus, there is an array of buckets that point into this list. What is important is that the pointer stored in bucket[i]
does not point to the first node that belongs to this bucket (according to hash function mapping), but to its predecessor in the global list instead. The reason is obvious — when you add an item into the singlylinked list, you need to update its predecessor. Here, when you need to insert an element into some bucket, you need to update the predecessor of the first node of this bucket.
However, the very first node of the global linked list does not have any predecessor. To make things unified, there is a sentinel node that plays this role. In libstdc++, it's a member variable _M_before_begin
.
Let's assume that we have a hash table with keys A
and B
that belong to bucket[0]
and a key C
that belongs to bucket[1]
. It may, for instance, look like as follows:
global linked list buckets[]
 
_M_before_begin < bucket[0]

v
node_with_key_A

v
node_with_key_B < bucket[1]

v
node_with_key_C

x
Now, when a new key, say D
, is added into an empty bucket, say bucket[2]
, libstdc++ inserts it at the beginning of the global linked list.
Therefore, the situation after this insertion is as follows:
global linked list buckets[]
 
_M_before_begin < bucket[2]

v
node_with_key_D < bucket[0]

v
node_with_key_A

v
node_with_key_B < bucket[1]

v
node_with_key_C

x
Note that bucket[0]
that corresponds with node_with_key_A
pointed to by _M_before_begin
needs to be updated. And, since, as again pointed to by others, libstdc++ does not cache hash values by default, the only option how to find a bucket index for node_with_key_A
is to trigger a hash function.
Note that basically I just said the same as others, but wanted to add some illustrations that may help.
Another consequence of this approach is that hash function might be called during lookup: https://godbolt.org/z/K6qhWc. The reason is that the first element for some bucket is known, but not the last one. Therefore, the hash function for node keys needs to be resolved to find out whether a node still belongs to the actual bucket during the linked list traversal.
I can't explain why it is done that way, but it doesn't fit in a comment, so I leave it here in the answer section. You have two parts in the stdlib (10.1.0) upon insertion of an element:
__hash_code __code = __h>_M_hash_code(__k);
Which calculates the hash value of the element to insert __k
.
And later on this part of the code:
{
// The bucket is empty, the new node is inserted at the
// beginning of the singlylinked list and the bucket will
// contain _M_before_begin pointer.
__node>_M_nxt = _M_before_begin._M_nxt;
_M_before_begin._M_nxt = __node;
if (__node>_M_nxt)
// We must update former begin bucket that is pointing to
// _M_before_begin.
_M_buckets[_M_bucket_index(__node>_M_next())] = __node;
_M_buckets[__bkt] = &_M_before_begin;
}
Where _M_bucket_index
calculates the hash for __node>_M_next()
, __node
referes to the node created for __k
.
Maybe that helps you or someone else to explain it further.