How to generate a time-ordered uid in Python?

In the article you've linked too, the cassandra.util.uuid_from_time(time_arg, node=None, clock_seq=None)[source] seems to be exactly what you're looking for.

def uuid_from_time(time_arg, node=None, clock_seq=None):
    """
    Converts a datetime or timestamp to a type 1 :class:`uuid.UUID`.

    :param time_arg:
      The time to use for the timestamp portion of the UUID.
      This can either be a :class:`datetime` object or a timestamp
      in seconds (as returned from :meth:`time.time()`).
    :type datetime: :class:`datetime` or timestamp

    :param node:
      None integer for the UUID (up to 48 bits). If not specified, this
      field is randomized.
    :type node: long

    :param clock_seq:
      Clock sequence field for the UUID (up to 14 bits). If not specified,
      a random sequence is generated.
    :type clock_seq: int

    :rtype: :class:`uuid.UUID`

    """
    if hasattr(time_arg, 'utctimetuple'):
        seconds = int(calendar.timegm(time_arg.utctimetuple()))
        microseconds = (seconds * 1e6) + time_arg.time().microsecond
    else:
        microseconds = int(time_arg * 1e6)

    # 0x01b21dd213814000 is the number of 100-ns intervals between the
    # UUID epoch 1582-10-15 00:00:00 and the Unix epoch 1970-01-01 00:00:00.
    intervals = int(microseconds * 10) + 0x01b21dd213814000

    time_low = intervals & 0xffffffff
    time_mid = (intervals >> 32) & 0xffff
    time_hi_version = (intervals >> 48) & 0x0fff

    if clock_seq is None:
        clock_seq = random.getrandbits(14)
    else:
        if clock_seq > 0x3fff:
            raise ValueError('clock_seq is out of range (need a 14-bit value)')

    clock_seq_low = clock_seq & 0xff
    clock_seq_hi_variant = 0x80 | ((clock_seq >> 8) & 0x3f)

    if node is None:
        node = random.getrandbits(48)

    return uuid.UUID(fields=(time_low, time_mid, time_hi_version,
                             clock_seq_hi_variant, clock_seq_low, node), version=1)

There's nothing Cassandra specific to a Type 1 UUID...


Why uuid.uuid1 is not sequential

uuid.uuid1(node=None, clock_seq=None) is effectively:

  • 60 bits of timestamp (representing number of 100-ns intervals after 1582-10-15 00:00:00)
  • 14 bits of "clock sequence"
  • 48 bits of "Node info" (generated from network card's mac-address or from hostname or from RNG).

If you don't provide any arguments, then System function is called to generate uuid. In that case:

  • It's unclear if "clock sequence" is sequential or random.
  • It's unclear if it's safe to be used in multiple processes (can clock_seq be repeated in different processes or not?). In Python 3.7 this info is now available.

If you provide clock_seq or node, then "pure python implementation is used". IN this case even with "fixed value" for clock_seq:

  • timestamp part is guaranteed to be sequential for all the calls in current process even in threaded execution.
  • clock_seq part is randomly generated. But that is not critical annymore because timestamp is sequential and unique.
  • It's NOT safe for multiple processes (processes that call uuid1 with the same clock_seq, node might return conflicting values if called during the "same 100-ns time interval")

Solution that reuses uuid.uuid1

It's easy to see, that you can make uuid1 sequential by providing clock_seq or node arguments (to use python implementation).

import time

from uuid import uuid1, getnode

_my_clock_seq = getrandbits(14)
_my_node = getnode()


def sequential_uuid(node=None):
    return uuid1(node=node, clock_seq=_my_clock_seq)
    # .hex attribute of this value is 32-characters long string


def alt_sequential_uuid(clock_seq=None):
    return uuid1(node=_my_node, clock_seq=clock_seq)


if __name__ == '__main__':
    from itertools import count
    old_n = uuid1()  # "Native"
    old_s = sequential_uuid()  # Sequential

    native_conflict_index = None

    t_0 = time.time()

    for x in count():
        new_n = uuid1()
        new_s = sequential_uuid()

        if old_n > new_n and not native_conflict_index:
            native_conflict_index = x

        if old_s >= new_s:
            print("OOops: non-sequential results for `sequential_uuid()`")
            break

        if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
            print('No issues for `sequential_uuid()`')
            break

        old_n = new_n
        old_s = new_s

    print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')

Multiple processes issues

BUT if you are running some parallel processes on the same machine, then:

  • node which defaults to uuid.get_node() will be the same for all the processes;
  • clock_seq has small chance to be the same for some processes (chance of 1/16384)

That might lead to conflicts! That is general concern for using uuid.uuid1 in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.

If you make sure to also set node to unique value for each parallel process that runs this code, then conflicts should not happen.

Even if you are using SafeUUID, and set unique node, it's still possible to have non-sequential (but unique) ids if they are generated in different processes.

If some lock-related overhead is acceptable, then you can store clock_seq in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing: clock_seq can be "shared" using multiprocessing.Value

As a result you always have to remember:

  • If you are running multiple processes on the same machine, then you must:

    • Ensure uniqueness of node. The problem for this solution: you can't be sure to have sequential ids from different processes generated during the same 100-ns interval. But this is very "light" operation executed once on process startup and achieved by: "adding" something to default node, e.g. int(time.time()*1e9) - 0x118494406d1cc000, or by adding some counter from machine-level atomic db.

    • Ensure "machine-level atomic clock_seq" and the same node for all processes on one machine. That way you'll have some overhead for "locking" clock_seq, but id-s are guaranteed to be sequential even if generated in different processes during the same 100-ns interval (unless you are calling uuid from several threads in the same process).

  • For processes on different machines:

    • either you have to use some "global counter service";

    • or it's not possible to have sequential ids generated on different machines during the same 100-ns interval.

Reducing size of the id

General approach to generate UUIDs is quite simple, so it's easy to implement something similar from scratch, and for example use less bits for node_info part:

import time
from random import getrandbits

_my_clock_seq = getrandbits(14)
_last_timestamp_part = 0
_used_clock_seq = 0


timestamp_multiplier = 1e7  # I'd recommend to use this value

# Next values are enough up to year 2116:
if timestamp_multiplier == 1e9:
    time_bits = 62  # Up to year 2116, also reduces chances for non-sequential id-s generated in different processes
elif timestamp_multiplier == 1e8:
    time_bits = 60  # up to year 2335
elif timestamp_multiplier == 1e7:
    time_bits = 56  # Up to year 2198.
else:
    raise ValueError('Please calculate and set time_bits')

time_mask = 2**time_bits - 1

seq_bits = 16
seq_mask = 2**seq_bits - 1

node_bits = 12
node_mask = 2**node_bits - 1

max_hex_len = len(hex(2**(node_bits+seq_bits+time_bits) - 1)) - 2  # 21

_default_node_number = getrandbits(node_bits)  # or `uuid.getnode() & node_mask`


def sequential_uuid(node_number=None):
    """Return 21-characters long hex string that is sequential and unique for each call in current process.

    Results from different processes may "overlap" but are guaranteed to
    be unique if `node_number` is different in each process.

    """
    global _my_clock_seq
    global _last_timestamp_part
    global _used_clock_seq
    if node_number is None:
        node_number = _default_node_number
    if not 0 <= node_number <= node_mask:
        raise ValueError("Node number out of range")

    timestamp_part = int(time.time() * timestamp_multiplier) & time_mask
    _my_clock_seq = (_my_clock_seq + 1) & seq_mask

    if _last_timestamp_part >= timestamp_part:
        timestamp_part = _last_timestamp_part
        if _used_clock_seq == _my_clock_seq:
            timestamp_part = (timestamp_part + 1) & time_mask
    else:
        _used_clock_seq = _my_clock_seq

    _last_timestamp_part = timestamp_part

    return hex(
        (timestamp_part << (node_bits+seq_bits))
        |
        (_my_clock_seq << (node_bits))
        |
        node_number
    )[2:]

Notes:

  • Maybe it's better to simply store integer value (not hex-string) in the database
  • If you are storing it as text/char, then its better to convert integer to base64-string instead of converting it to hex-string. That way it will be shorter (21 chars hex-string → 16 chars b64-encoded string):
from base64 import b64encode

total_bits = time_bits+seq_bits+node_bits
total_bytes = total_bits // 8 + 1 * bool(total_bits % 8)

def int_to_b64(int_value):
    return b64encode(int_value.to_bytes(total_bytes, 'big'))

Collision chances

  • Single process: collisions not possible
  • Multiple processes with manually set unique clock_seq or unique node in each process: collisions not possible
  • Multiple processes with randomly set node (48-bits, "fixed" in time):

    • Chance to have the node collision in several processes:

      • in 2 processes out of 10000: ~0.000018%
      • in 2 processes out of 100000: 0.0018%
    • Chance to have single collision of the id per second in 2 processes with the "colliding" node:

      • for "timestamp" interval of 100-ns (default for uuid.uuid1 , and in my code when timestamp_multiplier == 1e7): proportional to 3.72e-19 * avg_call_frequency²

      • for "timestamp" interval of 10-ns (timestamp_multiplier == 1e8): proportional to 3.72e-21 * avg_call_frequency²