Why is the commute function called twice when changing a ref in Clojure?

I've done some experiments in order to understand how commute works. I would like to break my explanation into 3 parts:

  • Compare and Set Semantics
  • alter
  • commute

Compare and Set Semantics

I think Clojure for the Brave and True has explained it quite well:

swap! implements "compare-and-set" semantics, meaning it does the following internally:

  1. It reads the current state of the atom
  2. It then applies the update function to that state
  3. Next, it checks whether the value it read in step 1 is identical to the atom's current value
  4. If it is, then swap! updates the atom to refer to the result of step 2
  5. If it isn't, then swap! retries, going through the process again with step 1.

swap! is for atom, but knowing it will help us to understand alter and commute, because they got use similar method to update ref.

Unlike atom, ref modification (through alter, commute, ref-set) has to be wrapped inside a transaction. When transaction start (or retry), it will capture a snapshot of all containing ref (because alter needs it). ref will be modified only when the transaction is committed.

alter

In a transaction, all ref that will be modified by alter form a group. If any one of the ref in the group fails its alteration, transaction will be retried. Basically alter does the following:

  1. Compare its altering ref to the snapshot captured by transaction. If they look different, retry the transaction; else
  2. Create a new state from the snapshot using function provided.
  3. Compare ref to the snapshot again. If they look different, retry the transaction; else
  4. Try to write-lock the ref, don't let anybody modify it until end of this transaction trial. If failed (ref has already been locked), wait for a while (e.g. 100 ms), then retry the transaction.
  5. Tell transaction to update this ref into new state when it perform commission.

Let's demonstrate a smooth alteration. First, we are going to create a thread t1 to alter 3 counters c1, c2 and c3 with slow-inc.

(ns testing.core)

(def start (atom 0)) ; Record start time.

(def c1 (ref 0)) ; Counter 1
(def c2 (ref 0)) ; Counter 2
(def c3 (ref 0)) ; Counter 3

(defn milliTime 
  "Get current time in millisecond."
  []
  (int (/ (System/nanoTime) 1000000)))

(defn lap 
  "Get elapse time since 'start' in millisecond."
  []
  (- (milliTime) @start))

(defn slow-inc
  "Slow increment, takes 1 second."
  [x x-name]
  (println "slow-inc beg" x-name ":" x "|" (lap) "ms")
  (Thread/sleep 1000)
  (println "slow-inc end" x-name ":" (inc x) "|" (lap) "ms")
  (inc x))

(defn fast-inc
  "Fast increment. The value it prints is incremented."
  [x x-name]
  (println "fast-inc    " x-name ":" (inc x) "|" (lap) "ms")
  (inc x))

(defn -main
  []
  ;; Initialize c1, c2, c3 and start.
  (dosync (ref-set c1 0) 
          (ref-set c2 0)
          (ref-set c3 0))
  (reset! start (milliTime))

  ;; Start two new threads simultaneously.
  (let [t1 (future
             (dosync
               (println "transaction start   |" (lap) "ms")
               (alter c1 slow-inc "c1")
               (alter c2 slow-inc "c2")
               (alter c3 slow-inc "c3")
               (println "transaction end     |" (lap) "ms")))
        t2 (future)]

    ;; Dereference all of them (wait until all 2 threads finish).
    @t1 @t2 

    ;; Print final counters' values.
    (println "c1 :" @c1)
    (println "c2 :" @c2)
    (println "c3 :" @c3)))

And we got this:

transaction start   | 3 ms    ; 1st try
slow-inc beg c1 : 0 | 8 ms
slow-inc end c1 : 1 | 1008 ms
slow-inc beg c2 : 0 | 1009 ms
slow-inc end c2 : 1 | 2010 ms
slow-inc beg c3 : 0 | 2010 ms
slow-inc end c3 : 1 | 3011 ms
transaction end     | 3012 ms
c1 : 1
c2 : 1
c3 : 1

Smooth procedure. No surprise.

Let's see what will happen if a ref (let's say, c3) is modified before its alteration ((alter c3 ...)). We'll modify it during alteration of c1. Edit the let's binding of t2 into:

t2 (future
     (Thread/sleep 900) ; Increment at 900 ms
     (dosync (alter c3 fast-inc "c3")))

Result:

transaction start   | 2 ms    ; 1st try
slow-inc beg c1 : 0 | 7 ms
fast-inc     c3 : 1 | 904 ms  ; c3 being modified in thread t2
slow-inc end c1 : 1 | 1008 ms
slow-inc beg c2 : 0 | 1009 ms
slow-inc end c2 : 1 | 2010 ms
transaction start   | 2011 ms ; 2nd try
slow-inc beg c1 : 0 | 2011 ms
slow-inc end c1 : 1 | 3012 ms
slow-inc beg c2 : 0 | 3013 ms
slow-inc end c2 : 1 | 4014 ms
slow-inc beg c3 : 1 | 4015 ms
slow-inc end c3 : 2 | 5016 ms
transaction end     | 5016 ms
c1 : 1
c2 : 1
c3 : 2

As you can see, in step 1 of 1st-try-(alter c3 ...), it realize c3 (val = 1) looks different to the snapshot (val = 0) captured by transaction, so it retry the transaction.

Now, what if a ref (let's say, c1) was modified during its alteration ((alter c1 ...))? We will modify c1 on thread t2. Edit the let's binding of t2 into:

t2 (future
     (Thread/sleep 900) ; Increment at 900 ms
     (dosync (alter c1 fast-inc "c1")))

Result:

transaction start   | 3 ms    ; 1st try
slow-inc beg c1 : 0 | 8 ms
fast-inc     c1 : 1 | 904 ms  ; c1 being modified in thread t2
slow-inc end c1 : 1 | 1008 ms
transaction start   | 1009 ms ; 2nd try
slow-inc beg c1 : 1 | 1009 ms
slow-inc end c1 : 2 | 2010 ms
slow-inc beg c2 : 0 | 2011 ms
slow-inc end c2 : 1 | 3011 ms
slow-inc beg c3 : 0 | 3012 ms
slow-inc end c3 : 1 | 4013 ms
transaction end     | 4014 ms
c1 : 2
c2 : 1
c3 : 1

This time, in step 3 of 1st-try-(alter c1 ...), it find out that ref has been modified, so it call for a transaction retry.

Now, let's try to modify a ref (let's say, c1) after its alteration ((alter c1 ...)). We'll modify it during alteration of c2.

t2 (future
     (Thread/sleep 1600) ; Increment at 1600 ms
     (dosync (alter c1 fast-inc "c1")))

Result:

transaction start   | 3 ms    ; 1st try
slow-inc beg c1 : 0 | 8 ms
slow-inc end c1 : 1 | 1009 ms
slow-inc beg c2 : 0 | 1010 ms
fast-inc     c1 : 1 | 1604 ms ; try to modify c1 in thread t2, but failed
fast-inc     c1 : 1 | 1705 ms ; keep trying...
fast-inc     c1 : 1 | 1806 ms
fast-inc     c1 : 1 | 1908 ms
fast-inc     c1 : 1 | 2009 ms
slow-inc end c2 : 1 | 2011 ms
slow-inc beg c3 : 0 | 2012 ms
fast-inc     c1 : 1 | 2110 ms ; still trying...
fast-inc     c1 : 1 | 2211 ms
fast-inc     c1 : 1 | 2312 ms
fast-inc     c1 : 1 | 2413 ms
fast-inc     c1 : 1 | 2514 ms
fast-inc     c1 : 1 | 2615 ms
fast-inc     c1 : 1 | 2716 ms
fast-inc     c1 : 1 | 2817 ms
fast-inc     c1 : 1 | 2918 ms ; and trying....
slow-inc end c3 : 1 | 3012 ms
transaction end     | 3013 ms ; 1st try ended, transaction committed.
fast-inc     c1 : 2 | 3014 ms ; finally c1 modified successfully
c1 : 2
c2 : 1
c3 : 1

Since 1st-try-(alter c1 ...) has locked c1 (step 4), no one can modify c1, not until this round of transaction trial ended.

That's it for alter.

So, what if we don't want c1, c2, c3 all group together? Let's say I want to retry the transaction only when c1 or c3 fail their alteration (being modified by other thread during the transaction). I don't care the state of c2. There's no need to retry the transaction if c2 is modified during the transaction, so that I can save some time. How do we achieve that? Yes, through commute.

commute

Basically, commute does the following:

  1. Run the function provided using ref directly (not from snapshot) but don't do anything with the result.
  2. Ask the transaction to call real-commute with the same arguments before transaction commits. (real-commute is just a made up name by me.)

I'm don't actually know why commute has to run step 1. It seems to me step 2 alone is enough. real-commute does the following:

  1. Read-and-write-lock the ref until end of this transaction trial if it hasn't been locked, else retry the transaction.
  2. Create a new state from ref using given function.
  3. Tell transaction to update this ref into new state when it perform commission.

Let's examine it. Edit the let's binding into:

t1 (future
     (dosync
       (println "transaction start   |" (lap) "ms")
       (alter c1 slow-inc "c1")
       (commute c2 slow-inc "c2") ; changed to commute
       (alter c3 slow-inc "c3")
       (println "transaction end     |" (lap) "ms")))
t2 (future)

Result:

transaction start   | 3 ms
slow-inc beg c1 : 0 | 7 ms    ; called by alter
slow-inc end c1 : 1 | 1008 ms
slow-inc beg c2 : 0 | 1009 ms ; called by commute
slow-inc end c2 : 1 | 2009 ms
slow-inc beg c3 : 0 | 2010 ms ; called by alter
slow-inc end c3 : 1 | 3011 ms
transaction end     | 3012 ms
slow-inc beg c2 : 0 | 3012 ms ; called by real-commute
slow-inc end c2 : 1 | 4012 ms
c1 : 1
c2 : 1
c3 : 1

So slow-inc get called twice if you use commute, once by commute and once by real-commute before transaction commit. The first commute didn't do anything with slow-inc's result.

slow-inc can get called more than twice. For example, let's try to modify c3 on thread t2:

t2 (future
     (Thread/sleep 500) ; modify c3 at 500 ms
     (dosync (alter c3 fast-inc "c3")))

Result:

transaction start   | 2 ms
slow-inc beg c1 : 0 | 8 ms
fast-inc     c3 : 1 | 504 ms  ; c3 modified at thread t2
slow-inc end c1 : 1 | 1008 ms
slow-inc beg c2 : 0 | 1009 ms ; 1st time
slow-inc end c2 : 1 | 2010 ms
transaction start   | 2012 ms
slow-inc beg c1 : 0 | 2012 ms
slow-inc end c1 : 1 | 3013 ms
slow-inc beg c2 : 0 | 3014 ms ; 2nd time
slow-inc end c2 : 1 | 4015 ms
slow-inc beg c3 : 1 | 4016 ms
slow-inc end c3 : 2 | 5016 ms
transaction end     | 5017 ms
slow-inc beg c2 : 0 | 5017 ms ; 3rd time
slow-inc end c2 : 1 | 6018 ms
c1 : 1
c2 : 1
c3 : 2

In first trial of transaction, after (commute c2 ...) has been evaluated, (alter c3 ...) found out that c3 was different from the snapshot, thus triggered transaction to retry. If (alter c3 ...) is before (commute c2 ...), then retrial will be triggered before evaluation or (commute c2 ..). So, placing all commutes after all alters may save you some time.

Let's see what will happen if we modify c2 in thread t2 while transaction in t1 is being evaluated.

t2 (future
     (Thread/sleep 500) ; before evaluation of (commute c2 ...)
     (dosync (alter c2 fast-inc "c2"))
     (Thread/sleep 1000) ; during evaluation of (commute c2 ...)
     (dosync (alter c2 fast-inc "c2"))
     (Thread/sleep 1000) ; after evaluation of (commute c2 ...)
     (dosync (alter c2 fast-inc "c2")))

Result:

transaction start   | 3 ms
slow-inc beg c1 : 0 | 9 ms
fast-inc     c2 : 1 | 504 ms  ; before
slow-inc end c1 : 1 | 1009 ms
slow-inc beg c2 : 1 | 1010 ms
fast-inc     c2 : 2 | 1506 ms ; during
slow-inc end c2 : 2 | 2011 ms
slow-inc beg c3 : 0 | 2012 ms
fast-inc     c2 : 3 | 2508 ms ; after
slow-inc end c3 : 1 | 3013 ms
transaction end     | 3013 ms
slow-inc beg c2 : 3 | 3014 ms
slow-inc end c2 : 4 | 4014 ms
c1 : 1
c2 : 4
c3 : 1

As you can see, no transaction retrial, and c2 still updated to our expected value (4), thanks to real-commute.

Now I want to demonstrate the effect of step 1 in real-commute: its ref is read-and-write-locked. First, to confirm it is read-locked:

t2 (future
     (Thread/sleep 3500) ; during real-commute
     (println "try to read c2:" @c2 "  |" (lap) "ms"))

Result:

transaction start   | 3 ms
slow-inc beg c1 : 0 | 9 ms
slow-inc end c1 : 1 | 1010 ms
slow-inc beg c2 : 0 | 1010 ms
slow-inc end c2 : 1 | 2011 ms
slow-inc beg c3 : 0 | 2012 ms
slow-inc end c3 : 1 | 3012 ms
transaction end     | 3013 ms
slow-inc beg c2 : 0 | 3013 ms
slow-inc end c2 : 1 | 4014 ms
try to read c2: 1   | 4015 ms ; got printed after transaction trial ended
c1 : 1
c2 : 1
c3 : 1

The @c2 is blocked until c2 got unlocked. That's why println got evaluated after 4000 ms even though our order is sleep for 3500 ms.

Since commute and alter need to read their ref to perform given function, they will be blocked until their ref is unlocked too. You can try to replace (println ...) with (alter c2 fast-inc "c2"). The effect should be same as this example.

So, in order to confirm it is write-locked, we can use ref-set:

t2 (future
     (Thread/sleep 3500) ; during real-commute
     (dosync (ref-set c2 (fast-inc 9 " 8"))))

Result:

transaction start   | 3 ms
slow-inc beg c1 : 0 | 8 ms
slow-inc end c1 : 1 | 1008 ms
slow-inc beg c2 : 0 | 1010 ms
slow-inc end c2 : 1 | 2011 ms
slow-inc beg c3 : 0 | 2012 ms
slow-inc end c3 : 1 | 3013 ms
transaction end     | 3014 ms
slow-inc beg c2 : 0 | 3014 ms
fast-inc      8 : 9 | 3504 ms ; try to ref-set but failed
fast-inc      8 : 9 | 3605 ms ; try again...
fast-inc      8 : 9 | 3706 ms
fast-inc      8 : 9 | 3807 ms
fast-inc      8 : 9 | 3908 ms
fast-inc      8 : 9 | 4009 ms
slow-inc end c2 : 1 | 4015 ms
fast-inc      8 : 9 | 4016 ms ; finally success, c2 ref-set to 9
c1 : 1
c2 : 9
c3 : 1

From here you can also guess what ref-set does:

  • If its ref has been write-locked, retry the transaction after some time (e.g. 100 ms); else tell the transaction to update this ref into given value when it perform commission.

real-commute can fail too, when its ref has been locked on step 1. Unlike alter or ref-set, it doesn't wait for some time before retrying the transaction. This may cause problem if the ref is locked for too long. For example, we'll try to modify c1 after its alteration, using commute:

t2 (future
     (Thread/sleep 2500) ; during alteration of c3
     (dosync (commute c1 fast-inc "c1")))

Result:

transaction start   | 3 ms
slow-inc beg c1 : 0 | 8 ms
slow-inc end c1 : 1 | 1008 ms
slow-inc beg c2 : 0 | 1010 ms
slow-inc end c2 : 1 | 2011 ms
slow-inc beg c3 : 0 | 2012 ms
fast-inc     c1 : 1 | 2506 ms
fast-inc     c1 : 1 | 2506 ms
fast-inc     c1 : 1 | 2506 ms
...

Exception in thread "main" java.util.concurrent.ExecutionException:
  java.lang.RuntimeException: Transaction failed after reaching retry
  limit, compiling: ...

Recall that c1 is write-locked by alter after its alteration, therefore real-commute keep failing and keep retrying the transaction. Without buffer time, it reached transaction retrial limit and boomed.

Note

commute helps to improve concurrency by letting user reduce ref that will cause transaction retrial, with the cost of calling given function at least twice to update its ref. There are some scenarios which commute can be slower than alter. For example, when the only thing to do in a transaction is update a ref, commute cost more than alter:

(def c (ref 0)) ; counter

(defn slow-inc
  [x]
  (Thread/sleep 1000)
  (inc x))

(defn add-2
  "Create two threads to slow-inc c simultaneously with func.
  func can be alter or commute."
  [func]
  (let [t1 (future (dosync (func c slow-inc)))
        t2 (future (dosync (func c slow-inc)))]
    @t1 @t2))

(defn -main
  [& args]
  (dosync (ref-set c 0))
  (time (add-2 alter))
  (dosync (ref-set c 0))
  (time (add-2 commute)))

Result:

"Elapsed time: 2003.239891 msecs" ; alter
"Elapsed time: 4001.073448 msecs" ; commute

Here are the procedures of alter:

  • 0 ms: t1's alter started.
  • 1 ms: t2's alter started.
  • 1000 ms: t1's alter successfully, t1 committed, c became 1.
  • 1001 ms: t2's alter found out that c is different from its snapshot (step 2), retry transaction.
  • 2001 ms: t2's alter successfully, t2 committed, c became 2.

And procedures of commute:

  • 0 ms: t1's commute started.
  • 1 ms: t2's commute started.
  • 1000 ms: t1's real-commute started. c is locked.
  • 1001 ms: t2's real-commute started. It found out that c has been locked, so it retry the transaction (step 1).
  • 1002 ms: t2's commute started, but c is locked, so it is blocked.
  • 2000 ms: t1's real-commute ended, transaction committed. c became 1. t2 is unblocked.
  • 3002 ms: t2's real-commute started.
  • 4002 ms: t2's real-commute ended, transaction committed. c became 2.

That's why commute is slower than alter in this example.

This may look contradict to example of commute from clojuredocs.org. The key difference is, in his example, the delay (100 ms) happened in body of transaction, but the delay happened in slow-inc in my example. This difference cause his real-commute phase run very fast, thus reducing the locking time and blocking time. Less locking time means less probability of retrial. That's why in his example, commute is faster than alter. Change his inc into slow-inc and you will get same observation as mine.

That's all.


I figured it out.

This happens because commute functions are always executed twice.

Commute allows for more potential concurrency than alter because it doesn't lock the identity for the entire duration of the transaction.

Instead, it reads the identity's value once at the beginning of the transaction and when the commute operation is called it returns the commute function applied to THIS VALUE.

It's entirely possible that this value is now out of date because some other thread could have changed it some time between the start of the transaction and the execute of the commute function.

However, integrity is maintained because the commute function is executed AGAIN at commit time when it actually modifies the ref.

This site has a very clear explanation of the difference: http://squirrel.pl/blog/2010/07/13/clojure-alter-vs-commute/

In fact, when commute is called it instantly returns result of running the function on the ref. At the very end of transaction it performs the calculation again, this time synchronously (like alter) updating the ref. That’s why eventually the counter value is 51 even though the last thread printed 45.

So be careful if your commute function has side-effects because they will be executed twice!!

Tags:

Clojure

Stm