Is volatile expensive?

Generally speaking, on most modern processors a volatile load is comparable to a normal load. A volatile store is about 1/3 the time of a montior-enter/monitor-exit. This is seen on systems that are cache coherent.

To answer the OP's question, volatile writes are expensive while the reads usually are not.

Does this mean that volatile read operations can be done without a explicit cache invalidation on x86, and is as fast as a normal variable read (disregarding the reordering contraints of volatile)?

Yes, sometimes when validating a field the CPU may not even hit main memory, instead spy on other thread caches and get the value from there (very general explanation).

However, I second Neil's suggestion that if you have a field accessed by multiple threads you shold wrap it as an AtomicReference. Being an AtomicReference it executes roughly the same throughput for reads/writes but also is more obvious that the field will be accessed and modified by multiple threads.

Edit to answer OP's edit:

Cache coherence is a bit of a complicated protocol, but in short: CPU's will share a common cache line that is attached to main memory. If a CPU loads memory and no other CPU had it that CPU will assume it is the most up to date value. If another CPU tries to load the same memory location the already loaded CPU will be aware of this and actually share the cached reference to the requesting CPU - now the request CPU has a copy of that memory in its CPU cache. (It never had to look in main memory for the reference)

There is quite a bit more of protocol involved but this gives an idea of what is going on. Also to answer your other question, with the absence of multiple processors, volatile reads/writes can in fact be faster then with multiple processors. There are some applications that would in fact run faster concurrently with a single CPU then multiple.


On Intel an un-contended volatile read is quite cheap. If we consider the following simple case:

public static long l;

public static void run() {        
    if (l == -1)
        System.exit(-1);

    if (l == -2)
        System.exit(-1);
}

Using Java 7's ability to print assembly code the run method looks something like:

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb396ce80: mov    %eax,-0x3000(%esp)
0xb396ce87: push   %ebp
0xb396ce88: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 33)
0xb396ce8e: mov    $0xffffffff,%ecx
0xb396ce93: mov    $0xffffffff,%ebx
0xb396ce98: mov    $0x6fa2b2f0,%esi   ;   {oop('Test2')}
0xb396ce9d: mov    0x150(%esi),%ebp
0xb396cea3: mov    0x154(%esi),%edi   ;*getstatic l
                                    ; - Test2::run@0 (line 33)
0xb396cea9: cmp    %ecx,%ebp
0xb396ceab: jne    0xb396ceaf
0xb396cead: cmp    %ebx,%edi
0xb396ceaf: je     0xb396cece         ;*getstatic l
                                    ; - Test2::run@14 (line 37)
0xb396ceb1: mov    $0xfffffffe,%ecx
0xb396ceb6: mov    $0xffffffff,%ebx
0xb396cebb: cmp    %ecx,%ebp
0xb396cebd: jne    0xb396cec1
0xb396cebf: cmp    %ebx,%edi
0xb396cec1: je     0xb396ceeb         ;*return
                                    ; - Test2::run@28 (line 40)
0xb396cec3: add    $0x8,%esp
0xb396cec6: pop    %ebp
0xb396cec7: test   %eax,0xb7732000    ;   {poll_return}
;... lines removed

If you look at the 2 references to getstatic, the first involves a load from memory, the second skips the load as the value is reused from the register(s) it is already loaded into (long is 64 bit and on my 32 bit laptop it uses 2 registers).

If we make the l variable volatile the resulting assembly is different.

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb3ab9340: mov    %eax,-0x3000(%esp)
0xb3ab9347: push   %ebp
0xb3ab9348: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 32)
0xb3ab934e: mov    $0xffffffff,%ecx
0xb3ab9353: mov    $0xffffffff,%ebx
0xb3ab9358: mov    $0x150,%ebp
0xb3ab935d: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab9365: movd   %xmm0,%eax
0xb3ab9369: psrlq  $0x20,%xmm0
0xb3ab936e: movd   %xmm0,%edx         ;*getstatic l
                                    ; - Test2::run@0 (line 32)
0xb3ab9372: cmp    %ecx,%eax
0xb3ab9374: jne    0xb3ab9378
0xb3ab9376: cmp    %ebx,%edx
0xb3ab9378: je     0xb3ab93ac
0xb3ab937a: mov    $0xfffffffe,%ecx
0xb3ab937f: mov    $0xffffffff,%ebx
0xb3ab9384: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab938c: movd   %xmm0,%ebp
0xb3ab9390: psrlq  $0x20,%xmm0
0xb3ab9395: movd   %xmm0,%edi         ;*getstatic l
                                    ; - Test2::run@14 (line 36)
0xb3ab9399: cmp    %ecx,%ebp
0xb3ab939b: jne    0xb3ab939f
0xb3ab939d: cmp    %ebx,%edi
0xb3ab939f: je     0xb3ab93ba         ;*return
;... lines removed

In this case both of the getstatic references to the variable l involves a load from memory, i.e. the value can not be kept in a register across multiple volatile reads. To ensure that there is an atomic read the value is read from main memory into an MMX register movsd 0x6fb7b2f0(%ebp),%xmm0 making the read operation a single instruction (from the previous example we saw that 64bit value would normally require two 32bit reads on a 32bit system).

So the overall cost of a volatile read will roughly equivalent of a memory load and can be as cheap as a L1 cache access. However if another core is writing to the volatile variable, the cache-line will be invalidated requiring a main memory or perhaps an L3 cache access. The actual cost will depend heavily on the CPU architecture. Even between Intel and AMD the cache coherency protocols are different.


In the words of the Java Memory Model (as defined for Java 5+ in JSR 133), any operation -- read or write -- on a volatile variable creates a happens-before relationship with respect to any other operation on the same variable. This means that the compiler and JIT are forced to avoid certain optimisations such as reordering instructions within the thread or performing operations only within the local cache.

Since some optimisations are not available, the resulting code is necessarily slower that it would have been, though probably not by very much.

Nevertheless you shouldn't make a variable volatile unless you know that it will be accessed from multiple threads outside of synchronized blocks. Even then you should consider whether volatile is the best choice versus synchronized, AtomicReference and its friends, the explicit Lock classes, etc.