What does multicore assembly language look like?

This isn't a direct answer to the question, but it's an answer to a question that appears in the comments. Essentially, the question is what support the hardware gives to multi-core operation, the ability to run multiple software threads at truly the same time, without software context-switching between them. (Sometimes called an SMP system).

Nicholas Flynt had it right, at least regarding x86. In a multi-core environment (Hyper-threading, multi-core or multi-processor), the Bootstrap core (usually hardware-thread (aka logical core) 0 in core 0 in processor 0) starts up fetching code from address 0xfffffff0. All the other cores (hardware threads) start up in a special sleep state called Wait-for-SIPI. As part of its initialization, the primary core sends a special inter-processor-interrupt (IPI) over the APIC called a SIPI (Startup IPI) to each core that is in WFS. The SIPI contains the address from which that core should start fetching code.

This mechanism allows each core to execute code from a different address. All that's needed is software support for each hardware core to set up its own tables and messaging queues.

The OS uses those to do the actual multi-threaded scheduling of software tasks. (A normal OS only has to bring up other cores once, at bootup, unless you're hot-plugging CPUs, e.g. in a virtual machine. This is separate from starting or migrating software threads onto those cores. Each core is running the kernel, which spends its time calling a sleep function to wait for an interrupt if there isn't anything else for it to be doing.)

As far as the actual assembly is concerned, as Nicholas wrote, there's no difference between the assemblies for a single threaded or multi threaded application. Each core has its own register set (execution context), so writing:

mov edx, 0

will only update EDX for the currently running thread. There's no way to modify EDX on another processor using a single assembly instruction. You need some sort of system call to ask the OS to tell another thread to run code that will update its own EDX.


As I understand it, each "core" is a complete processor, with its own register set. Basically, the BIOS starts you off with one core running, and then the operating system can "start" other cores by initializing them and pointing them at the code to run, etc.

Synchronization is done by the OS. Generally, each processor is running a different process for the OS, so the multi-threading functionality of the operating system is in charge of deciding which process gets to touch which memory, and what to do in the case of a memory collision.


Intel x86 minimal runnable baremetal example

Runnable bare metal example with all required boilerplate. All major parts are covered below.

Tested on Ubuntu 15.10 QEMU 2.3.0 and Lenovo ThinkPad T400 real hardware guest.

The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 covers SMP in chapters 8, 9 and 10.

Table 8-1. "Broadcast INIT-SIPI-SIPI Sequence and Choice of Timeouts" contains an example that basically just works:

MOV ESI, ICR_LOW    ; Load address of ICR low dword into ESI.
MOV EAX, 000C4500H  ; Load ICR encoding for broadcast INIT IPI
                    ; to all APs into EAX.
MOV [ESI], EAX      ; Broadcast INIT IPI to all APs
; 10-millisecond delay loop.
MOV EAX, 000C46XXH  ; Load ICR encoding for broadcast SIPI IP
                    ; to all APs into EAX, where xx is the vector computed in step 10.
MOV [ESI], EAX      ; Broadcast SIPI IPI to all APs
; 200-microsecond delay loop
MOV [ESI], EAX      ; Broadcast second SIPI IPI to all APs
                    ; Waits for the timer interrupt until the timer expires

On that code:

  1. Most operating systems will make most of those operations impossible from ring 3 (user programs).

    So you need to write your own kernel to play freely with it: a userland Linux program will not work.

  2. At first, a single processor runs, called the bootstrap processor (BSP).

    It must wake up the other ones (called Application Processors (AP)) through special interrupts called Inter Processor Interrupts (IPI).

    Those interrupts can be done by programming Advanced Programmable Interrupt Controller (APIC) through the Interrupt command register (ICR)

    The format of the ICR is documented at: 10.6 "ISSUING INTERPROCESSOR INTERRUPTS"

    The IPI happens as soon as we write to the ICR.

  3. ICR_LOW is defined at 8.4.4 "MP Initialization Example" as:

    ICR_LOW EQU 0FEE00300H
    

    The magic value 0FEE00300 is the memory address of the ICR, as documented at Table 10-1 "Local APIC Register Address Map"

  4. The simplest possible method is used in the example: it sets up the ICR to send broadcast IPIs which are delivered to all other processors except the current one.

    But it is also possible, and recommended by some, to get information about the processors through special data structures setup by the BIOS like ACPI tables or Intel's MP configuration table and only wake up the ones you need one by one.

  5. XX in 000C46XXH encodes the address of the first instruction that the processor will execute as:

    CS = XX * 0x100
    IP = 0
    

    Remember that CS multiples addresses by 0x10, so the actual memory address of the first instruction is:

    XX * 0x1000
    

    So if for example XX == 1, the processor will start at 0x1000.

    We must then ensure that there is 16-bit real mode code to be run at that memory location, e.g. with:

    cld
    mov $init_len, %ecx
    mov $init, %esi
    mov 0x1000, %edi
    rep movsb
    
    .code16
    init:
        xor %ax, %ax
        mov %ax, %ds
        /* Do stuff. */
        hlt
    .equ init_len, . - init
    

    Using a linker script is another possibility.

  6. The delay loops are an annoying part to get working: there is no super simple way to do such sleeps precisely.

    Possible methods include:

    • PIT (used in my example)
    • HPET
    • calibrate the time of a busy loop with the above, and use it instead

    Related: How to display a number on the screen and and sleep for one second with DOS x86 assembly?

  7. I think the initial processor needs to be in protected mode for this to work as we write to address 0FEE00300H which is too high for 16-bits

  8. To communicate between processors, we can use a spinlock on the main process, and modify the lock from the second core.

    We should ensure that memory write back is done, e.g. through wbinvd.

Shared state between processors

8.7.1 "State of the Logical Processors" says:

The following features are part of the architectural state of logical processors within Intel 64 or IA-32 processors supporting Intel Hyper-Threading Technology. The features can be subdivided into three groups:

  • Duplicated for each logical processor
  • Shared by logical processors in a physical processor
  • Shared or duplicated, depending on the implementation

The following features are duplicated for each logical processor:

  • General purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, and EBP)
  • Segment registers (CS, DS, SS, ES, FS, and GS)
  • EFLAGS and EIP registers. Note that the CS and EIP/RIP registers for each logical processor point to the instruction stream for the thread being executed by the logical processor.
  • x87 FPU registers (ST0 through ST7, status word, control word, tag word, data operand pointer, and instruction pointer)
  • MMX registers (MM0 through MM7)
  • XMM registers (XMM0 through XMM7) and the MXCSR register
  • Control registers and system table pointer registers (GDTR, LDTR, IDTR, task register)
  • Debug registers (DR0, DR1, DR2, DR3, DR6, DR7) and the debug control MSRs
  • Machine check global status (IA32_MCG_STATUS) and machine check capability (IA32_MCG_CAP) MSRs
  • Thermal clock modulation and ACPI Power management control MSRs
  • Time stamp counter MSRs
  • Most of the other MSR registers, including the page attribute table (PAT). See the exceptions below.
  • Local APIC registers.
  • Additional general purpose registers (R8-R15), XMM registers (XMM8-XMM15), control register, IA32_EFER on Intel 64 processors.

The following features are shared by logical processors:

  • Memory type range registers (MTRRs)

Whether the following features are shared or duplicated is implementation-specific:

  • IA32_MISC_ENABLE MSR (MSR address 1A0H)
  • Machine check architecture (MCA) MSRs (except for the IA32_MCG_STATUS and IA32_MCG_CAP MSRs)
  • Performance monitoring control and counter MSRs

Cache sharing is discussed at:

  • How are cache memories shared in multicore Intel CPUs?
  • http://stackoverflow.com/questions/4802565/multiple-threads-and-cpu-cache
  • Can multiple CPU's / cores access the same RAM simultaneously?

Intel hyperthreads have greater cache and pipeline sharing than separate cores: https://superuser.com/questions/133082/hyper-threading-and-dual-core-whats-the-difference/995858#995858

Linux kernel 4.2

The main initialization action seems to be at arch/x86/kernel/smpboot.c.

ARM minimal runnable baremetal example

Here I provide a minimal runnable ARMv8 aarch64 example for QEMU:

.global mystart
mystart:
    /* Reset spinlock. */
    mov x0, #0
    ldr x1, =spinlock
    str x0, [x1]

    /* Read cpu id into x1.
     * TODO: cores beyond 4th?
     * Mnemonic: Main Processor ID Register
     */
    mrs x1, mpidr_el1
    ands x1, x1, 3
    beq cpu0_only
cpu1_only:
    /* Only CPU 1 reaches this point and sets the spinlock. */
    mov x0, 1
    ldr x1, =spinlock
    str x0, [x1]
    /* Ensure that CPU 0 sees the write right now.
     * Optional, but could save some useless CPU 1 loops.
     */
    dmb sy
    /* Wake up CPU 0 if it is sleeping on wfe.
     * Optional, but could save power on a real system.
     */
    sev
cpu1_sleep_forever:
    /* Hint CPU 1 to enter low power mode.
     * Optional, but could save power on a real system.
     */
    wfe
    b cpu1_sleep_forever
cpu0_only:
    /* Only CPU 0 reaches this point. */

    /* Wake up CPU 1 from initial sleep!
     * See:https://github.com/cirosantilli/linux-kernel-module-cheat#psci
     */
    /* PCSI function identifier: CPU_ON. */
    ldr w0, =0xc4000003
    /* Argument 1: target_cpu */
    mov x1, 1
    /* Argument 2: entry_point_address */
    ldr x2, =cpu1_only
    /* Argument 3: context_id */
    mov x3, 0
    /* Unused hvc args: the Linux kernel zeroes them,
     * but I don't think it is required.
     */
    hvc 0

spinlock_start:
    ldr x0, spinlock
    /* Hint CPU 0 to enter low power mode. */
    wfe
    cbz x0, spinlock_start

    /* Semihost exit. */
    mov x1, 0x26
    movk x1, 2, lsl 16
    str x1, [sp, 0]
    mov x0, 0
    str x0, [sp, 8]
    mov x1, sp
    mov w0, 0x18
    hlt 0xf000

spinlock:
    .skip 8

GitHub upstream.

Assemble and run:

aarch64-linux-gnu-gcc \
  -mcpu=cortex-a57 \
  -nostdlib \
  -nostartfiles \
  -Wl,--section-start=.text=0x40000000 \
  -Wl,-N \
  -o aarch64.elf \
  -T link.ld \
  aarch64.S \
;
qemu-system-aarch64 \
  -machine virt \
  -cpu cortex-a57 \
  -d in_asm \
  -kernel aarch64.elf \
  -nographic \
  -semihosting \
  -smp 2 \
;

In this example, we put CPU 0 in a spinlock loop, and it only exits with CPU 1 releases the spinlock.

After the spinlock, CPU 0 then does a semihost exit call which makes QEMU quit.

If you start QEMU with just one CPU with -smp 1, then simulation just hangs forever on the spinlock.

CPU 1 is woken up with the PSCI interface, more details at: ARM: Start/Wakeup/Bringup the other CPU cores/APs and pass execution start address?

The upstream version also has a few tweaks to make it work on gem5, so you can experiment with performance characteristics as well.

I haven't tested it on real hardware, so and I'm not sure how portable this is. The following Raspberry Pi bibliography might be of interest:

  • https://github.com/bztsrc/raspi3-tutorial/tree/a3f069b794aeebef633dbe1af3610784d55a0efa/02_multicorec
  • https://github.com/dwelch67/raspberrypi/tree/a09771a1d5a0b53d8e7a461948dc226c5467aeec/multi00
  • https://github.com/LdB-ECM/Raspberry-Pi/blob/3b628a2c113b3997ffdb408db03093b2953e4961/Multicore/SmartStart64.S
  • https://github.com/LdB-ECM/Raspberry-Pi/blob/3b628a2c113b3997ffdb408db03093b2953e4961/Multicore/SmartStart32.S

This document provides some guidance on using ARM synchronization primitives which you can then use to do fun things with multiple cores: http://infocenter.arm.com/help/topic/com.arm.doc.dht0008a/DHT0008A_arm_synchronization_primitives.pdf

Tested on Ubuntu 18.10, GCC 8.2.0, Binutils 2.31.1, QEMU 2.12.0.

Next steps for more convenient programmability

The previous examples wake up secondary CPU and do basic memory synchronization with dedicated instructions, which is a good start.

But to make multicore systems easy to program, e.g. like POSIX pthreads, you would also need to go into the following more involved topics:

  • setup interrupts and run a timer that periodically decides which thread will run now. This is known as preemptive multithreading.

    Such system also needs to save and restore thread registers as they are started and stopped.

    It is also possible to have non-preemptive multitasking systems, but those might require you to modify your code so that every threads yields (e.g. with a pthread_yield implementation), and it becomes harder to balance workloads.

    Here are some simplistic bare metal timer examples:

    • x86 PIT
  • deal with memory conflicts. Notably, each thread will need a unique stack if you want to code in C or other high level languages.

    You could just limit threads to have a fixed maximum stack size, but the nicer way to deal with this is with paging which allows for efficient "unlimited size" stacks.

    Here is a naive aarch64 baremetal example that would blow up if the stack grows too deep

Those are some good reasons to use the Linux kernel or some other operating system :-)

Userland memory synchronization primitives

Although thread start / stop / management is generally beyond userland scope, you can however use assembly instructions from userland threads to synchronize memory accesses without potentially more expensive system calls.

You should of course prefer using libraries that portably wrap these low level primitives. The C++ standard itself has made great advances on the <mutex> and <atomic> headers, and in particular with std::memory_order. I'm not sure if it covers all possible memory semantics achievable, but it just might.

The more subtle semantics are particularly relevant in the context of lock free data structures, which can offer performance benefits in certain cases. To implement those, you will likely have to learn a bit about the different types of memory barriers: https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

Boost for example has some lock free container implementations at: https://www.boost.org/doc/libs/1_63_0/doc/html/lockfree.html

Such userland instructions also appear to be used to implement the Linux futex system call, which is one of the main synchronization primitives in Linux. man futex 4.15 reads:

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

The syscall name itself means "Fast Userspace XXX".

Here is a minimal useless C++ x86_64 / aarch64 example with inline assembly that illustrates basic usage of such instructions mostly for fun:

main.cpp

#include <atomic>
#include <cassert>
#include <iostream>
#include <thread>
#include <vector>

std::atomic_ulong my_atomic_ulong(0);
unsigned long my_non_atomic_ulong = 0;
#if defined(__x86_64__) || defined(__aarch64__)
unsigned long my_arch_atomic_ulong = 0;
unsigned long my_arch_non_atomic_ulong = 0;
#endif
size_t niters;

void threadMain() {
    for (size_t i = 0; i < niters; ++i) {
        my_atomic_ulong++;
        my_non_atomic_ulong++;
#if defined(__x86_64__)
        __asm__ __volatile__ (
            "incq %0;"
            : "+m" (my_arch_non_atomic_ulong)
            :
            :
        );
        // https://github.com/cirosantilli/linux-kernel-module-cheat#x86-lock-prefix
        __asm__ __volatile__ (
            "lock;"
            "incq %0;"
            : "+m" (my_arch_atomic_ulong)
            :
            :
        );
#elif defined(__aarch64__)
        __asm__ __volatile__ (
            "add %0, %0, 1;"
            : "+r" (my_arch_non_atomic_ulong)
            :
            :
        );
        // https://github.com/cirosantilli/linux-kernel-module-cheat#arm-lse
        __asm__ __volatile__ (
            "ldadd %[inc], xzr, [%[addr]];"
            : "=m" (my_arch_atomic_ulong)
            : [inc] "r" (1),
              [addr] "r" (&my_arch_atomic_ulong)
            :
        );
#endif
    }
}

int main(int argc, char **argv) {
    size_t nthreads;
    if (argc > 1) {
        nthreads = std::stoull(argv[1], NULL, 0);
    } else {
        nthreads = 2;
    }
    if (argc > 2) {
        niters = std::stoull(argv[2], NULL, 0);
    } else {
        niters = 10000;
    }
    std::vector<std::thread> threads(nthreads);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i] = std::thread(threadMain);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i].join();
    assert(my_atomic_ulong.load() == nthreads * niters);
    // We can also use the atomics direclty through `operator T` conversion.
    assert(my_atomic_ulong == my_atomic_ulong.load());
    std::cout << "my_non_atomic_ulong " << my_non_atomic_ulong << std::endl;
#if defined(__x86_64__) || defined(__aarch64__)
    assert(my_arch_atomic_ulong == nthreads * niters);
    std::cout << "my_arch_non_atomic_ulong " << my_arch_non_atomic_ulong << std::endl;
#endif
}

GitHub upstream.

Possible output:

my_non_atomic_ulong 15264
my_arch_non_atomic_ulong 15267

From this we see that the x86 LOCK prefix / aarch64 LDADD instruction made the addition atomic: without it we have race conditions on many of the adds, and the total count at the end is less than the synchronized 20000.

See also:

  • x86
    • LOCK What does the "lock" instruction mean in x86 assembly?
    • PAUSE How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
  • ARM
    • LDXR/STXR, LDAXR/STLXR: ARM64: LDXR/STXR vs LDAXR/STLXR
    • LDADD and other atomic v8.1 load modify store instructions: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801g/alc1476202791033.html
    • WFE / SVE: WFE instruction handling in ARM
  • What exactly is std::atomic?

Tested in Ubuntu 19.04 amd64 and with QEMU aarch64 user mode.