How does a microcontroller boot and startup, step by step?

1) the compiled binary is written to prom/flash yes. USB, serial, i2c, jtag, etc depends on the device as to what is supported by that device, irrelevent for understanding the boot process.

2) This is typically not true for a microcontroller, the primary use case is to have instructions in rom/flash and data in ram. No matter what the architecture. for a non-microcontroller, your pc, your laptop, your server, the program is copied from non-volatile (disk) to ram then run from there. Some microcontrollers let you use ram as well, even ones that claim harvard even though it appears to violate the definition. There is nothing about harvard that prevents you from mapping ram into the instruction side, you just need to have a mechanism to get the instructions there after power is up (which violates the definition, but harvard systems would have to do that to be useful other than as microcontrollers).

3) sort of.

Each cpu "boots" in a deterministic, as designed, way. The most common way is a vector table where the address for the first instructions to run after powering up are in the reset vector, an address that the hardware reads then uses that address to start running. The other general way is to have the processor start executing without a vector table at some well known address. Sometimes the chip will have "straps", some pins that you can tie high or low before releasing reset, that the logic uses to boot different ways. You have to separate the cpu itself, the processor core from the rest of the system. Understand how the cpu operates, and then understand that the chip/system designers have setup address decoders around the outside of the cpu so that some part of the cpus address space communicates with a flash, and some with ram and some with peripherals (uart, i2c, spi, gpio, etc). You can take that same cpu core if you wish, and wrap it differently. This is what you get when you buy something arm or mips based. arm and mips make cpu cores, which chip people buy and wrap their own stuff around, for various reasons they dont make that stuff compatible from brand to brand. Which is why rarely can ask a generic arm question when it comes to anything outside the core.

A microcontroller attempts to be a system on a chip, so its non-volatile memory (flash/rom), volatile (sram), and cpu are all on the same chip along with a mixture of peripherals. But the chip is designed internally such that the flash is mapped into the address space of the cpu that matches the boot characteristics of that cpu. If for example the cpu has a reset vector at address 0xFFFC, then there needs to be flash/rom that responds to that address that we can program via 1), along with enough flash/rom in the address space for useful programs. A chip designer may choose to have 0x1000 bytes of flash starting at 0xF000 in order to satisfy those requirements. And perhaps they put some amount of ram at a lower address or maybe 0x0000, and the peripherals somewhere in the middle.

Another architecture of cpu might start executing at address zero, so they would need to do the opposite, place the flash so that it answers to an address range around zero. say 0x0000 to 0x0FFF for example. and then put some ram elsewhere.

The chip designers know how the cpu boots and they have placed non-volatile storage there (flash/rom). It is then up to the software folks to write the boot code to match the well known behavior of that cpu. You have to place the reset vector address in the reset vector and your boot code at the address you defined in the reset vector. The toolchain can help you greatly here. sometimes, esp with point and click ides or other sandboxes they may do most of the work for you all you do is call apis in a high level language (C).

But, however it is done the program loaded into the flash/rom has to match the hardwired boot behavior of the cpu. Before the C portion of your program main() and on if you use main as your entry point, some things have to be done. A C programmer assumes that when the declare a variable with an initial value, they expect that to actually work. Well, variables, other than const ones, are in ram, but if you have one with an initial value that initial value has to be in non-volatile ram. So this is the .data segment and the C bootstrap needs to copy .data stuff from flash to ram (where is usually determined for you by the toolchain). Global variables that you declare without an initial value are assumed to be zero before your program starts although you should really not assume that and thankfully some compilers are starting to warn about uninitialized variables. This is the .bss segment, and the C bootstrap zeros that out in ram, the content, zeros, does not have to be stored in non-volatile memory, but the starting address and how much does. Again the toolchain helps you greatly here. And lastly the bare minimum is you need to setup a stack pointer as C programs expect to be able to have local variables and call other functions. Then maybe some other chip specific stuff is done, or we let the rest of the chip specific stuff happen in C.

The cortex-m series cores from arm will do some of this for you, the stack pointer is in the vector table, there is a reset vector to point at the code to be run after reset, so that other than whatever you have to do to generate the vector table (which you usually use asm for anyway) you can go pure C without asm. now you dont get your .data copied over nor your .bss zeroed so you have to do that yourself if you want to try to go without asm on something cortex-m based. The bigger feature is not the reset vector but interrupt vectors where the hardware follows the arms recommended C calling convention and preserves registers for you, and uses the correct return for that vector, so that you dont have to wrap the right asm around each handler (or have toolchain specific directives for your target to have the toolchain wrap it for you).

Chip specific stuff may be for example, microcontrollers are often used in battery based systems, so low power so some come out of reset with most of the peripherals turned off, and you have to turn each of these sub systems on so you can use them. Uarts, gpios, etc. Often a low-ish clock speed is used, straight from a crystal or internal oscillator. And your system design may show that you need a faster clock, so you initialize that. your clock may be too fast for the flash or ram so you may have needed to change the wait states before upping the clock. Might need to setup the uart, or usb or other interfaces. then your application can do its thing.

A computer desktop, laptop, server, and a microcontroller are no different in how they boot/work. Except that they are not mostly on one chip. The bios program is often on a separate chip flash/rom from the cpu. Although recently x86 cpus are pulling more and more of what used to be support chips into the same package (pcie controllers, etc) but you still have most of your ram and rom off chip, but it is still a system and it still works exactly the same at a high level. The cpu boot process is well known, the board designers place the flash/rom in the address space where the cpu boots. that program (part of the BIOS on an x86 pc) does all the things mentioned above, it starts up various peripherals, it initializes dram, enumerates the pcie buses, and so on. Is often quite configurable by the user based on bios settings or what we used to call cmos settings, because at the time that is what tech was used. Doesnt matter, there are user settings that you can go and change to tell the bios boot code how to vary what it does.

different folks will use different terminology. a chip boots, that is the first code that runs. sometimes called bootstrap. a bootloader with the word loader often means that if you dont do anything to interfere it is a bootstrap which takes you from generic booting into something larger, your application or operating system. but the loader part implies that you can interrupt the boot process and then maybe load other test programs. if you have ever used uboot for example on an embedded linux system, you can hit a key and stop the normal boot then you can download a test kernel into ram and boot it instead of the one that is on flash, or you can download your own programs, or you can download the new kernel then have the bootloader write it to flash so that next time you boot it runs the new stuff. but bootloader as a term is often used for any kind of booting even if it doesnt have a loader portion to it.

As far as the cpu itself, the core processor, which doesnt know ram from flash from peripherals. There is no notion of bootloader, operating system, application. It is just a sequence of instructions that are fed into the cpu to be executed. These are software terms to distinguish different programming tasks from one another. Software concepts from one another.

Some microcontrollers have a separate bootloader provided by the chip vendor in a separate flash or separate area of flash that you might not be able to modify. In this case there is often a pin or set of pins (I call them straps) that if you tie them high or low before reset is released you are telling the logic and/or that bootloader what to do, for example one strap combination may tell the chip to run that bootloader and wait on the uart for data to be programmed into the flash. Set the straps the other way and your program boots not the chip vendors bootloader, allowing for field programming of the chip or recovering from your program crashing. Sometimes it is just pure logic that allows you to program the flash. This is quite common these days, but if you go wayback you did need/want your own bootloader for the same reasons (of course go too far back and you pulled the eeprom/prom/rom chip out of the socket and replaced it with another or reprogrammed it in a fixture. And you can still have your own bootloader if you want, even if there are hardware ways to field program (avr/arduino).

The reason why most microcontrollers have much more flash than ram is that the primary use case is to run the program directly from flash, and only have enough ram to cover stack and variables. Although in some cases you can run programs from ram which you have to compile right and store in flash then copy before calling.

EDIT

flash.s

.cpu cortex-m0
.thumb

.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.word hang
.word hang

.thumb_func
reset:
    bl notmain
    b hang

.thumb_func
hang:   b .

notmain.c

int notmain ( void )
{
    unsigned int x=1;
    unsigned int y;
    y = x + 1;

    return(0);
}

flash.ld

MEMORY
{
    bob : ORIGIN = 0x00000000, LENGTH = 0x1000
    ted : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > bob
    .rodata : { *(.rodata*) } > bob
    .bss : { *(.bss*) } > ted
    .data : { *(.bss*) } > ted AT > bob
}

So this is an example for a cortex-m0, the cortex-ms all work the same as far as this example goes. The particular chip, for this example, has application flash at address 0x00000000 in the arm address space and ram at 0x20000000.

The way a cortex-m boots is the 32 bit word at address 0x0000 is the address to initialize the stack pointer. I dont need much stack for this example so 0x20001000 will suffice, obviously there has to be ram below that address (the way the arm pushes, is it subtracts first then pushes so if you set 0x20001000 the first item on the stack is at address 0x2000FFFC you dont have to use 0x2000FFFC). The 32 bit word at address 0x0004 is the address to the reset handler, basically the first code that runs after a reset. Then there are more interrupt and event handlers that are specific to that cortex m core and chip, possibly as many as 128 or 256, if you dont use them then you dont need to setup the table for them, I threw in a few for demonstration purposes. But you would have to make sure you have the right vector at the address hardcoded in the logic for a particular interrupt/event (for example a uart rx data interrupt, or a gpio pin state change interrupt, as well as the undefined instructions, data aborts and such).

I do not need to deal with .data nor .bss in this example because I know already there is nothing in those segments by looking at the code. If there were I would deal with it, and will in a second.

So the stack is setup, check, .data taken care of, check, .bss, check, so the C bootstrap stuff is done, can branch to the entry function for C. Because some compilers will add extra junk if they see the function main() and on the way to main, I dont use that exact name, I used notmain() here as my C entry point. So the reset handler calls notmain() then if/when notmain() returns it goes to hang which is just an infinite loop, possibly poorly named.

I firmly believe in mastering the tools, many folks dont, but what you will find is that each bare metal developer does his/her own thing, because of the near complete freedom, not remotely as constrained as you would be making apps or web pages. They again do their own thing. I prefer to have my own bootstrap code and linker script. Others rely on the toolchain, or play in the vendors sandbox where most of the work is done by someone else (and if something breaks you are in a world of hurt, and with bare metal things break often and in dramatic ways).

So assembling, compiling and linking with gnu tools I get:

00000000 <_start>:
   0:   20001000    andcs   r1, r0, r0
   4:   00000015    andeq   r0, r0, r5, lsl r0
   8:   0000001b    andeq   r0, r0, fp, lsl r0
   c:   0000001b    andeq   r0, r0, fp, lsl r0
  10:   0000001b    andeq   r0, r0, fp, lsl r0

00000014 <reset>:
  14:   f000 f802   bl  1c <notmain>
  18:   e7ff        b.n 1a <hang>

0000001a <hang>:
  1a:   e7fe        b.n 1a <hang>

0000001c <notmain>:
  1c:   2000        movs    r0, #0
  1e:   4770        bx  lr

So how does the bootloader know where stuff is. Because the compiler did the work. In the first case the assembler generated the code for flash.s, and by doing so knows where the labels are (labels are just addresses just like function names or variable names, etc) so I didnt have to count bytes and fill in the vector table manually, I used a label name and the assembler did it for me. Now you ask, if reset is address 0x14 why did the assembler put 0x15 in the vector table. Well this is a cortex-m and it boots and only runs in thumb mode. With ARM when you branch to an address if branching to thumb mode the lsbit needs to be set, if arm mode then reset. So you always need that bit set. I know the tools and by putting .thumb_func before a label, if that label is used as it is in the vector table or for branching to or whatever. The toolchain knows to set the lsbit. So it has here 0x14|1 = 0x15. Likewise for hang. Now the disassembler doesnt show 0x1D for the call to notmain() but dont worry the tools have correctly built the instruction.

Now that code in notmain, those local variables are, not used, they are dead code. The compiler even comments on that fact by saying y is set but not used.

Note the address space, these things all start at address 0x0000 and go from there so the vector table is properly placed, the .text or program space is also properly placed, how I got flash.s in front of notmain.c's code is by knowing the tools, a common mistake is to not get that right and crash and burn hard. IMO you have to dissassemble to make sure things are placed right before you boot the first time, once you have things in the right place you dont necessarily have to check every time. Just for new projects or if they hang.

Now something that surprises some folks is that there is no reason whatsoever to expect any two compilers to produce the same output from the same input. Or even the same compiler with different settings. Using clang, the llvm compiler I get these two outputs with and without optimization

llvm/clang optimized

00000000 <_start>:
   0:   20001000    andcs   r1, r0, r0
   4:   00000015    andeq   r0, r0, r5, lsl r0
   8:   0000001b    andeq   r0, r0, fp, lsl r0
   c:   0000001b    andeq   r0, r0, fp, lsl r0
  10:   0000001b    andeq   r0, r0, fp, lsl r0

00000014 <reset>:
  14:   f000 f802   bl  1c <notmain>
  18:   e7ff        b.n 1a <hang>

0000001a <hang>:
  1a:   e7fe        b.n 1a <hang>

0000001c <notmain>:
  1c:   2000        movs    r0, #0
  1e:   4770        bx  lr

not optimized

00000000 <_start>:
   0:   20001000    andcs   r1, r0, r0
   4:   00000015    andeq   r0, r0, r5, lsl r0
   8:   0000001b    andeq   r0, r0, fp, lsl r0
   c:   0000001b    andeq   r0, r0, fp, lsl r0
  10:   0000001b    andeq   r0, r0, fp, lsl r0

00000014 <reset>:
  14:   f000 f802   bl  1c <notmain>
  18:   e7ff        b.n 1a <hang>

0000001a <hang>:
  1a:   e7fe        b.n 1a <hang>

0000001c <notmain>:
  1c:   b082        sub sp, #8
  1e:   2001        movs    r0, #1
  20:   9001        str r0, [sp, #4]
  22:   2002        movs    r0, #2
  24:   9000        str r0, [sp, #0]
  26:   2000        movs    r0, #0
  28:   b002        add sp, #8
  2a:   4770        bx  lr

so that is a lie the compiler did optimize out the addition, but it did allocate two items on the stack for the variables, since these are local variables they are in ram but on the stack not at fixed addresses, will see with globals that that changes. But the compiler realized that it could compute y at compile time and there was no reason to compute it at run time so it simply placed a 1 in the stack space allocated for x and a 2 for the stack space allocated for y. the compiler "allocates" this space with internal tables I declare stack plus 0 for variable y and stack plus 4 for variable x. the compiler can do whatever it wants so long as the code it implements conforms to the C standard or expetations of a C programmer. There is no reason why the compiler has to leave x at stack + 4 for the duration of the function, it could move it around as much as it wants, but remember humans make compilers and humans have to debug compilers and you have to balance maintenance and debugging with performance, and very often you will see that compiler generated code tends to setup a stack frame once and keep everything relative to the stack pointer throughout the function.

If I add a function dummy in assembler

.thumb_func
.globl dummy
dummy:
    bx lr

and then call it

void dummy ( unsigned int );
int notmain ( void )
{
    unsigned int x=1;
    unsigned int y;
    y = x + 1;
    dummy(y);
    return(0);
}

the output changes

00000000 <_start>:
   0:   20001000    andcs   r1, r0, r0
   4:   00000015    andeq   r0, r0, r5, lsl r0
   8:   0000001b    andeq   r0, r0, fp, lsl r0
   c:   0000001b    andeq   r0, r0, fp, lsl r0
  10:   0000001b    andeq   r0, r0, fp, lsl r0

00000014 <reset>:
  14:   f000 f804   bl  20 <notmain>
  18:   e7ff        b.n 1a <hang>

0000001a <hang>:
  1a:   e7fe        b.n 1a <hang>

0000001c <dummy>:
  1c:   4770        bx  lr
    ...

00000020 <notmain>:
  20:   b510        push    {r4, lr}
  22:   2002        movs    r0, #2
  24:   f7ff fffa   bl  1c <dummy>
  28:   2000        movs    r0, #0
  2a:   bc10        pop {r4}
  2c:   bc02        pop {r1}
  2e:   4708        bx  r1

now that we have nested functions, the notmain function needs to preserve its return address, so that it can clobber the return address for the nested call. this is because the arm uses a register for returns, if it used the stack like say an x86 or some others well...it would still use the stack but differently. Now you ask why did it push r4? Well, the calling convention not long ago changed to keep the stack aligned on 64 bit (two word) boundaries instead of 32 bit, one word boundaries. So they need to push something to keep the stack aligned, so the compiler arbitrarily chose r4 for some reason, doesnt matter why. Popping into r4 would be a bug though as per the calling convention for this target, we dont clobber r4 on a function call, we can clobber r0 through r3. r0 is the return value. Looks like it is doing a tail optimization maybe, I dont know for some reason it didnt use lr to return.

But we see that the x and y math is optimized to a hardcoded value of 2 being passed to the dummy function (dummy was specifically coded in a separate file, in this case asm, so that the compiler wouldnt optimize the function call out completely, if I had a dummy function that simply returned in C in notmain.c the optimizer would have removed the x, y, and dummy function call because they are all dead/useless code).

Also note that because flash.s code got larger notmain is elsehwere and the toolchain has taken care of patching up all the addresses for us so we dont have to do that manually.

unoptimized clang for reference

00000020 <notmain>:
  20:   b580        push    {r7, lr}
  22:   af00        add r7, sp, #0
  24:   b082        sub sp, #8
  26:   2001        movs    r0, #1
  28:   9001        str r0, [sp, #4]
  2a:   2002        movs    r0, #2
  2c:   9000        str r0, [sp, #0]
  2e:   f7ff fff5   bl  1c <dummy>
  32:   2000        movs    r0, #0
  34:   b002        add sp, #8
  36:   bd80        pop {r7, pc}

optimized clang

00000020 <notmain>:
  20:   b580        push    {r7, lr}
  22:   af00        add r7, sp, #0
  24:   2002        movs    r0, #2
  26:   f7ff fff9   bl  1c <dummy>
  2a:   2000        movs    r0, #0
  2c:   bd80        pop {r7, pc}

that compiler author chose to use r7 as the dummy variable to align the stack, also it is creating a frame pointer using r7 even though it doesnt have anything in the stack frame. basically instruction could have been optimized out. but it used the pop to return not three instructions, that was probably on me I bet I could get gcc to do that with the right command line options (specifying the processor).

this should mostly answer the rest of your questions

void dummy ( unsigned int );
unsigned int x=1;
unsigned int y;
int notmain ( void )
{
    y = x + 1;
    dummy(y);
    return(0);
}

I have globals now. so they go in either .data or .bss if they dont get optimized out.

before we look at the final output lets look at the itermediate object

00000000 <notmain>:
   0:   b510        push    {r4, lr}
   2:   4b05        ldr r3, [pc, #20]   ; (18 <notmain+0x18>)
   4:   6818        ldr r0, [r3, #0]
   6:   4b05        ldr r3, [pc, #20]   ; (1c <notmain+0x1c>)
   8:   3001        adds    r0, #1
   a:   6018        str r0, [r3, #0]
   c:   f7ff fffe   bl  0 <dummy>
  10:   2000        movs    r0, #0
  12:   bc10        pop {r4}
  14:   bc02        pop {r1}
  16:   4708        bx  r1
    ...

Disassembly of section .data:
00000000 <x>:
   0:   00000001    andeq   r0, r0, r1

now there is info missing from this but it gives an idea of what is going on, the linker is the one that takes objects and links them together with information provided it (in this case flash.ld) that tells it where .text and .data and such goes. the compiler does not know such things, it can only focus on the code it is presented, any external it has to leave a hole for the linker to fill in the connection. Any data it has to leave a way to link those things together, so the addresses for everything are zero based here simply because the compiler and this dissassembler dont know. there is other info not shown here that the linker uses to place things. the code here is position independent enough so the linker can do its job.

we then see at least a disassembly of the linked output

00000020 <notmain>:
  20:   b510        push    {r4, lr}
  22:   4b05        ldr r3, [pc, #20]   ; (38 <notmain+0x18>)
  24:   6818        ldr r0, [r3, #0]
  26:   4b05        ldr r3, [pc, #20]   ; (3c <notmain+0x1c>)
  28:   3001        adds    r0, #1
  2a:   6018        str r0, [r3, #0]
  2c:   f7ff fff6   bl  1c <dummy>
  30:   2000        movs    r0, #0
  32:   bc10        pop {r4}
  34:   bc02        pop {r1}
  36:   4708        bx  r1
  38:   20000004    andcs   r0, r0, r4
  3c:   20000000    andcs   r0, r0, r0

Disassembly of section .bss:

20000000 <y>:
20000000:   00000000    andeq   r0, r0, r0

Disassembly of section .data:

20000004 <x>:
20000004:   00000001    andeq   r0, r0, r1

the compiler has basically asked for two 32 bit variables in ram. One is in .bss because I didnt initialize it so it is assumed to init as zero. the other is .data because I did initialize it on declaration.

Now because these are global variables it is assumed that other functions can modify them. the compiler makes no assumptions as to when notmain can be called so it cannot optimize with what it can see, the y = x + 1 math, so it has to do that runtime. It has to read from ram the two variables add them and save back.

Now clearly this code wont work. Why? because my bootstrap as shown here does not prepare the ram before calling notmain, so whatever garbage was in 0x20000000 and 0x20000004 when the chip woke up is what will be used for y and x.

Not going to show that here. you can read my even longer winded rambling on .data and .bss and why I dont ever need them in my bare metal code, but if you feel you have to and want to master the tools rather than hoping someone else did it right...

https://github.com/dwelch67/raspberrypi/tree/master/bssdata

linker scripts, and bootstraps are somewhat compiler specific so everything you learn about one version of one compiler could get tossed on the next version or with some other compiler, yet another reason why I dont put a ton of effort into .data and .bss preparation just to be this lazy:

unsigned int x=1;

I would much rather do this

unsigned int x;
...
x = 1;

and let the compiler put it in .text for me. Sometimes it saves flash that way sometimes it burns more. It is most definitely much easier to program and port from toolchain version or one compiler to another. Much more reliable, less error prone. Yep, does not conform to the C standard.

now what if we make these static globals?

void dummy ( unsigned int );
static unsigned int x=1;
static unsigned int y;
int notmain ( void )
{
    y = x + 1;
    dummy(y);
    return(0);
}

well

00000020 <notmain>:
  20:   b510        push    {r4, lr}
  22:   2002        movs    r0, #2
  24:   f7ff fffa   bl  1c <dummy>
  28:   2000        movs    r0, #0
  2a:   bc10        pop {r4}
  2c:   bc02        pop {r1}
  2e:   4708        bx  r1

obviously those variables cannot be modified by other code, so the compiler can now at compile time optimize out the dead code, like it did before.

unoptimized

00000020 <notmain>:
  20:   b580        push    {r7, lr}
  22:   af00        add r7, sp, #0
  24:   4804        ldr r0, [pc, #16]   ; (38 <notmain+0x18>)
  26:   6800        ldr r0, [r0, #0]
  28:   1c40        adds    r0, r0, #1
  2a:   4904        ldr r1, [pc, #16]   ; (3c <notmain+0x1c>)
  2c:   6008        str r0, [r1, #0]
  2e:   f7ff fff5   bl  1c <dummy>
  32:   2000        movs    r0, #0
  34:   bd80        pop {r7, pc}
  36:   46c0        nop         ; (mov r8, r8)
  38:   20000004    andcs   r0, r0, r4
  3c:   20000000    andcs   r0, r0, r0

this compiler which used the stack for locals, now uses ram for globals and this code as written is broken because I didnt handle .data nor .bss properly.

and one last thing that we cant see in the disassembly.

:1000000000100020150000001B0000001B00000075
:100010001B00000000F004F8FFE7FEE77047000057
:1000200080B500AF04480068401C04490860FFF731
:10003000F5FF002080BDC046040000200000002025
:08004000E0FFFF7F010000005A
:0400480078563412A0
:00000001FF

I changed x to be pre-init with 0x12345678. My linker script (this is for gnu ld) has this ted at bob thing. that tells the linker I want the final place to be in the ted address space, but store it in the binary in the ted address space and someone will move it for you. And we can see that happened. this is intel hex format. and we can see the 0x12345678

:0400480078563412A0

is in the flash address space of the binary.

readelf also shows this

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  EXIDX          0x010040 0x00000040 0x00000040 0x00008 0x00008 R   0x4
  LOAD           0x010000 0x00000000 0x00000000 0x00048 0x00048 R E 0x10000
  LOAD           0x020004 0x20000004 0x00000048 0x00004 0x00004 RW  0x10000
  LOAD           0x030000 0x20000000 0x20000000 0x00000 0x00004 RW  0x10000
  GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RWE 0x10

the LOAD line where the virtual address is 0x20000004 and the physical is 0x48


This answer is going to focus more on the boot process. First, a correction -- writes to flash are done after the MCU (or at least part of it) has already started up. On some MCUs (usually the more advanced ones), the CPU itself might operate the serial ports and write to the flash registers. So writing and executing the program are different processes. I'm going to assume that the program has already been written to flash.

Here's the basic boot process. I'll name some common variations, but mostly I'm keeping this simple.

  1. Reset: There are two basic types. The first is a power-on reset, which is internally generated while the supply voltages are ramping up. The second is an external pin toggle. Regardless, the reset forces all of the flip-flops in the MCU to a predetermined state.

  2. Extra hardware initialization: Extra time and/or clock cycles may be needed before the CPU starts running. For example, in the TI MCUs I work on, there's an internal configuration scan chain that gets loaded.

  3. CPU boot: The CPU fetches its first instruction from a special address called the reset vector. This address is determined when the CPU is designed. From there, it's just normal program execution.

    The CPU repeats three basic steps over and over:

    • Fetch: Read an instruction (8-, 16-, or 32-bit value) from the address stored in the program counter (PC) register, then increment the PC.
    • Decode: Convert the binary instruction into a set of value's for the CPU's internal control signals.
    • Execute: Carry out the instruction -- add two registers, read from or write to memory, branch (change the PC), or whatever.

    (It's actually more complicated than this. CPUs are usually pipelined, which means they can be doing each of the above steps on different instructions at the same time. Each of the above steps may have multiple pipeline stages. Then there's parallel pipelines, branch prediction, and all the fancy computer architecture stuff that makes those Intel CPUs take a billion transistors to design.)

    You might be wondering how the fetch works. The CPU has a bus consisting of address (out) and data (in/out) signals. To do a fetch, the CPU sets its address lines to the value in the program counter, then sends a clock over the bus. The address is decoded to enable a memory. The memory receives the clock and address, and puts the value at that address on the data lines. The CPU receives this value. Data reads and writes are similar, except the address comes from the instruction or a value in a general-purpose register, not the PC.

    CPUs with a von Neumann architecture have a single bus that's used for both instructions and data. CPUs with a Harvard architecture have one bus for instructions and one for data. In real MCUs, both of these buses may be connected to the same memories, so it's often (but not always) something you don't have to worry about.

    Back to the boot process. After reset, the PC is loaded with a starting value called the reset vector. This can be built into the hardware, or (in ARM Cortex-M CPUs) it can be read out of memory automatically. The CPU fetches the instruction from the reset vector and starts looping through the steps above. At this point, the CPU is executing normally.

  4. Boot loader: There's often some low-level setup that needs to be done to make the rest of the MCU operational. This can include things like clearing RAMs and loading manufacturing trim settings for analog components. There may also be an option to load code from an external source such as a serial port or external memory. The MCU may include a boot ROM that contains a small program to do these things. In this case, the CPU reset vector points to the boot ROM's address space. This is basically normal code, it's just provided by the manufacturer so you don't have to write it yourself. :-) In a PC, the BIOS is the equivalent of the boot ROM.

  5. C environment setup: C expects to have a stack (RAM area for storing state during function calls) and initialized memory locations for global variables. These are the .stack, .data, and .bss sections that Dwelch is talking about. Initialized global variables have their initialization values copied from flash to RAM at this step. Uninitialized global variables have RAM addresses that are close together, so the whole block of memory can be initialized to zero very easily. The stack doesn't need to be initialized (although it can be) -- all you really need to do is set the CPU's stack pointer register so it points to an assigned region in RAM.

  6. Main function: Once the C environment is set up, the C loader calls the main() function. That's where your application code normally begins. If you want, you can leave out the standard library, skip the C environment setup, and write your own code to call main(). Some MCUs might let you write your own boot loader, and then you can do all the low-level setup on your own.

Miscellaneous stuff: Many MCUs will let you execute code out of RAM for better performance. This is usually set up in the linker configuration. The linker assigns two addresses to every function -- a load address, which is where the code is first stored (typically flash), and a run address, which is the address loaded into the PC to execute the function (flash or RAM). To execute code out of RAM, you write code to make the CPU copy the function code from its load address in flash to its run address in RAM, then call the function at the run address. The linker can define global variables to help with this. But executing code out of RAM is optional in MCUs. You'd normally only do it if you really need high performance or if you want to rewrite the flash.