How to lay out B-Tree data on disk?

UPDATE(archived version of oracle index internals): http://web.archive.org/web/20161221112438/http://www.toadworld.com/platforms/oracle/w/wiki/11001.oracle-b-tree-index-from-the-concept-to-internals


OLD (the original link does not exist anymore): some info about oracle index internals: http://www.toadworld.com/platforms/oracle/w/wiki/11001.oracle-b-tree-index-from-the-concept-to-internals

Notes:

Databases do not directly implement indexes based on B-tree but on a variant called B+ tree. Which according to wikipedia:

A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves.

Databases work, in general, with block-oriented storage and b+ tree is more suited then a b-tree for this.

The blocks are fixed size and are left with some free space to accommodate future changes in value or key size.

A block can be either a leaf(holds actual data) or branch(holds the pointers to the leaf nodes)

A toy model how writing to disk can be implemented (for a block size 10k for arithmetic simplification):

  1. a file of 10G is created on disk(it has 1000 of blocks)
  2. first block is assigned as root and the next free one as a leaf and a list of leaf addresses is put in root
  3. new data inserted, the current leaf node is filled with values until a threshold is reached
  4. data continue to be inserted, the next free ones are allocated as leaf blocks and the list of leaf nodes is updated
    1. after many inserts, the current root node needs children, so the next free block is allocated as branch node, it copies the list from root and now the root will maintains only a list of intermediary nodes.
    2. if node block needs to be split, the next free block is allocated as branch node, added into root list, and list of leaf nodes is split between initial and new branch node

When the information is read from a big index: can go following:

  1. read first/root block (seek(0), read(10k)) which points to the a child which is located in block 900
  2. read block 900 (seek(900*10k), read(10K)) which points to a child which located in block 5000
  3. read block 5000 (seek(5000*10k), read(10K)) which points to the leaf node located in block 190
  4. read block 190 (seek(190*10k), read(10K)) and extract the interested value from it

a really large index can be split on multiple files, then the address of block will be something as (filename_id, address_relative_to_this_file)