As part of continuing the work on my AArch64 kernel, I’ve reached the point where it makes sense to introduce a minimal loader. Until now, I’ve been running the kernel directly, but going forward, I want a more realistic setup — one where a small bootloader is responsible for loading the kernel image, passing the device tree blob (DTB), and jumping into the kernel’s entry point.
The goal for this post is to build that loader. It won’t do much beyond mapping the ELF file, forwarding the DTB, and transferring control — just enough to set the stage for the next step: parsing the DTB from within the kernel. This approach mirrors how a real boot process would look, and gives the kernel a more defined interface to start from.
What is a Bootloader?
A bootloader is a small program that runs when a device powers on, performing initial hardware setup and loading the operating system (OS) or firmware into memory. It initializes hardware components, such as memory, and then hands control over to the main OS or application.
Put simply: a bootloader is the little piece of software that gets everything ready so the kernel can run. It loads the kernel into the right place, passes along any needed information, and then steps aside. A grasp of what the bootloader might do, could be:
- Print startup messages.
- Check presence of PCI.
- Get memory map from BIOS.
- Enable graphics mode.
My Early Mental Model
Before going into technical details, I want to share what I initially thought a bootloader was — and what I believed it should do.
My understanding of the bootloader formed while planning the next steps of this project. As far as I knew, a bootloader was responsible for:
- Testing peripherals.
- Initializing the RAM.
- Initializing basic devices (UART, PCIe, USB, etc.).
- Loading the kernel.
Looking back, this mental model isn’t entirely wrong — but it’s incomplete. My biggest uncertainty was around hardware initialization. If the bootloader can initialize peripherals, why would the kernel need to do it again?
The short answer is: the kernel must have its own view of the hardware and cannot rely on the bootloader’s configuration. It needs the original hardware “blueprint” to re-initialize devices with its own drivers and manage them throughout the system’s lifetime.
Expanding on that, the kernel re-initializes hardware because:
- Bugs & Incompleteness: The bootloader’s drivers (e.g., UART) might be buggy or configured in ways unsuitable for a full OS.
- Ownership: The kernel must have exclusive control over each device, which means resetting and configuring it from scratch.
- Full Hardware Picture: The kernel needs to know everything about the hardware — timers, interrupt controllers, power units, and so on — to properly manage the system.
Kernel Responsibilities
Once the bootloader hands off control, the kernel takes over and performs a very different set of tasks:
- Take Control and Re-initialize: It reads the information passed by the bootloader and immediately reconfigures hardware using its own drivers.
- Interrupts: The kernel sets up its own interrupt vector table and enables interrupts once it’s ready to handle them.
- Virtual Memory: Unlike the bootloader, which runs entirely in physical memory, the kernel sets up page tables to enable virtual memory and protection.
- Core Subsystems: It starts the scheduler, memory manager, and device drivers (timers, storage, etc.).
- User Space: Once the core is ready, it launches the first user-space process (often called
init), which starts the rest of the system. - Ongoing Management: From then on, the kernel handles all system calls, interrupts, and resource management until shutdown.
Understanding the Device Tree
After talking about my early mental model and understanding that the kernel needs a complete view of the hardware, the next question naturally arises: how does the kernel actually get that information? This is where the device tree comes in. Instead of hardcoding details about every possible piece of hardware, a bootloader can provide the kernel with a structured description of the system. This allows the kernel to discover and configure devices in a uniform way, without relying on assumptions about what the bootloader has already set up.
A device tree is a data structure that describes the hardware layout of a system. It uses a tree-like format made up of nodes, each representing a device (or part of a device), along with properties that define its characteristics.
The device tree is especially useful for describing hardware that cannot be discovered automatically — such as memory-mapped peripherals or on-chip controllers. Each node has one parent (except for the root node), forming a hierarchical view of the system.
A “device” in this context might be:
- A physical hardware component, such as a UART.
- A subcomponent within a larger device, like a random-number generator inside a TPM.
- A virtual or emulated device, such as a communication interface to a peripheral on another CPU.
- A function implemented by firmware or a higher-privilege execution level.
Device tree descriptions are meant to be OS-agnostic. They should describe hardware as it exists, not how a specific kernel or project uses it. When the system boots, the bootloader loads the device tree into memory and passes a pointer to it to the kernel. The kernel then parses this data structure to learn what hardware is available and how to initialize it.
Device Tree Blob
The Device Tree Blob (DTB) is the binary, compiled version of the device tree source. It’s what the bootloader actually loads into memory and passes to the kernel at boot time.
The DTB allows platform-specific hardware information to be separated from the kernel’s source code. Instead of embedding hardware descriptions directly into the kernel, the kernel can remain generic — it just parses the DTB and configures the system accordingly.
Building the Bootloader
Before getting into the core logic, the build system needs a small adjustment so both the bootloader and kernel can be combined into a single image. The bootloader is built as a raw .bin file, since QEMU changes its boot behavior depending on the type of image provided:
- For guests using the Linux kernel boot protocol (any non-ELF file passed to the
-kerneloption), the address of the Device Tree Blob (DTB) is passed in a register —r2for 32-bit guests orx0for 64-bit guests.- For guests booting as “bare-metal” (any other kind of boot), the DTB is placed at the start of RAM (
0x4000_0000).
Makefile Changes
For now, I’m going with a monolithic build — a single binary that includes both the bootloader and the kernel, linked one after another. This approach keeps the boot process simple and avoids having to deal with storage or file loading yet. It’s also a convenient setup under QEMU, where I can easily control where each part of the image is loaded in memory.
To support this setup, the Makefile grew with few new pieces. It now detects whether the bootloader submodule is present and, if so, builds it automatically before linking everything together. The resulting binary (bootloader.bin) is then concatenated with the kernel ELF to produce a single file, combined.bin.
ifeq ($(BOOTLOADER_EXISTS),yes)
# Create combined blob: bootloader.bin + kernel.elf (with alignment for safe struct access)
# Align to 4096 bytes (page size) - reasonable tradeoff between space and alignment
KERNEL_ALIGN := 4096
$(COMBINED_BLOB): $(BOOTLOADER_BIN) $(KERNEL_ELF)
@echo "Creating combined blob: bootloader + kernel (aligned for struct safety)..."
@echo " Bootloader: $(BOOTLOADER_BIN) (loaded at 0x40080000 by QEMU)"
@cp $(BOOTLOADER_BIN) $(COMBINED_BLOB)
# Pad to next $(KERNEL_ALIGN)-byte boundary
@truncate -s %$(KERNEL_ALIGN) $(COMBINED_BLOB)
@KERNEL_OFFSET=$$(stat -c%s $(COMBINED_BLOB)); \
KERNEL_ADDR=$$(printf "0x%x" $$((0x40080000 + KERNEL_OFFSET))); \
echo " Bootloader padded to: 0x$$(printf %x $$KERNEL_OFFSET) (aligned to $(KERNEL_ALIGN) bytes)"; \
echo " Kernel ELF: $(KERNEL_ELF) at offset 0x$$(printf %x $$KERNEL_OFFSET) (runtime addr: $$KERNEL_ADDR)"
@cat $(KERNEL_ELF) >> $(COMBINED_BLOB)
@echo -n " Combined blob size: "
@ls -lh $(COMBINED_BLOB) | awk '{print $$5}'
@echo "Blob created successfully!"
# Build blob (depends on bootloader and kernel)
blob: $(COMBINED_BLOB)
@echo "Blob build complete"
# Run the combined blob
run-blob: $(COMBINED_BLOB) $(DTB_FILE)
@echo "Running combined blob (bootloader will load kernel)..."
$(QEMU) -machine virt,gic-version=3,virtualization=on -cpu cortex-a57 -serial stdio \
-kernel $(COMBINED_BLOB) -dtb $(DTB_FILE) -m 1G
If the bootloader is not present then the kernel will be built as a single object, like previous posts. The bootloader source lives in a separate repository, and this kernel repository just pulls the resulting binary and links it in as part of the build.
Additionally I have added a new QEMU flag: virtualization=on. The goal of this flag is that the bootloader gets loaded into EL2 state, so then we can drop to EL1, simulating a real boot sequence. If it is not the case, the bootloader will be also prepared to do the job at EL1 and then jump to kernel.
Linking and Memory Layout
Since the bootloader now takes control first, the kernel can’t assume it starts at the same physical address as before. The linker script (linker.ld) was updated to reflect that. The main change is the load address, which now matches the region where the bootloader places the kernel in memory (for instance, 0x50000000).
ENTRY(_start)
MEMORY {
/* Kernel loaded at 0x50000000 by bootloader (from ELF parsing) */
RAM : ORIGIN = 0x50000000, LENGTH = 16M
}
SECTIONS {
. = ORIGIN(RAM);
__kernel_start = .;
.text :
{
__text_start = .;
*(.text.boot)
*(.text.*)
__text_end = .;
} > RAM
__data_start = .;
.rodata : ALIGN(4K)
{
__rodata_start = .;
*(.rodata.*)
__rodata_end = .;
} > RAM
.data : ALIGN(4K)
{
*(.data .data.*)
} > RAM
.bss : ALIGN(4K)
{
_bss_start = .;
*(.bss .bss.*)
*(COMMON)
_bss_end = .;
} > RAM
__data_end = .;
__kernel_end = .;
. = . + 64K;
. = ALIGN(64K);
stack_top = .;
If we take a look now at the bootloader’s link script it has as similar structure, changing the loading address:
MEMORY {
/* Bootloader loaded at 0x40080000. Qemu will jump there when loading a .bin */
RAM : ORIGIN = 0x40080000, LENGTH = 1M
}
First Stage
Following Linux Kernel convention, the bootloader’s first assembly file is named head.S – representing the ‘head’ of execution. This file contains the _start symbol where QEMU transfers control after loading the bootloader. In this first stage, the bootloader will do a set of basic tasks we have covered in previous posts:
- Mask some interrupts.
- Set up a stack.
- Set up exception vectors.
- Set up the UART for early prints.
The idea here is simply to establish a minimal and predictable execution environment so the next stage can focus on loading and preparing the kernel.
Second Stage
Once the minimal runtime environment is ready, the bootloader moves into its second stage: loading the kernel. This step involves parsing the kernel’s binary image, which is stored in ELF (Executable and Linkable Format) format, and placing its contents into the correct physical memory locations. Rather than treating the kernel as a simple flat binary, using ELF provides a structured and portable way to describe exactly what needs to be loaded and where.
Before diving into the actual parsing, it’s worth noting that the Rust ecosystem already provides crates like elf for handling ELF files in a higher-level way. For this walkthrough, however, I’ll be implementing a minimal loader from scratch to understand the fundamentals. Later on, I plan to switch to the elf crate to simplify the implementation once the fundamentals are clear.
First, let’s understand the file format we’re about to parse.
A Quick Look at the ELF Format
Before we can load our kernel, we need to understand the structure of the binary we’re working with. The ELF format is used widely across Unix-like systems and was designed to be flexible for linkers and debuggers, yet efficient enough for a boot-time loader.
For this project, we only care about one type: ET_EXEC (an executable file). This is a statically linked ELF where all code and data have already been placed at their final physical addresses by the linker.
An ELF file is essentially a container describing:
- What code and data exist.
- How they should be laid out in memory.
- Where execution should begin.
The very first structure in the file is the ELF header, located at offset 0. This header tells us what kind of binary we’re dealing with — whether it’s relocatable, dynamically linked, or a standalone executable; whether it targets 32 or 64-bit architectures; and how the rest of the file is organized. It also includes global information such as byte ordering, ABI identifiers, and offsets to other important tables within the file. For our purposes only a handful of fields truly matter. Once we verify that the file is a 64-bit, little-endian ELF executable for Arm64, we care primarily about:
e_entry: The virtual address where execution should begin.e_phoff: The byte offset from the start of the file to the Program Header Table.e_phnum: How many entries are in the Program Header Table.e_phentsize: The size (in bytes) of each individual program header entry.
Together, these fields are the only global metadata the loader truly needs to locate and interpret the program headers.
Following the offset in e_phoff leads us to the Program Header Table. This is the single most important structure for a loader. Unlike the Section Header Table (which we’ll get to in a moment), program headers describe the runtime view of the file. They tell the operating system (or in our case, the bootloader) how to create a process image in memory. Each entry describes a “segment.”
We only care about entries with the type PT_LOAD. These are the segments that actually need to be loaded into memory. For each PT_LOAD segment, we look at the following fields:
p_offset: Where the segment’s data is stored inside the file (relative to the file start).p_vaddr: The physical memory address where the segment should be copied.p_filesz: How many bytes to copy from the file.p_memsz: How many bytes the segment should occupy in memory.
If p_memsz is larger than p_filesz, it means the segment has a .bss section. This is uninitialized data (like static variables) that the kernel expects to be zero. Our bootloader must copy the p_filesz bytes and then “zero out” the remaining (p_memsz - p_filesz) bytes.
You might also be familiar with the Section Header Table, which describes logical groupings like .text, .rodata, .data, and .bss. These are crucial for linkers and debuggers..
The Program Header Table already merges sections into runtime segments. This means the linker did the heavy lifting — the bootloader only needs to obey PT_LOAD entries. We can safely (and correctly) ignore section headers entirely.
Implementing the ELF Loader
Now that we know what to look for, let’s walk through the implementation. The process has two parts:
- Assembly: Find the memory address of the kernel ELF (which is linked right after our bootloader) and pass it to our Rust function.
- Rust: Parse the ELF data at that address and copy the segments.
The process begins in our assembly startup code, right after we’ve set up the stack. We need to find the address where the kernel ELF binary is loaded in memory. We’ve linked it to be placed immediately after the bootloader itself, but we need to ensure it’s on a proper page-aligned boundary.
This AArch64 assembly calculates the first KERNEL_ALIGN boundary after the __bootloader_end symbol. It stores the result in the x0 register, which is the first argument register in the Arm64 calling convention, and then calls our Rust function.
/* Calculate kernel ELF address. Kernel starts immediately after
* bootloader binary
*/
adr x1, __bootloader_end
adr x2, __bootloader_start
sub x1, x1, x2
add x0, x1, x2
mov x1, #KERNEL_ALIGN
add x0, x0, x1 /* Start + one alignment unit */
neg x1, x1
and x0, x0, x1 /* Round up to aligned boundary */
bl load_kernel
The bl load_kernel instruction jumps to our Rust code, passing the aligned kernel base address in x0. The Rust function will do its work and, if successful, return the kernel’s entry point address (which we will then jump to).
The load_kernel function is the entry point called from assembly. It wraps the main logic, load_elf, which performs the steps we just discussed in the theory section.
#[unsafe(no_mangle)]
pub extern "C" fn load_kernel(elf_base: usize) -> usize {
return load_elf(elf_base);
}
The first thing our load_elf function does is call check_elf_header. This is a critical sanity check. It peers into the first few bytes of the file and confirms that we’re dealing with a binary we can actually load. It checks the magic number (to ensure it’s an ELF file), and then verifies it’s a 64-bit, little-endian executable built for the AArch64 architecture.
fn check_elf_header(header: &Elf64Ehdr) -> bool {
// Validate Magic
if header.e_ident[0..4] != ELFMAG {
uart::println(b"Not an ELF file!");
return false;
}
// Validate Bitness
if header.e_ident[EI_CLASS] != ELFCLASS64 as u8 {
uart::println(b"Not an ELF file!");
return false;
}
// Validate Endianess
if header.e_ident[EI_DATA] != ELFDATA2LSB as u8 {
uart::println(b"Invalid endianess!");
return false;
}
// Validate Class
if header.e_ident[EI_OSABI] != ELFOSABI_SYSV as u8 {
uart::println(b"Invalid class!");
return false;
}
// Validate Type
if header.e_type != ET_EXEC as u16 {
uart::println(b"Invalid type!");
return false;
}
// Validate Machine
if header.e_machine != EM_AARCH64 as u16 {
uart::println(b"Invalid machine!");
return false;
}
return true;
}
With the header validated, the load_elf function gets to the heart of the matter. It begins by finding the Program Header Table using the e_phoff offset from the header. Then, it simply iterates through every entry in that table, one by one.
For each program header, it asks one simple question: is the type PT_LOAD? This flag means the segment contains data that actually needs to be loaded into memory. If it is a loadable segment, we:
- Copy the data. We move
p_fileszbytes from the kernel file’s location to its final destination in physical memory. This single operation places all the kernel’s executable code (.text) and initialized data (.data) into their final locations. - Clear the BSS. We check if the segment’s in-memory size (
p_memsz) is greater than its file size (p_filesz). If it is, this signals a.bsssection.
// Parse program headers
phdr_base = elf_base + header.e_phoff as usize;
for i in 0..header.e_phnum {
let phdr = unsafe {
&*((phdr_base + i as usize * mem::size_of::()) as *const Elf64Phdr)
};
if phdr.p_type == PT_LOAD as u32 {
// PT_LOAD
// Copy segment from ELF to target address
let src = elf_base + phdr.p_offset as usize;
let dst = phdr.p_vaddr as usize;
let size = phdr.p_filesz as usize;
unsafe {
ptr::copy_nonoverlapping(src as *const u8, dst as *mut u8, size);
}
// Zero out BSS if memsz > filesz
if phdr.p_memsz > phdr.p_filesz {
let bss_start = dst + size;
let bss_size = (phdr.p_memsz - phdr.p_filesz) as usize;
unsafe {
ptr::write_bytes(bss_start as *mut u8, 0, bss_size);
}
}
}
}
Once the loop finishes, our job is done. The kernel is now fully laid out in physical memory, ready to run. The load_elf function completes by returning header.e_entry — the address of the kernel’s very first instruction. This return value is passed back to our assembly code, which is waiting to make the final jump and hand over control.
Final stage
With the kernel image now fully loaded and ready in memory, the bootloader enters its final stage. The main goal here is to transition the CPU from the higher privilege level used during early initialization (EL2) down to the kernel’s expected execution level (EL1). This step mirrors what a real firmware or hypervisor would do before handing off control to an operating system.
To achieve this, the bootloader performs three key actions:
- Configure
HCR_EL2register: Sets the CPU to execute at AArch64 in EL1 and disables hypervisor calls (HVC). - Configure
SPSR_EL2register: All interrupts are masked, theMbit (bit 4) is set to indicate the processor mode at the time the exception was taken, and exceptions are configured to be handled in EL1 usingSP_EL1. - Jump to the kernel: The kernel’s entry point is placed in
ELR_EL1, preparing the CPU to transfer execution.
The following assembly snippet illustrates these steps:
ENTRY(switch_to_elx)
switch_elx x2, 1f, 2f, 3f
1:
msr elr_el3, x1
b end
2:
msr elr_el2, x1
mrs x2, hcr_el2
/* Execution state for EL1 is AArch64 & Disable hypervisor calls */
ldr x3, =(HCR_EL2_RW | HCR_EL2_HCD)
orr x2, x2, x3
msr hcr_el2, x2
mrs x2, spsr_el2
/* Mask all exception. Set execution state to AArch64 & Handle exceptions
* with SP_EL1 */
ldr x3, =(SPSR_EL_DEBUG_MASK | SPSR_EL_SERR_MASK | SPSR_EL_IRQ_MASK | SPSR_EL_FIQ_MASK | \
SPSR_EL_M_AARCH64 | SPSR_EL_M_EL1)
orr x2, x2, x3
msr spsr_el2, x2
b end
3:
/* To be implemented */
b .
end:
eret
ENDPROC(switch_to_elx)
Once the CPU registers are configured, the bootloader executes an eret instruction, returning to the lower privilege level and jumping directly to the kernel’s entry point. From this moment, the kernel takes full control.
At this point, the bootloader’s job is done. Its only purpose was to prepare a clean, controlled environment and pass execution to the kernel safely.
Next steps
With this minimal bootloader in place, the kernel now boots in a more realistic environment — one where it’s handed control from a previous stage, receives a valid DTB pointer, and can begin execution at EL1 just like on real hardware. Even though the loader itself is small, it establishes a clear contract between the boot process and the kernel, which will make future development much cleaner.
In the next post, I’ll move into the kernel side and start parsing the DTB. This will let the kernel discover available hardware directly from the boot information, rather than relying on hardcoded addresses or assumptions. The bootloader code can be found in here, and you can follow the ongoing kernel development on my GitHub.