In the latest post we did a deep overview on how virtual memory works, but without focusing in any architecture. In this post we will discuss how virtual memory is implemented in x86_64 arch, by building a Linux kernel module. At the end, you will find a link to download the source code.
Virtual memory in x86_64
In x86_64 arch “Paging cannot be enabled before the processor is switched to protected mode”. In this post we are not covering boot process, we will only focus in virtual memory enable phase.
Enabling paging
Paging is enabled by performing the following actions.
- Disable paging if has been enabled before: CR0.PG=0.
- Enable PAE (Physical Address Extension): CR4.PAE=1.
- Set LME (Long Mode Enable): IA32_EFER.LME=1.
- Load cr3 register with the physical address of the page directory.
- Enable paging: CR0.PG=1.
- Enable protected mode. CR0.PE=1.
Once paging has been enabled, you cannot switch from 4-level paging to 5-level paging (and vice-versa) directly. The same is true for switching to legacy 32-bit paging. You must first disable paging by clearing CR0.PG before making changes. Failure to do so will result in a General Protection Exception (#GP).

Once paging configuration is set up, we could end with two possibilities: 4-level or 5-level paging. In this post we will focus on 4-level paging, because 5-level is the same but adding one more level of indirection. In addition, as the computer used to test the modules has 4-level paging and 4KiB page size, the module will need to be readjusted in order to work properly on a 5-level paging systems or in other configuration of 4-level paging.
Page Table Layout
In the Linux kernel we found four levels of page tables. A table consists of an array of entries of type pXX_t, wrapping a pXXval_t.
- Page Global Directory (PGD) – pgd_t/pgdval_t.
- Page Upper Directory (PUD) – pud_t/pudval_t.
- Page Middle Directory (PMD) – pmd_t/pmdval_t.
- Page Table Entry Directory (PTE) – pte_t/pteval_t.
In fact, these datatypes are wrappers around fundamental architecture-dependent types. For instance, pteval_t => unsigned long.
A virtual address then is simply a set of offsets into each of these tables. In the typical case of a 4KiB page size, each of the PGD, PUD, PMD, and PTE tables contain 512 pointers each, and since the word size is 8 bytes, this means 4KiB of storage, so each page table takes up a page of memory. The number of pointers available per table is defined in the PTRS_PER_Pxx preprocessor constant. All this macros and typedef’s can be found in arch/x86/include/asm/pgtable_types.h and arch/x86/include/asm/pgtable_64_types.h.
In the following section, the process to obtain the physical address from a virtual one is explained.
Translation mechanism
When a virtual address arrives to the MMU, first it gets split into some parts. Of all the parts, the first one starting from the left is ignored. This part is a sign extension, and is ignored because in 4-level paging only 48 bits of 64 are used. In addition, the sign extended part can “serve” to identify an address. If that part only contains “1”, then the address belongs to the kernel space; if only contains “0”, the address belongs to user space. At the end we have two regions as follows:
- User space: 0x000000000000000 to 0x00007fffffffffff.
- Unused hole: 0x0000800000000000 to 0xffff7fffffffff.
- Kernel space: 0xffff8000000000 to 0xffffffffffffffff.
All addresses which are in kernel space and in user space are called canonical addresses. There is a non-canonical hole between these memory regions. Any acces to an address in that region will cause a #GP. Together these two memory regions (kernel space and user space) are exactly 248 bits (256TiB) wide.
The remain parts are used in the transformation as follows:
- cr3 register stores the physical address of the 4 top-level paging structure.
- 47:39 bits of the given linear address store an index into the PGD.
- 38:30 bits store index into the PUD.
- 29:21 bits store an index into the PMD.
- 20:12 bits store an index into the PTE.
- 11:0 bits provide the offset into the physical page in bytes.
The process is described in the following picture.

NOTE: In the Intel’s Manual the terms are slighty different. for example, what in Linux is called PGD, here receives the name of PML4T.
While page walking each level, the MMU will also check for the permission and other bits to move forward.

As can be observed, some of these bits are present in more than one level. That bits will have the same meaning across all levels. On the contrary, other bits only appear in certain level. The meaning of each bit is now described:
- Present (P) bit: This bit indicates whether the page-translation table or physical page is loaded in physical memory. When the P bit is cleared to 0, the table or physical page is not loaded in physical memory. When the P bit is set to 1, the table or physical page is loaded in physical memory. Software clears this bit to 0 to indicate a page table or physical page is not loaded in physical memory. A page-fault exception (#PF) occurs if an attempt is made to access a table or page when the P bit is 0. System software is responsible for loading the missing table or page into memory and setting the P bit to 1.
- Read/Write (R/W) bit: This bit controls read/write access to all physical pages mapped by the table entry. When the R/W bit is cleared to 0, access is restricted to read-only. When the R/W bit is set to 1, both read and write access is allowed.
- User/Supervisor (U/S) bit: This bit controls user (CPL 3) access to all physical pages mapped by the table entry. When the U/S bit is cleared to 0, access is restricted to supervisor level (CPL 0, 1, 2). When the U/S bit is set to 1, both user and supervisor access is allowed.
- Page-Level Writethrough (PWT) bit: This bit indicates whether the page-translation table or physical page to which this entry points has a writeback or writethrough caching policy. When the PWT bit is cleared to 0, the table or physical page has a writeback caching policy. When the PWT bit is set to 1, the table or physical page has a writethrough caching policy.
- Page-Level Cache Disable (PCD) bit: This bit indicates whether the page-translation table or physical page to which this entry points is cacheable. When the PCD bit is cleared to 0, the table or physical page is cacheable. When the PCD bit is set to 1, the table or physical page is not cacheable.
- Accessed (A) bit: This bit indicates whether the page-translation table or physical page to which this entry points has been accessed. The A bit is set to 1 by the processor the first time the table or physical page is either read from or written to. The A bit is never cleared by the processor. Instead, software must clear this bit to 0 when it needs to track the frequency of table or physical-page accesses.
- Dirty (D) bit: This bit is only present in the lowest level of the page-translation hierarchy. It indicates whether the physical page to which this entry points has been written. The D bit is set to 1 by the processor the first time there is a write to the physical page. The D bit is never cleared by the processor. Instead, software must clear this bit to 0 when it needs to track the frequency of physical-page writes.
- Page Size (PS) bit: This bit is present in page-directory entries and long-mode page-directory-pointer entries. When the PS bit is set in the page-directory-pointer entry (PDPE) or page-directory entry (PDE), that entry is the lowest level of the page-translation hierarchy. When the PS bit is cleared to 0 in all levels above PTE, the lowest level of the page-translation hierarchy is the page-table entry (PTE), and the physical-page size is 4 Kbytes. The physical-page size is determined as follows.
- If EFER.LMA=1 and PDPE.PS=1, the physical-page size is 1 Gbyte.
- If CR4.PAE=0 and PDE.PS=1, the physical-page size is 4 Mbytes.
- If CR4.PAE=1 and PDE.PS=1, the physical-page size is 2 Mbytes.
- Global Page (G) bit: This bit is only present in the lowest level of the page-translation hierarchy. It indicates the physical page is a global page. The TLB entry for a global page (G=1) is not invalidated when cr3 is loaded during a task switch. Use of the G bit requires the page-global enable bit in cr4 to be set to 1 (CR4.PGE=1).
- Available to Software (AVL) bit: These bits are not interpreted by the processor and are available for use by system software.
- Page-Attribute Table (PAT) bit: The PAT bit is the high-order bit of a 3-bit index into the PAT register. The other two bits involved in forming the index are the PCD and PWT bits. Not all processors support the PAT bit by implementing the PAT registers. This bit is only present in the lowest level of the page-translation hierarchy, as follows:
- If the lowest level is a PTE (PDE.PS=0), PAT occupies bit 7.
- If the lowest level is a PDE (PDE.PS=1) or PDPE (PDPE.PS=1), PAT occupies bit 12.
- Memory Protection Key (MPK) bits: When Memory Protection Keys are enabled (CR4.PKE=1), this 4-bit field selects the memory protection key for the physical page mapped by this entry. Ignored if memory protection keys are disabled (CR4.PKE=0).
- No Execute (NX) bit: The NX bit can only be set when the no-execute page-protection feature is enabled by setting EFER.NXE to 1. If EFER.NXE=0, the NX bit is treated as reserved. In this case, a page-fault exception (#PF) occurs if the NX bit is not cleared to 0.
- Reserved bits: Software should clear all reserved bits to 0. If the processor is in long mode, or if page-size and physical-address extensions are enabled in legacy mode, a page-fault exception (#PF) occurs if reserved bits are not cleared to 0.
- MBZ bit: Must be zero.
- IGN bit: Ignored.
Module overview
The Linux kernel module has been coded on a x86_64 machine with a 5.15 kernel version. It works by providing three arguments:
- pid: PID of the process to inspect.
- vaddr: Virtual address to transform.
- value: Expected value in vaddr location.
The idea is to execute the test program provided in the source code and then provide the arguments to the module. It is also possible to modify the module to translate kernel vaddr.
In the module, is defined the function virt_to_phys. The function does all the page walk, at the time is printing on the screen information about each level. For example, at first prints the physical address contained in cr3. Examining the funtion, the page walk has been implemented in a “raw” format, i.e. without using functions like pgd_offset and without using datatypes like pgd_t. The page walk has been done in that way to act as an academic resource, by this way I think is more clear to follow the navigation. At the same time, other source file is included in the source code. This source file defines the functions that print information about each level bits. In addition, two Makefiles are provided to compile the module itself and the test program. All the code is hosted on my github, in a repo called x86page-walker.
Conclusions
Over the course of the virtual memory series, we have learned:
- What virtual memory is and what problems solve.
- How virtual memory is implemented in an efficent way.
- How the address translation mechanism works.
- How to speed up translation mechanism by implementing a TLB.
- How to create a driver to understand the page translation process.